Topic modeling on scientific literature
How to automatically discover what are some trends in research literature using topic modeling analysis?
January, 2022
Topic modeling algorithms, according to Blei (2012), are statistical techniques that examine the words of the original texts to find the themes that run through them, how those themes are connected, and how they change over time.
One of the reasons for their wide use is that all these algorithms can be applied to a corpus of unlabeled texts, and discover latent features to reflect themes or topics that occur in the corpus.
The current most widely used topic modeling algorithms include Latent Dirichlet Allocation (LDA, Blei et al., 2003), Probabilistic Latent Semantic Analysis (pLSA, Hofmann, 1999) and its extension Non-negative Matrix Factorization (NMF, Lee and Seung, 1999).
The current use of topic modeling is in a variety of fields. In computational linguistics and natural language processing, topic modeling is used to discover hidden topics in documents. In text mining, the clustering of text documents into groups of similar topics is of interest.
In an ambitious scenario, some might suggest that topic modeling can be used to find hidden or latent social structures in the text of online forums. But more modestly, I think we can walk together with this notebook through the introductory steps needed to implement topic modeling analysis on scientific texts.
With the help of the definition stated in the beginning, we chose to analyze the following factors:
- What are some of the literature's most popular topics?
- What links those topics together?
- How frequently were those topics covered over time?
Approach
The code below provides a gist on how to leverage topic modeling on scientific literature (PDFs).
I recommend doing this approach before conducting a systematic or scoping literature review in order to increase the eyesight of what you as a researcher should expect to see as topics that surround the papers under investigation.
To illustrate the procedure, I collected some papers in three well-known search engines for scientific articles, namely Web of Science, Scopus and IEEE. In /data, you can find the queries used in each search engine, as well as the exported metadata. I cannot provide the results PDFs due to copyright issues.
Query phrase: ( ( algorithm* OR automat* OR bot$ OR "artificial intelligence" ) AND ( financ* OR trading ) AND ( "social media" OR twitter OR facebook OR youtube OR reddit OR telegram ) AND ( manipulat* OR disinformation ) )
If you use this code or find it interesting, please cite us!
@article{pohl2022social,
title={Social bots spreading disinformation about finance: research trends, and ethical challenges},
author={Pohl, Janina and Griesbach, Marie and Samiei, Alireza and Araujo, Adelson},
journal={In press},
volume={0},
number={0},
pages={0},
year={2022},
publisher={In press}
}
Feel free to leave a comment with a suggestion.
PDF_FOLDER = './data/pdf'
MASK_IMAGE_PATH = './images/cloud.png'
OUTPUT_FOLDER = './outputs'
SEED = 0
N_TOPICS = 6
N_WORDS_PER_TOPIC = 50
N_WORDS_HIGHLIGHT_WORDCLOUD = 5
with open('./utils/stop_words_EN.txt', 'r') as f:
STOP_WORDS = f.read().split('\n')
documents, filenames, years = read_pdfs(PDF_FOLDER)
print(f'{len(documents)} PDFs loaded as text.')
%%time
nlp = spacy.load("en_core_web_sm")
docs = []
# Filter lemmas for tokens in ['ADJ', 'VERB', 'NOUN', 'PROPN']
for doc in nlp.pipe(documents):
filtered = ' '.join([token.lemma_ for token in doc if token.pos_ in ['ADJ', 'VERB', 'NOUN', 'PROPN']])
docs.append(filtered)
pipe = Pipeline([
('tfidf', TfidfVectorizer(max_df=0.90, min_df=2,
max_features=1000,
stop_words=STOP_WORDS,
token_pattern=r'(?u)\b[A-Za-z]+\b',
ngram_range=(1, 3))),
('topic_model', NMF(N_TOPICS,
max_iter=1000,
random_state=1))
])
pipe.fit(docs)
topic_words, words_importance = get_topic_words(pipe, N_WORDS_PER_TOPIC)
tw = pd.DataFrame(topic_words).melt(var_name='topic_id', value_name='word')
wi = pd.DataFrame(words_importance).melt(var_name='topic_id', value_name='importance')
topic_word_importance = pd.concat([tw,wi['importance']], axis=1)
topic_word_importance = topic_word_importance.set_index(['topic_id', 'word'])
topic_word_importance.sample(5)
TOPIC_LABELS = {
1: 'Sentiment analysis',
2: 'Bot detection',
3: 'Campaign promotion',
4: 'Disinformation and privacy',
5: 'Pumps and dumps',
6: 'Rumor spreading'
}
docs_topics = pd.DataFrame(pipe.transform(docs),
columns=list(TOPIC_LABELS.values()),
index=filenames)
docs_topics
sns.boxplot(data=docs_topics.melt(), y='variable', x='value')
References
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
- Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
- Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57).
- Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788-791.