January, 2022


Topic modeling algorithms, according to Blei (2012), are statistical techniques that examine the words of the original texts to find the themes that run through them, how those themes are connected, and how they change over time.  

One of the reasons for their wide use is that all these algorithms can be applied to a corpus of unlabeled texts, and discover latent features to reflect themes or topics that occur in the corpus. 

The current most widely used topic modeling algorithms include Latent Dirichlet Allocation (LDA, Blei et al., 2003), Probabilistic Latent Semantic Analysis (pLSA, Hofmann, 1999) and its extension Non-negative Matrix Factorization (NMF, Lee and Seung, 1999).

The current use of topic modeling is in a variety of fields. In computational linguistics and natural language processing, topic modeling is used to discover hidden topics in documents. In text mining, the clustering of text documents into groups of similar topics is of interest.

In an ambitious scenario, some might suggest that topic modeling can be used to find hidden or latent social structures in the text of online forums. But more modestly, I think we can walk together with this notebook through the introductory steps needed to implement topic modeling analysis on scientific texts.

With the help of the definition stated in the beginning, we chose to analyze the following factors:

  • What are some of the literature's most popular topics? 
  • What links those topics together? 
  • How frequently were those topics covered over time?

Approach

The code below provides a gist on how to leverage topic modeling on scientific literature (PDFs).

I recommend doing this approach before conducting a systematic or scoping literature review in order to increase the eyesight of what you as a researcher should expect to see as topics that surround the papers under investigation.

To illustrate the procedure, I collected some papers in three well-known search engines for scientific articles, namely Web of Science, Scopus and IEEE. In /data, you can find the queries used in each search engine, as well as the exported metadata. I cannot provide the results PDFs due to copyright issues.

Query phrase: ( ( algorithm* OR automat* OR bot$ OR "artificial intelligence" ) AND ( financ* OR trading ) AND ( "social media" OR twitter OR facebook OR youtube OR reddit OR telegram ) AND ( manipulat* OR disinformation ) )

If you use this code or find it interesting, please cite us!

@article{pohl2022social,
  title={Social bots spreading disinformation about finance: research trends, and ethical challenges},
  author={Pohl, Janina and Griesbach, Marie and Samiei, Alireza and Araujo, Adelson},
  journal={In press},
  volume={0},
  number={0},
  pages={0},
  year={2022},
  publisher={In press}
}

Feel free to leave a comment with a suggestion.

Imports

PDF_FOLDER = './data/pdf'
MASK_IMAGE_PATH = './images/cloud.png'
OUTPUT_FOLDER = './outputs'

SEED = 0
N_TOPICS = 6
N_WORDS_PER_TOPIC = 50
N_WORDS_HIGHLIGHT_WORDCLOUD = 5

with open('./utils/stop_words_EN.txt', 'r') as f:
    STOP_WORDS = f.read().split('\n')

Load documents

documents, filenames, years = read_pdfs(PDF_FOLDER)
    
print(f'{len(documents)} PDFs loaded as text.')
19 PDFs loaded as text.

Preprocessing

%%time

nlp = spacy.load("en_core_web_sm")

docs = []

# Filter lemmas for tokens in ['ADJ', 'VERB', 'NOUN', 'PROPN']
for doc in nlp.pipe(documents):
    filtered = ' '.join([token.lemma_ for token in doc if token.pos_ in ['ADJ', 'VERB', 'NOUN', 'PROPN']])
    docs.append(filtered)
CPU times: user 37.4 s, sys: 10.5 s, total: 47.9 s
Wall time: 1min 9s

Topic model

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(max_df=0.90, min_df=2,
                              max_features=1000,
                              stop_words=STOP_WORDS,
                              token_pattern=r'(?u)\b[A-Za-z]+\b',
                              ngram_range=(1, 3))),
    ('topic_model', NMF(N_TOPICS, 
                        max_iter=1000,
                        random_state=1))
])
pipe.fit(docs)
/Users/adelsondias/miniconda3/envs/ca-prototype/lib/python3.8/site-packages/sklearn/decomposition/_nmf.py:289: FutureWarning: The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).
  warnings.warn(
Pipeline(steps=[('tfidf',
                 TfidfVectorizer(max_df=0.9, max_features=1000, min_df=2,
                                 ngram_range=(1, 3),
                                 stop_words=['ll', 'tis', 'twas', 've', '000',
                                             '00', 'a', 'as', 'able',
                                             'ableabout', 'about', 'above',
                                             'abroad', 'abst', 'accordance',
                                             'according', 'accordingly',
                                             'across', 'act', 'actually', 'ad',
                                             'added', 'adj', 'adopted', 'ae',
                                             'af', 'affected', 'affecting',
                                             'affects', 'after', ...],
                                 token_pattern='(?u)\\b[A-Za-z]+\\b')),
                ('topic_model',
                 NMF(max_iter=1000, n_components=6, random_state=1))])

Analysis

a. What are some prominent topics in the literature?

topic_words, words_importance = get_topic_words(pipe, N_WORDS_PER_TOPIC)

tw = pd.DataFrame(topic_words).melt(var_name='topic_id', value_name='word')
wi = pd.DataFrame(words_importance).melt(var_name='topic_id', value_name='importance')

topic_word_importance = pd.concat([tw,wi['importance']], axis=1)
topic_word_importance = topic_word_importance.set_index(['topic_id', 'word'])
topic_word_importance.sample(5)
importance
topic_id word
0 price 0.145109
2 detection 0.141785
3 human 0.149593
5 source 0.046277
3 download 0.089857
TOPIC_LABELS = {
    1: 'Sentiment analysis',
    2: 'Bot detection',
    3: 'Campaign promotion',
    4: 'Disinformation and privacy',
    5: 'Pumps and dumps',
    6: 'Rumor spreading'
}
docs_topics = pd.DataFrame(pipe.transform(docs), 
                           columns=list(TOPIC_LABELS.values()),
                           index=filenames)
docs_topics
Sentiment analysis Bot detection Campaign promotion Disinformation and privacy Pumps and dumps Rumor spreading
Ibrahim - 2021 - Forecasting the early market movement in bitcoin using twitter's sentiment analysis An ensemble-based prediction model.pdf 0.469492 0.000000 0.000000 0.041820 0.041896 0.000000
Lange, Kettani - 2019 - On Security Threats of Botnets to Cyber Systems.pdf 0.000000 0.251676 0.000000 0.267881 0.000000 0.000000
Sela et al - 2020 - Using connected accounts to enhance information spread in social networks.pdf 0.000000 0.140555 0.015975 0.000000 0.069932 0.620700
Kudugunta - 2018 - Deep neural networks for bot detection.pdf 0.076697 0.580714 0.072897 0.000000 0.000000 0.000000
Geckil et al. - 2018 - A clickbait detection method on news sites.pdf 0.085922 0.000000 0.183729 0.326146 0.035637 0.103153
Tardelli et al. - 2020 - Characterizing Social Bots Spreading Financial Disinformation.pdf 0.028358 0.750603 0.000000 0.000000 0.000000 0.060125
Khaund et al. - 2021 - Social Bots and Their Coordination During Online Campaigns A Survey.pdf 0.000000 0.726069 0.000000 0.095858 0.030996 0.000000
Mahmood - 2019 - Antidatamining framework - Better privacy on Online Social Networks and Beyond.pdf 0.000000 0.007458 0.000000 0.642140 0.005048 0.000000
Dogan et al. - 2020 - Speculator and Influencer Evaluation in Stock Market by Using Social Media.pdf 0.509662 0.000000 0.000000 0.078509 0.000000 0.057278
Nizzoli et al. - 2020 - Charting the Landscape of Online Cryptocurrency Manipulation.pdf 0.000000 0.054760 0.000000 0.000000 0.826464 0.000000
Varol et al. - 2017 - Early detection of promoted campaigns on social media.pdf 0.003892 0.009413 0.866455 0.000000 0.000049 0.048419
Golmohammadi, Zaiane - 2017 - Sentiment Analysis on Twitter to Improve Time Series Contextual Anomaly Detection for Detecting Stock Market Manipulation.pdf 0.427723 0.000000 0.151739 0.000000 0.000000 0.000000
Tardelli et al - 2022 - Detecting inorganic financial compaigns on Twitter.pdf 0.047495 0.492811 0.236611 0.000000 0.004616 0.168862
Fernandez Vilas, Diaz Redondo, Lorenzo Garcia - 2020 - The Irruption of Cryptocurrencies into Twitter Cashtags A Classifying Solution.pdf 0.410586 0.032230 0.000000 0.000000 0.064085 0.092800
Mirtaheri et al. - 2021 - Identifying and Analyzing Cryptocurrency Manipulations in Social Media.pdf 0.071860 0.000000 0.009478 0.014403 0.803376 0.000000
Ferrara et al. - 2016 - Detection of Promoted Social Media Campaigns.pdf 0.000000 0.000000 0.876062 0.000000 0.000000 0.000000
Isle, Smith - 2019 - Real World Examples Suggest a Path to Automated Mitigation of Disinformation.pdf 0.009876 0.000000 0.000000 0.699743 0.000000 0.000000
Land, Aronson - 2020 - Human Rights and Technology New Challenges for Justice and Accountability.pdf 0.000000 0.000000 0.000000 0.544316 0.000000 0.079105
Majumdar, Bose - 2018 - Detection of financial rumors using big data analytics the case of the Bombay Stock Exchange.pdf 0.000000 0.000000 0.000000 0.000000 0.000000 1.054615
sns.boxplot(data=docs_topics.melt(), y='variable', x='value')
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a386ab790>

b. How those themes are connected to each other?

Sentiment analysis Bot detection Campaign promotion Disinformation and privacy Pumps and dumps Rumor spreading
Sentiment analysis 0.0 0.0 2.0 1.0 0.0 2.0
Bot detection 0.0 0.0 1.0 2.0 0.0 2.0
Campaign promotion 2.0 1.0 0.0 1.0 0.0 2.0
Disinformation and privacy 1.0 2.0 1.0 0.0 0.0 1.0
Pumps and dumps 0.0 0.0 0.0 0.0 0.0 0.0
Rumor spreading 2.0 2.0 2.0 1.0 0.0 0.0
Node sizes: {'Sentiment analysis': 5000, 'Bot detection': 6000, 'Campaign promotion': 5000, 'Disinformation and privacy': 6000, 'Pumps and dumps': 2000, 'Rumor spreading': 5000}
Tip: Draw this in another software (e.g. draw.io) if you wish a prettier representation.

c. How frequent those themes were addressed over time?

References

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
  • Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
  • Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57).
  • Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788-791.