Thematic analysis: how to uncover hidden patterns in text data

From social media analysis to news verification, thematic analysis is used in a variety
of fields. How does this method help journalists and researchers work with big data? Let’s look at examples.


Thematic analysis is a method of analyzing text data that helps identify the main themes, concepts, and patterns in a text. This method is especially useful when working with large volumes of textual information, as it allows for automated extraction of key themes and qualitative content analysis.

The main stages of thematic analysis

1. Data preparation: At the preparation stage, the data should be cleared of unnecessary information such as stop words, punctuation, etc. The data can also be brought to a standard format and structure.

2. Tokenization: To conduct thematic analysis, text must be broken down into individual words or “tokens.” This process is called tokenization and allows for analysis at the word level.

3. Creating a Term-Document Matrix (TDM): This is a key stage of thematic analysis, where a matrix is constructed in which the rows represent terms (words) and the columns represent documents (texts). Each cell of the matrix contains information about the frequency of occurrence of the term in the corresponding document.

4. Application of topic modeling methods: Based on the TDM matrix, topic modeling methods such as Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF) are applied to identify the main topics in the text.

5. Interpretation of results: After applying topic modeling methods, the results obtained are interpreted. This includes analyzing the keywords in each topic, identifying topic clusters and determining their content.

Application of thematic analysis.

Thematic analysis is actively used in fact-checking to identify key narratives, verify
the veracity of information and detect disinformation campaigns.

  • Analyze news trends and identify fakes. Thematic analysis allows you
    to identify recurring themes (for example, “vaccinations and autism,” “election fraud”) and check how much they correspond to real data.

In China, an analysis of the WeChat platform revealed that 31% of fakes were about healthcare, 19% were about consumer rights, and 18% were about politics.

Tools:

– LDA (Latent Dirichlet Allocation) algorithms for topic modeling.

– Keyword analysis in the Term-Document Matrix (TDM).

  • Identifying coordinated disinformation campaigns: Thematic analysis helps identify recurring language and themes in posts from bots or fake accounts to expose coordinated dissemination of false information on social media

In 2020, a Bloomberg investigation showed how a coordinated disinformation campaign linked COVID-19 to 5G technology, which subsequently led to real attacks on cell towers. Thematic analysis made it possible to clearly demonstrate that the same fake arguments, memes, and claims appeared simultaneously on different social networks, proving that the 5G and COVID-19 conspiracy theory was not spontaneous, but was promoted in a coordinated manner, which is typical for disinformation attacks.   

  • Fact-checking statements by politicians and public figures. Fact-checkers analyze politicians’ speeches, highlighting key topics (e.g., “economic growth,” “migration”) to determine how often politicians make false or manipulative claims.

In 2024, the FactCheck.org team analyzed the performances of the candidates at the debates on the eve of the US presidential election. Through thematic analysis, fact-checkers identified exaggerations, false and misleading statements by both candidates on immigration, economics, medicine, taxes and other socially significant topics.

  • Monitoring social media for rumors and false information that need
    to be refuted.

Since 2018, the Chinese platform WeChat has had a “Rumor Debunking Assistant” that automatically analyzes message topics and marks potential fakes. Similar systems exist in other social networks and messengers, such as X and Telegram. They scan user posts for panic and potentially false messages (for example, about “chemtrails” or “chipping”) and mark them as potentially unreliable.

In addition, thematic analysis is used in the following areas:

  • Scientific research — to analyze the texts of scientific articles, identify key topics and trends in research.
  • Social research — to analyze the texts of interviews, surveys, social media to identify key topics and patterns in behavior and opinions.
  • Marketing research — to analyze customer reviews, comments on social networks and other sources to identify trends and patterns in consumer preferences.
  • Journalism and media – in journalism and media analysis, thematic analysis helps identify key themes and stories in news articles, television programs, and other sources.

Challenges and Prospects.

Thematic analysis has its own challenges, such as choosing modeling methods, interpreting the results, and the need for a proper approach to data preprocessing. However, with the development of machine learning methods and topic modeling algorithms, thematic analysis remains an important tool for exploring text data and identifying key themes and patterns, helping to combat misinformation in real time.

From identifying fake narratives to monitoring social networks, thematic analysis allows you to systematically examine large volumes of data, find hidden connections, and quickly respond to information threats. It not only automates routine processes, but also helps you see what escapes superficial analysis — patterns, trends, and manipulations. In the era of Big Data, this approach is becoming not just useful, but a necessary tool for combating disinformation