Tracing the 2024 U.S. Election Debate on Telegram

This repository contains the code and analysis pipeline accompanying the paper:

Paoletti, G., Ferreira, C.H.G., Vassio, L. et al.
Tracing the 2024 U.S. election debate on Telegram with LLMs and graph analysis.
Social Network Analysis and Mining, 15, 91 (2025).
https://doi.org/10.1007/s13278-025-01504-0

Overview

The project studies how political discourse around the 2024 U.S. Presidential Election unfolded on Telegram. It combines network analysis, topic modelling (BERTopic), and Large Language Models (LLMs) to trace how narratives emerged, spread, and evolved across channels and communities over time.

Repository Structure

tracing-us-election-telegram-main/
│
├── 00_extract_files/           # Dataset loading & language filtering
├── 01a_network_analysis/       # Forward-network construction & community detection
├── 01b_BertTopic_extraction/   # Topic modelling with BERTopic
├── 02_message_summarization/   # LLM-based topic labelling & discussion flow
├── 03_Groups_and_Communities/  # Community-level analysis
├── 03_coordination/            # Detection of coordinated behaviour
└── utils/                      # Shared preprocessing utilities

Pipeline

The analysis is organised as a sequential pipeline. Each stage produces outputs consumed by the next.

Stage 0 — Data Extraction & Language Filtering (`00_extract_files/`)

File	Description
`decompress_db.py`	Decompresses the raw SQLite `.db` (zlib-compressed columns). Warning: decompressed DBs are ~3× the compressed size.
`extract_tsv_files.ipynb`	Exports messages and metadata from SQLite to TSV files for downstream processing.
`explore_the_dataset.ipynb`	Exploratory analysis: message volume over time, channel statistics.
`check_languages.py`	Distributed language detection using PySpark + FastText; flags non-English messages.
`check_dominant_language.ipynb`	Aggregates language-detection results per channel to identify dominant language.

Stage 1a — Network Analysis (`01a_network_analysis/`)

Builds a forward (repost) bipartite network of channels and analyses its community structure.

File	Description
`Extract_Forword_Bipartite_Network.ipynb`	Extracts channel-to-channel forwarding edges from the dataset.
`Discovery_Graph.ipynb`	Exploratory visualisation of the raw forwarding graph.
`Common_Forwards.ipynb`	Computes edge weights based on shared forwarded messages between channels.
`spectral_clustering.py`	Spectral clustering utilities: Laplacian construction, eigenvector decomposition, K-Means assignment, and modularity-based homophily evaluation.

Stage 1b — Topic Modelling (`01b_BertTopic_extraction/`)

Applies BERTopic to identify recurring themes in the message corpus.

File	Description
`00_clean_text.ipynb`	Text cleaning pipeline (URL removal, emoji stripping, ASCII normalisation).
`06_generate_embeddings.py`	Encodes cleaned messages into sentence embeddings.
`01_Topic_Modeling_grid_search.py`	Grid search over BERTopic hyperparameters.
`02_Find_best_model.ipynb`	Evaluates grid-search results and selects the best configuration.
`03_Visualize_Results.ipynb`	Visualises topic distributions and topic-word clouds.
`04_Results_best_model.ipynb`	Full results of the best BERTopic model.
`05_model_selection.ipynb`	Additional model selection and validation.
`07_label_july_messages.py`	Assigns BERTopic labels to messages from July (early campaign period).
`08_Compute_group_analysis.ipynb`	Per-community breakdown of topic prevalence.
`Topic_Modeling.ipynb` / `Topic_Modeling-CPU.ipynb`	End-to-end topic modelling notebooks (GPU and CPU variants).

Stage 2 — LLM-Based Summarisation & Topic Detection (`02_message_summarization/`)

Uses LLMs (LLaMA, GPT) to produce human-readable topic labels and trace discussion flows.

File	Description
`01_extract_samples_of_data.ipynb`	Samples representative messages per topic cluster for LLM annotation.
`02_chatgpt_summaries.ipynb`	Generates topic summaries via the OpenAI API (GPT).
`03_LLama_Summarization/`	LLaMA-based summarisation scripts.
`04_Topic_Detection_with_LLama.ipynb`	Topic detection and zero-shot labelling with LLaMA.
`04a_UMAP_parameter_selection.py`	Selects UMAP parameters for topic-space dimensionality reduction.
`04b_Merging_Similar_Topics.ipynb`	Merges semantically redundant topics using LLM judgements.
`05_Discussion_Flow.ipynb`	Reconstructs the temporal flow of topics across the election timeline.
`06_Sparking_events_effect.ipynb`	Analyses the impact of key real-world events (debates, announcements) on discussion volume.
`Give_a_label_to_the_topic_with_ChatGPT.ipynb`	Assigns concise labels to topics using ChatGPT.
`annotate_merging_topic.py`	Helper script for manual annotation of topic merging decisions.
`Cost_Calculation.ipynb`	Estimates API usage costs for LLM-based steps.
`LLama_time_estimation.ipynb`	Benchmarks LLaMA inference time.

Stage 3 — Community Analysis (`03_Groups_and_Communities/`)

File	Description
`Find_Communities.ipynb`	Applies spectral clustering to detect network communities; characterises each cluster.
`Into_the_communities.ipynb`	Deep-dive into individual community behaviour, dominant topics, and posting patterns.

Stage 3 — Coordination Detection (`03_coordination/`)

File	Description
`Find_Repeated_messages.ipynb`	Detects coordinated inauthentic behaviour by identifying channels that share identical or near-identical messages.

Utilities (`utils/`)

File	Description
`preprocess_text.py`	`clean_text()` function (URL/emoji/mention removal, ASCII normalisation) and a `PreProcessing` class for extended NLP pre-processing (language detection via `langdetect`).
`decompress.py`	`decompress_db()` function for decompressing zlib-compressed SQLite columns.
`__init__.py`	Exposes `decompress_db`, `clean_text`, and `PreProcessing` at package level.

Dependencies

The project uses a mix of standard data-science libraries and NLP-specific tools.

Core dependencies:

Python ≥ 3.10
pandas, numpy, scipy, scikit-learn
networkx
bertopic
sentence-transformers
umap-learn, hdbscan
emoji, unidecode, langdetect

For distributed processing (Stage 0):

Apache Spark / PySpark
fasttext (language detection)

For LLM steps (Stage 2):

openai (GPT-based summarisation)
LLaMA (local inference; see 03_LLama_Summarization/)

For graph visualisation:

matplotlib

Data

The raw data consists of Telegram channel messages collected via the Telegram API and stored in compressed SQLite databases. The data is not distributed in this repository. To reproduce the analysis, populate the expected database path and run Stage 0 to decompress and export to TSV.

Note: decompressed databases are approximately three times heavier than their compressed counterparts.

How to Cite

If you use this code or build on this work, please cite the original paper:

@article{paoletti2025tracing,
  author    = {Paoletti, Giulio and Ferreira, Caio H. G. and Vassio, Luca and others},
  title     = {Tracing the 2024 {U.S.} election debate on {Telegram} with {LLMs} and graph analysis},
  journal   = {Social Network Analysis and Mining},
  volume    = {15},
  pages     = {91},
  year      = {2025},
  doi       = {10.1007/s13278-025-01504-0},
  url       = {https://doi.org/10.1007/s13278-025-01504-0}
}

License

Please refer to the original publication and the repository's license file for terms of use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tracing the 2024 U.S. Election Debate on Telegram

Overview

Repository Structure

Pipeline

Stage 0 — Data Extraction & Language Filtering (`00_extract_files/`)

Stage 1a — Network Analysis (`01a_network_analysis/`)

Stage 1b — Topic Modelling (`01b_BertTopic_extraction/`)

Stage 2 — LLM-Based Summarisation & Topic Detection (`02_message_summarization/`)

Stage 3 — Community Analysis (`03_Groups_and_Communities/`)

Stage 3 — Coordination Detection (`03_coordination/`)

Utilities (`utils/`)

Dependencies

Data

How to Cite

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
00_extract_files		00_extract_files
01a_network_analysis		01a_network_analysis
01b_BertTopic_extraction		01b_BertTopic_extraction
02_message_summarization		02_message_summarization
03_Groups_and_Communities		03_Groups_and_Communities
03_coordination		03_coordination
utils		utils
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Tracing the 2024 U.S. Election Debate on Telegram

Overview

Repository Structure

Pipeline

Stage 0 — Data Extraction & Language Filtering (00_extract_files/)

Stage 1a — Network Analysis (01a_network_analysis/)

Stage 1b — Topic Modelling (01b_BertTopic_extraction/)

Stage 2 — LLM-Based Summarisation & Topic Detection (02_message_summarization/)

Stage 3 — Community Analysis (03_Groups_and_Communities/)

Stage 3 — Coordination Detection (03_coordination/)

Utilities (utils/)

Dependencies

Data

How to Cite

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Stage 0 — Data Extraction & Language Filtering (`00_extract_files/`)

Stage 1a — Network Analysis (`01a_network_analysis/`)

Stage 1b — Topic Modelling (`01b_BertTopic_extraction/`)

Stage 2 — LLM-Based Summarisation & Topic Detection (`02_message_summarization/`)

Stage 3 — Community Analysis (`03_Groups_and_Communities/`)

Stage 3 — Coordination Detection (`03_coordination/`)

Utilities (`utils/`)

Packages