Skip to content

GiordanoPaoletti/tracing-us-election-telegram

Repository files navigation

Tracing the 2024 U.S. Election Debate on Telegram

Paper

This repository contains the code and analysis pipeline accompanying the paper:

Paoletti, G., Ferreira, C.H.G., Vassio, L. et al.
Tracing the 2024 U.S. election debate on Telegram with LLMs and graph analysis.
Social Network Analysis and Mining, 15, 91 (2025).
https://doi.org/10.1007/s13278-025-01504-0


Overview

The project studies how political discourse around the 2024 U.S. Presidential Election unfolded on Telegram. It combines network analysis, topic modelling (BERTopic), and Large Language Models (LLMs) to trace how narratives emerged, spread, and evolved across channels and communities over time.


Repository Structure

tracing-us-election-telegram-main/
│
├── 00_extract_files/           # Dataset loading & language filtering
├── 01a_network_analysis/       # Forward-network construction & community detection
├── 01b_BertTopic_extraction/   # Topic modelling with BERTopic
├── 02_message_summarization/   # LLM-based topic labelling & discussion flow
├── 03_Groups_and_Communities/  # Community-level analysis
├── 03_coordination/            # Detection of coordinated behaviour
└── utils/                      # Shared preprocessing utilities

Pipeline

The analysis is organised as a sequential pipeline. Each stage produces outputs consumed by the next.

Stage 0 — Data Extraction & Language Filtering (00_extract_files/)

File Description
decompress_db.py Decompresses the raw SQLite .db (zlib-compressed columns). Warning: decompressed DBs are ~3× the compressed size.
extract_tsv_files.ipynb Exports messages and metadata from SQLite to TSV files for downstream processing.
explore_the_dataset.ipynb Exploratory analysis: message volume over time, channel statistics.
check_languages.py Distributed language detection using PySpark + FastText; flags non-English messages.
check_dominant_language.ipynb Aggregates language-detection results per channel to identify dominant language.

Stage 1a — Network Analysis (01a_network_analysis/)

Builds a forward (repost) bipartite network of channels and analyses its community structure.

File Description
Extract_Forword_Bipartite_Network.ipynb Extracts channel-to-channel forwarding edges from the dataset.
Discovery_Graph.ipynb Exploratory visualisation of the raw forwarding graph.
Common_Forwards.ipynb Computes edge weights based on shared forwarded messages between channels.
spectral_clustering.py Spectral clustering utilities: Laplacian construction, eigenvector decomposition, K-Means assignment, and modularity-based homophily evaluation.

Stage 1b — Topic Modelling (01b_BertTopic_extraction/)

Applies BERTopic to identify recurring themes in the message corpus.

File Description
00_clean_text.ipynb Text cleaning pipeline (URL removal, emoji stripping, ASCII normalisation).
06_generate_embeddings.py Encodes cleaned messages into sentence embeddings.
01_Topic_Modeling_grid_search.py Grid search over BERTopic hyperparameters.
02_Find_best_model.ipynb Evaluates grid-search results and selects the best configuration.
03_Visualize_Results.ipynb Visualises topic distributions and topic-word clouds.
04_Results_best_model.ipynb Full results of the best BERTopic model.
05_model_selection.ipynb Additional model selection and validation.
07_label_july_messages.py Assigns BERTopic labels to messages from July (early campaign period).
08_Compute_group_analysis.ipynb Per-community breakdown of topic prevalence.
Topic_Modeling.ipynb / Topic_Modeling-CPU.ipynb End-to-end topic modelling notebooks (GPU and CPU variants).

Stage 2 — LLM-Based Summarisation & Topic Detection (02_message_summarization/)

Uses LLMs (LLaMA, GPT) to produce human-readable topic labels and trace discussion flows.

File Description
01_extract_samples_of_data.ipynb Samples representative messages per topic cluster for LLM annotation.
02_chatgpt_summaries.ipynb Generates topic summaries via the OpenAI API (GPT).
03_LLama_Summarization/ LLaMA-based summarisation scripts.
04_Topic_Detection_with_LLama.ipynb Topic detection and zero-shot labelling with LLaMA.
04a_UMAP_parameter_selection.py Selects UMAP parameters for topic-space dimensionality reduction.
04b_Merging_Similar_Topics.ipynb Merges semantically redundant topics using LLM judgements.
05_Discussion_Flow.ipynb Reconstructs the temporal flow of topics across the election timeline.
06_Sparking_events_effect.ipynb Analyses the impact of key real-world events (debates, announcements) on discussion volume.
Give_a_label_to_the_topic_with_ChatGPT.ipynb Assigns concise labels to topics using ChatGPT.
annotate_merging_topic.py Helper script for manual annotation of topic merging decisions.
Cost_Calculation.ipynb Estimates API usage costs for LLM-based steps.
LLama_time_estimation.ipynb Benchmarks LLaMA inference time.

Stage 3 — Community Analysis (03_Groups_and_Communities/)

File Description
Find_Communities.ipynb Applies spectral clustering to detect network communities; characterises each cluster.
Into_the_communities.ipynb Deep-dive into individual community behaviour, dominant topics, and posting patterns.

Stage 3 — Coordination Detection (03_coordination/)

File Description
Find_Repeated_messages.ipynb Detects coordinated inauthentic behaviour by identifying channels that share identical or near-identical messages.

Utilities (utils/)

File Description
preprocess_text.py clean_text() function (URL/emoji/mention removal, ASCII normalisation) and a PreProcessing class for extended NLP pre-processing (language detection via langdetect).
decompress.py decompress_db() function for decompressing zlib-compressed SQLite columns.
__init__.py Exposes decompress_db, clean_text, and PreProcessing at package level.

Dependencies

The project uses a mix of standard data-science libraries and NLP-specific tools.

Core dependencies:

  • Python ≥ 3.10
  • pandas, numpy, scipy, scikit-learn
  • networkx
  • bertopic
  • sentence-transformers
  • umap-learn, hdbscan
  • emoji, unidecode, langdetect

For distributed processing (Stage 0):

  • Apache Spark / PySpark
  • fasttext (language detection)

For LLM steps (Stage 2):

  • openai (GPT-based summarisation)
  • LLaMA (local inference; see 03_LLama_Summarization/)

For graph visualisation:

  • matplotlib

Data

The raw data consists of Telegram channel messages collected via the Telegram API and stored in compressed SQLite databases. The data is not distributed in this repository. To reproduce the analysis, populate the expected database path and run Stage 0 to decompress and export to TSV.

Note: decompressed databases are approximately three times heavier than their compressed counterparts.


How to Cite

If you use this code or build on this work, please cite the original paper:

@article{paoletti2025tracing,
  author    = {Paoletti, Giulio and Ferreira, Caio H. G. and Vassio, Luca and others},
  title     = {Tracing the 2024 {U.S.} election debate on {Telegram} with {LLMs} and graph analysis},
  journal   = {Social Network Analysis and Mining},
  volume    = {15},
  pages     = {91},
  year      = {2025},
  doi       = {10.1007/s13278-025-01504-0},
  url       = {https://doi.org/10.1007/s13278-025-01504-0}
}

License

Please refer to the original publication and the repository's license file for terms of use.

About

Code for the paper "Tracing the 2024 U.S. election debate on Telegram with LLMs and graph analysis".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors