Digital tribes

Reddit is one of the largest social networks in the world, and its vast and diverse communities make it the perfect environment to analyze how tribal behavior manifests in online interactions.

This project presents a statistical analysis of cross-community hostility patterns on Reddit, specifically comparing political and sports communities. Our main goal is to discover whether online hostility follows universal linguistic patterns across different domains, examining how the frequency vs vocabulary of conflict varies between political discourse and sports rivalry.

Datastory — Final notebook

❓ Research questions

We focus on understanding the nature of tribal hostility across domains:

Political vs Sports Hostility
- How do baseline hostility rates differ between political and sports communities?
- Which camps (ideological groups or team fandoms) generate the most hostile interactions?
- What role do "observer" communities (like r/subredditdrama) play in hostility patterns?
Universal Hostility Signature
- Do hostile posts use the same linguistic markers across domains?
- Can classifiers trained on one domain detect hostility in another?
- What LIWC (Linguistic Inquiry and Word Count) features characterize hostile interactions?
Event Impact Analysis
- How do major events (elections, championships) affect hostility patterns?
- Are these effects temporary spikes or lasting behavioral changes?

📊 Additional datasets

For our project, we searched online a few datasets that would allow us to create more interesting statistics, but after the EDA we finally decided that the following was enough.

Embedding vectors of subreddits (communities on Reddit) source

For practical reasons, we used Kaggle so it's easier for everyone to install the datasets automatically. It was not possible to put it on GitHub as the datasets were huge (> 200MB).

🗂️ Methods

Data Preprocessing and Filtering

Load all datasets into clean Pandas DataFrames (286,561 body posts, 571,927 title posts)
Filter the date period into the main dataset period (Jan 2014 to Apr 2017)
Extract 86 features per post: 65 LIWC psycholinguistic markers, 18 text structure features, 3 VADER sentiment scores
Group subreddits into meaningful "camps" using embedding-based clustering

Clustering and Camp Definitions

Apply PCA to 300-dimensional subreddit embeddings (retaining 94% variance at 50 components)
Evaluate K-means clustering using silhouette and Davies-Bouldin scores
Combine data-driven clustering with domain expertise for interpretable camp definitions
Define political camps: trump_conservative, anti_trump, progressive, alt_right, meta_drama, etc.
Define sports camps: NFL divisions, NBA, MLB, NHL, soccer leagues, etc.

Baseline Hostility Analysis

Calculate negative interaction rates for each domain
Compare hostility rates using chi-square tests and Cohen's h effect size
Build camp-to-camp interaction matrices to identify hostile relationships
Analyze which camps are aggressors vs targets

LIWC Signature Analysis

Compute hostility signatures: mean LIWC values in hostile posts minus friendly posts
Correlate signatures across domains (Pearson and Spearman)
Identify universal hostility markers: NEGEMO, ANGER, SWEAR, THEY pronouns
Analyze pronoun patterns (we/they dynamics) in tribal communication

Event Impact Analysis

Define 14 major events: 8 political (2016 Election, Brexit, debates) and 6 sports (Super Bowls, NBA Finals)
Compare hostility rates in 7-day windows before, during, and after events
Visualize weekly hostility trends with event markers
Run difference-in-differences analysis for causal inference

Statistical Validation

Proportion tests: Chi-square with Yates correction, Cohen's h effect sizes
Cross-domain transfer: Train logistic classifiers on one domain, test on another (AUC evaluation)
Coefficient comparison: Correlate logistic regression coefficients across domains
Bootstrap/Permutation tests: Robust confidence intervals and null hypothesis testing
Network analysis: Compare interaction network structures (density, reciprocity, clustering)

📈 Key results

Finding	Evidence
Politics is 3.3× more hostile than sports	17.3% vs 5.2% negative interaction rate (Cohen's h = 0.40)
Hostility "sounds" the same across domains	r = 0.937 LIWC signature correlation
Cross-domain classifiers transfer successfully	AUC = 0.91 for politics→sports transfer
Same features predict hostility in both domains	73.5% coefficient sign agreement
Events cause temporary spikes, not permanent change	Behavior returns to baseline within days
Observer communities are universally hostile	meta_drama shows ~25% hostility in both domains

⏰ Proposed timeline

Week 0 (before Nov 5)

Obtain, preprocess, and clean the main Reddit datasets (title and body)
Find, filter, and integrate subreddit embeddings and external event datasets
Clustering of the subreddits to identify the main groups for each category
Clean the results.ipynb for P2

Week 1 (before Nov 12)

Perform Exploratory Data Analysis EDA

Week 2 (before Nov 19)

Generate additional visualizations
Identify massive interactions and camp-to-camp patterns
Analyze interaction matrices and hostility profiles

Week 3 (before Nov 26)

Analyze LIWC signatures and cross-domain correlations
Implement cross-domain classifier transfer experiments
Complete event impact analysis with statistical tests

Week 4 (before Dec 3)

Clean code from the notebook and python files
Implement statistical validation (bootstrap, permutation, DiD)
Network structure analysis and comparison

Week 5 (before Dec 10)

Final polishing of notebooks, README.md, and visualizations
Ensure the GitHub repository is complete and organized
Finalize the data story website

Week 6 (before Dec 17)

Final review and submission

🤝 Team organization

We would like to point out that for the P3 of this project, Yuri did not participate as she chosed to drop this course. Please take that into consideration upon grading this project.

Badr
- Mathematical sections of results.ipynb
- digital_tribes_analysis_2.ipynb notebook with interactions exploration using existing modules
Daniel
- End-to-end analysis pipeline from data preparation to statistical validation
- Clustering and camp definitions (clustering.py, data_prep.py)
- Interaction matrices and LIWC signature analysis (interaction_analysis.py, full module
- Event impact analysis with before/during/after comparisons (event_analysis.py), full module
- Statistical validation: classifier transfer, coefficient comparison, bootstrap, permutation, DiD (statistical_analysis.py), full module
- Final results notebook and wrapper functions (results.ipynb, results_helpers.py), except mathematical markdown cells.
Arnaud
- Datastory (pages branch): shockwaves and universal signature
- First EDA on Sports and results.ipynb with Yuri
Louis
- Writing of README.md
- Datasets integration with Kaggle, cleaning and import (data_prep.py)
- Datastory (pages branch): plots, template, structure and content
- First EDA on Politics and results.ipynb with Yuri

⚙️ Quickstart

Clone the repository
Open a terminal and execute python -m venv /PATH/TO/PROJECT/.venv to create a virtual environment
Execute pip install -r requirements.txt to install the project dependencies

Note: the datasets will be downloaded automatically from Kaggle when running the Notebook.

🗄️ Project structure

├── ... cache                   # Cached datasets files
├── src                         # Source code
│   ├── clustering.py
│   ├── consts.py
│   ├── data_prep.py
│   ├── event_analysis.py
│   ├── interaction_analysis.py
│   ├── results_helpers.py
│   └── statistical_analyisis.py
├── .gitignore
├── results.ipynb               # Notebook showing the results
├── requirements.txt            # File for installing python dependencies
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digital tribes

❓ Research questions

📊 Additional datasets

🗂️ Methods

Data Preprocessing and Filtering

Clustering and Camp Definitions

Baseline Hostility Analysis

LIWC Signature Analysis

Event Impact Analysis

Statistical Validation

📈 Key results

⏰ Proposed timeline

Week 0 (before Nov 5)

Week 1 (before Nov 12)

Week 2 (before Nov 19)

Week 3 (before Nov 26)

Week 4 (before Dec 3)

Week 5 (before Dec 10)

Week 6 (before Dec 17)

🤝 Team organization

⚙️ Quickstart

🗄️ Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
results.ipynb		results.ipynb

Folders and files

Latest commit

History

Repository files navigation

Digital tribes

❓ Research questions

📊 Additional datasets

🗂️ Methods

Data Preprocessing and Filtering

Clustering and Camp Definitions

Baseline Hostility Analysis

LIWC Signature Analysis

Event Impact Analysis

Statistical Validation

📈 Key results

⏰ Proposed timeline

Week 0 (before Nov 5)

Week 1 (before Nov 12)

Week 2 (before Nov 19)

Week 3 (before Nov 26)

Week 4 (before Dec 3)

Week 5 (before Dec 10)

Week 6 (before Dec 17)

🤝 Team organization

⚙️ Quickstart

🗄️ Project structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages