Bad_Swiw: Humor.exe : Reverse-Engineering Laughter from the Reddit Ruins

Image generated using Google Gemini (text-to-image), December 2025.

https://snap.stanford.edu/data/soc-RedditHyperlinks.html. Is the source of the datasets. To load and preprocess the datasets, run dataloader.py.

Abstract

For this project, we take the roles of rebel scientists in a dystopian future where humor has been banned by the government. In the quest of analysing how humor affected humans in the past, we find the hyperlink reddit database to work with. With this, we plan to first categorize what types of posts are "funny". Once we have this information, we will analyse the effect of humor in the Reddit network: how humor could change over time in the database, the different types of subreddits that post humorous posts and the interactions between them.

Our interest in this lies in understanding the role of humor in a complex environment such as reddit and how it helped to link different communities together.

Data Story

Link to our story.

Research Questions

Can we make a relation between the text characteristics in a post with it having humor or not ?
How does humor change with time on reddit during that time period ?
Is humor universal or community-specific and which communities use it the most ?
What is the role of humor in the conflict dynamics in the network ? How does humor propagate in the network ?

Methods

Additional data: web scraping

To proceed, we performed web scraping to retrieve information from the posts in the provided database via their post_id. This allowed us to have access to additional information, such as the body content, title and comments for the body database.

Pipelines in place:

Data preprocessing

We configured an automatic importation and storage (in 'data/' directory) of the hyperlinks (datasets of "body" and "title") and embeddings (users and subreddit embeddings) data from the two datasets provided, which is enabled by simply running the data_loader.py file, and preprocessed the data on it.

Humor quantification

To quantify textual humor, we decided to use a pre-existing model which is based on a popular linguistic theory of humor.

Before applying the model, we tested it with an arbitrary dataset made up of 5000 sentences for which we knew whether or not were funny (Boolean). This yielded very positive results in terms of overall model reliability and performance which encouraged us to use the model's confidence scores as a classifier for funny or not funny for posts. Finally, we implemented this model into both of the two data sets (title and body) posts that we were able to scrap to categorize them into either funny or not.

LIWC relation with humor

To test if our initial hypothesis that a general relationship could be established between some LIWC features (capturing linguistic and psychological markers such as emotional tone, self-references) and cognitive processes and the humoressness of a post was feasible, we ran a logistic regression model which proved quite unconclusive.

Further scraping for feature correlation

Once we finished scraping all the source posts content, we set up a pipeline to scrape the target posts, whose links are available in the content of the source posts. This then was used for Feature Correlation Analysis to extract text characteristics that could potentially correlate with humor classification.

Temporal and general analysis

We analysed how humor evolved over time across Reddit by visualizing temporal trends and shifts in humor prevalence and examining correlations between humor and external events. We expected this to reveal whether humor intensity or style fluctuated during the observed period or if it was rather constant. It was then found that humor was likely to be sensitive to stochastic spikes or dips likely related by external events.

Clustering

We decided to run HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with noise) in order to cluster the subreddits into groups based on their embeddings and thus grouping similiar subreddits together. From this, a general structure of the platform could be observed.

Inter-Cluster Network Analysis

For each cluster of subreddits then, we computed a general "funny" score for that cluster in particular and then drew a connected network (with NetworkX (nx) and PyVis) in order to analyse the connections in Reddit between different clusters based on their link sentiment as well as the clusters general funny score. This revealed the non-uniform distribution of humor in the platform as well as interesting cross-comunities dynamics.

Conclusions

Humor is inherently multi-dimensional: Capturing humor requires a joint analysis of temporal dynamics, linguistic patterns, and community structure rather than treating them in isolation.
Context outweighs content: Humor appears to emerge more from how language is used than from the specific words themselves, particularly within constrained formats such as titles.
External events influence humor: Periods marked by major global events align with noticeable declines in humor prevalence, suggesting that humor may reflect shifts in collective emotional states.
Platform constraints shape expression: The attention-driven nature of titles, contrasted with the expressive freedom of post bodies, leads to fundamentally different humor dynamics.
Network structure facilitates propagation: Humor tends to diffuse through highly connected entertainment-oriented communities, which highlights the importance of network connectivity for a viral spread.

Timeline

Phase 1: Data Collection and Preprocessing Setup

Timeline: 5th November – 15th November 2025

Phase 2: Descriptive Data Exploration, Humor Detection and Quantification

Timeline: 16th November – 20th November 2025

Phase 3: Temporal Dynamics and Subreddit Clustering with Humor Analysis

Timeline: 21st November – 30th November 2025

Phase 4: In-Depth Analysis and Contingency Management

Timeline: 1st December – 9th December 2025

Phase 5: Refinement, Reporting and Storytelling

Timeline: 9th December – 21th December 2025

Organisation within the team

Aïda: Initial preprocessing of the data, temporal analysis of humor.
Amine: Benchmarking of the humor model for post classification, pipelines for clustering and network dynamics analysis.
Dámaso: Initial preprocessing of the data, temporal analysis of humor.
Mokhtar: Implemented complex and optimized web scrapping pipelines for source and target posts, characterization of target posts.
Raki: Research and implementation of humor model for post classification, creation of website and writing of data story.

Repository structure

├── data                                   <- Project data files
│
├── src                                    <- Source code
│   │
│   ├── initial plots                      <- Initial / exploratory generated plots
│   ├── scripts                            <- Contains cached scraped data (.json); empty if not scraped                                  
│   └──  utils                             <- Python helper functions and methods
│
├── plots                                  <- Final interactive plots of the analysis
│
├── .gitignore                             <- List of files ignored by git
│
├── README.md                              <- Project description and instructions
│
├── requirements.txt                       <- Python dependencies
│
├── Initial_analysis.ipynb                 <- Initial notebook results for Milestone 2
|
└── Final_results.ipynb                    <- Notebook showing final results

Quickstart

# clone project
git clone https://github.com/epfl-ada/ada-2025-project-bad_swiw

# create conda environment 
conda create -n <env_name> python=3.11
conda activate <env_name>

#install requirements 
pip install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bad_Swiw: Humor.exe : Reverse-Engineering Laughter from the Reddit Ruins

Abstract

Data Story

Research Questions

Methods

Additional data: web scraping

Pipelines in place:

Data preprocessing

Humor quantification

LIWC relation with humor

Further scraping for feature correlation

Temporal and general analysis

Clustering

Inter-Cluster Network Analysis

Conclusions

Timeline

Phase 1: Data Collection and Preprocessing Setup

Phase 2: Descriptive Data Exploration, Humor Detection and Quantification

Phase 3: Temporal Dynamics and Subreddit Clustering with Humor Analysis

Phase 4: In-Depth Analysis and Contingency Management

Phase 5: Refinement, Reporting and Storytelling

Organisation within the team

Repository structure

Quickstart

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
plots		plots
src		src
.gitignore		.gitignore
Final_results.ipynb		Final_results.ipynb
Initial_analysis.ipynb		Initial_analysis.ipynb
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Bad_Swiw: Humor.exe : Reverse-Engineering Laughter from the Reddit Ruins

Abstract

Data Story

Research Questions

Methods

Additional data: web scraping

Pipelines in place:

Data preprocessing

Humor quantification

LIWC relation with humor

Further scraping for feature correlation

Temporal and general analysis

Clustering

Inter-Cluster Network Analysis

Conclusions

Timeline

Phase 1: Data Collection and Preprocessing Setup

Phase 2: Descriptive Data Exploration, Humor Detection and Quantification

Phase 3: Temporal Dynamics and Subreddit Clustering with Humor Analysis

Phase 4: In-Depth Analysis and Contingency Management

Phase 5: Refinement, Reporting and Storytelling

Organisation within the team

Repository structure

Quickstart

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages