Image generated using Google Gemini (text-to-image), December 2025.
https://snap.stanford.edu/data/soc-RedditHyperlinks.html. Is the source of the datasets. To load and preprocess the datasets, run dataloader.py.
For this project, we take the roles of rebel scientists in a dystopian future where humor has been banned by the government. In the quest of analysing how humor affected humans in the past, we find the hyperlink reddit database to work with. With this, we plan to first categorize what types of posts are "funny". Once we have this information, we will analyse the effect of humor in the Reddit network: how humor could change over time in the database, the different types of subreddits that post humorous posts and the interactions between them.
Our interest in this lies in understanding the role of humor in a complex environment such as reddit and how it helped to link different communities together.
- Can we make a relation between the text characteristics in a post with it having humor or not ?
- How does humor change with time on reddit during that time period ?
- Is humor universal or community-specific and which communities use it the most ?
- What is the role of humor in the conflict dynamics in the network ? How does humor propagate in the network ?
To proceed, we performed web scraping to retrieve information from the posts in the provided database via their post_id. This allowed us to have access to additional information, such as the body content, title and comments for the body database.
We configured an automatic importation and storage (in 'data/' directory) of the hyperlinks (datasets of "body" and "title") and embeddings (users and subreddit embeddings) data from the two datasets provided, which is enabled by simply running the data_loader.py file, and preprocessed the data on it.
To quantify textual humor, we decided to use a pre-existing model which is based on a popular linguistic theory of humor.
Before applying the model, we tested it with an arbitrary dataset made up of 5000 sentences for which we knew whether or not were funny (Boolean). This yielded very positive results in terms of overall model reliability and performance which encouraged us to use the model's confidence scores as a classifier for funny or not funny for posts. Finally, we implemented this model into both of the two data sets (title and body) posts that we were able to scrap to categorize them into either funny or not.
To test if our initial hypothesis that a general relationship could be established between some LIWC features (capturing linguistic and psychological markers such as emotional tone, self-references) and cognitive processes and the humoressness of a post was feasible, we ran a logistic regression model which proved quite unconclusive.
Once we finished scraping all the source posts content, we set up a pipeline to scrape the target posts, whose links are available in the content of the source posts. This then was used for Feature Correlation Analysis to extract text characteristics that could potentially correlate with humor classification.
We analysed how humor evolved over time across Reddit by visualizing temporal trends and shifts in humor prevalence and examining correlations between humor and external events. We expected this to reveal whether humor intensity or style fluctuated during the observed period or if it was rather constant. It was then found that humor was likely to be sensitive to stochastic spikes or dips likely related by external events.
We decided to run HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with noise) in order to cluster the subreddits into groups based on their embeddings and thus grouping similiar subreddits together. From this, a general structure of the platform could be observed.
For each cluster of subreddits then, we computed a general "funny" score for that cluster in particular and then drew a connected network (with NetworkX (nx) and PyVis) in order to analyse the connections in Reddit between different clusters based on their link sentiment as well as the clusters general funny score. This revealed the non-uniform distribution of humor in the platform as well as interesting cross-comunities dynamics.
-
Humor is inherently multi-dimensional: Capturing humor requires a joint analysis of temporal dynamics, linguistic patterns, and community structure rather than treating them in isolation.
-
Context outweighs content: Humor appears to emerge more from how language is used than from the specific words themselves, particularly within constrained formats such as titles.
-
External events influence humor: Periods marked by major global events align with noticeable declines in humor prevalence, suggesting that humor may reflect shifts in collective emotional states.
-
Platform constraints shape expression: The attention-driven nature of titles, contrasted with the expressive freedom of post bodies, leads to fundamentally different humor dynamics.
-
Network structure facilitates propagation: Humor tends to diffuse through highly connected entertainment-oriented communities, which highlights the importance of network connectivity for a viral spread.
- Timeline: 5th November – 15th November 2025
- Timeline: 16th November – 20th November 2025
- Timeline: 21st November – 30th November 2025
- Timeline: 1st December – 9th December 2025
- Timeline: 9th December – 21th December 2025
-
Aïda: Initial preprocessing of the data, temporal analysis of humor.
-
Amine: Benchmarking of the humor model for post classification, pipelines for clustering and network dynamics analysis.
-
Dámaso: Initial preprocessing of the data, temporal analysis of humor.
-
Mokhtar: Implemented complex and optimized web scrapping pipelines for source and target posts, characterization of target posts.
-
Raki: Research and implementation of humor model for post classification, creation of website and writing of data story.
├── data <- Project data files
│
├── src <- Source code
│ │
│ ├── initial plots <- Initial / exploratory generated plots
│ ├── scripts <- Contains cached scraped data (.json); empty if not scraped
│ └── utils <- Python helper functions and methods
│
├── plots <- Final interactive plots of the analysis
│
├── .gitignore <- List of files ignored by git
│
├── README.md <- Project description and instructions
│
├── requirements.txt <- Python dependencies
│
├── Initial_analysis.ipynb <- Initial notebook results for Milestone 2
|
└── Final_results.ipynb <- Notebook showing final results
# clone project
git clone https://github.com/epfl-ada/ada-2025-project-bad_swiw
# create conda environment
conda create -n <env_name> python=3.11
conda activate <env_name>
#install requirements
pip install -r requirements.txt