Created by Maliha Binte Mamun · Independent Research · 2025
📄 Code: MIT License · Dataset: CC BY 4.0
SarcasmExplain-5K is a balanced dataset of 5,000 Reddit sarcasm instances annotated with five complementary natural language explanation types, generated via a systematic GPT-4 pipeline and validated through crowd-sourced human evaluation.
Unlike existing sarcasm datasets that provide only binary labels, this dataset provides rich, multi-perspective explanations — enabling research in explainable AI, pragmatic language understanding, and human-AI communication.
| Statistic | Value |
|---|---|
| Total instances | 5,000 |
| Sarcastic | 2,500 |
| Non-sarcastic | 2,500 |
| Explanation types | 5 |
| Evaluation forms | 50 per type (COG + INT) |
| Source | Reddit conversations |
| Generation model | OpenAI GPT-4 |
The full 5,000-instance dataset is hosted on HuggingFace with gated access.
Access is free — you contribute a small amount of annotation work in exchange.
| Step | Action |
|---|---|
| 1. Annotate | Visit annotate.html and choose any open Cognitive (COG) or Intent-based (INT) form |
| 2. Rate | Rate 10 sarcasm explanations for clarity (1–5) and optionally suggest improvements (~8 min) |
| 3. Get your code | After submitting, enter your Form ID (e.g. COG014) at the annotate page to receive your unique completion code (e.g. SE5K-COG014-1EBAD543) |
| 4. Request access | Visit access.html, verify your code, then paste it into the HuggingFace access request form |
Access is approved within 24–48 hours after submission.
💡 Preview available: A sample is freely available without registration:
data/sample_data.csv
This model supports ongoing, community-driven quality validation of the dataset at no cost to anyone — your annotations directly improve the evaluation study for our EMNLP 2026 submission.
Each sarcastic instance includes five complementary explanations:
| Type | Description | Human Evaluated |
|---|---|---|
| Cognitive | Why the mind recognises sarcasm — the belief or knowledge the speaker invokes | ✅ Active (COG001–COG050) |
| Intent-Based | Speaker's communicative goal — what they are trying to achieve socially or emotionally | ✅ Active (INT001–INT050) |
| Contrastive | Sarcastic vs. sincere comparison — what a genuine version would look like | 🔜 Planned |
| Textual | Linguistic features that signal sarcasm — word choice, tone, exaggeration | — |
| Rule-Based | Formal linguistic markers — punctuation, register shift, hyperbole | — |
sarcasm-explain-5k/
├── README.md
├── LICENSE
├── index.html ← dataset landing page (GitHub Pages)
├── annotate.html ← annotation forms + completion code lookup
├── access.html ← code verification + HuggingFace access guide
├── data/
│ └── sample_data.csv ← 8-instance preview (freely available)
└── code/
└── ParaphraseSarcasm.ipynb ← full data generation pipeline
| Column | Description |
|---|---|
label |
0 = non-sarcastic, 1 = sarcastic |
label_name |
"sarcastic" or "non_sarcastic" |
comment |
The original Reddit comment |
parent_comment |
Conversational context |
rephrased_comment |
Non-sarcastic paraphrase of the comment |
cognitive_explanation |
Mental reasoning perspective |
intent_based_explanation |
Speaker's communicative goal |
contrastive_explanation |
Sarcastic vs. sincere comparison |
textual_explanation |
Linguistic analysis perspective |
rule_based_explanation |
Linguistic markers identified |
Comment: "Yeah, like the president is a big deal!"
Parent Comment: And even a prominent democrat defended him.
Label: Sarcastic
| Explanation Type | Content |
|---|---|
| Cognitive | The speaker invokes the common knowledge that the presidency is a position of immense power — using that as a foil to mock someone who downplays the president's importance. |
| Intent-Based | The speaker is mocking whoever minimizes the president's significance. Social goal: highlight the absurdity of the counterpart's position. |
| Contrastive | A sincere version: "The president is indeed a significant figure." The sarcastic comment inverts this through dismissive phrasing. |
| Textual | The word "like" and the exclamation mark signal insincerity. Downplaying an obviously powerful position creates the ironic gap. |
| Rule-Based | Linguistic markers: informal minimiser ("like"), exclamatory punctuation, contradiction with common knowledge. |
Human evaluation focuses on Cognitive and Intent-Based explanation types, with 50 evaluation forms per type (10 instances per form).
- Rate clarity of each explanation (1–5 Likert scale)
- Agree or disagree with the generated explanation
- Write a correction if the explanation is unclear or inaccurate (optional)
| Pool | Form IDs | Forms | Instances |
|---|---|---|---|
| Cognitive | COG001 – COG050 | 50 | 500 |
| Intent-Based | INT001 – INT050 | 50 | 500 |
Forms are available at annotate.html. Each completed form earns one completion code for dataset access.
Codes follow the format SE5K-[FORMID]-[HASH], for example:
SE5K-COG014-1EBAD543
SE5K-INT031-3200CCCB
Enter your Form ID at the annotate page to retrieve your code at any time.
Explanations were generated using OpenAI GPT-4 with carefully engineered prompts for each explanation type. The pipeline:
- Source sarcastic comments from Reddit (r/sarcasm corpus)
- Balance dataset: 2,500 sarcastic + 2,500 non-sarcastic
- For each sarcastic instance, generate 5 explanation types via GPT-4
- Post-process for consistency and quality
- Create human evaluation forms for validation
This dataset supports research in:
- Explainable AI (XAI): Multi-perspective explanation generation for NLP models
- Sarcasm Detection: Training models with richer contextual understanding
- Pragmatic NLP: Computational approaches to non-literal language
- Cognitive Modelling: Understanding how humans recognise irony and sarcasm
- Human-AI Interaction: Improving model awareness of speaker intent
- Generate 5,000-instance dataset with 5 explanation types
- Publish sample (100 instances) to GitHub
- Host full dataset on HuggingFace (gated)
- Create contribute-to-access annotation system (COG + INT forms)
- Launch annotate.html + access.html on GitHub Pages
- Collect 100–200 human evaluations per type
- Publish inter-annotator agreement analysis
- Baseline experiments: do explanations improve sarcasm detection?
- Submit to EMNLP 2026
- Cross-lingual extension (Japanese, multilingual)
If you use this dataset or pipeline in your research, please cite:
@misc{mamun2025sarcasmexplain,
author = {Mamun, Maliha Binte},
title = {SarcasmExplain-5K: A Multi-Perspective Sarcasm Explanation Dataset},
year = {2025},
publisher = {GitHub / HuggingFace},
url = {https://huggingface.co/datasets/maliha/sarcasm-explain-5k},
note = {Independent research. Contact: bintemaliha19@gmail.com}
}Maliha Binte Mamun
PhD, Computer and Information Science — Shizuoka University (2024)
Product Development Engineer — Pi Photonics, Hamamatsu, Japan
- 📮 Email: bintemaliha19@gmail.com
- 🐙 GitHub: @maliha-usui
- 💼 LinkedIn: Maliha Binte Mamun
- 🤗 HuggingFace: maliha
Independent research project, 2025. Builds on publicly available Reddit data.