[Dataset] Contextual Abuse Dataset

# Contextual Abuse Dataset (CAD) Corpus

This corpus contains around 26,500 annotated Reddit entries (1,394 post titles, 1,394 post bodies, and 23,762 comments). Each entry is labeled into one or more of six primary categories: Identity-directed abuse, Affiliation-directed abuse, Person-directed abuse, Counter Speech, Non-hateful Slurs, and Neutral, with additional secondary subcategories like Derogation, Animosity, Threatening, Dehumanization, and Glorification.

Attribution: Bertie Vidgen, Dong Nguyen, Helen Margetts, Patricia Rossini, and Rebekah Tromble. 2021. Introducing CAD: the Contextual Abuse Dataset. In *Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 2289–2303, Online. Association for Computational Linguistics. Available: https://aclanthology.org/2021.naacl-main.182/


## Dataset Details

### Speaker-level information
Speakers in this dataset correspond to Reddit users. Each `Speaker` object is created from the `meta_author` field. If the author value is missing, NA, or deleted, the speaker ID is substituted with `[deleted]`.



### Utterance-level information
Each utterance corresponds to one Reddit entry (title, post body, or comment).

**Utterance fields**:
* **id**: an identifier for the utterance (taken from `info_id`). 
* **conversation_id**: an identifier for the Reddit thread where the utterance was taken.
* **reply_to**: id of the parent post/comment (`info_id.parent`), or `None` if no valid parent exists.
* **speaker**: Reddit username of the author of the utterance.  
* **timestamp**: time the utterance was created (Unix timestamp in seconds).
* **text**: the cleaned textual content of the utterance, with `[linebreak]` markers replaced by `\n`.

**Utterance metadata**:
* **annotation_Primary:** the main abuse category assigned by trained experts. Possible values: `Identity-directed abuse`, `Affiliation-directed abuse`, `Person-directed abuse`, `Counter Speech`, `Non-hateful Slurs`, `Neutral`.
* **annotation_Secondary:** subtype of abuse. Examples include: `Derogation`, `Animosity`, `Threatening`, `Dehumanization`, `Glorification`.
* **annotation_Context:** whether the utterance requires additional context to interpret the label (`Yes` / `No` / `NA`).
* **annotation_Target:** the specific individual or group targeted by abuse. Examples include: `Women`, `Men`, `Immigrants`, `Political groups`.
* **annotation_Target_top.level.category:** a higher-level category of the target. Examples include: `Identity`, `Group`, `Other`.
* **annotation_highlighted:** text span(s) highlighted by annotators as containing abusive or offensive content. `"NA"` if none.
* **meta_date:** UTC date of the creation of the utterance (YYYY-MM-DD).  
* **meta_created_utc:** UNIX timestamp of the creation of the utterance.  
* **meta_day:** day of the creation of the utterance (YYYY-MM-DD).  
* **meta_permalink:** Reddit permalink to the original post or comment.  
* **info_subreddit:** name of the subreddit where the utterance was posted.  
* **info_subreddit_id:** Reddit’s internal ID for that subreddit.  
* **id:** original cad assigned ID (e.g., cad_1, cad_2).
* **info_id:** original identifier for the utterance (with the `-title`, `-post` suffix).
* **info_id.parent:** identifier of the parent utterance.  
* **info_id.link:** identifier of the original submission that started the thread.  
* **info_thread.id:** identifier grouping all utterances in the same Reddit thread.  
* **info_order:** order of the utterance within its thread.
* **info_image.saved:** whether the utterance had an image saved with it (`0` = no, `1` = yes). 
* **split:** the dataset split in the original project, including `train`, `dev`, `test`, `exclude_empty`, `exclude_bot`, `exclude_lang`, and `exclude_image`.
* **subreddit_seen:** indicator of whether the subreddit was included in the annotation set (`1`) or not (`0`).
* **entry_type:** type of the utterance, including`title`, `post`, and `comment`.  



### Conversational-level information
Each Reddit thread (grouped by `info_thread.id`) is treated as a conversation. Within each thread, `reply_to` relations establish the comment tree structure.  


## Basic Statistics
Number of Speakers: 11123,
Number of Utterances: 26550,
Number of Conversations: 1395.

Number of titles: 1394,
Number of posts: 1394,
Number of comments: 23762.

Primary labels: 'Neutral': 21935, 'IdentityDirectedAbuse': 2216, 'AffiliationDirectedAbuse': 1111, 'PersonDirectedAbuse': 951, 'CounterSpeech': 210, 'Slur': 127.



## Contact
The original Contextual Abuse Dataset was distributed in the paper [Introducing CAD: the Contextual Abuse Dataset](https://aclanthology.org/2021.naacl-main.182/) (Vidgen et al., NAACL 2021). Corresponding Author: Bertie Vidgen (bvidgen@turing.ac.uk).

The dataset was formatted for Convokit by Hao Wan (hw799@cornell.edu).
The demo on transformer usage and analysis was provided by Jadon Geathers (jag569@cornell.edu).


## Data Access
Dataset with example script: https://drive.google.com/drive/folders/1biuTPwpuvCWDbmlwZMZN4iAQWoZ8tfty?usp=sharing


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dataset] Contextual Abuse Dataset #310

Contextual Abuse Dataset (CAD) Corpus

Dataset Details

Speaker-level information

Utterance-level information

Conversational-level information

Basic Statistics

Contact

Data Access

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Dataset] Contextual Abuse Dataset #310

Description

Contextual Abuse Dataset (CAD) Corpus

Dataset Details

Speaker-level information

Utterance-level information

Conversational-level information

Basic Statistics

Contact

Data Access

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions