-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Contextual Abuse Dataset (CAD) Corpus
This corpus contains around 26,500 annotated Reddit entries (1,394 post titles, 1,394 post bodies, and 23,762 comments). Each entry is labeled into one or more of six primary categories: Identity-directed abuse, Affiliation-directed abuse, Person-directed abuse, Counter Speech, Non-hateful Slurs, and Neutral, with additional secondary subcategories like Derogation, Animosity, Threatening, Dehumanization, and Glorification.
Attribution: Bertie Vidgen, Dong Nguyen, Helen Margetts, Patricia Rossini, and Rebekah Tromble. 2021. Introducing CAD: the Contextual Abuse Dataset. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2289–2303, Online. Association for Computational Linguistics. Available: https://aclanthology.org/2021.naacl-main.182/
Dataset Details
Speaker-level information
Speakers in this dataset correspond to Reddit users. Each Speaker object is created from the meta_author field. If the author value is missing, NA, or deleted, the speaker ID is substituted with [deleted].
Utterance-level information
Each utterance corresponds to one Reddit entry (title, post body, or comment).
Utterance fields:
- id: an identifier for the utterance (taken from
info_id). - conversation_id: an identifier for the Reddit thread where the utterance was taken.
- reply_to: id of the parent post/comment (
info_id.parent), orNoneif no valid parent exists. - speaker: Reddit username of the author of the utterance.
- timestamp: time the utterance was created (Unix timestamp in seconds).
- text: the cleaned textual content of the utterance, with
[linebreak]markers replaced by\n.
Utterance metadata:
- annotation_Primary: the main abuse category assigned by trained experts. Possible values:
Identity-directed abuse,Affiliation-directed abuse,Person-directed abuse,Counter Speech,Non-hateful Slurs,Neutral. - annotation_Secondary: subtype of abuse. Examples include:
Derogation,Animosity,Threatening,Dehumanization,Glorification. - annotation_Context: whether the utterance requires additional context to interpret the label (
Yes/No/NA). - annotation_Target: the specific individual or group targeted by abuse. Examples include:
Women,Men,Immigrants,Political groups. - annotation_Target_top.level.category: a higher-level category of the target. Examples include:
Identity,Group,Other. - annotation_highlighted: text span(s) highlighted by annotators as containing abusive or offensive content.
"NA"if none. - meta_date: UTC date of the creation of the utterance (YYYY-MM-DD).
- meta_created_utc: UNIX timestamp of the creation of the utterance.
- meta_day: day of the creation of the utterance (YYYY-MM-DD).
- meta_permalink: Reddit permalink to the original post or comment.
- info_subreddit: name of the subreddit where the utterance was posted.
- info_subreddit_id: Reddit’s internal ID for that subreddit.
- id: original cad assigned ID (e.g., cad_1, cad_2).
- info_id: original identifier for the utterance (with the
-title,-postsuffix). - info_id.parent: identifier of the parent utterance.
- info_id.link: identifier of the original submission that started the thread.
- info_thread.id: identifier grouping all utterances in the same Reddit thread.
- info_order: order of the utterance within its thread.
- info_image.saved: whether the utterance had an image saved with it (
0= no,1= yes). - split: the dataset split in the original project, including
train,dev,test,exclude_empty,exclude_bot,exclude_lang, andexclude_image. - subreddit_seen: indicator of whether the subreddit was included in the annotation set (
1) or not (0). - entry_type: type of the utterance, including
title,post, andcomment.
Conversational-level information
Each Reddit thread (grouped by info_thread.id) is treated as a conversation. Within each thread, reply_to relations establish the comment tree structure.
Basic Statistics
Number of Speakers: 11123,
Number of Utterances: 26550,
Number of Conversations: 1395.
Number of titles: 1394,
Number of posts: 1394,
Number of comments: 23762.
Primary labels: 'Neutral': 21935, 'IdentityDirectedAbuse': 2216, 'AffiliationDirectedAbuse': 1111, 'PersonDirectedAbuse': 951, 'CounterSpeech': 210, 'Slur': 127.
Contact
The original Contextual Abuse Dataset was distributed in the paper Introducing CAD: the Contextual Abuse Dataset (Vidgen et al., NAACL 2021). Corresponding Author: Bertie Vidgen ([email protected]).
The dataset was formatted for Convokit by Hao Wan ([email protected]).
The demo on transformer usage and analysis was provided by Jadon Geathers ([email protected]).
Data Access
Dataset with example script: https://drive.google.com/drive/folders/1biuTPwpuvCWDbmlwZMZN4iAQWoZ8tfty?usp=sharing