Skip to content

[Dataset] Contextual Abuse Dataset #310

@haowanhw

Description

@haowanhw

Contextual Abuse Dataset (CAD) Corpus

This corpus contains around 26,500 annotated Reddit entries (1,394 post titles, 1,394 post bodies, and 23,762 comments). Each entry is labeled into one or more of six primary categories: Identity-directed abuse, Affiliation-directed abuse, Person-directed abuse, Counter Speech, Non-hateful Slurs, and Neutral, with additional secondary subcategories like Derogation, Animosity, Threatening, Dehumanization, and Glorification.

Attribution: Bertie Vidgen, Dong Nguyen, Helen Margetts, Patricia Rossini, and Rebekah Tromble. 2021. Introducing CAD: the Contextual Abuse Dataset. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2289–2303, Online. Association for Computational Linguistics. Available: https://aclanthology.org/2021.naacl-main.182/

Dataset Details

Speaker-level information

Speakers in this dataset correspond to Reddit users. Each Speaker object is created from the meta_author field. If the author value is missing, NA, or deleted, the speaker ID is substituted with [deleted].

Utterance-level information

Each utterance corresponds to one Reddit entry (title, post body, or comment).

Utterance fields:

  • id: an identifier for the utterance (taken from info_id).
  • conversation_id: an identifier for the Reddit thread where the utterance was taken.
  • reply_to: id of the parent post/comment (info_id.parent), or None if no valid parent exists.
  • speaker: Reddit username of the author of the utterance.
  • timestamp: time the utterance was created (Unix timestamp in seconds).
  • text: the cleaned textual content of the utterance, with [linebreak] markers replaced by \n.

Utterance metadata:

  • annotation_Primary: the main abuse category assigned by trained experts. Possible values: Identity-directed abuse, Affiliation-directed abuse, Person-directed abuse, Counter Speech, Non-hateful Slurs, Neutral.
  • annotation_Secondary: subtype of abuse. Examples include: Derogation, Animosity, Threatening, Dehumanization, Glorification.
  • annotation_Context: whether the utterance requires additional context to interpret the label (Yes / No / NA).
  • annotation_Target: the specific individual or group targeted by abuse. Examples include: Women, Men, Immigrants, Political groups.
  • annotation_Target_top.level.category: a higher-level category of the target. Examples include: Identity, Group, Other.
  • annotation_highlighted: text span(s) highlighted by annotators as containing abusive or offensive content. "NA" if none.
  • meta_date: UTC date of the creation of the utterance (YYYY-MM-DD).
  • meta_created_utc: UNIX timestamp of the creation of the utterance.
  • meta_day: day of the creation of the utterance (YYYY-MM-DD).
  • meta_permalink: Reddit permalink to the original post or comment.
  • info_subreddit: name of the subreddit where the utterance was posted.
  • info_subreddit_id: Reddit’s internal ID for that subreddit.
  • id: original cad assigned ID (e.g., cad_1, cad_2).
  • info_id: original identifier for the utterance (with the -title, -post suffix).
  • info_id.parent: identifier of the parent utterance.
  • info_id.link: identifier of the original submission that started the thread.
  • info_thread.id: identifier grouping all utterances in the same Reddit thread.
  • info_order: order of the utterance within its thread.
  • info_image.saved: whether the utterance had an image saved with it (0 = no, 1 = yes).
  • split: the dataset split in the original project, including train, dev, test, exclude_empty, exclude_bot, exclude_lang, and exclude_image.
  • subreddit_seen: indicator of whether the subreddit was included in the annotation set (1) or not (0).
  • entry_type: type of the utterance, includingtitle, post, and comment.

Conversational-level information

Each Reddit thread (grouped by info_thread.id) is treated as a conversation. Within each thread, reply_to relations establish the comment tree structure.

Basic Statistics

Number of Speakers: 11123,
Number of Utterances: 26550,
Number of Conversations: 1395.

Number of titles: 1394,
Number of posts: 1394,
Number of comments: 23762.

Primary labels: 'Neutral': 21935, 'IdentityDirectedAbuse': 2216, 'AffiliationDirectedAbuse': 1111, 'PersonDirectedAbuse': 951, 'CounterSpeech': 210, 'Slur': 127.

Contact

The original Contextual Abuse Dataset was distributed in the paper Introducing CAD: the Contextual Abuse Dataset (Vidgen et al., NAACL 2021). Corresponding Author: Bertie Vidgen ([email protected]).

The dataset was formatted for Convokit by Hao Wan ([email protected]).
The demo on transformer usage and analysis was provided by Jadon Geathers ([email protected]).

Data Access

Dataset with example script: https://drive.google.com/drive/folders/1biuTPwpuvCWDbmlwZMZN4iAQWoZ8tfty?usp=sharing

Metadata

Metadata

Assignees

No one assigned

    Labels

    datasetUse this tag when providing a new dataset for inclusion in ConvoKit.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions