Skip to content

v2 Dataset Overview Issue #194

@KennethEnevoldsen

Description

@KennethEnevoldsen

An overview of datasets to add in the new version

New datasets

Danish:

Norwegian:

Multilingual:

Multimodal:

Remove

  • Da Political Comments: Quality if questionable and no clear paper attached to it. Similar to DKHate
  • Massive Intent: Translated dataset (MUNI is a strictly more realistic evaluation set)
  • Massive Scenario: Translated dataset
  • Potentially remove
    • LCC, few samples and only labelled by one guy.
    • Twitterhjerne: Questionable quality, we can probably replace it with reasonable retrieval alternatives from MTEB

Other Updates

  • Replace dataset with their improved/faster variant in MTEB
  • Downsample datasets where needed (at least the largest ones)
  • See if there are relevant dataset within MTEB that can be added

Metadata

Metadata

Assignees

No one assigned

    Labels

    datasetnew dataset to addv2

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions