-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
Description
An overview of datasets to add in the new version
New datasets
Danish:
- Applied:
- Municipal chatbot: MUNI (already added)
- Historical
- historical clustering (already added). Potentially also Author style clustering? #144
- historical dataset which I discussed with Alie
Potential bilingual English-Danish parallel corpus within the medical domain #186
- Linguistic Acceptability:
- Potential datasets to add from danish-semantic-reasoning-benchmark #172
- Add DDisco #169 (already added in mteb)
- Add 1000 talemaader
- Could we as a pair classification dataset where the correct def. is correct and the abstract def and random def are wrong ("konkrete fejlfortolkninger" we will probably have to check")
Norwegian:
- Legal dataset: Mail communication with Hans
- Check that there if there is datasets to add from here: #142
Multilingual:
- Add ScandiSent #151 (language ids might not be great)
- Potentially FAQs from: fix: Add WebFAQ Retrieval dataset embeddings-benchmark/mteb#2236
Multimodal:
- dan:
- text-image: https://huggingface.co/datasets/alexandrainst/nordjylland-news-image-captioning
- audio-text:
- https://huggingface.co/datasets/alexandrainst/coral
- ftspeech (speaker clustering?)
- nob
- swe
- audio-text:
- images:
- multilingual
- audio-text: Mozilla common voice
- ft
Remove
- Da Political Comments: Quality if questionable and no clear paper attached to it. Similar to DKHate
- Massive Intent: Translated dataset (MUNI is a strictly more realistic evaluation set)
- Massive Scenario: Translated dataset
- Potentially remove
- LCC, few samples and only labelled by one guy.
- Twitterhjerne: Questionable quality, we can probably replace it with reasonable retrieval alternatives from MTEB
Other Updates
- Replace dataset with their improved/faster variant in MTEB
- Downsample datasets where needed (at least the largest ones)
- See if there are relevant dataset within MTEB that can be added
Reactions are currently unavailable