Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tags information - url #1691

Merged
merged 2 commits into from
Mar 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions prepare/cards/rag/end_to_end/bioasq.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
],
task="tasks.rag.end_to_end",
templates={"default": "templates.rag.end_to_end.json_predictions"},
__tags__={"license": "cc-by-2.5"},
__tags__={"license": "cc-by-2.5", "url":"https://huggingface.co/datasets/enelpol/rag-mini-bioasq"},
__description__="""This dataset is a subset of a training dataset by the BioASQ Challenge, which is available here.

It is derived from rag-datasets/rag-mini-bioasq.
Expand Down Expand Up @@ -88,7 +88,7 @@
output_format="",
),
},
__tags__={"license": "cc-by-2.5"},
__tags__={"license": "cc-by-2.5", "url" : "https://huggingface.co/datasets/enelpol/rag-mini-bioasq"},
__description__="""This dataset is a subset of a training dataset by the BioASQ Challenge, which is available here.

It is derived from rag-datasets/rag-mini-bioasq.
Expand Down
4 changes: 4 additions & 0 deletions prepare/cards/rag/end_to_end/clapnq.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ class ClapNqBenchmark:
),
],
task="tasks.rag.end_to_end",
__tags__={"license": "Apache License 2.0", "url": "https://huggingface.co/datasets/PrimeQA/clapnq"},
__description__="""CLAP NQ is created from the subset of Natural Questions (NQ) that have a long answer but no short answer. NQ consists of ~380k examples. There are ~30k questions that are long answers without short answers excluding tables and lists. To increases the likelihood of longer answers we only explored ones that have more than 5 sentences in the passage. The subset that was annotated consists of ~12k examples. All examples where cohesion of non-consecutive sentences was required for the answer were annotated a second time. The final dataset is made up of all data that went through two rounds of annotation. (We provide the single round annotations as well - it is only training data) An equal amount of unanswerable questions have also been added from the original NQ train/dev sets. Details about the annotation task and unanswerables can be found at https://github.com/primeqa/clapnq/blob/main/annotated_data.""",
# templates=["templates.empty"],
templates={"default": "templates.rag.end_to_end.json_predictions"},
)
Expand Down Expand Up @@ -87,6 +89,8 @@ class ClapNqBenchmark:
),
],
task="tasks.rag.corpora",
__tags__={"license": "Apache License 2.0", "url":"https://huggingface.co/datasets/PrimeQA/clapnq"},
__description__="""CLAP NQ is created from the subset of Natural Questions (NQ) that have a long answer but no short answer. NQ consists of ~380k examples. There are ~30k questions that are long answers without short answers excluding tables and lists. To increases the likelihood of longer answers we only explored ones that have more than 5 sentences in the passage. The subset that was annotated consists of ~12k examples. All examples where cohesion of non-consecutive sentences was required for the answer were annotated a second time. The final dataset is made up of all data that went through two rounds of annotation. (We provide the single round annotations as well - it is only training data) An equal amount of unanswerable questions have also been added from the original NQ train/dev sets. Details about the annotation task and unanswerables can be found at https://github.com/primeqa/clapnq/blob/main/annotated_data.""",
templates={
"empty": InputOutputTemplate(
input_format="",
Expand Down
4 changes: 2 additions & 2 deletions prepare/cards/rag/end_to_end/hotpotqa.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
],
task="tasks.rag.end_to_end",
templates={"default": "templates.rag.end_to_end.json_predictions"},
__tags__={"license": "CC BY-SA 4.0"},
__tags__={"license": "CC BY-SA 4.0", "url": "https://huggingface.co/datasets/BeIR/hotpotqa"},
__description__="""HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.
HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems ability to extract relevant facts and perform necessary comparison.
""",
Expand Down Expand Up @@ -118,7 +118,7 @@
output_format="",
),
},
__tags__={"license": "CC BY-SA 4.0"},
__tags__={"license": "CC BY-SA 4.0", "url": "https://huggingface.co/datasets/BeIR/hotpotqa"},
__description__="""HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.
HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems ability to extract relevant facts and perform necessary comparison.
""",
Expand Down
4 changes: 2 additions & 2 deletions prepare/cards/rag/end_to_end/miniwikipedia.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
],
task="tasks.rag.end_to_end",
templates={"default": "templates.rag.end_to_end.json_predictions"},
__tags__={"license": "cc-by-sa-3.0"},
__tags__={"license": "cc-by-2.5", "url":"https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia/"},
__description__="""This dataset, a subset generated by the RAG-Datasets team, supports research in question answering by providing questions and answers derived from Wikipedia articles, along with difficulty ratings assigned by both question writers and answerers. It includes files for questions from three student cohorts (S08, S09, and S10) and 690,000 words of cleaned Wikipedia text, facilitating exploration of question generation and answering tasks.""",
)

Expand Down Expand Up @@ -72,7 +72,7 @@
output_format="",
),
},
__tags__={"license": "cc-by-2.5"},
__tags__={"license": "cc-by-2.5", "url":"https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia/"},
__description__="""This dataset, a subset generated by the RAG-Datasets team, supports research in question answering by providing questions and answers derived from Wikipedia articles, along with difficulty ratings assigned by both question writers and answerers. It includes files for questions from three student cohorts (S08, S09, and S10) and 690,000 words of cleaned Wikipedia text, facilitating exploration of question generation and answering tasks.""",
)

Expand Down
3 changes: 2 additions & 1 deletion src/unitxt/catalog/cards/rag/benchmark/bioasq/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@
"default": "templates.rag.end_to_end.json_predictions"
},
"__tags__": {
"license": "cc-by-2.5"
"license": "cc-by-2.5",
"url": "https://huggingface.co/datasets/enelpol/rag-mini-bioasq"
},
"__description__": "This dataset is a subset of a training dataset by the BioASQ Challenge, which is available here.\n\nIt is derived from rag-datasets/rag-mini-bioasq.\n\nModifications include:\n\nfilling in missing passages (some of them contained \"nan\" instead of actual text),\nchanging relevant_passage_ids' type from string to sequence of ints,\ndeduplicating the passages (removed 40 duplicates) and fixing the relevant_passage_ids in QAP triplets to point to the corrected, deduplicated passages' ids,\nsplitting QAP triplets into train and test splits.\n"
}
5 changes: 5 additions & 0 deletions src/unitxt/catalog/cards/rag/benchmark/clap_nq/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@
}
],
"task": "tasks.rag.end_to_end",
"__tags__": {
"license": "Apache License 2.0",
"url": "https://huggingface.co/datasets/PrimeQA/clapnq"
},
"__description__": "CLAP NQ is created from the subset of Natural Questions (NQ) that have a long answer but no short answer. NQ consists of ~380k examples. There are ~30k questions that are long answers without short answers excluding tables and lists. To increases the likelihood of longer answers we only explored ones that have more than 5 sentences in the passage. The subset that was annotated consists of ~12k examples. All examples where cohesion of non-consecutive sentences was required for the answer were annotated a second time. The final dataset is made up of all data that went through two rounds of annotation. (We provide the single round annotations as well - it is only training data) An equal amount of unanswerable questions have also been added from the original NQ train/dev sets. Details about the annotation task and unanswerables can be found at https://github.com/primeqa/clapnq/blob/main/annotated_data.",
"templates": {
"default": "templates.rag.end_to_end.json_predictions"
}
Expand Down
3 changes: 2 additions & 1 deletion src/unitxt/catalog/cards/rag/benchmark/hotpotqa/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@
"default": "templates.rag.end_to_end.json_predictions"
},
"__tags__": {
"license": "CC BY-SA 4.0"
"license": "CC BY-SA 4.0",
"url": "https://huggingface.co/datasets/BeIR/hotpotqa"
},
"__description__": "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.\nHotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems ability to extract relevant facts and perform necessary comparison.\n "
}
3 changes: 2 additions & 1 deletion src/unitxt/catalog/cards/rag/benchmark/miniwiki/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@
"default": "templates.rag.end_to_end.json_predictions"
},
"__tags__": {
"license": "cc-by-sa-3.0"
"license": "cc-by-2.5",
"url": "https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia/"
},
"__description__": "This dataset, a subset generated by the RAG-Datasets team, supports research in question answering by providing questions and answers derived from Wikipedia articles, along with difficulty ratings assigned by both question writers and answerers. It includes files for questions from three student cohorts (S08, S09, and S10) and 690,000 words of cleaned Wikipedia text, facilitating exploration of question generation and answering tasks."
}
3 changes: 2 additions & 1 deletion src/unitxt/catalog/cards/rag/documents/bioasq/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@
}
},
"__tags__": {
"license": "cc-by-2.5"
"license": "cc-by-2.5",
"url": "https://huggingface.co/datasets/enelpol/rag-mini-bioasq"
},
"__description__": "This dataset is a subset of a training dataset by the BioASQ Challenge, which is available here.\n\nIt is derived from rag-datasets/rag-mini-bioasq.\n\nModifications include:\n\nfilling in missing passages (some of them contained \"nan\" instead of actual text),\nchanging relevant_passage_ids' type from string to sequence of ints,\ndeduplicating the passages (removed 40 duplicates) and fixing the relevant_passage_ids in QAP triplets to point to the corrected, deduplicated passages' ids,\nsplitting QAP triplets into train and test splits.\n"
}
5 changes: 5 additions & 0 deletions src/unitxt/catalog/cards/rag/documents/clap_nq/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@
}
],
"task": "tasks.rag.corpora",
"__tags__": {
"license": "Apache License 2.0",
"url": "https://huggingface.co/datasets/PrimeQA/clapnq"
},
"__description__": "CLAP NQ is created from the subset of Natural Questions (NQ) that have a long answer but no short answer. NQ consists of ~380k examples. There are ~30k questions that are long answers without short answers excluding tables and lists. To increases the likelihood of longer answers we only explored ones that have more than 5 sentences in the passage. The subset that was annotated consists of ~12k examples. All examples where cohesion of non-consecutive sentences was required for the answer were annotated a second time. The final dataset is made up of all data that went through two rounds of annotation. (We provide the single round annotations as well - it is only training data) An equal amount of unanswerable questions have also been added from the original NQ train/dev sets. Details about the annotation task and unanswerables can be found at https://github.com/primeqa/clapnq/blob/main/annotated_data.",
"templates": {
"empty": {
"__type__": "input_output_template",
Expand Down
3 changes: 2 additions & 1 deletion src/unitxt/catalog/cards/rag/documents/hotpotqa/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,8 @@
}
},
"__tags__": {
"license": "CC BY-SA 4.0"
"license": "CC BY-SA 4.0",
"url": "https://huggingface.co/datasets/BeIR/hotpotqa"
},
"__description__": "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering.\nHotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems ability to extract relevant facts and perform necessary comparison.\n"
}
3 changes: 2 additions & 1 deletion src/unitxt/catalog/cards/rag/documents/miniwiki/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@
}
},
"__tags__": {
"license": "cc-by-2.5"
"license": "cc-by-2.5",
"url": "https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia/"
},
"__description__": "This dataset, a subset generated by the RAG-Datasets team, supports research in question answering by providing questions and answers derived from Wikipedia articles, along with difficulty ratings assigned by both question writers and answerers. It includes files for questions from three student cohorts (S08, S09, and S10) and 690,000 words of cleaned Wikipedia text, facilitating exploration of question generation and answering tasks."
}
Loading