Description
Feature request
Some datasets may contain an id-like field, for example the id
field in wikimedia/wikipedia and the _id
field in BeIR/dbpedia-entity. HF datasets support efficient random access via row, but not via this kinds of id fields. I wonder if it is possible to add support for indexing by a custom "id-like" field to enable random access via such ids. The ids may be numbers or strings.
Motivation
In some cases, especially during inference/evaluation, I may want to find out the item that has a specified id, defined by the dataset itself.
For example, in a typical re-ranking setting in information retrieval, the user may want to re-rank the set of candidate documents of each query. The input is usually presented in a TREC-style run file, with the following format:
<qid> Q0 <docno> <rank> <score> <tag>
The re-ranking program should be able to fetch the queries and documents according to the <qid>
and <docno>
, which are the original id defined in the query/document datasets. To accomplish this, I have to iterate over the whole HF dataset to get the mapping from real ids to row ids every time I start the program, which is time-consuming. Thus I want HF dataset to provide options for users to index by a custom id column, not by row.
Your contribution
I'm not an expert in this project and I'm afraid that I'm not able to make contributions on the code.