[Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id

### Feature request

Some datasets may contain an id-like field, for example the `id` field in [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) and the `_id` field in [BeIR/dbpedia-entity](https://huggingface.co/datasets/BeIR/dbpedia-entity). HF datasets support efficient random access via row, but not via this kinds of id fields. I wonder if it is possible to add support for indexing by a custom "id-like" field to enable random access via such ids. The ids may be numbers or strings.

### Motivation

In some cases, especially during inference/evaluation, I may want to find out the item that has a specified id, defined by the dataset itself.

For example, in a typical re-ranking setting in information retrieval, the user may want to re-rank the set of candidate documents of each query. The input is usually presented in a TREC-style run file, with the following format:

```
<qid> Q0 <docno> <rank> <score> <tag>
```

The re-ranking program should be able to fetch the queries and documents according to the `<qid>` and `<docno>`, which are the original id defined in the query/document datasets. To accomplish this, I have to iterate over the whole HF dataset to get the mapping from real ids to row ids every time I start the program, which is time-consuming. Thus I want HF dataset to provide options for users to index by a custom id column, not by row.

### Your contribution

I'm not an expert in this project and I'm afraid that I'm not able to make contributions on the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id #6532

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id #6532

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions