feat(api): add /shard endpoint for row-to-shard mapping #3276
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
/shardendpoint that maps a row index to its corresponding shard information. This enables data provenance tracking - users can determine which original input file a specific row came from.This is the "next step" mentioned in huggingface/datasets#7897, which added
original_shard_lengthsto split info.API
Response:
{ "row_index": 150, "original_shard_index": 1, "original_shard_start_row": 100, "original_shard_end_row": 199, "parquet_shard_index": 0, "parquet_shard_file": "train-00000-of-00002.parquet" }For legacy datasets without
original_shard_lengths, the original shard fields returnnullwith an explanatory message.Implementation
config-parquet-and-infofor efficiencyduckdb.py)Files Changed
libs/libapi/src/libapi/shard_utils.pyservices/api/src/api/routes/shard.pyservices/api/src/api/app.pydocs/source/openapi.jsonTest Plan
libs/libapi/tests/test_shard_utils.py)services/api/tests/routes/test_shard.py)e2e/tests/test_56_shard.py)