feat(api): add /shard endpoint for row-to-shard mapping #3276

The-Obstacle-Is-The-Way · 2025-12-06T18:35:28Z

Summary

Adds a /shard endpoint that maps a row index to its corresponding shard information. This enables data provenance tracking - users can determine which original input file a specific row came from.

This is the "next step" mentioned in huggingface/datasets#7897, which added original_shard_lengths to split info.

API

GET /shard?dataset=X&config=Y&split=Z&row=N

Response:

{
  "row_index": 150,
  "original_shard_index": 1,
  "original_shard_start_row": 100,
  "original_shard_end_row": 199,
  "parquet_shard_index": 0,
  "parquet_shard_file": "train-00000-of-00002.parquet"
}

For legacy datasets without original_shard_lengths, the original shard fields return null with an explanatory message.

Implementation

Single cache call to config-parquet-and-info for efficiency
Cumulative sum algorithm to find shard boundaries
Proper error handling (no silent fallbacks - raises errors for corrupted metadata)
Follows existing codebase patterns (validated against duckdb.py)

Files Changed

File	Purpose
`libs/libapi/src/libapi/shard_utils.py`	Core shard lookup algorithm
`services/api/src/api/routes/shard.py`	Endpoint handler
`services/api/src/api/app.py`	Route registration
`docs/source/openapi.json`	API specification

Test Plan

Unit tests (libs/libapi/tests/test_shard_utils.py)
Integration tests (services/api/tests/routes/test_shard.py)
E2E tests (e2e/tests/test_56_shard.py)
mypy passes
ruff formatting applied

HuggingFaceDocBuilderDev · 2025-12-08T09:03:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Implements Row-to-Shard API that maps a row index to its original input shard and parquet output shard. This enables data provenance tracking for datasets using the new `original_shard_lengths` field from huggingface/datasets PR #7897. API: GET /shard?dataset=X&config=Y&split=Z&row=N - Adds core algorithm in libapi/shard_utils.py - Registers /shard route in API service (no nginx changes needed) - Uses single cache call to config-parquet-and-info for optimization - Handles missing original_shard_lengths for legacy datasets - Includes unit, integration, and E2E tests

- Consolidate single-line statements within 119 char limit - Sort imports per isort rules (I001) - Add trailing commas to dict literals - Expand long dict entries to multi-line format All CI quality checks verified locally: - ruff check src/tests: PASS - ruff format --check src/tests: PASS - mypy src/tests: PASS - bandit -r src --skip B615: PASS

Critical fixes: - Add ResponseNotFoundError/ResponseNotReadyError to exception handling (fixes 404 being incorrectly returned as 500) - Add shard_lengths validation to catch corrupted parquet metadata early - Add headers to 400 response in OpenAPI spec for consistency - Add 500 response to OpenAPI spec for completeness Minor improvements: - DRY: store sum(original_shard_lengths) in variable - Improve test assertion specificity (assert error code value)

Add missing return type annotations and parameter type hints to fixtures and test functions to satisfy mypy strict checking.

AI-generated code was fabricating filenames instead of raising errors: - Empty parquet_files -> was returning fabricated "{split}.parquet" - More shards than files -> was fabricating "{split}-{idx:05d}.parquet" Now follows codebase pattern (duckdb.py:97-98): raise ValueError for metadata inconsistencies instead of hiding data corruption. Added tests for both error cases.

The-Obstacle-Is-The-Way force-pushed the feature/row-to-shard-api branch from eada1a0 to 40f0a0d Compare December 11, 2025 18:23

The-Obstacle-Is-The-Way added 5 commits December 12, 2025 18:21

fix: add type annotations to test_shard.py for mypy compliance

47b1c5d

Add missing return type annotations and parameter type hints to fixtures and test functions to satisfy mypy strict checking.

The-Obstacle-Is-The-Way force-pushed the feature/row-to-shard-api branch from 40f0a0d to 98fe400 Compare December 12, 2025 23:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(api): add /shard endpoint for row-to-shard mapping #3276

feat(api): add /shard endpoint for row-to-shard mapping #3276

Uh oh!

The-Obstacle-Is-The-Way commented Dec 6, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(api): add /shard endpoint for row-to-shard mapping #3276

Are you sure you want to change the base?

feat(api): add /shard endpoint for row-to-shard mapping #3276

Uh oh!

Conversation

The-Obstacle-Is-The-Way commented Dec 6, 2025

Summary

API

Implementation

Files Changed

Test Plan

Uh oh!

HuggingFaceDocBuilderDev commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants