Skip to content

Conversation

@The-Obstacle-Is-The-Way
Copy link

Summary

Adds a /shard endpoint that maps a row index to its corresponding shard information. This enables data provenance tracking - users can determine which original input file a specific row came from.

This is the "next step" mentioned in huggingface/datasets#7897, which added original_shard_lengths to split info.

API

GET /shard?dataset=X&config=Y&split=Z&row=N

Response:

{
  "row_index": 150,
  "original_shard_index": 1,
  "original_shard_start_row": 100,
  "original_shard_end_row": 199,
  "parquet_shard_index": 0,
  "parquet_shard_file": "train-00000-of-00002.parquet"
}

For legacy datasets without original_shard_lengths, the original shard fields return null with an explanatory message.

Implementation

  • Single cache call to config-parquet-and-info for efficiency
  • Cumulative sum algorithm to find shard boundaries
  • Proper error handling (no silent fallbacks - raises errors for corrupted metadata)
  • Follows existing codebase patterns (validated against duckdb.py)

Files Changed

File Purpose
libs/libapi/src/libapi/shard_utils.py Core shard lookup algorithm
services/api/src/api/routes/shard.py Endpoint handler
services/api/src/api/app.py Route registration
docs/source/openapi.json API specification

Test Plan

  • Unit tests (libs/libapi/tests/test_shard_utils.py)
  • Integration tests (services/api/tests/routes/test_shard.py)
  • E2E tests (e2e/tests/test_56_shard.py)
  • mypy passes
  • ruff formatting applied

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Implements Row-to-Shard API that maps a row index to its original input
shard and parquet output shard. This enables data provenance tracking
for datasets using the new `original_shard_lengths` field from
huggingface/datasets PR #7897.

API: GET /shard?dataset=X&config=Y&split=Z&row=N

- Adds core algorithm in libapi/shard_utils.py
- Registers /shard route in API service (no nginx changes needed)
- Uses single cache call to config-parquet-and-info for optimization
- Handles missing original_shard_lengths for legacy datasets
- Includes unit, integration, and E2E tests
- Consolidate single-line statements within 119 char limit
- Sort imports per isort rules (I001)
- Add trailing commas to dict literals
- Expand long dict entries to multi-line format

All CI quality checks verified locally:
- ruff check src/tests: PASS
- ruff format --check src/tests: PASS
- mypy src/tests: PASS
- bandit -r src --skip B615: PASS
Critical fixes:
- Add ResponseNotFoundError/ResponseNotReadyError to exception handling
  (fixes 404 being incorrectly returned as 500)
- Add shard_lengths validation to catch corrupted parquet metadata early
- Add headers to 400 response in OpenAPI spec for consistency
- Add 500 response to OpenAPI spec for completeness

Minor improvements:
- DRY: store sum(original_shard_lengths) in variable
- Improve test assertion specificity (assert error code value)
Add missing return type annotations and parameter type hints to
fixtures and test functions to satisfy mypy strict checking.
AI-generated code was fabricating filenames instead of raising errors:
- Empty parquet_files -> was returning fabricated "{split}.parquet"
- More shards than files -> was fabricating "{split}-{idx:05d}.parquet"

Now follows codebase pattern (duckdb.py:97-98): raise ValueError for
metadata inconsistencies instead of hiding data corruption.

Added tests for both error cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants