Skip to content

feat: add return_file_name support to CSV packaged builder#8019

Open
dhruvildarji wants to merge 1 commit intohuggingface:mainfrom
dhruvildarji:feat/csv-file-name
Open

feat: add return_file_name support to CSV packaged builder#8019
dhruvildarji wants to merge 1 commit intohuggingface:mainfrom
dhruvildarji:feat/csv-file-name

Conversation

@dhruvildarji
Copy link

Summary

  • Adds an optional return_file_name: bool = False parameter to CsvConfig
  • When return_file_name=True, a file_name column is appended to every batch in _generate_tables, containing the basename of the source CSV file for each row
  • Default is False, preserving full backward compatibility

Motivation

Part of #5806. Extends the file_name feature (already implemented for the JSON packaged builder in #7948) to the CSV packaged builder. This enables use cases such as resuming training from checkpoints by identifying which data shards have already been consumed.

Changes

  • src/datasets/packaged_modules/csv/csv.py: Add return_file_name: bool = False field to CsvConfig; in _generate_tables, append a file_name column when the flag is True
  • tests/packaged_modules/test_csv.py: Add three tests covering default behavior (no column), enabled behavior (column present), and correct column values

Test plan

  • test_csv_no_file_name_by_default — verifies file_name column is absent by default
  • test_csv_return_file_name_enabled — verifies file_name column is present when return_file_name=True
  • test_csv_file_name_values — verifies each row's file_name value equals the source file basename

All three tests pass locally.

🤖 Generated with Claude Code

Add an optional `return_file_name` parameter to `CsvConfig` that, when
set to `True`, appends a `file_name` column containing the source file
basename to every batch yielded by `_generate_tables`. Default is
`False` to preserve backward compatibility.

Part of huggingface#5806. Extends the file_name feature (already implemented for
the JSON builder in huggingface#7948) to the CSV packaged builder.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant