feat: add return_file_name support to CSV packaged builder#8019
Open
dhruvildarji wants to merge 1 commit intohuggingface:mainfrom
Open
feat: add return_file_name support to CSV packaged builder#8019dhruvildarji wants to merge 1 commit intohuggingface:mainfrom
dhruvildarji wants to merge 1 commit intohuggingface:mainfrom
Conversation
Add an optional `return_file_name` parameter to `CsvConfig` that, when set to `True`, appends a `file_name` column containing the source file basename to every batch yielded by `_generate_tables`. Default is `False` to preserve backward compatibility. Part of huggingface#5806. Extends the file_name feature (already implemented for the JSON builder in huggingface#7948) to the CSV packaged builder. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
return_file_name: bool = Falseparameter toCsvConfigreturn_file_name=True, afile_namecolumn is appended to every batch in_generate_tables, containing the basename of the source CSV file for each rowFalse, preserving full backward compatibilityMotivation
Part of #5806. Extends the
file_namefeature (already implemented for the JSON packaged builder in #7948) to the CSV packaged builder. This enables use cases such as resuming training from checkpoints by identifying which data shards have already been consumed.Changes
src/datasets/packaged_modules/csv/csv.py: Addreturn_file_name: bool = Falsefield toCsvConfig; in_generate_tables, append afile_namecolumn when the flag isTruetests/packaged_modules/test_csv.py: Add three tests covering default behavior (no column), enabled behavior (column present), and correct column valuesTest plan
test_csv_no_file_name_by_default— verifiesfile_namecolumn is absent by defaulttest_csv_return_file_name_enabled— verifiesfile_namecolumn is present whenreturn_file_name=Truetest_csv_file_name_values— verifies each row'sfile_namevalue equals the source file basenameAll three tests pass locally.
🤖 Generated with Claude Code