- 
                Notifications
    
You must be signed in to change notification settings  - Fork 328
 
feat: custom deliminer write csv #5430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: custom deliminer write csv #5430
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
Added delimiter parameter to write_csv() to support custom CSV delimiters. However, the implementation is incomplete:
- 
Critical Issue: The
delimiterparameter only works for empty DataFrames. For non-empty DataFrames, the parameter is accepted but never passed through the execution pipeline (LogicalPlanBuilder→ Rust layer →ExecutionStep→recordbatch_io.write_tabular()), so it will always use the default comma delimiter. - 
Test Still Skipped: The test
test_write_csv_with_delimiterremains marked with@pytest.mark.skip, suggesting the feature wasn't actually verified to work. 
To properly implement this feature, the delimiter needs to be:
- Passed from 
DataFrame.write_csv()toLogicalPlanBuilder.write_tabular() - Stored in the Rust logical plan
 - Passed through 
ExecutionStep._handle_file_write() - Accepted by 
recordbatch_io.write_tabular()and used to create CSV write options 
The implementation for empty DataFrames shows the right approach with proper validation and PyArrow integration, but this needs to be extended to the full write pipeline.
Confidence Score: 1/5
- This PR has a critical implementation gap that breaks the advertised feature for non-empty DataFrames
 - The delimiter parameter is only functional for empty DataFrames. For the primary use case (non-empty DataFrames), the parameter is silently ignored and the default comma delimiter is always used. The test remains skipped, indicating the feature wasn't validated. This requires significant additional implementation across Python and Rust layers.
 - daft/dataframe/dataframe.py requires passing delimiter through the builder; daft/recordbatch/recordbatch_io.py needs delimiter support in write_tabular(); Rust code needs updates to support delimiter in the logical plan
 
Important Files Changed
File Analysis
| Filename | Score | Overview | 
|---|---|---|
| daft/dataframe/dataframe.py | 1/5 | Added delimiter parameter but it's only passed to empty DataFrame path, not to non-empty DataFrames through the builder | 
| daft/recordbatch/recordbatch_io.py | 3/5 | Implemented delimiter handling for empty DataFrames with proper validation, but missing support for non-empty DataFrames in write_tabular() | 
| tests/connect/test_io.py | 2/5 | Added test for custom delimiter but test is still marked as skipped | 
Sequence Diagram
sequenceDiagram
    participant User
    participant DataFrame
    participant LogicalPlanBuilder
    participant ExecutionStep
    participant recordbatch_io
    participant PyArrow
    
    User->>DataFrame: write_csv(path, delimiter="|")
    
    alt Non-empty DataFrame
        DataFrame->>LogicalPlanBuilder: write_tabular(root_dir, file_format=CSV)
        Note over DataFrame,LogicalPlanBuilder: ❌ delimiter NOT passed here
        LogicalPlanBuilder->>ExecutionStep: execute write operation
        ExecutionStep->>recordbatch_io: write_tabular(table, file_format)
        Note over ExecutionStep,recordbatch_io: ❌ delimiter NOT passed here
        recordbatch_io->>PyArrow: pads.write_dataset(opts=None)
        Note over recordbatch_io,PyArrow: ❌ Uses default comma delimiter
        PyArrow-->>User: CSV with comma delimiter
    else Empty DataFrame
        DataFrame->>recordbatch_io: write_empty_tabular(delimiter="|")
        Note over DataFrame,recordbatch_io: ✅ delimiter passed correctly
        recordbatch_io->>PyArrow: pacsv.write_csv(write_options=WriteOptions(delimiter="|"))
        PyArrow-->>User: CSV with custom delimiter
    end
    Additional Comments (1)
- 
daft/dataframe/dataframe.py, line 894-900 (link)logic:
delimiterparameter is not passed through to the builderThe
delimiterparameter is accepted but never passed towrite_tabular(). This means non-empty DataFrames will ignore the delimiter and always use the default comma separator.The parameter needs to be threaded through:
LogicalPlanBuilder.write_tabular()- The Rust 
table_writemethod ExecutionStep._handle_file_write()recordbatch_io.write_tabular()
Currently only empty DataFrames (line 918) use the delimiter parameter.
 
3 files reviewed, 4 comments
        
          
                tests/connect/test_io.py
              
                Outdated
          
        
      | pass | ||
| 
               | 
          ||
| 
               | 
          ||
| @pytest.mark.skip(reason="https://github.com/Eventual-Inc/Daft/issues/3786") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: test is still marked as skipped
Since the feature was implemented, remove the @pytest.mark.skip decorator
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/connect/test_io.py
Line: 43:43
Comment:
**logic:** test is still marked as skipped
Since the feature was implemented, remove the `@pytest.mark.skip` decorator
How can I resolve this? If you propose a fix, please make it concise.| if delimiter is not None and len(delimiter) == 1: | ||
| write_opts = pacsv.WriteOptions(delimiter=delimiter) | ||
| pacsv.write_csv(table, output_file, write_options=write_opts) | ||
| elif delimiter is not None and len(delimiter) > 1: | ||
| # The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details | ||
| # https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions | ||
| # https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html | ||
| raise ValueError("Delimiter must be a single character") | ||
| else: | ||
| pacsv.write_csv(table, output_file) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: validation logic could be simplified
| if delimiter is not None and len(delimiter) == 1: | |
| write_opts = pacsv.WriteOptions(delimiter=delimiter) | |
| pacsv.write_csv(table, output_file, write_options=write_opts) | |
| elif delimiter is not None and len(delimiter) > 1: | |
| # The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details | |
| # https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions | |
| # https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html | |
| raise ValueError("Delimiter must be a single character") | |
| else: | |
| pacsv.write_csv(table, output_file) | |
| if delimiter is not None: | |
| if len(delimiter) != 1: | |
| # The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details | |
| # https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions | |
| # https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html | |
| raise ValueError("Delimiter must be a single character") | |
| write_opts = pacsv.WriteOptions(delimiter=delimiter) | |
| pacsv.write_csv(table, output_file, write_options=write_opts) | |
| else: | |
| pacsv.write_csv(table, output_file) | 
Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/recordbatch/recordbatch_io.py
Line: 621:630
Comment:
**style:** validation logic could be simplified
```suggestion
            if delimiter is not None:
                if len(delimiter) != 1:
                    # The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details
                    # https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions
                    # https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html
                    raise ValueError("Delimiter must be a single character")
                write_opts = pacsv.WriteOptions(delimiter=delimiter)
                pacsv.write_csv(table, output_file, write_options=write_opts)
            else:
                pacsv.write_csv(table, output_file)
```
How can I resolve this? If you propose a fix, please make it concise.| write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append", | ||
| partition_cols: Optional[list[ColumnInputType]] = None, | ||
| io_config: Optional[IOConfig] = None, | ||
| delimiter: Optional[str] = None, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: add validation for delimiter parameter
The parameter should be validated before being passed through. Check that it's either None or a single character string, similar to the validation in write_empty_tabular at recordbatch_io.py:624-628
Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/dataframe/dataframe.py
Line: 852:852
Comment:
**style:** add validation for delimiter parameter
The parameter should be validated before being passed through. Check that it's either None or a single character string, similar to the validation in `write_empty_tabular` at recordbatch_io.py:624-628
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very welcomed change! Thanks for making a contribution <3 Overall I think the change here is mostly good, but I have a few requested changes. The delimiter should be str and just default to "," instead of Optional[str] = None where None is interpreted as ",". There's also a few other requested changes too. Once that's in I can approve and we can merge this in! 🙌
| data = {"id": [1, 2, 3], "name": ["alice", "bob", "charlie"]} | ||
| df = make_df(data) | ||
| 
               | 
          ||
| csv_dir = os.path.join(tmp_path, "csv_custom_delimiter") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So tmp_path here is injected by the pytest framework. It's type is a pathlib.Path object, not a str (https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmp-path-fixture). So, instead of using os.path.join, you can instead use the built-in methods of Path.
| csv_dir = os.path.join(tmp_path, "csv_custom_delimiter") | |
| csv_dir = tmp_path / "csv_custom_delimiter" | 
| "python.testing.pytestEnabled": true, | ||
| "makefile.configureOnOpen": false | ||
| "makefile.configureOnOpen": false, | ||
| "editor.insertSpaces": true | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty reasonable setting to have. But, we'd prefer if any settings-related changes are isolated to their own PR, please.
| write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append", | ||
| partition_cols: Optional[list[ColumnInputType]] = None, | ||
| io_config: Optional[IOConfig] = None, | ||
| delimiter: Optional[str] = None, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| delimiter: Optional[str] = None, | |
| delimiter: str = ",", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea here! But I do feel like we can use a str and default to "," instead of using None and then defaulting to "," on None. Since strings in Python are immutable, we don't have to worry about having a mutable default argument here :)  And it makes the API just a little bit cleaner IMO.
| schema: Schema, | ||
| compression: str | None = None, | ||
| io_config: IOConfig | None = None, | ||
| delimiter: str | None = None, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| delimiter: str | None = None, | |
| delimiter: str = ",", | 
| elif file_format == FileFormat.Csv: | ||
| output_file = fs.open_output_stream(file_path) | ||
| pacsv.write_csv(table, output_file) | ||
| if delimiter is not None and len(delimiter) == 1: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great check here! I also really like the documentation here about PyArrow's CSV writer constraining this to 1 character.
          Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #5430      +/-   ##
==========================================
+ Coverage   71.48%   71.56%   +0.08%     
==========================================
  Files         991      992       +1     
  Lines      125399   125954     +555     
==========================================
+ Hits        89639    90142     +503     
- Misses      35760    35812      +52     
 🚀 New features to boost your workflow:
  | 
    
Co-authored-by: Malcolm Greaves <[email protected]>
| 
           Discussion in OSS Daft Slack! 🧵 https://dist-data.slack.com/archives/C041NA2RBFD/p1760716811883269  | 
    
Co-authored-by: Malcolm Greaves <[email protected]>
Co-authored-by: Malcolm Greaves <[email protected]>
Changes Made
the
daft.write_csvfunction now has adelimiteroption to set a custom delimiterRelated Issues
Closes #3786
Checklist
docs/mkdocs.ymlnavigation