Skip to content

Conversation

@destroyer22719
Copy link
Contributor

Changes Made

the daft.write_csv function now has a delimiter option to set a custom delimiter

Related Issues

Closes #3786

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly

@github-actions github-actions bot added the feat label Oct 21, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

Added delimiter parameter to write_csv() to support custom CSV delimiters. However, the implementation is incomplete:

  • Critical Issue: The delimiter parameter only works for empty DataFrames. For non-empty DataFrames, the parameter is accepted but never passed through the execution pipeline (LogicalPlanBuilder → Rust layer → ExecutionSteprecordbatch_io.write_tabular()), so it will always use the default comma delimiter.

  • Test Still Skipped: The test test_write_csv_with_delimiter remains marked with @pytest.mark.skip, suggesting the feature wasn't actually verified to work.

To properly implement this feature, the delimiter needs to be:

  1. Passed from DataFrame.write_csv() to LogicalPlanBuilder.write_tabular()
  2. Stored in the Rust logical plan
  3. Passed through ExecutionStep._handle_file_write()
  4. Accepted by recordbatch_io.write_tabular() and used to create CSV write options

The implementation for empty DataFrames shows the right approach with proper validation and PyArrow integration, but this needs to be extended to the full write pipeline.

Confidence Score: 1/5

  • This PR has a critical implementation gap that breaks the advertised feature for non-empty DataFrames
  • The delimiter parameter is only functional for empty DataFrames. For the primary use case (non-empty DataFrames), the parameter is silently ignored and the default comma delimiter is always used. The test remains skipped, indicating the feature wasn't validated. This requires significant additional implementation across Python and Rust layers.
  • daft/dataframe/dataframe.py requires passing delimiter through the builder; daft/recordbatch/recordbatch_io.py needs delimiter support in write_tabular(); Rust code needs updates to support delimiter in the logical plan

Important Files Changed

File Analysis

Filename Score Overview
daft/dataframe/dataframe.py 1/5 Added delimiter parameter but it's only passed to empty DataFrame path, not to non-empty DataFrames through the builder
daft/recordbatch/recordbatch_io.py 3/5 Implemented delimiter handling for empty DataFrames with proper validation, but missing support for non-empty DataFrames in write_tabular()
tests/connect/test_io.py 2/5 Added test for custom delimiter but test is still marked as skipped

Sequence Diagram

sequenceDiagram
    participant User
    participant DataFrame
    participant LogicalPlanBuilder
    participant ExecutionStep
    participant recordbatch_io
    participant PyArrow
    
    User->>DataFrame: write_csv(path, delimiter="|")
    
    alt Non-empty DataFrame
        DataFrame->>LogicalPlanBuilder: write_tabular(root_dir, file_format=CSV)
        Note over DataFrame,LogicalPlanBuilder: ❌ delimiter NOT passed here
        LogicalPlanBuilder->>ExecutionStep: execute write operation
        ExecutionStep->>recordbatch_io: write_tabular(table, file_format)
        Note over ExecutionStep,recordbatch_io: ❌ delimiter NOT passed here
        recordbatch_io->>PyArrow: pads.write_dataset(opts=None)
        Note over recordbatch_io,PyArrow: ❌ Uses default comma delimiter
        PyArrow-->>User: CSV with comma delimiter
    else Empty DataFrame
        DataFrame->>recordbatch_io: write_empty_tabular(delimiter="|")
        Note over DataFrame,recordbatch_io: ✅ delimiter passed correctly
        recordbatch_io->>PyArrow: pacsv.write_csv(write_options=WriteOptions(delimiter="|"))
        PyArrow-->>User: CSV with custom delimiter
    end
Loading

Additional Comments (1)

  1. daft/dataframe/dataframe.py, line 894-900 (link)

    logic: delimiter parameter is not passed through to the builder

    The delimiter parameter is accepted but never passed to write_tabular(). This means non-empty DataFrames will ignore the delimiter and always use the default comma separator.

    The parameter needs to be threaded through:

    1. LogicalPlanBuilder.write_tabular()
    2. The Rust table_write method
    3. ExecutionStep._handle_file_write()
    4. recordbatch_io.write_tabular()

    Currently only empty DataFrames (line 918) use the delimiter parameter.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

pass


@pytest.mark.skip(reason="https://github.com/Eventual-Inc/Daft/issues/3786")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: test is still marked as skipped

Since the feature was implemented, remove the @pytest.mark.skip decorator

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/connect/test_io.py
Line: 43:43

Comment:
**logic:** test is still marked as skipped

Since the feature was implemented, remove the `@pytest.mark.skip` decorator

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 621 to 630
if delimiter is not None and len(delimiter) == 1:
write_opts = pacsv.WriteOptions(delimiter=delimiter)
pacsv.write_csv(table, output_file, write_options=write_opts)
elif delimiter is not None and len(delimiter) > 1:
# The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details
# https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions
# https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html
raise ValueError("Delimiter must be a single character")
else:
pacsv.write_csv(table, output_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: validation logic could be simplified

Suggested change
if delimiter is not None and len(delimiter) == 1:
write_opts = pacsv.WriteOptions(delimiter=delimiter)
pacsv.write_csv(table, output_file, write_options=write_opts)
elif delimiter is not None and len(delimiter) > 1:
# The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details
# https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions
# https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html
raise ValueError("Delimiter must be a single character")
else:
pacsv.write_csv(table, output_file)
if delimiter is not None:
if len(delimiter) != 1:
# The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details
# https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions
# https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html
raise ValueError("Delimiter must be a single character")
write_opts = pacsv.WriteOptions(delimiter=delimiter)
pacsv.write_csv(table, output_file, write_options=write_opts)
else:
pacsv.write_csv(table, output_file)
Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/recordbatch/recordbatch_io.py
Line: 621:630

Comment:
**style:** validation logic could be simplified

```suggestion
            if delimiter is not None:
                if len(delimiter) != 1:
                    # The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details
                    # https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions
                    # https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html
                    raise ValueError("Delimiter must be a single character")
                write_opts = pacsv.WriteOptions(delimiter=delimiter)
                pacsv.write_csv(table, output_file, write_options=write_opts)
            else:
                pacsv.write_csv(table, output_file)
```

How can I resolve this? If you propose a fix, please make it concise.

write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
partition_cols: Optional[list[ColumnInputType]] = None,
io_config: Optional[IOConfig] = None,
delimiter: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: add validation for delimiter parameter

The parameter should be validated before being passed through. Check that it's either None or a single character string, similar to the validation in write_empty_tabular at recordbatch_io.py:624-628

Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/dataframe/dataframe.py
Line: 852:852

Comment:
**style:** add validation for delimiter parameter

The parameter should be validated before being passed through. Check that it's either None or a single character string, similar to the validation in `write_empty_tabular` at recordbatch_io.py:624-628

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor

@malcolmgreaves malcolmgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very welcomed change! Thanks for making a contribution <3 Overall I think the change here is mostly good, but I have a few requested changes. The delimiter should be str and just default to "," instead of Optional[str] = None where None is interpreted as ",". There's also a few other requested changes too. Once that's in I can approve and we can merge this in! 🙌

data = {"id": [1, 2, 3], "name": ["alice", "bob", "charlie"]}
df = make_df(data)

csv_dir = os.path.join(tmp_path, "csv_custom_delimiter")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So tmp_path here is injected by the pytest framework. It's type is a pathlib.Path object, not a str (https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmp-path-fixture). So, instead of using os.path.join, you can instead use the built-in methods of Path.

Suggested change
csv_dir = os.path.join(tmp_path, "csv_custom_delimiter")
csv_dir = tmp_path / "csv_custom_delimiter"

"python.testing.pytestEnabled": true,
"makefile.configureOnOpen": false
"makefile.configureOnOpen": false,
"editor.insertSpaces": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty reasonable setting to have. But, we'd prefer if any settings-related changes are isolated to their own PR, please.

write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
partition_cols: Optional[list[ColumnInputType]] = None,
io_config: Optional[IOConfig] = None,
delimiter: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delimiter: Optional[str] = None,
delimiter: str = ",",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea here! But I do feel like we can use a str and default to "," instead of using None and then defaulting to "," on None. Since strings in Python are immutable, we don't have to worry about having a mutable default argument here :) And it makes the API just a little bit cleaner IMO.

schema: Schema,
compression: str | None = None,
io_config: IOConfig | None = None,
delimiter: str | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delimiter: str | None = None,
delimiter: str = ",",

elif file_format == FileFormat.Csv:
output_file = fs.open_output_stream(file_path)
pacsv.write_csv(table, output_file)
if delimiter is not None and len(delimiter) == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great check here! I also really like the documentation here about PyArrow's CSV writer constraining this to 1 character.

@codecov
Copy link

codecov bot commented Oct 21, 2025

Codecov Report

❌ Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.56%. Comparing base (bac90f7) to head (0c95e75).
⚠️ Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
daft/recordbatch/recordbatch_io.py 50.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5430      +/-   ##
==========================================
+ Coverage   71.48%   71.56%   +0.08%     
==========================================
  Files         991      992       +1     
  Lines      125399   125954     +555     
==========================================
+ Hits        89639    90142     +503     
- Misses      35760    35812      +52     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 78.33% <100.00%> (+0.26%) ⬆️
daft/recordbatch/recordbatch_io.py 35.95% <50.00%> (+0.08%) ⬆️

... and 52 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@malcolmgreaves
Copy link
Contributor

Discussion in OSS Daft Slack! 🧵 https://dist-data.slack.com/archives/C041NA2RBFD/p1760716811883269

destroyer22719 and others added 2 commits October 23, 2025 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

custom delimiter when writing csv

2 participants