feat: custom deliminer write csv #5430

destroyer22719 · 2025-10-21T21:00:15Z

Changes Made

the daft.write_csv function now has a delimiter option to set a custom delimiter

Related Issues

Closes #3786

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly

greptile-apps

Greptile Overview

Summary

Added delimiter parameter to write_csv() to support custom CSV delimiters. However, the implementation is incomplete:

Critical Issue: The delimiter parameter only works for empty DataFrames. For non-empty DataFrames, the parameter is accepted but never passed through the execution pipeline (LogicalPlanBuilder → Rust layer → ExecutionStep → recordbatch_io.write_tabular()), so it will always use the default comma delimiter.
Test Still Skipped: The test test_write_csv_with_delimiter remains marked with @pytest.mark.skip, suggesting the feature wasn't actually verified to work.

To properly implement this feature, the delimiter needs to be:

Passed from DataFrame.write_csv() to LogicalPlanBuilder.write_tabular()
Stored in the Rust logical plan
Passed through ExecutionStep._handle_file_write()
Accepted by recordbatch_io.write_tabular() and used to create CSV write options

The implementation for empty DataFrames shows the right approach with proper validation and PyArrow integration, but this needs to be extended to the full write pipeline.

Confidence Score: 1/5

This PR has a critical implementation gap that breaks the advertised feature for non-empty DataFrames
The delimiter parameter is only functional for empty DataFrames. For the primary use case (non-empty DataFrames), the parameter is silently ignored and the default comma delimiter is always used. The test remains skipped, indicating the feature wasn't validated. This requires significant additional implementation across Python and Rust layers.
daft/dataframe/dataframe.py requires passing delimiter through the builder; daft/recordbatch/recordbatch_io.py needs delimiter support in write_tabular(); Rust code needs updates to support delimiter in the logical plan

Important Files Changed

File Analysis

Filename	Score	Overview
daft/dataframe/dataframe.py	1/5	Added `delimiter` parameter but it's only passed to empty DataFrame path, not to non-empty DataFrames through the builder
daft/recordbatch/recordbatch_io.py	3/5	Implemented delimiter handling for empty DataFrames with proper validation, but missing support for non-empty DataFrames in `write_tabular()`
tests/connect/test_io.py	2/5	Added test for custom delimiter but test is still marked as skipped

Sequence Diagram

sequenceDiagram
    participant User
    participant DataFrame
    participant LogicalPlanBuilder
    participant ExecutionStep
    participant recordbatch_io
    participant PyArrow
    
    User->>DataFrame: write_csv(path, delimiter="|")
    
    alt Non-empty DataFrame
        DataFrame->>LogicalPlanBuilder: write_tabular(root_dir, file_format=CSV)
        Note over DataFrame,LogicalPlanBuilder: ❌ delimiter NOT passed here
        LogicalPlanBuilder->>ExecutionStep: execute write operation
        ExecutionStep->>recordbatch_io: write_tabular(table, file_format)
        Note over ExecutionStep,recordbatch_io: ❌ delimiter NOT passed here
        recordbatch_io->>PyArrow: pads.write_dataset(opts=None)
        Note over recordbatch_io,PyArrow: ❌ Uses default comma delimiter
        PyArrow-->>User: CSV with comma delimiter
    else Empty DataFrame
        DataFrame->>recordbatch_io: write_empty_tabular(delimiter="|")
        Note over DataFrame,recordbatch_io: ✅ delimiter passed correctly
        recordbatch_io->>PyArrow: pacsv.write_csv(write_options=WriteOptions(delimiter="|"))
        PyArrow-->>User: CSV with custom delimiter
    end

Additional Comments (1)

daft/dataframe/dataframe.py, line 894-900 (link)

logic: delimiter parameter is not passed through to the builder

The delimiter parameter is accepted but never passed to write_tabular(). This means non-empty DataFrames will ignore the delimiter and always use the default comma separator.

The parameter needs to be threaded through:
1. LogicalPlanBuilder.write_tabular()
2. The Rust table_write method
3. ExecutionStep._handle_file_write()
4. recordbatch_io.write_tabular()
Currently only empty DataFrames (line 918) use the delimiter parameter.

_{3 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-21T21:05:13Z

tests/connect/test_io.py

    pass


 @pytest.mark.skip(reason="https://github.com/Eventual-Inc/Daft/issues/3786")


logic: test is still marked as skipped

Since the feature was implemented, remove the @pytest.mark.skip decorator

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/connect/test_io.py Line: 43:43 Comment: **logic:** test is still marked as skipped Since the feature was implemented, remove the `@pytest.mark.skip` decorator How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-10-21T21:05:14Z

daft/recordbatch/recordbatch_io.py

+            if delimiter is not None and len(delimiter) == 1:
+                write_opts = pacsv.WriteOptions(delimiter=delimiter)
+                pacsv.write_csv(table, output_file, write_options=write_opts)
+            elif delimiter is not None and len(delimiter) > 1:
+                # The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details
+                # https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions
+                # https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html
+                raise ValueError("Delimiter must be a single character")
+            else:
+                pacsv.write_csv(table, output_file)


style: validation logic could be simplified

Suggested change

if delimiter is not None and len(delimiter) == 1:

write_opts = pacsv.WriteOptions(delimiter=delimiter)

pacsv.write_csv(table, output_file, write_options=write_opts)

elif delimiter is not None and len(delimiter) > 1:

# The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details

# https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions

# https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html

raise ValueError("Delimiter must be a single character")

else:

pacsv.write_csv(table, output_file)

if delimiter is not None:

if len(delimiter) != 1:

# The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details

# https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions

# https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html

raise ValueError("Delimiter must be a single character")

write_opts = pacsv.WriteOptions(delimiter=delimiter)

pacsv.write_csv(table, output_file, write_options=write_opts)

else:

pacsv.write_csv(table, output_file)

Prompt To Fix With AI

This is a comment left during a code review. Path: daft/recordbatch/recordbatch_io.py Line: 621:630 Comment: **style:** validation logic could be simplified ```suggestion if delimiter is not None: if len(delimiter) != 1: # The PyArrow CSV writer only supports single-character delimiters, see documentation below for more details # https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions # https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html raise ValueError("Delimiter must be a single character") write_opts = pacsv.WriteOptions(delimiter=delimiter) pacsv.write_csv(table, output_file, write_options=write_opts) else: pacsv.write_csv(table, output_file) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-10-21T21:05:15Z

daft/dataframe/dataframe.py

        write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
        partition_cols: Optional[list[ColumnInputType]] = None,
        io_config: Optional[IOConfig] = None,
+        delimiter: Optional[str] = None,


style: add validation for delimiter parameter

The parameter should be validated before being passed through. Check that it's either None or a single character string, similar to the validation in write_empty_tabular at recordbatch_io.py:624-628

Prompt To Fix With AI

This is a comment left during a code review. Path: daft/dataframe/dataframe.py Line: 852:852 Comment: **style:** add validation for delimiter parameter The parameter should be validated before being passed through. Check that it's either None or a single character string, similar to the validation in `write_empty_tabular` at recordbatch_io.py:624-628 How can I resolve this? If you propose a fix, please make it concise.

malcolmgreaves

Very welcomed change! Thanks for making a contribution <3 Overall I think the change here is mostly good, but I have a few requested changes. The delimiter should be str and just default to "," instead of Optional[str] = None where None is interpreted as ",". There's also a few other requested changes too. Once that's in I can approve and we can merge this in! 🙌

daft/recordbatch/recordbatch_io.py

tests/connect/test_io.py

malcolmgreaves · 2025-10-21T21:10:16Z

tests/connect/test_io.py

+    data = {"id": [1, 2, 3], "name": ["alice", "bob", "charlie"]}
+    df = make_df(data)
+
+    csv_dir = os.path.join(tmp_path, "csv_custom_delimiter")


So tmp_path here is injected by the pytest framework. It's type is a pathlib.Path object, not a str (https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmp-path-fixture). So, instead of using os.path.join, you can instead use the built-in methods of Path.

Suggested change

csv_dir = os.path.join(tmp_path, "csv_custom_delimiter")

csv_dir = tmp_path / "csv_custom_delimiter"

malcolmgreaves · 2025-10-21T21:11:11Z

.vscode/settings.json

  "python.testing.pytestEnabled": true,
-  "makefile.configureOnOpen": false
+  "makefile.configureOnOpen": false,
+  "editor.insertSpaces": true


This is a pretty reasonable setting to have. But, we'd prefer if any settings-related changes are isolated to their own PR, please.

malcolmgreaves · 2025-10-21T21:11:23Z

daft/dataframe/dataframe.py

        write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
        partition_cols: Optional[list[ColumnInputType]] = None,
        io_config: Optional[IOConfig] = None,
+        delimiter: Optional[str] = None,


Suggested change

delimiter: Optional[str] = None,

delimiter: str = ",",

Nice idea here! But I do feel like we can use a str and default to "," instead of using None and then defaulting to "," on None. Since strings in Python are immutable, we don't have to worry about having a mutable default argument here :) And it makes the API just a little bit cleaner IMO.

daft/dataframe/dataframe.py

malcolmgreaves · 2025-10-21T21:14:12Z

daft/recordbatch/recordbatch_io.py

    schema: Schema,
    compression: str | None = None,
    io_config: IOConfig | None = None,
+    delimiter: str | None = None,


Suggested change

delimiter: str | None = None,

delimiter: str = ",",

malcolmgreaves · 2025-10-21T21:14:40Z

daft/recordbatch/recordbatch_io.py

        elif file_format == FileFormat.Csv:
            output_file = fs.open_output_stream(file_path)
-            pacsv.write_csv(table, output_file)
+            if delimiter is not None and len(delimiter) == 1:


Great check here! I also really like the documentation here about PyArrow's CSV writer constraining this to 1 character.

codecov · 2025-10-21T21:37:21Z

Codecov Report

❌ Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.56%. Comparing base (bac90f7) to head (0c95e75).
⚠️ Report is 19 commits behind head on main.

Files with missing lines	Patch %	Lines
daft/recordbatch/recordbatch_io.py	50.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5430      +/-   ##
==========================================
+ Coverage   71.48%   71.56%   +0.08%     
==========================================
  Files         991      992       +1     
  Lines      125399   125954     +555     
==========================================
+ Hits        89639    90142     +503     
- Misses      35760    35812      +52

Files with missing lines	Coverage Δ
daft/dataframe/dataframe.py	`78.33% <100.00%> (+0.26%)`	⬆️
daft/recordbatch/recordbatch_io.py	`35.95% <50.00%> (+0.08%)`	⬆️

... and 52 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: Malcolm Greaves <[email protected]>

malcolmgreaves · 2025-10-21T22:35:40Z

Discussion in OSS Daft Slack! 🧵 https://dist-data.slack.com/archives/C041NA2RBFD/p1760716811883269

Co-authored-by: Malcolm Greaves <[email protected]>

destroyer22719 added 2 commits October 21, 2025 16:14

saved changes

686c066

added tests

d82e23a

github-actions bot added the feat label Oct 21, 2025

greptile-apps bot reviewed Oct 21, 2025

View reviewed changes

malcolmgreaves requested changes Oct 21, 2025

View reviewed changes

Update daft/recordbatch/recordbatch_io.py

0c95e75

Co-authored-by: Malcolm Greaves <[email protected]>

destroyer22719 and others added 2 commits October 23, 2025 14:38

Update tests/connect/test_io.py

5db6611

Co-authored-by: Malcolm Greaves <[email protected]>

Update daft/dataframe/dataframe.py

db20f25

Co-authored-by: Malcolm Greaves <[email protected]>

		pass


		@pytest.mark.skip(reason="https://github.com/Eventual-Inc/Daft/issues/3786")

	csv_dir = os.path.join(tmp_path, "csv_custom_delimiter")
	csv_dir = tmp_path / "csv_custom_delimiter"

feat: custom deliminer write csv #5430

Are you sure you want to change the base?

feat: custom deliminer write csv #5430

Uh oh!

Conversation

destroyer22719 commented Oct 21, 2025

Changes Made

Related Issues

Checklist

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Greptile Overview

Summary

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Additional Comments (1)

Uh oh!

greptile-apps bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

malcolmgreaves Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

malcolmgreaves Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

malcolmgreaves commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot left a comment •

edited

Loading

malcolmgreaves left a comment •

edited

Loading

codecov bot commented Oct 21, 2025 •

edited

Loading