feat(io): implement write_sql with SQLDataSink and explicit dtype support #5979

huleilei · 2026-01-08T03:19:16Z

Summary

This PR introduces DataFrame.write_sql() , enabling users to write Daft DataFrames to SQL databases (e.g., PostgreSQL, SQLite) via SQLAlchemy.

It implements a robust, distributed SQLDataSink that:

Handles Distributed Writes : Uses the DataSink pattern to manage driver-side table initialization and worker-side parallel writes.
Supports Explicit Types : Exposes a dtype parameter to allow users to override default type inference with specific SQLAlchemy types (addressing type verification concerns).
Ensures Connection Safety : Manages connection lifecycles properly across distributed workers to avoid socket serialization issues.

Key Changes

1. New Public API: DataFrame.write_sql

Location : daft/dataframe/dataframe.py

Signature :

def write_sql(
    self,
    table_name: str,
    conn: str | Callable[[], "Connection"],
    write_mode: Literal["append",  "overwrite", "fail"] = "append",
    chunk_size: int | None = None,
    dtype: dict[str, Any] | None = None  # 
    <--- NEW: Explicit type control
) -> DataFrame: ...

Behavior : Delegates to SQLDataSink and returns a DataFrame with write metrics ( total_written_rows , total_written_bytes ).

2. Internal Implementation: SQLDataSink

Location : daft/io/_sql.py
Architecture :
- start() (Driver) : Handles write_mode logic.
  - fail : Checks existence, raises error if table exists, creates table schema if not.
  - overwrite : Replaces table with new schema.
  - append : Creates table if not exists, ensuring schema readiness for workers.
- write() (Workers) :
  - Creates independent SQLAlchemy engines/connections per task (no pickling of connections).
  - Converts MicroPartitions to Pandas.
  - Uses pd.to_sql with the user-provided dtype to write data efficiently.
  - Ensures proper resource cleanup ( engine.dispose() ).

3. Tests

Location : tests/integration/sql/test_write_sql.py
Coverage :
- Sources : Verified with PyDict, CSV, and JSON sources.
- Modes : Comprehensive tests for append , overwrite , and fail modes.
- Type Verification : Added specific tests ( test_write_sql_dtype_basic_types ) that use sqlalchemy.inspect to verify that columns are created with the correct SQL types when dtype is provided.
- Connection Factory : Verified support for passing a connection factory function (crucial for pickling compatibility).

Addressing Previous Concerns (Type Verification)

This implementation addresses concerns about type safety (raised in previous discussions) by:

Leveraging Pandas' mature type inference for standard types.
Providing the dtype "escape hatch" for complex scenarios, giving users full control over the target schema definition.
Including integration tests that explicitly verify schema creation correctness.

Checklist

I have added comprehensive unit/integration tests.
I have updated the documentation (docstrings included).
I have verified that connections are properly closed and disposed of.

Thank you very much for the idea provided by #5471

Related Issues

greptile-apps

Greptile Overview

Greptile Summary

Introduces DataFrame.write_sql() method and SQLDataSink to enable distributed SQL writes through the DataSink pattern, aligning with other connectors like ClickHouse and Bigtable.

Key Changes:

Implemented SQLDataSink with driver-side table initialization (start()) that handles write mode semantics (append/overwrite/fail) before distributed workers begin writing
Worker processes create isolated SQLAlchemy connections to avoid socket serialization issues across distributed workers
Added optional dtype parameter for explicit SQLAlchemy type mapping, passed through to pandas.DataFrame.to_sql()
Returns aggregate write metrics (total_written_rows, total_written_bytes) as a single-row DataFrame
Comprehensive test coverage validates multiple data sources (pydict, CSV, JSON), write modes, dtype scenarios, and chunking options
Tests properly verify correctness by reading back written data using both daft.read_sql() and SQLAlchemy's inspect() API

Minor Issue:

Manual write_mode validation in write_sql() is redundant since the @DataframePublicAPI decorator already validates Literal type hints

Confidence Score: 4/5

This PR is safe to merge with minimal risk - it follows established patterns and includes comprehensive testing
Score reflects well-structured implementation following the DataSink pattern used by other connectors, thorough test coverage across multiple scenarios, and proper handling of distributed execution. The only issue found is a redundant validation check that doesn't affect functionality.
No files require special attention - all changes follow established patterns and have appropriate test coverage

Important Files Changed

File Analysis

Filename	Score	Overview
daft/dataframe/dataframe.py	5/5	Added write_sql method that creates SQLDataSink and calls write_sink; well-documented with clear examples showing dtype usage
daft/io/_sql.py	4/5	Implemented SQLDataSink following DataSink pattern; handles distributed writes with driver-side table setup and worker-side appends
tests/integration/sql/test_write_sql.py	5/5	Comprehensive test coverage for write_sql with multiple sources, modes, dtypes, and chunking; validates both write success and schema enforcement

Sequence Diagram

sequenceDiagram
    participant User
    participant DataFrame
    participant SQLDataSink
    participant Driver
    participant Workers
    participant Database

    User->>DataFrame: write_sql(table_name, conn, write_mode, dtype)
    DataFrame->>DataFrame: limit(0).to_pandas() to get empty_pdf
    DataFrame->>SQLDataSink: Create SQLDataSink(table_name, conn, write_mode, dtype, empty_pdf)
    DataFrame->>SQLDataSink: write_sink(sink)
    
    Note over SQLDataSink,Driver: Driver-side initialization
    SQLDataSink->>Driver: start()
    Driver->>Database: Connect to database
    Driver->>Database: Check if table exists
    
    alt write_mode == "fail"
        alt Table exists
            Database-->>Driver: Table exists
            Driver-->>User: Raise ValueError
        else Table does not exist
            Driver->>Database: Create empty table with schema (empty_pdf.to_sql)
        end
    else write_mode == "overwrite"
        Driver->>Database: Replace table with empty schema (empty_pdf.to_sql)
    else write_mode == "append"
        alt Table does not exist
            Driver->>Database: Create empty table with schema (empty_pdf.to_sql)
        end
    end
    Driver->>Database: Close connection
    
    Note over SQLDataSink,Workers: Distributed write phase
    loop For each micropartition
        SQLDataSink->>Workers: write(micropartition)
        Workers->>Database: Connect to database
        Workers->>Workers: Convert micropartition.to_pandas()
        Workers->>Database: pdf.to_sql(table_name, if_exists="append", dtype=dtype)
        Database-->>Workers: Write complete
        Workers-->>SQLDataSink: WriteResult(bytes_written, rows_written)
        Workers->>Database: Close connection
    end
    
    Note over SQLDataSink,Driver: Finalization
    SQLDataSink->>Driver: finalize(write_results)
    Driver->>Driver: Aggregate total_written_rows and total_written_bytes
    Driver-->>DataFrame: Return MicroPartition with metrics
    DataFrame-->>User: Return DataFrame with write metrics

daft/dataframe/dataframe.py

codecov · 2026-01-08T04:07:44Z

Codecov Report

❌ Patch coverage is 25.58140% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.59%. Comparing base (fb45faf) to head (42201e3).

Files with missing lines	Patch %	Lines
daft/io/_sql.py	25.00%	60 Missing ⚠️
daft/dataframe/dataframe.py	33.33%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5979      +/-   ##
==========================================
- Coverage   72.63%   72.59%   -0.05%     
==========================================
  Files         970      970              
  Lines      126562   126636      +74     
==========================================
+ Hits        91924    91927       +3     
- Misses      34638    34709      +71

Files with missing lines	Coverage Δ
daft/dataframe/dataframe.py	`77.11% <33.33%> (-0.23%)`	⬇️
daft/io/_sql.py	`30.09% <25.00%> (-23.75%)`	⬇️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

desmondcheongzx · 2026-01-08T18:54:06Z

tests/integration/sql/test_write_sql.py

Thanks for working on this @huleilei! For reference, one of the reasons why the original PR wasn't merged was more verification needing to be done on handling Daft types. This will probably need some discretion based on how smoothly things, but here's another PR that implemented a postgres catalog that could be useful: 670eecc

Thanks for the context and the reference to the Postgres catalog PR! I completely understand the concern regarding type verification, as ensuring correct schema mapping is critical for production use.

To address this, I've designed the implementation with a "safe by default, controllable when needed" approach:

Explicit Type Control via dtype : I've exposed the dtype parameter in write_sql , which is passed directly to the underlying to_sql call. This serves as a robust "escape hatch," allowing users to explicitly define SQLAlchemy types for columns where default inference might be insufficient (e.g., specific precision for Decimals or JSON types).

Leveraging Pandas Ecosystem : By utilizing micropartition.to_pandas() , we benefit from Pandas' mature and battle-tested type inference logic for standard Daft/Arrow types before they hit the database.

Verification Tests : I've added comprehensive integration tests (specifically test_write_sql_dtype_basic_types and test_write_sql_dtype_empty_df_creates_table ) that not only write data but also use sqlalchemy.inspect to verify that the actual created table schema matches our expectations.
While a full Catalog integration (like in #670eecc) is definitely a great direction for the future, I believe this DataSink implementation provides a solid, verified foundation for generic SQL writing capabilities.

huleilei · 2026-01-10T03:35:09Z

@desmondcheongzx @kevinzwang @colin-ho help me review when you are convenient. Thanks

github-actions bot added the feat label Jan 8, 2026

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

daft/dataframe/dataframe.py Outdated Show resolved Hide resolved

huleilei changed the title ~~feat(sql): add SQL DataSink and write_sql function~~ feat(sql): add SQL DataSink and write_sql function to write DataFrames to SQL databases Jan 8, 2026

huleilei changed the title ~~feat(sql): add SQL DataSink and write_sql function to write DataFrames to SQL databases~~ feat(sql): add write_sql function to write DataFrames to SQL databases Jan 8, 2026

huleilei changed the title ~~feat(sql): add write_sql function to write DataFrames to SQL databases~~ feat(sql): add write_sql function for writing data to SQL databases Jan 8, 2026

huleilei marked this pull request as draft January 8, 2026 03:34

huleilei added 2 commits January 8, 2026 14:08

feat(sql): add SQL DataSink and write_sql function

6dd5324

change according to comment

82bcd24

huleilei force-pushed the hll/write_sql_gitlab branch from e72da49 to 82bcd24 Compare January 8, 2026 07:46

change tests

42201e3

desmondcheongzx reviewed Jan 8, 2026

View reviewed changes

huleilei marked this pull request as ready for review January 9, 2026 02:04

huleilei changed the title ~~feat(sql): add write_sql function for writing data to SQL databases~~ feat(io): implement write_sql with SQLDataSink and explicit dtype support Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(io): implement write_sql with SQLDataSink and explicit dtype support #5979

feat(io): implement write_sql with SQLDataSink and explicit dtype support #5979

huleilei commented Jan 8, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

codecov bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

desmondcheongzx Jan 8, 2026

Uh oh!

huleilei Jan 9, 2026 •

edited

Loading

Uh oh!

huleilei commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(io): implement write_sql with SQLDataSink and explicit dtype support #5979

Are you sure you want to change the base?

feat(io): implement write_sql with SQLDataSink and explicit dtype support #5979

Conversation

huleilei commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. New Public API: DataFrame.write_sql

2. Internal Implementation: SQLDataSink

3. Tests

Addressing Previous Concerns (Type Verification)

Checklist

Related Issues

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

codecov bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

desmondcheongzx Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

huleilei Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huleilei commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huleilei commented Jan 8, 2026 •

edited

Loading

codecov bot commented Jan 8, 2026 •

edited

Loading

huleilei Jan 9, 2026 •

edited

Loading