Skip to content

Conversation

@huleilei
Copy link
Contributor

@huleilei huleilei commented Jan 8, 2026

Summary

This PR introduces DataFrame.write_sql() , enabling users to write Daft DataFrames to SQL databases (e.g., PostgreSQL, SQLite) via SQLAlchemy.

It implements a robust, distributed SQLDataSink that:

  1. Handles Distributed Writes : Uses the DataSink pattern to manage driver-side table initialization and worker-side parallel writes.
  2. Supports Explicit Types : Exposes a dtype parameter to allow users to override default type inference with specific SQLAlchemy types (addressing type verification concerns).
  3. Ensures Connection Safety : Manages connection lifecycles properly across distributed workers to avoid socket serialization issues.

Key Changes

1. New Public API: DataFrame.write_sql

  • Location : daft/dataframe/dataframe.py
  • Signature :
    def write_sql(
        self,
        table_name: str,
        conn: str | Callable[[], "Connection"],
        write_mode: Literal["append",  "overwrite", "fail"] = "append",
        chunk_size: int | None = None,
        dtype: dict[str, Any] | None = None  # 
        <--- NEW: Explicit type control
    ) -> DataFrame: ...
    
  • Behavior : Delegates to SQLDataSink and returns a DataFrame with write metrics ( total_written_rows , total_written_bytes ).

2. Internal Implementation: SQLDataSink

  • Location : daft/io/_sql.py
  • Architecture :
    • start() (Driver) : Handles write_mode logic.
      • fail : Checks existence, raises error if table exists, creates table schema if not.
      • overwrite : Replaces table with new schema.
      • append : Creates table if not exists, ensuring schema readiness for workers.
    • write() (Workers) :
      • Creates independent SQLAlchemy engines/connections per task (no pickling of connections).
      • Converts MicroPartitions to Pandas.
      • Uses pd.to_sql with the user-provided dtype to write data efficiently.
      • Ensures proper resource cleanup ( engine.dispose() ).

3. Tests

  • Location : tests/integration/sql/test_write_sql.py
  • Coverage :
    • Sources : Verified with PyDict, CSV, and JSON sources.
    • Modes : Comprehensive tests for append , overwrite , and fail modes.
    • Type Verification : Added specific tests ( test_write_sql_dtype_basic_types ) that use sqlalchemy.inspect to verify that columns are created with the correct SQL types when dtype is provided.
    • Connection Factory : Verified support for passing a connection factory function (crucial for pickling compatibility).

Addressing Previous Concerns (Type Verification)

This implementation addresses concerns about type safety (raised in previous discussions) by:

  1. Leveraging Pandas' mature type inference for standard types.
  2. Providing the dtype "escape hatch" for complex scenarios, giving users full control over the target schema definition.
  3. Including integration tests that explicitly verify schema creation correctness.

Checklist

  • I have added comprehensive unit/integration tests.
  • I have updated the documentation (docstrings included).
  • I have verified that connections are properly closed and disposed of.

Thank you very much for the idea provided by #5471

Related Issues

@github-actions github-actions bot added the feat label Jan 8, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Introduces DataFrame.write_sql() method and SQLDataSink to enable distributed SQL writes through the DataSink pattern, aligning with other connectors like ClickHouse and Bigtable.

Key Changes:

  • Implemented SQLDataSink with driver-side table initialization (start()) that handles write mode semantics (append/overwrite/fail) before distributed workers begin writing
  • Worker processes create isolated SQLAlchemy connections to avoid socket serialization issues across distributed workers
  • Added optional dtype parameter for explicit SQLAlchemy type mapping, passed through to pandas.DataFrame.to_sql()
  • Returns aggregate write metrics (total_written_rows, total_written_bytes) as a single-row DataFrame
  • Comprehensive test coverage validates multiple data sources (pydict, CSV, JSON), write modes, dtype scenarios, and chunking options
  • Tests properly verify correctness by reading back written data using both daft.read_sql() and SQLAlchemy's inspect() API

Minor Issue:

  • Manual write_mode validation in write_sql() is redundant since the @DataframePublicAPI decorator already validates Literal type hints

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it follows established patterns and includes comprehensive testing
  • Score reflects well-structured implementation following the DataSink pattern used by other connectors, thorough test coverage across multiple scenarios, and proper handling of distributed execution. The only issue found is a redundant validation check that doesn't affect functionality.
  • No files require special attention - all changes follow established patterns and have appropriate test coverage

Important Files Changed

File Analysis

Filename Score Overview
daft/dataframe/dataframe.py 5/5 Added write_sql method that creates SQLDataSink and calls write_sink; well-documented with clear examples showing dtype usage
daft/io/_sql.py 4/5 Implemented SQLDataSink following DataSink pattern; handles distributed writes with driver-side table setup and worker-side appends
tests/integration/sql/test_write_sql.py 5/5 Comprehensive test coverage for write_sql with multiple sources, modes, dtypes, and chunking; validates both write success and schema enforcement

Sequence Diagram

sequenceDiagram
    participant User
    participant DataFrame
    participant SQLDataSink
    participant Driver
    participant Workers
    participant Database

    User->>DataFrame: write_sql(table_name, conn, write_mode, dtype)
    DataFrame->>DataFrame: limit(0).to_pandas() to get empty_pdf
    DataFrame->>SQLDataSink: Create SQLDataSink(table_name, conn, write_mode, dtype, empty_pdf)
    DataFrame->>SQLDataSink: write_sink(sink)
    
    Note over SQLDataSink,Driver: Driver-side initialization
    SQLDataSink->>Driver: start()
    Driver->>Database: Connect to database
    Driver->>Database: Check if table exists
    
    alt write_mode == "fail"
        alt Table exists
            Database-->>Driver: Table exists
            Driver-->>User: Raise ValueError
        else Table does not exist
            Driver->>Database: Create empty table with schema (empty_pdf.to_sql)
        end
    else write_mode == "overwrite"
        Driver->>Database: Replace table with empty schema (empty_pdf.to_sql)
    else write_mode == "append"
        alt Table does not exist
            Driver->>Database: Create empty table with schema (empty_pdf.to_sql)
        end
    end
    Driver->>Database: Close connection
    
    Note over SQLDataSink,Workers: Distributed write phase
    loop For each micropartition
        SQLDataSink->>Workers: write(micropartition)
        Workers->>Database: Connect to database
        Workers->>Workers: Convert micropartition.to_pandas()
        Workers->>Database: pdf.to_sql(table_name, if_exists="append", dtype=dtype)
        Database-->>Workers: Write complete
        Workers-->>SQLDataSink: WriteResult(bytes_written, rows_written)
        Workers->>Database: Close connection
    end
    
    Note over SQLDataSink,Driver: Finalization
    SQLDataSink->>Driver: finalize(write_results)
    Driver->>Driver: Aggregate total_written_rows and total_written_bytes
    Driver-->>DataFrame: Return MicroPartition with metrics
    DataFrame-->>User: Return DataFrame with write metrics
Loading

@huleilei huleilei changed the title feat(sql): add SQL DataSink and write_sql function feat(sql): add SQL DataSink and write_sql function to write DataFrames to SQL databases Jan 8, 2026
@huleilei huleilei changed the title feat(sql): add SQL DataSink and write_sql function to write DataFrames to SQL databases feat(sql): add write_sql function to write DataFrames to SQL databases Jan 8, 2026
@huleilei huleilei changed the title feat(sql): add write_sql function to write DataFrames to SQL databases feat(sql): add write_sql function for writing data to SQL databases Jan 8, 2026
@huleilei huleilei marked this pull request as draft January 8, 2026 03:34
@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 25.58140% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.59%. Comparing base (fb45faf) to head (42201e3).

Files with missing lines Patch % Lines
daft/io/_sql.py 25.00% 60 Missing ⚠️
daft/dataframe/dataframe.py 33.33% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5979      +/-   ##
==========================================
- Coverage   72.63%   72.59%   -0.05%     
==========================================
  Files         970      970              
  Lines      126562   126636      +74     
==========================================
+ Hits        91924    91927       +3     
- Misses      34638    34709      +71     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 77.11% <33.33%> (-0.23%) ⬇️
daft/io/_sql.py 30.09% <25.00%> (-23.75%) ⬇️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@huleilei huleilei force-pushed the hll/write_sql_gitlab branch from e72da49 to 82bcd24 Compare January 8, 2026 07:46
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @huleilei! For reference, one of the reasons why the original PR wasn't merged was more verification needing to be done on handling Daft types. This will probably need some discretion based on how smoothly things, but here's another PR that implemented a postgres catalog that could be useful: 670eecc

Copy link
Contributor Author

@huleilei huleilei Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context and the reference to the Postgres catalog PR! I completely understand the concern regarding type verification, as ensuring correct schema mapping is critical for production use.

To address this, I've designed the implementation with a "safe by default, controllable when needed" approach:

  1. Explicit Type Control via dtype : I've exposed the dtype parameter in write_sql , which is passed directly to the underlying to_sql call. This serves as a robust "escape hatch," allowing users to explicitly define SQLAlchemy types for columns where default inference might be insufficient (e.g., specific precision for Decimals or JSON types).
  2. Leveraging Pandas Ecosystem : By utilizing micropartition.to_pandas() , we benefit from Pandas' mature and battle-tested type inference logic for standard Daft/Arrow types before they hit the database.
  3. Verification Tests : I've added comprehensive integration tests (specifically test_write_sql_dtype_basic_types and test_write_sql_dtype_empty_df_creates_table ) that not only write data but also use sqlalchemy.inspect to verify that the actual created table schema matches our expectations.
    While a full Catalog integration (like in #670eecc) is definitely a great direction for the future, I believe this DataSink implementation provides a solid, verified foundation for generic SQL writing capabilities.

@huleilei huleilei marked this pull request as ready for review January 9, 2026 02:04
@huleilei huleilei changed the title feat(sql): add write_sql function for writing data to SQL databases feat(io): implement write_sql with SQLDataSink and explicit dtype support Jan 9, 2026
@huleilei
Copy link
Contributor Author

@desmondcheongzx @kevinzwang @colin-ho help me review when you are convenient. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants