Feature Request: Ducklake python API #419

lthimus-opto · 2025-09-05T23:34:16Z

lthimus-opto
Sep 5, 2025

Why do you want this feature?

I’ve worked extensively with Parquet and Delta Lake over the past few years, and one of the biggest advantages I’ve found with Delta Lake is its native Python API and the schema flexibility it provides.

A great example is the schema_mode option in Delta Lake. It’s very intuitive — I can simply tell Delta Lake how to handle schema evolution, and then focus on the real task of collecting and consolidating data, instead of spending time forcing the data into a rigid schema (which can be extremely painful when the schema is unknown or messy or change multiple time).

For context: in the attached example, I’m extracting messy data from the SEC’s EDGAR database, transforming it into a tabular format, and then writing it out. With DuckDB’s Python API today, I need to manually alter the table to fit the data. With Delta Lake, I just set schema_mode and it handles it seamlessly.

I think adding this kind of capability would make DuckLake much more user-friendly for messy, real-world data ingestion scenarios.

I'd love to hear your thoughts!

schema-mode-example.docx

leorong-opto · 2025-09-06T03:34:57Z

leorong-opto
Sep 6, 2025

Totally need this feature!

1 reply

noelamezaga Sep 8, 2025

Please please please!

lthimus-opto · 2025-09-17T14:45:33Z

lthimus-opto
Sep 17, 2025
Author

I’d like to share another example that highlights why this feature is necessary.

It took me a while to track this down:
First write: when I wrote my DataFrame to DuckLake for the first time, the column order was:

['end', 'val', 'accn', 'fy', 'fp', 'form', 'filed', 'extracted_date',
 'padded_cik', 'taxonomy', 'concept', 'label', 'description', 'unit',
 'frame', 'start']

→ Everything worked perfectly.

Second write: the next time I wrote to the same DuckLake table, the column order had changed slightly:

['end', 'val', 'accn', 'fy', 'fp', 'form', 'filed', 'frame',
 'extracted_date', 'padded_cik', 'taxonomy', 'concept', 'label',
 'description', 'unit', 'start']

→ This failed, because DuckDB tried to cast the frame column into the extracted_date column’s position.
The error was:

duckdb.duckdb.ConversionException: Conversion Error: invalid timestamp field format: "CY2008Q3I"
expected format is (YYYY-MM-DD HH:MM:SS[.US][±HH[:MM[:SS]]| ZONE]) when casting from source column frame

This happens because DuckDB aligns columns by position instead of name.

For reference, Delta Lake supports column name-based matching: “When you write a DataFrame to a Delta table, Delta Lake primarily matches columns based on their names, not their ordinal position. If the column names in your DataFrame match the column names in the Delta table’s schema, Delta Lake will correctly map the data.”

1 reply

guillesd Oct 14, 2025
Collaborator

hey @lthimus-opto ! could you share the code you are using for writing? This is sidetracking a bit but I'm curious!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Ducklake python API #419

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Request: Ducklake python API #419

Uh oh!

lthimus-opto Sep 5, 2025

Replies: 2 comments · 2 replies

Uh oh!

leorong-opto Sep 6, 2025

Uh oh!

noelamezaga Sep 8, 2025

Uh oh!

lthimus-opto Sep 17, 2025 Author

Uh oh!

guillesd Oct 14, 2025 Collaborator

lthimus-opto
Sep 5, 2025

Replies: 2 comments 2 replies

leorong-opto
Sep 6, 2025

lthimus-opto
Sep 17, 2025
Author

guillesd Oct 14, 2025
Collaborator