Skip to content

Implement support for duckdb_arrow_scan for ingesting data via Arrow#488

Closed
phillipleblanc wants to merge 1 commit into
duckdb:mainfrom
spiceai:phillip/250409-upstream-duckdb-scan
Closed

Implement support for duckdb_arrow_scan for ingesting data via Arrow#488
phillipleblanc wants to merge 1 commit into
duckdb:mainfrom
spiceai:phillip/250409-upstream-duckdb-scan

Conversation

@phillipleblanc
Copy link
Copy Markdown
Contributor

Adds support for calling the duckdb_arrow_scan when passed an FFI_ArrowArrayStream from the Arrow library that implements the https://arrow.apache.org/docs/format/CStreamInterface.html

This matches behavior that exists in the Go DuckDB library: https://github.com/marcboeker/go-duckdb/blob/c607539e645091c1e3f600f929229d8baa7166e4/arrow.go#L251

This also provides an alternative to the arrow-vtab module that is not zero-copy in all cases for Arrow -> DuckDB buffers, whereas the native C++ DuckDB code has better support for zero-copy Arrow -> DuckDB buffers.

#18)

* Implement support for `duckdb_arrow_scan` for ingesting data via Arrow

* Map the NulError
@Maxxen
Copy link
Copy Markdown
Member

Maxxen commented Apr 9, 2025

Hello! Thanks a lot for the PR! Im about to cut v1.2.2 so in order to expedite getting this merged I forked this off into #489 with formatting/clippy fixes

@Maxxen
Copy link
Copy Markdown
Member

Maxxen commented Apr 9, 2025

Actually, ASAN reports memory leaks related to the test. I've done some digging and asked around the office from people with more arrow-knowledge, and my impression is that the arrow_scan_array method is most likely partly broken duckdb-side in that it has some weird assumptions regarding ownership and doesn't always free schemas properly. This PR seems related.

I guess these functions are deprecated for a reason. I don't know exactly when we will get around to replace them with something better, but we're looking into trying to fix what we think is causing this issue in particular - and if we do I'll check against this PR again.

@phillipleblanc
Copy link
Copy Markdown
Contributor Author

Actually, ASAN reports memory leaks related to the test. I've done some digging and asked around the office from people with more arrow-knowledge, and my impression is that the arrow_scan_array method is most likely partly broken duckdb-side in that it has some weird assumptions regarding ownership and doesn't always free schemas properly. This PR seems related.

I guess these functions are deprecated for a reason. I don't know exactly when we will get around to replace them with something better, but we're looking into trying to fix what we think is causing this issue in particular - and if we do I'll check against this PR again.

Thanks for checking! I'll convert this PR to draft until the memory leak issue is fixed and/or a new C API is available that we can use.

@mlafeldt
Copy link
Copy Markdown
Member

duckdb/duckdb#18246 lays the groundwork for a new Arrow C API.

@phillipleblanc
Copy link
Copy Markdown
Contributor Author

Looks like the new C API will be coming out in DuckDB v1.4. I'll close this PR since we'll need a new one to integrate with the new C API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants