-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement base scan_avro
#21700
base: main
Are you sure you want to change the base?
Conversation
Small updates to AvroReader Summary: Test Plan:
Summary: Test Plan:
0e55e28
to
96919d4
Compare
Summary: Test Plan:
Hey @erikbrinkman, all of this work is amazing, and I really feel bad telling you that will not be merging it. We decided a short while ago to actually extract AVRO from the main project and offer it as an IO plugin. We just haven't gotten around to it yet, and that is why AVRO has been in a bit of a limbo state since then. If you want to support AVRO better, I suggest instead porting the existing AVRO reader to an IO plugin. Those plugins should be able to do everything the other IO sources can do at minimal overhead. If anything is missing from that interface, we are of course willing to help out there. |
Hey @coastalwhite
In general, happy to have this not go to waste and attempt a port out, but I have a few questions:
|
Hey @erikbrinkman, What we do want is a plugin for both This would be a whole new crate with a Rust interface and a Python IO plugin interface and it's own pypi release and crates.io release. There might be some parts still blocking this from the rust side. I think our |
@ritchie46 yeah, that's my plan. I'm probably going to use apache's avro crate instead of rolling my own. There could be issues with that as it seems to require that records have a name which the old interface didn't seem to always do. However, one place I'm a little concerned, but haven't looked into much, is the ability to leverage other polars-io functionality to abstract away caching and cloud access. I'm terms of def read_avro(..., columns):
return scan_avro(...).select(*columns).collect() or did you have something else in mind? Finally, can you address since if the questions at the end of my last comment? |
Hey, will the future avro IO plugin support reading/writing to cloud storage? |
@lisasgoh in the implementation I'm working on, I'm using the same cloud handling as some of the other scanners, so presumably it should work, although I might reach out here to see if you can test it on real data, since I think all the tests that I'm going to try and copy from the mainline branch simply mock out cloud reading. |
This implements a basic version of
scan_avro
, see below for details of what I mean.This PR was larger than I initially expected it to be. Github doesn't really do stacked commits very well, but I still separated this PR into three logical commits if that makes it easier to review. I could similar submit separate PRs, whatever is best.
Fixes #6903
Commits
AvroReader
and update some utilities to make theRowIndex
mutable making it easier to pass down, and updated the types so they'reArc
s andPlSmallStr
insteat ofVec<String>
.scan_avro
on the rust side.scan_avro
on the python side and add tests.Notes
scan_ndjson
, this might night be ideal so I'm open to general comments.*Options
struct that's passed around. There weren't really any available for avro, so I omitted that field entirely, so there wasn't dead code. I could still pass it around, or make the struct non-terminating, but this seemed like the best choice.AvroReader
these seem better and more consistent with what I've seen elsewhere, but maybe there was another reason that's not the interface, or maybe there are rust versioning constraints with changing it.new_streaming
aren't implemented. This seemed to require implementing some optimizations in the ir, and seemed non-trivial. It does seem like in the past week this support was just added for ndjson and csv making avro the odd format out. It seems possible to implement, but this is another aspect I meant by basic.rechunk=False
. However, when this is done,read_avro
won't read the output ofwrite_avro
. Currently I have a somewhat hacky workaround in the test. I'm open for other solutions. Fixing the bug seems out of scope for this PR, but I'm open to thoughts here.ScanExec
forJsonExec
only reads the first file in sourced fornum_unfiltered_rows
. This seems like a bug, but since most of this is undocumented, I've been mostly piecing together what everything is, and as a result, could be misinterpreting what values certain functions are supposed to return.