Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement base scan_avro #21700

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

erikbrinkman
Copy link
Contributor

This implements a basic version of scan_avro, see below for details of what I mean.

This PR was larger than I initially expected it to be. Github doesn't really do stacked commits very well, but I still separated this PR into three logical commits if that makes it easier to review. I could similar submit separate PRs, whatever is best.

Fixes #6903

Commits

  • Add options to the AvroReader and update some utilities to make the RowIndex mutable making it easier to pass down, and updated the types so they're Arcs and PlSmallStr insteat of Vec<String>.
  • Implement scan_avro on the rust side.
  • Implement scan_avro on the python side and add tests.

Notes

  • In general, I tried to copy what has been done before, e.g. usually from scan_ndjson, this might night be ideal so I'm open to general comments.
  • In order to get the row index surviving pre-predicate, I updated some of the generic utilities to make row_index mutable. It seems like many implements actually ignore this, read multi-threaded without the predicate, which allows files to be read multi-threaded, and then apply the predicate. I just wanted to get this implemented to start so this was the easiest route, but I'm open to thoughts here.
  • Many of the scan implementations have an extra *Options struct that's passed around. There weren't really any available for avro, so I omitted that field entirely, so there wasn't dead code. I could still pass it around, or make the struct non-terminating, but this seemed like the best choice.
  • See the above note about switching some of the types of the AvroReader these seem better and more consistent with what I've seen elsewhere, but maybe there was another reason that's not the interface, or maybe there are rust versioning constraints with changing it.
  • I didn't implement hive partitioning logic. This would probably be useful for avro, but I was lazy. It seems like it should be able to updated later. This is one instance of "basic".
  • Some of the multiscan tests that use new_streaming aren't implemented. This seemed to require implementing some optimizations in the ir, and seemed non-trivial. It does seem like in the past week this support was just added for ndjson and csv making avro the odd format out. It seems possible to implement, but this is another aspect I meant by basic.
  • There seems to be a bug in avro reading or writing. Some of the tests enforce a maximum chunk size by first slicing the frame, then concatenating it with rechunk=False. However, when this is done, read_avro won't read the output of write_avro. Currently I have a somewhat hacky workaround in the test. I'm open for other solutions. Fixing the bug seems out of scope for this PR, but I'm open to thoughts here.
  • Unrelated to this PR, the implement of ScanExec for JsonExec only reads the first file in sourced for num_unfiltered_rows. This seems like a bug, but since most of this is undocumented, I've been mostly piecing together what everything is, and as a result, could be misinterpreting what values certain functions are supposed to return.

Small updates to AvroReader


Summary:

Test Plan:
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Mar 12, 2025
@erikbrinkman erikbrinkman force-pushed the scan_avro branch 2 times, most recently from 0e55e28 to 96919d4 Compare March 12, 2025 02:32
Summary:

Test Plan:
@coastalwhite
Copy link
Collaborator

Hey @erikbrinkman, all of this work is amazing, and I really feel bad telling you that will not be merging it. We decided a short while ago to actually extract AVRO from the main project and offer it as an IO plugin. We just haven't gotten around to it yet, and that is why AVRO has been in a bit of a limbo state since then.

If you want to support AVRO better, I suggest instead porting the existing AVRO reader to an IO plugin. Those plugins should be able to do everything the other IO sources can do at minimal overhead. If anything is missing from that interface, we are of course willing to help out there.

@erikbrinkman
Copy link
Contributor Author

erikbrinkman commented Mar 12, 2025

Hey @coastalwhite

  1. You do say pretty explicitly to only work on approved issues so this is my own fault.
  2. I really thought this would be trivial, and it mostly spiraled.
  3. A plugin makes sense long term, and especially since avro is not super common and isn't backed by arrow (e.g. to polars it's basically a typed ndjson).
  4. Looking at the plugin definition, it seems to have much less overhead in terms of interface, but maybe slightly more in terms of

In general, happy to have this not go to waste and attempt a port out, but I have a few questions:

  1. Is there an issue for tracking it? Should there be one? Should Implement scan_avro #6903 be closed?
  2. What about sink_*?
  3. What about my random question about ndjson? Happy to open a separate issue, but it seems pretty trivial one way or the other, and from looking at the recent test changes, it seems like ndjson is going to be supported in core
  4. Is there a naming convention or discoverability path?
  5. Would it be possible to get a review from a maintainer, just to make sure I'm not making any egregious mistakes?

@ritchie46
Copy link
Member

Hey @erikbrinkman,

What we do want is a plugin for both read_avro, scan_avro The read_avro can dispatch to scan_avro internally.

This would be a whole new crate with a Rust interface and a Python IO plugin interface and it's own pypi release and crates.io release.

There might be some parts still blocking this from the rust side. I think our AnonymousScan interface might need to be slightly adapted but not much, I don't think it should be a blocker.

@erikbrinkman
Copy link
Contributor Author

@ritchie46 yeah, that's my plan. I'm probably going to use apache's avro crate instead of rolling my own. There could be issues with that as it seems to require that records have a name which the old interface didn't seem to always do.

However, one place I'm a little concerned, but haven't looked into much, is the ability to leverage other polars-io functionality to abstract away caching and cloud access.

I'm terms of read_avro I was just planning to do something pure python like:

def read_avro(..., columns):
    return scan_avro(...).select(*columns).collect()

or did you have something else in mind?

Finally, can you address since if the questions at the end of my last comment?

@lisasgoh
Copy link

Hey, will the future avro IO plugin support reading/writing to cloud storage?

@erikbrinkman
Copy link
Contributor Author

@lisasgoh in the implementation I'm working on, I'm using the same cloud handling as some of the other scanners, so presumably it should work, although I might reach out here to see if you can test it on real data, since I think all the tests that I'm going to try and copy from the mainline branch simply mock out cloud reading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement scan_avro
4 participants