-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet, Arrow: Refactor vectorized reader #9772
base: main
Are you sure you want to change the base?
Conversation
5a0fe0f
to
8438906
Compare
I plan to resolve #7162 by adding vectorized readers for all v2 encodings. This is the 1st patch. Would you mind taking a look? @rdblue @nastra @Fokko @jackye1995 |
@wgtmac thanks so much for taking this issue on! I am just a user of iceberg (I have never contributed), but I was curious--is it also possible to include a resolution for #521 as an extension of this work? It would be a huge boon for iceberg performance since I often use complex/nested types, and I'm sure others do as well. |
@anthonysgro Yes, that is on my radar. I will work on it once v2 encodings of all primitive types have been supported. |
@wgtmac do you by any chance have all other changes that are required for v2 encodings? I'd like to see and understand how the changes in this PR fit into the bigger picture (It's been a while since I touched/reviewed the vectorized code paths) |
Not yet. My rough plan is to do following things:
This patch is the step 1 above and only added vectorized reading interfaces for float/double/int32/int64 physical types. It already shows the big picture that how following steps will be done. More interfaces will be added progressively for reading other physical types and logical types into arrow field vectors in order to have a better review experience. Does this make sense to you? @nastra |
08f4754
to
92ba90a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my inline comments for the main changes. All new classes/interfaces are package-private and most of this PR is refactoring, so I think this approach is not aggressive. Hope this can help review the PR. @nastra
import org.apache.parquet.io.api.Binary; | ||
|
||
/** Interface for value decoding that supports vectorized (aka batched) decoding. */ | ||
interface VectorizedValuesReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new package-private vectorized reader for a single page. All v2 encodings will be added by implementing this interface and selected readXXX() methods.
@@ -47,35 +46,45 @@ public VectorizedParquetDefinitionLevelReader( | |||
super(bitWidth, maxDefLevel, readLength, setArrowValidityVector); | |||
} | |||
|
|||
abstract class NumericBaseReader { | |||
abstract class BaseReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have merged NumericBaseReader
into BaseReader
to eliminate the duplicate code. This refactoring contributes to the most part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I can split the refactoring work into a separate PR to make the review much easier.
int typeWidth, | ||
Mode mode); | ||
|
||
protected void nextValues( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is where we utilize the vectorized reading method from VectorizedValuesReader. You can see LongReader/IntegerReader/FloatReader/DoubleReader have overridden this method.
@@ -46,7 +45,7 @@ public VectorizedPageIterator( | |||
this.setArrowValidityVector = setValidityVector; | |||
} | |||
|
|||
private ValuesAsBytesReader plainValuesReader = null; | |||
private VectorizedValuesReader valuesReader = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have replaced plainValuesReader with generic values reader to support different encodings.
if (dataEncoding != Encoding.PLAIN) { | ||
if (dataEncoding == Encoding.PLAIN) { | ||
valuesReader = new VectorizedPlainValuesReader(); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More VectorizedValuesReader implementations will be added here for different encodings.
int numValues, | ||
VectorizedValuesReader valuesReader, | ||
int typeWidth) { | ||
valuesReader.readLongs(numValues, vector, bufferIdx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, LongReader
overrides nextValues to use VectorizedValuesReader.readLongs()
method for vectorized reading.
- Add VectorizedValuesReader interface for future extension. - Refactor VectorizedPageIterator to use VectorizedValuesReader. - Consolidate duplicate code in the VectorizedPageIterator.
92ba90a
to
f47157d
Compare
Since this feature has been requested in the community but the PR is not reviewed for a long time. Do you think this is the right direction here? Or should I directly support reading arrow vectors from the parquet-java library? @nastra @aokolnychyi @Fokko @szehon-ho |
hey @wgtmac sorry for the long wait here but I don't have any spare cycles atm to review this. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
Bump comment to keep the bot from closing this |
Hi @wgtmac , can we resume this PR? I can help with review. |
This patch is the preparation work of #7162 and it contains following changes: