Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet, Arrow: Refactor vectorized reader #9772

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Feb 22, 2024

This patch is the preparation work of #7162 and it contains following changes:

  • Add VectorizedValuesReader interface for future extension.
  • Refactor VectorizedPageIterator to use VectorizedValuesReader.
  • Consolidate duplicate code in the VectorizedPageIterator.

@github-actions github-actions bot added the arrow label Feb 22, 2024
@wgtmac wgtmac force-pushed the vectorized_plain_reader branch from 5a0fe0f to 8438906 Compare February 22, 2024 06:32
@wgtmac
Copy link
Member Author

wgtmac commented Feb 22, 2024

I plan to resolve #7162 by adding vectorized readers for all v2 encodings. This is the 1st patch. Would you mind taking a look? @rdblue @nastra @Fokko @jackye1995

@anthonysgro
Copy link

@wgtmac thanks so much for taking this issue on! I am just a user of iceberg (I have never contributed), but I was curious--is it also possible to include a resolution for #521 as an extension of this work? It would be a huge boon for iceberg performance since I often use complex/nested types, and I'm sure others do as well.

@Fokko Fokko requested a review from aokolnychyi March 11, 2024 19:41
@wgtmac
Copy link
Member Author

wgtmac commented Mar 12, 2024

@anthonysgro Yes, that is on my radar. I will work on it once v2 encodings of all primitive types have been supported.

@nastra
Copy link
Contributor

nastra commented Mar 15, 2024

I plan to resolve #7162 by adding vectorized readers for all v2 encodings. This is the 1st patch.

@wgtmac do you by any chance have all other changes that are required for v2 encodings? I'd like to see and understand how the changes in this PR fit into the bigger picture (It's been a while since I touched/reviewed the vectorized code paths)

@wgtmac
Copy link
Member Author

wgtmac commented Mar 15, 2024

Not yet. My rough plan is to do following things:

  1. add a new VectorizedValuesReader base class to supporting different encodings. This is similar to what spark does but reading into arrow field vector: :https://github.com/apache/spark/blob/b7aa9740249b50ad9db254626c530ff5bc33d385/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java#L30
  2. extend VectorizedValuesReader to add v2 encodings one by one.
  3. support vectorized readers for nested types.

This patch is the step 1 above and only added vectorized reading interfaces for float/double/int32/int64 physical types. It already shows the big picture that how following steps will be done. More interfaces will be added progressively for reading other physical types and logical types into arrow field vectors in order to have a better review experience.

Does this make sense to you? @nastra

@wgtmac wgtmac force-pushed the vectorized_plain_reader branch from 08f4754 to 92ba90a Compare March 16, 2024 16:25
Copy link
Member Author

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my inline comments for the main changes. All new classes/interfaces are package-private and most of this PR is refactoring, so I think this approach is not aggressive. Hope this can help review the PR. @nastra

import org.apache.parquet.io.api.Binary;

/** Interface for value decoding that supports vectorized (aka batched) decoding. */
interface VectorizedValuesReader {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new package-private vectorized reader for a single page. All v2 encodings will be added by implementing this interface and selected readXXX() methods.

@@ -47,35 +46,45 @@ public VectorizedParquetDefinitionLevelReader(
super(bitWidth, maxDefLevel, readLength, setArrowValidityVector);
}

abstract class NumericBaseReader {
abstract class BaseReader {
Copy link
Member Author

@wgtmac wgtmac Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have merged NumericBaseReader into BaseReader to eliminate the duplicate code. This refactoring contributes to the most part of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I can split the refactoring work into a separate PR to make the review much easier.

int typeWidth,
Mode mode);

protected void nextValues(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is where we utilize the vectorized reading method from VectorizedValuesReader. You can see LongReader/IntegerReader/FloatReader/DoubleReader have overridden this method.

@@ -46,7 +45,7 @@ public VectorizedPageIterator(
this.setArrowValidityVector = setValidityVector;
}

private ValuesAsBytesReader plainValuesReader = null;
private VectorizedValuesReader valuesReader = null;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced plainValuesReader with generic values reader to support different encodings.

if (dataEncoding != Encoding.PLAIN) {
if (dataEncoding == Encoding.PLAIN) {
valuesReader = new VectorizedPlainValuesReader();
} else {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More VectorizedValuesReader implementations will be added here for different encodings.

int numValues,
VectorizedValuesReader valuesReader,
int typeWidth) {
valuesReader.readLongs(numValues, vector, bufferIdx);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, LongReader overrides nextValues to use VectorizedValuesReader.readLongs() method for vectorized reading.

- Add VectorizedValuesReader interface for future extension.
- Refactor VectorizedPageIterator to use VectorizedValuesReader.
- Consolidate duplicate code in the VectorizedPageIterator.
@wgtmac wgtmac force-pushed the vectorized_plain_reader branch from 92ba90a to f47157d Compare March 16, 2024 16:38
@wgtmac
Copy link
Member Author

wgtmac commented Jul 24, 2024

Since this feature has been requested in the community but the PR is not reviewed for a long time. Do you think this is the right direction here? Or should I directly support reading arrow vectors from the parquet-java library? @nastra @aokolnychyi @Fokko @szehon-ho

@nastra
Copy link
Contributor

nastra commented Jul 24, 2024

hey @wgtmac sorry for the long wait here but I don't have any spare cycles atm to review this.

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 21, 2024
@PaulLiang1
Copy link

Bump comment to keep the bot from closing this

@nastra nastra added not-stale and removed stale labels Oct 21, 2024
@flyrain
Copy link
Contributor

flyrain commented Feb 6, 2025

Hi @wgtmac , can we resume this PR? I can help with review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants