Skip to content

GH-46371: [C++][Parquet] Parquet Variant decoding tools #46372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented May 9, 2025

Rationale for this change

This patch supports tool to decode the parquet variant.

What changes are included in this PR?

This patch supports tool to decode the parquet variant.

Are these changes tested?

Yes. I uses parquet-testings. Some problems is listed here: apache/parquet-testing#79

I can also add some hand-written tests after interface is agreed.

Are there any user-facing changes?

Yes, this adds interfaces for decode variant.

Copy link

github-actions bot commented May 9, 2025

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@mapleFU mapleFU changed the title [C++][Parquet] Parquet Variant decoding tools GH-46371: [C++][Parquet] Parquet Variant decoding tools May 9, 2025
Copy link

github-actions bot commented May 9, 2025

⚠️ GitHub issue #46371 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 13, 2025
/// \defgroup ValueAccessors
/// @{

// Note: Null doesn't need visitor.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know should we just return an arrow's Scalar, it would be easy to use but in-efficient.

Comment on lines 156 to 160
int8_t getInt8() const;
int16_t getInt16() const;
int32_t getInt32() const;
int64_t getInt64() const;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, getInt64 only supports read from int64, which is too strict for integer. I think we can also uses some way to allow getInt64 to get some "smaller types" like int32, int16, int8.

int32_t getInt32() const;
int64_t getInt64() const;
/// Include short_string optimization and primitive string type
std::string_view getString() const;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I didn't check utf-8 here.

std::string_view getString() const;
std::string_view getBinary() const;
float getFloat() const;
double getDouble() const;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, getDouble only supports read from getFloat, which is too strict for. Maybe we can also uses some way to allow getDouble get other types

}

// checking the element is incremental.
// TODO(mwish): Remove this or encapsulate this range check to function
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should we use a extra function here like "Validate", or just checks them here?

@mapleFU mapleFU marked this pull request as ready for review May 14, 2025 08:10
@mapleFU mapleFU requested a review from wgtmac as a code owner May 14, 2025 08:10
@mapleFU mapleFU force-pushed the variant-cpp-decoder-tools branch from fb59842 to da142a6 Compare May 14, 2025 08:10
@mapleFU
Copy link
Member Author

mapleFU commented May 14, 2025

@emkornfield @wgtmac @pitrou @zeroshade

This patch add some basic variant decoding tools. Some thoughts:

How would the interface for visiting variant like? The simplist way is cast <metadata, value> pairs to a ptr<::arrow::Scalar>, but this is too slow and needs to read whole data. We can also wraps a std::variant, but I think it's also slow and needs to dynamic dispatch the visitor. Here I just add visitor for every type. And currently, getInt64 would only supports read int64. Any idea is welcome

@mapleFU mapleFU force-pushed the variant-cpp-decoder-tools branch from da142a6 to 54681c4 Compare May 14, 2025 08:20
Copy link

@xxubai xxubai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I have some questions please help me out!

if (metadata.size() < 2) {
throw ParquetException("Invalid Variant metadata: too short: " +
std::to_string(metadata.size()));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some conditions that should be checked first when creating metadata?

  1. byte order
  2. whether the header version is 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About (1) I assume it's lsb format here. (2) is a good point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've check it's 1 now

if ((dict_size + 1) * offset_size > metadata_.size()) {
throw ParquetException("Invalid Variant metadata: offset out of range");
}
// TODO(mwish): This can be optimized by using binary search if the metadata is sorted.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be optimized by sorting, I'll optimize this later


uint32_t field_offset = std::numeric_limits<uint32_t>::max();
// Get the field offset
// TODO(mwish): Using binary search to optimize it.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be optimized by binary search

@mapleFU mapleFU force-pushed the variant-cpp-decoder-tools branch from 51f02d9 to bbce69c Compare May 14, 2025 12:16
@mapleFU mapleFU force-pushed the variant-cpp-decoder-tools branch 3 times, most recently from 99a59ee to c0526da Compare May 14, 2025 14:44
@mapleFU mapleFU force-pushed the variant-cpp-decoder-tools branch from c0526da to 6454bb7 Compare May 14, 2025 14:51
@xxubai
Copy link

xxubai commented May 15, 2025

Is it possible to manually add unit tests covering scenarios such as null values, UUID, etc.?

@mapleFU
Copy link
Member Author

mapleFU commented May 15, 2025

Is it possible to manually add unit tests covering scenarios such as null values, UUID, etc.?

Sure, I just leave it for review, I think the interfaces might be modified a lot after review, and parquet-java(which can write the correspond data) haven't release yet

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took a glimpse of the API. Let me know what you think.


enum class VariantBasicType {
/// One of the primitive types
Primitive = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Primitive = 0,
kPrimitive = 0,

We need to add k-prefix to be consistent in this codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this codebase doesn't have this style? Like:

enum class ParquetDataPageVersion { V1, V2 };

/// Controls the level of size statistics that are written to the file.
enum class SizeStatisticsLevel : uint8_t {
  // No size statistics are written.
  None = 0,
  // Only column chunk size statistics are written.
  ColumnChunk,
  // Both size statistics in the column chunk and page index are written.
  PageAndColumnChunk
};


enum class VariantPrimitiveType : int8_t {
/// Equivalent Parquet Type: UNKNOWN
NullType = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

/// @{

// Note: Null doesn't need visitor.
bool getBool() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expectation if the basic type or primitive type does not match?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it acceptable to use std::variant to replace these getters? The caveat is that we need to define view types for object, array as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think variant is a good idea, VariantValue is a "variant" type, casting it to std::variant and get from it is (1) not lazy, needs to deserialize whole structure (2) it decode twice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. std::variant pays for a full deserialization.

int64_t getTimestamp() const;
int64_t getTimestampNTZ() const;
// 16 bytes UUID
std::array<uint8_t, 16> getUuid() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not std::string_view? Is it due to big endianness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...

std::string toDebugString() const;
};
ObjectInfo getObjectInfo() const;
std::optional<VariantValue> getObjectValueByKey(std::string_view key) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need functions to iterate each key/value?

};
ArrayInfo getArrayInfo() const;
// Would throw ParquetException if index is out of range.
VariantValue getArrayValueByIndex(uint32_t index) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for iterator over the elements


std::string toDebugString() const;
};
ObjectInfo getObjectInfo() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it for reuse? Shouldn't we cache it internally?

@mapleFU
Copy link
Member Author

mapleFU commented May 16, 2025

@wgtmac I'm think of keep some elements inside the VariantValue, the way below adds 24B or 28B overheads on VariantValue. It adds some cost when constructing the VariantValue but maybe more friendly for user.

    VariantBasicType type_; // (maybe we don't need store this)
    uint32_t num_elements_; // exists when type_ is array or object
    uint32_t id_size_;   // exists when type_ is object
    uint32_t offset_size_; // exists when type_ is array or object
    uint32_t id_start_offset_; // exists when type_ is object
    uint32_t offset_start_offset_; // exists when type_ is array or object
    uint32_t data_start_offset_;   // exists when type_ is array or object

Another problem is the error style for decoding tools. If we just leave VariantValue/VariantMetadata a tool in parquet, and uses it in arrow compute or extension type, we also needs cast the handling value to status...

@mapleFU
Copy link
Member Author

mapleFU commented May 16, 2025

Also cc @pitrou for interfaces because you're interface expert here

@mapleFU mapleFU force-pushed the variant-cpp-decoder-tools branch from 7bfb57d to 1b31d42 Compare May 16, 2025 20:11
Comment on lines +159 to +160
std::string_view metadata_;
uint32_t dictionary_size_{0};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth to change this to below case, which could make this 24B -> 16B

const uint8_t* metadata_ptr_;
const uint32_t metadata_size_;
uint32_t dictionary_size_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants