-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46371: [C++][Parquet] Parquet Variant decoding tools #46372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
|
/// \defgroup ValueAccessors | ||
/// @{ | ||
|
||
// Note: Null doesn't need visitor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know should we just return an arrow's Scalar
, it would be easy to use but in-efficient.
cpp/src/parquet/variant.h
Outdated
int8_t getInt8() const; | ||
int16_t getInt16() const; | ||
int32_t getInt32() const; | ||
int64_t getInt64() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, getInt64
only supports read from int64, which is too strict for integer. I think we can also uses some way to allow getInt64 to get some "smaller types" like int32, int16, int8.
int32_t getInt32() const; | ||
int64_t getInt64() const; | ||
/// Include short_string optimization and primitive string type | ||
std::string_view getString() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I didn't check utf-8 here.
std::string_view getString() const; | ||
std::string_view getBinary() const; | ||
float getFloat() const; | ||
double getDouble() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, getDouble
only supports read from getFloat
, which is too strict for. Maybe we can also uses some way to allow getDouble
get other types
cpp/src/parquet/variant.cc
Outdated
} | ||
|
||
// checking the element is incremental. | ||
// TODO(mwish): Remove this or encapsulate this range check to function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think should we use a extra function here like "Validate", or just checks them here?
fb59842
to
da142a6
Compare
@emkornfield @wgtmac @pitrou @zeroshade This patch add some basic variant decoding tools. Some thoughts: How would the interface for visiting variant like? The simplist way is cast |
da142a6
to
54681c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, I have some questions please help me out!
if (metadata.size() < 2) { | ||
throw ParquetException("Invalid Variant metadata: too short: " + | ||
std::to_string(metadata.size())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some conditions that should be checked first when creating metadata?
- byte order
- whether the header version is 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About (1) I assume it's lsb format here. (2) is a good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've check it's 1 now
if ((dict_size + 1) * offset_size > metadata_.size()) { | ||
throw ParquetException("Invalid Variant metadata: offset out of range"); | ||
} | ||
// TODO(mwish): This can be optimized by using binary search if the metadata is sorted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be optimized by sorting, I'll optimize this later
|
||
uint32_t field_offset = std::numeric_limits<uint32_t>::max(); | ||
// Get the field offset | ||
// TODO(mwish): Using binary search to optimize it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be optimized by binary search
51f02d9
to
bbce69c
Compare
99a59ee
to
c0526da
Compare
c0526da
to
6454bb7
Compare
Is it possible to manually add unit tests covering scenarios such as null values, UUID, etc.? |
Sure, I just leave it for review, I think the interfaces might be modified a lot after review, and parquet-java(which can write the correspond data) haven't release yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just took a glimpse of the API. Let me know what you think.
|
||
enum class VariantBasicType { | ||
/// One of the primitive types | ||
Primitive = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primitive = 0, | |
kPrimitive = 0, |
We need to add k
-prefix to be consistent in this codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this codebase doesn't have this style? Like:
enum class ParquetDataPageVersion { V1, V2 };
/// Controls the level of size statistics that are written to the file.
enum class SizeStatisticsLevel : uint8_t {
// No size statistics are written.
None = 0,
// Only column chunk size statistics are written.
ColumnChunk,
// Both size statistics in the column chunk and page index are written.
PageAndColumnChunk
};
|
||
enum class VariantPrimitiveType : int8_t { | ||
/// Equivalent Parquet Type: UNKNOWN | ||
NullType = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
/// @{ | ||
|
||
// Note: Null doesn't need visitor. | ||
bool getBool() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expectation if the basic type or primitive type does not match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it acceptable to use std::variant
to replace these getters? The caveat is that we need to define view types for object, array as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think variant is a good idea, VariantValue
is a "variant" type, casting it to std::variant and get from it is (1) not lazy, needs to deserialize whole structure (2) it decode twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. std::variant
pays for a full deserialization.
int64_t getTimestamp() const; | ||
int64_t getTimestampNTZ() const; | ||
// 16 bytes UUID | ||
std::array<uint8_t, 16> getUuid() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not std::string_view
? Is it due to big endianness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes...
std::string toDebugString() const; | ||
}; | ||
ObjectInfo getObjectInfo() const; | ||
std::optional<VariantValue> getObjectValueByKey(std::string_view key) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need functions to iterate each key/value?
cpp/src/parquet/variant.h
Outdated
}; | ||
ArrayInfo getArrayInfo() const; | ||
// Would throw ParquetException if index is out of range. | ||
VariantValue getArrayValueByIndex(uint32_t index) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for iterator over the elements
cpp/src/parquet/variant.h
Outdated
|
||
std::string toDebugString() const; | ||
}; | ||
ObjectInfo getObjectInfo() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it for reuse? Shouldn't we cache it internally?
@wgtmac I'm think of keep some elements inside the
Another problem is the error style for decoding tools. If we just leave VariantValue/VariantMetadata a tool in parquet, and uses it in arrow compute or extension type, we also needs cast the handling value to status... |
Also cc @pitrou for interfaces because you're interface expert here |
7bfb57d
to
1b31d42
Compare
std::string_view metadata_; | ||
uint32_t dictionary_size_{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth to change this to below case, which could make this 24B -> 16B
const uint8_t* metadata_ptr_;
const uint32_t metadata_size_;
uint32_t dictionary_size_
Rationale for this change
This patch supports tool to decode the parquet variant.
What changes are included in this PR?
This patch supports tool to decode the parquet variant.
Are these changes tested?
Yes. I uses parquet-testings. Some problems is listed here: apache/parquet-testing#79
I can also add some hand-written tests after interface is agreed.
Are there any user-facing changes?
Yes, this adds interfaces for decode variant.