-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading plain encoded INT96 timestamp from Parquet file #10850
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Hi @mskapilks, since Gluten customer has been asking for this feature, I created this PR to add support for reading plain encoded INT96 timestamp with your commit kept. If you would like to continue this work in your PR, please let us know. Thanks! |
Thanks for the PR |
a93770b
to
07bfce3
Compare
velox/dwio/common/IntDecoder.h
Outdated
unsigned char ch; | ||
|
||
// Read 8 unsigned bytes. | ||
uint64_t nanos = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just do this? This class should not have any knowledge about timestamp
int128_t result = 0;
for (int i = 0; i < 12; ++i) {
auto ch = readByte();
result |= static_cast<uint128_t>(ch & BASE_256_MASK) << (i * 8);
}
return result;
Then inside TimestampColumnReader
we convert these int128_t
into Timestamp
, possibly in-place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated following this suggestion. Now we convert int128_t
to Timestamp
in getValues inside TimestampColumnReader
. Please note that we also need to convert the int128_t
value as Timestamp
in TimestampRange
with this change. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntDecoder
and TimestampColumnReader
looks good now, the change in TimestampRange
is not ideal though, that is parquet specific logic we need to find a way to move this into parquet as well.
709e15e
to
b8ef8f2
Compare
@@ -70,7 +76,7 @@ class TimestampColumnReader : public IntegerColumnReader { | |||
const RowSet& rows, | |||
const uint64_t* /*incomingNulls*/) override { | |||
auto& data = formatData_->as<ParquetData>(); | |||
// Use int128_t as a workaroud. Timestamp in Velox is of 16-byte length. | |||
// Use int128_t as a workaround. Timestamp in Velox is of 16-byte length. | |||
prepareRead<int128_t>(offset, rows, nullptr); | |||
readCommon<IntegerColumnReader, true>(rows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the clean way to do it is to create our own ColumnVisitor
instance with a subclass of TimestampRange
that is dedicated for Parquet Int96, instead of reusing the readCommon
for integers. Almost all of the filter type and valueSize_
dispatches on integers do not apply to timestamp type anyway.
Follow-up for: #4680.