-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[C++][Parquet] GH-47628: Implement basic parquet file rewriter #47775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
e4de469
to
c216849
Compare
@pitrou @adamreeve @mapleFU Do you have any suggestions about this draft? Is there any efficient way to merge two parquet files' schema? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emm I'm thinking that just reuse the current code a ok way, since these logic in current impl would be a bit hacking with current interface...
} | ||
|
||
/// Build the RewriterProperties with the builder parameters. | ||
std::shared_ptr<RewriterProperties> build() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const or move?
template <typename Builder> | ||
void SerializeIndex( | ||
const std::vector<std::vector<std::unique_ptr<Builder>>>& page_index_builders, | ||
const std::vector<std::vector<std::unique_ptr<Index<Builder>>>>& page_indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this separate to different method? This reuse is a bit hacking to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed all the changes yet and will progressively post my comments.
writer_properties_(default_writer_properties()), | ||
reader_properties_(default_reader_properties()) {} | ||
|
||
explicit Builder(const RewriterProperties& properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start simple and remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need it in the future. I use builder because WriterProperties
and ArrowWriterProperties
are both built by builder.
class Builder { | ||
public: | ||
Builder() | ||
: pool_(::arrow::default_memory_pool()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about initializing them to nullptr and only assign to default values when they are not provided in build()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the implementation of other XXXProperties
which assign ::arrow::default_memory_pool()
to pool. Either is fine for me.
private: | ||
std::vector<PageLocation> page_locations_; | ||
std::vector<int64_t> unencoded_byte_array_data_bytes_; | ||
format::OffsetIndex offset_index_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach may double the memory consumption. If we really want to go with this approach, getters of OffsetIndexImpl
should try to return values directly from offset_index_
and remove page_locations_
.
/// \brief List of repetition level histograms for each page concatenated together. | ||
virtual const std::vector<int64_t>& repetition_level_histograms() const = 0; | ||
|
||
virtual const void* to_thrift() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is on the critical path. Is it simpler and cleaner to convert ColumnIndex
and OffsetIndex
to their thrift equivalents?
private: | ||
MemoryPool* pool_; | ||
std::shared_ptr<WriterProperties> writer_properties_; | ||
ReaderProperties reader_properties_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks strange that one is a shared_ptr but the other isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that WriterProperties
is always used with shared_ptr
and ReaderProperties
is always used directly. I don't know the reason too. I follow the implementation of WriterProperties
.
PARQUET_EXPORT const std::shared_ptr<WriterProperties>& default_writer_properties();
PARQUET_EXPORT
std::shared_ptr<ArrowWriterProperties> default_arrow_writer_properties();
ReaderProperties PARQUET_EXPORT default_reader_properties();
PARQUET_EXPORT
ArrowReaderProperties default_arrow_reader_properties();
::arrow::MemoryPool* pool = ::arrow::default_memory_pool(), int64_t size = 0); | ||
|
||
PARQUET_EXPORT | ||
void CopyStream(std::shared_ptr<ArrowInputStream> from, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks too specialized. Should we just define it in the file_rewriter.cc?
This is a draft PR now. I follow Java's implementation but I think it is not a good enough design for C++. Because we must copy lots of code from file_writer.cc or file_reader.cc and it will be troublesome to maintain in the future. I prefer to implement some classes inheriting
XXXWriter
orXXXReader
. I'll think about how to refactor the code. If anyone has any good suggestions, please comment.Now I have written two kinds of tests. Test the horizontal splicing and vertical splicing of parquet files separately. But only horizontal splicing is implemented now because I don't find an efficient way to merge two parquet files' schema.
Rationale for this change
Allow to rewrite parquet files in binary data formats instead of reading, decoding all values and writing them.
What changes are included in this PR?
ParquetFileRewriter
andRewriterProperties
.to_thrift
andSetXXX
methods to help me copy the metadata.CopyStream
methods to callmemcpy
betweenArrowInputStream
andArrowOutputStream
.RowGroupMetaDataBuilder::NextColumnChunk(std::unique_ptr<ColumnChunkMetaData> cc_metadata, int64_t shift)
which allows to add column metadata without creatingColumnChunkMetaDataBuilder
.Are these changes tested?
Yes
Are there any user-facing changes?
ReaderProperties::GetStream
is changed to a const method. Only the signature has been changed. Its original implementation allows it to be declared as a const method.