[C++][Parquet] GH-47628: Implement basic parquet file rewriter #47775

HuaHuaY · 2025-10-10T06:58:56Z

This is a draft PR now. I follow Java's implementation but I think it is not a good enough design for C++. Because we must copy lots of code from file_writer.cc or file_reader.cc and it will be troublesome to maintain in the future. I prefer to implement some classes inheriting XXXWriter or XXXReader. I'll think about how to refactor the code. If anyone has any good suggestions, please comment.

Now I have written two kinds of tests. Test the horizontal splicing and vertical splicing of parquet files separately. But only horizontal splicing is implemented now because I don't find an efficient way to merge two parquet files' schema.

Rationale for this change

Allow to rewrite parquet files in binary data formats instead of reading, decoding all values and writing them.

What changes are included in this PR?

Add class ParquetFileRewriter and RewriterProperties.
Add some to_thrift and SetXXX methods to help me copy the metadata.
Add CopyStream methods to call memcpy between ArrowInputStream and ArrowOutputStream.
Add RowGroupMetaDataBuilder::NextColumnChunk(std::unique_ptr<ColumnChunkMetaData> cc_metadata, int64_t shift) which allows to add column metadata without creating ColumnChunkMetaDataBuilder.

Are these changes tested?

Yes

Are there any user-facing changes?

Add some new classes and methods mentioned above.
ReaderProperties::GetStream is changed to a const method. Only the signature has been changed. Its original implementation allows it to be declared as a const method.

GitHub Issue: [C++][Parquet] Provide a rewriter to rewrite parquet files without decoding all the row groups/pages #47628

github-actions · 2025-10-10T06:59:20Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

HuaHuaY · 2025-10-13T04:07:14Z

@pitrou @adamreeve @mapleFU Do you have any suggestions about this draft? Is there any efficient way to merge two parquet files' schema?

mapleFU

Emm I'm thinking that just reuse the current code a ok way, since these logic in current impl would be a bit hacking with current interface...

mapleFU · 2025-10-13T04:16:18Z

cpp/src/parquet/properties.h

+    }
+
+    /// Build the RewriterProperties with the builder parameters.
+    std::shared_ptr<RewriterProperties> build() {


const or move?

mapleFU · 2025-10-13T04:19:02Z

cpp/src/parquet/page_index.cc

  template <typename Builder>
  void SerializeIndex(
      const std::vector<std::vector<std::unique_ptr<Builder>>>& page_index_builders,
+      const std::vector<std::vector<std::unique_ptr<Index<Builder>>>>& page_indices,


Can this separate to different method? This reuse is a bit hacking to me

wgtmac

I haven't reviewed all the changes yet and will progressively post my comments.

wgtmac · 2025-10-13T03:16:42Z

cpp/src/parquet/properties.h

+          writer_properties_(default_writer_properties()),
+          reader_properties_(default_reader_properties()) {}
+
+    explicit Builder(const RewriterProperties& properties)


Let's start simple and remove this?

I think we may need it in the future. I use builder because WriterProperties and ArrowWriterProperties are both built by builder.

wgtmac · 2025-10-13T03:17:56Z

cpp/src/parquet/properties.h

+  class Builder {
+   public:
+    Builder()
+        : pool_(::arrow::default_memory_pool()),


What about initializing them to nullptr and only assign to default values when they are not provided in build()?

I followed the implementation of other XXXProperties which assign ::arrow::default_memory_pool() to pool. Either is fine for me.

wgtmac · 2025-10-13T03:35:43Z

cpp/src/parquet/page_index.cc

 private:
  std::vector<PageLocation> page_locations_;
  std::vector<int64_t> unencoded_byte_array_data_bytes_;
+  format::OffsetIndex offset_index_;


This approach may double the memory consumption. If we really want to go with this approach, getters of OffsetIndexImpl should try to return values directly from offset_index_ and remove page_locations_.

wgtmac · 2025-10-13T03:43:30Z

cpp/src/parquet/page_index.h

  /// \brief List of repetition level histograms for each page concatenated together.
  virtual const std::vector<int64_t>& repetition_level_histograms() const = 0;
+
+  virtual const void* to_thrift() const = 0;


I'm not sure if this is on the critical path. Is it simpler and cleaner to convert ColumnIndex and OffsetIndex to their thrift equivalents?

wgtmac · 2025-10-13T03:46:22Z

cpp/src/parquet/properties.h

+   private:
+    MemoryPool* pool_;
+    std::shared_ptr<WriterProperties> writer_properties_;
+    ReaderProperties reader_properties_;


It looks strange that one is a shared_ptr but the other isn't.

It seems that WriterProperties is always used with shared_ptr and ReaderProperties is always used directly. I don't know the reason too. I follow the implementation of WriterProperties.

PARQUET_EXPORT const std::shared_ptr<WriterProperties>& default_writer_properties(); PARQUET_EXPORT std::shared_ptr<ArrowWriterProperties> default_arrow_writer_properties(); ReaderProperties PARQUET_EXPORT default_reader_properties(); PARQUET_EXPORT ArrowReaderProperties default_arrow_reader_properties();

wgtmac · 2025-10-13T04:00:21Z

cpp/src/parquet/platform.h

    ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(), int64_t size = 0);

+PARQUET_EXPORT
+void CopyStream(std::shared_ptr<ArrowInputStream> from,


This function looks too specialized. Should we just define it in the file_rewriter.cc?

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Oct 10, 2025

HuaHuaY changed the title ~~[C++][Parquet] Implement basic parquet file rewriter~~ [C++][Parquet] GH-47628: Implement basic parquet file rewriter Oct 10, 2025

Implement basic parquet file rewriter

c216849

HuaHuaY force-pushed the fix_issue_47664 branch from e4de469 to c216849 Compare October 10, 2025 07:36

HuaHuaY added 2 commits October 10, 2025 15:43

fix cpplint

a4f5c31

fix compile errors

e037be7

mapleFU reviewed Oct 13, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 13, 2025

wgtmac reviewed Oct 13, 2025

View reviewed changes

[C++][Parquet] GH-47628: Implement basic parquet file rewriter #47775

Are you sure you want to change the base?

[C++][Parquet] GH-47628: Implement basic parquet file rewriter #47775

Conversation

HuaHuaY commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

HuaHuaY commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HuaHuaY commented Oct 10, 2025 •

edited

Loading

HuaHuaY commented Oct 13, 2025 •

edited

Loading

HuaHuaY Oct 13, 2025 •

edited

Loading