feat: implement true batch writing for parquet writer #530
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR implements true batch writing optimization for the Parquet writer, addressing performance issues identified in the current implementation. Instead of processing records individually in a loop, the implementation now groups records by partition and writes entire batches in single operations.
Problem Solved:
The existing implementation was calling the parquet writer once per record, which doesn't leverage the underlying parquet-go library's batch capabilities, leading to:
Solution Implemented:
time.Now()call for entire batchPerformance Results:
Fixes #477
Type of change
How Has This Been Tested?
Comprehensive test suite created and all tests pass:
go build ./destination/parquet/...passesgo test ./destination/parquet/... -vpasses with all scenariosRelated PR's (If Any):
This PR supersedes the previous attempt #529 which had GitHub conflict resolution issues. This is a clean implementation from scratch with:
Proper batch writing logic (not just API syntax)
Comprehensive test coverage
Proven performance improvement
Note
Replaces per-record writes with partition-wise batch writes using a single batch timestamp and grouped GenericWriter calls.
destination/parquet/parquet.go):OlakeTimestampper batch and group by partition viagetPartitionedFilePath.GenericWritercall (anyvstypes.RawRecord).Written by Cursor Bugbot for commit 6e7109d. This will update automatically on new commits. Configure here.