Skip to content

Conversation

@Sanjay-Kirti
Copy link
Contributor

@Sanjay-Kirti Sanjay-Kirti commented Sep 22, 2025

Description

This PR implements true batch writing optimization for the Parquet writer, addressing performance issues identified in the current implementation. Instead of processing records individually in a loop, the implementation now groups records by partition and writes entire batches in single operations.

Problem Solved:
The existing implementation was calling the parquet writer once per record, which doesn't leverage the underlying parquet-go library's batch capabilities, leading to:

  • Inefficient I/O operations due to single-record writes
  • Increased overhead and reduced throughput for large datasets
  • Suboptimal compression ratios
  • Poor utilization of parquet format's performance features

Solution Implemented:

  • Partition-wise batch grouping: Records are collected and grouped by partition path
  • Single timestamp per batch: Efficient timestamping with one time.Now() call for entire batch
  • True batch writes: Each partition gets ONE write operation instead of N individual writes
  • Backward compatibility: Same function signature and error handling patterns

Performance Results:

  • 1000 records processed in 2.6562ms (verified by performance tests)
  • ✅ Dramatically reduced I/O operations: From N writes to P(partitions) writes
  • ✅ All records in batch share same timestamp, proving true batching behavior

Fixes #477

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Comprehensive test suite created and all tests pass:

  • Basic batch writing test: Multiple records batched correctly with shared timestamps
  • Empty batch handling: Edge case properly handled without errors
  • Single record processing: Works correctly with single-record batches
  • Performance test: 1000 records processed in 2.6562ms, demonstrating significant improvement
  • Partition file creation: Verifies correct partition file management
  • Compilation test: go build ./destination/parquet/... passes
  • Full test suite: go test ./destination/parquet/... -v passes with all scenarios

Related PR's (If Any):
This PR supersedes the previous attempt #529 which had GitHub conflict resolution issues. This is a clean implementation from scratch with:

Proper batch writing logic (not just API syntax)
Comprehensive test coverage
Proven performance improvement


Note

Replaces per-record writes with partition-wise batch writes using a single batch timestamp and grouped GenericWriter calls.

  • Parquet writer (destination/parquet/parquet.go):
    • Write:
      • Batch-process records: assign one OlakeTimestamp per batch and group by partition via getPartitionedFilePath.
      • Create partition files on-demand; write each partition's batch in a single GenericWriter call (any vs types.RawRecord).
      • Early return on empty input; updated error messages for batch operations.

Written by Cursor Bugbot for commit 6e7109d. This will update automatically on new commits. Configure here.

@Sanjay-Kirti
Copy link
Contributor Author

I humbly apologise for creating a chaos.
@shubham19may Let me know your reviews.
@hash-data

@Sanjay-Kirti
Copy link
Contributor Author

@hash-data Is this MongoDB/Iceberg test failure also happening on the main/staging branch? The error suggests it's a test cleanup issue unrelated to the parquet batch writing changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants