-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Open
Labels
type:bugFor Issues and PRs that fix bugsFor Issues and PRs that fix bugs
Description
Bug Description
What happened:
Out of Memory (OOM) errors occur when building SI on large tables with Secondary Index enabled. The error manifests during metadata table write operations:
What you expected:
Can we avoid populating in-memory hash-maps and lists and then return iterator? We can directly use iterator and avoid building memory pressure?
https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/SecondaryIndexRecordGenerationUtils.java#L200
Steps to reproduce:
Build SI for 100GB table with parquet files of size 100MB+
Environment
- Hudi Version: 1.0.0+ (any version with Secondary Index support)
- Spark Version: 3.5.x
- Table Type: MOR or COW
- Table Size: 10M+ records, 100+ files
- Heap Size: Standard executor memory (not enough for non-streaming approach)
Logs and Stack Trace
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100)
at org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:96)
at org.apache.hudi.avro.HoodieAvroUtils.indexedRecordToBytesStream(HoodieAvroUtils.java:152)
at org.apache.hudi.common.util.HFileUtils.serializeRecordsToLogBlock(HFileUtils.java:221)
at org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:501)
at org.apache.hudi.io.HoodieAppendHandle.flushToDiskIfRequired(HoodieAppendHandle.java:681)
Metadata
Metadata
Assignees
Labels
type:bugFor Issues and PRs that fix bugsFor Issues and PRs that fix bugs