Skip to content

OOM with Secondary Index Enabled #14077

@vinishjail97

Description

@vinishjail97

Bug Description

What happened:
Out of Memory (OOM) errors occur when building SI on large tables with Secondary Index enabled. The error manifests during metadata table write operations:

What you expected:
Can we avoid populating in-memory hash-maps and lists and then return iterator? We can directly use iterator and avoid building memory pressure?
https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/SecondaryIndexRecordGenerationUtils.java#L200

Steps to reproduce:
Build SI for 100GB table with parquet files of size 100MB+

Environment

  • Hudi Version: 1.0.0+ (any version with Secondary Index support)
  • Spark Version: 3.5.x
  • Table Type: MOR or COW
  • Table Size: 10M+ records, 100+ files
  • Heap Size: Standard executor memory (not enough for non-streaming approach)

Logs and Stack Trace

  java.lang.OutOfMemoryError: Java heap space
      at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
      at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100)
      at org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:96)
      at org.apache.hudi.avro.HoodieAvroUtils.indexedRecordToBytesStream(HoodieAvroUtils.java:152)
      at org.apache.hudi.common.util.HFileUtils.serializeRecordsToLogBlock(HFileUtils.java:221)
      at org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:501)
      at org.apache.hudi.io.HoodieAppendHandle.flushToDiskIfRequired(HoodieAppendHandle.java:681)

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugFor Issues and PRs that fix bugs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions