Skip to content

[BUG] Integer overflow / truncation risks across the codebase #14471

@gerashegalov

Description

@gerashegalov

This report was generated by claude-4.6-opus-high

Context

PR #14466 fixed integer overflow in KudoGpuTableOperator.concat where:

  1. .map(...).sum (Int) was replaced with .foldLeft(0L) (Long) for byte-size computation
  2. 8 * (n + 1) (Int multiply) was changed to 8L * (n + 1)
  3. var currentOffset = 0 (Int) was changed to 0L (Long)

A codebase-wide audit found additional locations susceptible to the same class of bug. All involve Int arithmetic on values that represent byte sizes, byte offsets, or row counts that can exceed Int.MaxValue (~2 GB / ~2 billion rows).


HIGH Severity

1. GpuShuffleCoalesceExec.scalamap(getNumRows).sum as Int (3 locations)

Lines 351, 400, 428 — the zero-column branch of all three concat methods:

// Line 351 — JCudfTableOperator.concat
val totalRowsNum = tables.map(getNumRows).sum
cudf_utils.HostConcatResultUtil.rowsOnlyHostConcatResult(totalRowsNum)

// Line 400 — KudoTableOperator.concat
val totalRowsNum = columns.map(getNumRows).sum
RowCountOnlyMergeResult(totalRowsNum)

// Line 428 — KudoGpuTableOperator.concat (same method fixed by #14466, but the numCols==0 branch)
val totalRowsNum = columns.map(getNumRows).sum
new ColumnarBatch(Array.empty, totalRowsNum)

getNumRows returns Int; .sum is Int. Silently wraps to a wrong row count with many small batches. The canAddToBatch guard (line 615) limits numRowsInBatch but the sum in concat doesn't benefit from that guard directly.

2. GpuBroadcastExchangeExec.scala — broadcast row-count sum

// ~Line 582
numRows = withResource(buffers) { _ =>
  ...
  buffers.map(_.header.getNumRows).sum
}

Same .map(...).sum on Int row counts in the broadcast path.

3. MultithreadedShuffleBufferCatalog.scalasize().toInt truncation

// Line 255-263
override def size(): Long = segments.map(_.length).sum

override def nioByteBuffer(): ByteBuffer = {
  ...
  val totalSize = size().toInt   // <-- truncation
  val buffer = ByteBuffer.allocate(totalSize)

size() returns Long, but nioByteBuffer() calls .toInt — a shuffle block larger than 2 GB gets a truncated or negative allocation size.

4. GpuTextBasedPartitionReader.scala — Int multiply for offsets + .toInt truncation

// ~Line 175-213
private var offsetsBuffer = HostMemoryBuffer.allocate(
  (rowsAllocated + 1) * DType.INT32.getSizeInBytes)  // <-- Int * Int overflow
...
offsetsBuffer.setInt(
  (numRows + 1) * DType.INT32.getSizeInBytes.toLong,
  dataLocation.toInt)  // <-- truncation for data > 2GB

(rowsAllocated + 1) * 4 is pure Int multiplication — overflows for large row counts. dataLocation.toInt truncates when data exceeds 2 GB.

5. GpuParquetScan.scala — footer size .toInt before buffer allocation

// ~Line 597-598
val hmbLength = (fileLen - footerIndex).toInt  // <-- truncation
closeOnExcept(HostMemoryBuffer.allocate(hmbLength + MAGIC.length, false)) { outBuffer =>

If the footer span exceeds 2 GB, .toInt truncates, allocating a wrong-sized buffer.

6. GpuPartitioning.scalagetLength.toInt / getLong(...).toInt for large buffers

// ~Line 236-252
val idx = offsetsHost.getLong((i) * elemSize).toInt  // <-- truncation
...
new SlicedSerializedColumnVector(dataHost, start, dataHost.getLength.toInt)  // <-- truncation

Truncation of Long buffer positions/lengths to Int when serialized data exceeds 2 GB.


MEDIUM Severity

7. GpuParquetScan.scalacalculateExtraMemoryForParquetFooter pure Int arithmetic

// ~Line 1615-1617
def calculateExtraMemoryForParquetFooter(numCols: Int, numBlocks: Int): Int = {
  val numColumnChunks = numCols * numBlocks   // <-- Int overflow
  numColumnChunks * 2 * 8                     // <-- Int overflow

numCols * numBlocks * 16 — all Int — overflows with very wide tables and many row groups.

8. RapidsHostColumnBuilder.java — Int bit-shift for offset indexing

// ~Line 615
data.setLong(currentIndex++ << bitShiftBySize, value);

currentIndex << bitShiftBySize is 32-bit int shift; wraps if the logical byte offset exceeds Integer.MAX_VALUE.

9. ParquetCachedBatchSerializer.scalavar pos = 0 (Int) exposed as getPos: Long

// ~Line 141-161
new DelegatingPositionOutputStream(stream) {
  var pos = 0                          // <-- Int
  override def getPos: Long = pos      // <-- returned as Long
  override def write(b: Int): Unit = {
    super.write(b)
    pos += Integer.BYTES               // <-- wraps past 2GB
  }

pos tracks byte position as Int but is returned as Long via getPos. Wraps for output exceeding 2 GB.

10. GpuOrcScan.scala — ORC footer cache hmb.getLength.toInt

// ~Line 1796-1800
val bb = ByteBuffer.allocate(hmb.getLength.toInt)        // <-- truncation
hmb.getBytes(bb.array(), 0, 0, hmb.getLength.toInt)      // <-- truncation

Footer cache HMB length truncated to Int.

11. UCXConnection.scala (shuffle-plugin) — rkeys.map(_.capacity).sum as Int

// ~Line 411-414
val size = java.lang.Long.BYTES + java.lang.Integer.BYTES +
    (java.lang.Integer.BYTES * rkeys.size) +  // <-- Int * Int
    rkeys.map(_.capacity).sum                 // <-- Int .sum

Pure Int arithmetic for computing handshake buffer size.


LOW Severity (bounded by design or unlikely to trigger)

Location Notes
GpuParquetScan.scala:3030,3042,3598 .map(_.getRowCount).sum.toIntgetRowCount is Long, .sum is Long, .toInt truncates but row count per block is typically bounded
GpuMultiFileReader.scala:759-831 .map(_.bytes).sumbytes is Long, .sum is Long (safe)
GpuShuffleCoalesceExec.scala:504 numRowsInBatch: Int — guarded by canAddToBatch check at line 615
GpuAggregateExec.scala:263,1080 .map(_.sizeInBytes).sumsizeInBytes is Long (safe)
GpuColumnarBatchSerializer.scala:126 HostMemoryBuffer.allocate(header.getDataLen)getDataLen is Long (safe)
RapidsDeletionVectorStore.scala:130 size: Int passed from Delta API (API limitation, not arithmetic bug)
UCX size.toInt calls UCXShuffleTransport.scala:103, UCXConnection.scala:169, UCX.scala:870,1030 — metadata payloads typically small

Suggested fix patterns

  • .map(f).sum.foldLeft(0L)((acc, x) => acc + f(x)) or .map(f.toLong).sum to force Long accumulation
  • Int * Int for sizes → use Long literal: 8L * n or n.toLong * m
  • var offset = 0 for byte tracking → var offset = 0L
  • .toInt on buffer sizes → add bounds check / throw on overflow, or redesign API to use Long

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions