Skip to content

ClpDataSource drops matching rows and reports inflated row counts for IR streams #49

@20001020ycx

Description

@20001020ycx

Bug description

Background

Velox executes SQL queries as pipelines of operators. For this bug, we are concerned with TableScan which drives the reading of the data.

  • TableScan fetches data in batches by repeatedly calling into a connector-provided DataSource, and feeds the resulting rows into the rest of the query pipeline.

  • DataSource is the connector-side interface that TableScan calls to actually read data. Each connector (Hive, CLP, etc.) implements this interface. The lifecycle is:

  1. TableScan calls addSplit(split) to assign a "split" to the DataSource.
  2. TableScan calls next(size, future) repeatedly to fetch rows in batches of 1024 rows for a given DataSource.
  3. next returns a RowVector for each batch that contains matching rows.
  4. next returns nullptr to signal that the split is fully consumed, at which point TableScan moves on to the next split.

Note, A split must be fully drained by next before another split can be added.

Problem Statement

Bug 1: Premature split termination

ClpDataSource::next returns nullptr whenever the current batch has zero matching rows:

auto rowsScanned = cursor_->fetchNext(size);
auto rowsFiltered = cursor_->getNumFilteredRows();
if (rowsFiltered == 0) {
    return nullptr;
}

Per the DataSource contract, nullptr signals that the split is fully consumed. TableScan then stops calling next and moves on to the next split. However, rowsFiltered == 0 only means the current batch had no matches — later batches in the same split may still contain matching rows.

Impact: This compromises the correctness of the returned result! Matching rows are silently dropped whenever a non-matching batch precedes them within the same split.

Example: An IR split has 2048 log events, batch size 1024. Events 1–1024 match nothing; events 1025–1100 has 1 match. The first next call scans batch 1, gets 0 filtered rows, returns nullptr. The matching row in batch 2 is never returned.

Bug 2: Inflated completedRows_

ClpIrCursor::fetchNext returns irDeserializer_->get_num_log_events_deserialized(), which is a cumulative count across all invocations, not a per-batch count. However, ClpDataSource::next adds this directly to completedRows_:

  completedRows_ += rowsScanned; // rowsScanned is cumulative

This causes completedRows_ to grow quadratically.

Example: An IR split contains 3000 log events with a batch size of 1024. Every batch has matching rows.

Call fetchNext returns completedRows_ += completedRows_ Correct
1 1024 1024 1024 1024
2 2048 2048 3072 2048
3 3000 3000 6072 3000

Impact: TableScan reads completedRows_ via getCompletedRows() and records it as
rawInputPositions in operator statistics. The inflated count causes rawInputPositions
to over-report the number of rows scanned (~2x in this example), making query performance
stats unreliable.

System information

Partial result is returned, incorrect functionality

Relevant logs

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions