Bug description
Background
Velox executes SQL queries as pipelines of operators. For this bug, we are concerned with TableScan which drives the reading of the data.
-
TableScan fetches data in batches by repeatedly calling into a connector-provided DataSource, and feeds the resulting rows into the rest of the query pipeline.
-
DataSource is the connector-side interface that TableScan calls to actually read data. Each connector (Hive, CLP, etc.) implements this interface. The lifecycle is:
- TableScan calls addSplit(split) to assign a "split" to the
DataSource.
- TableScan calls next(size, future) repeatedly to fetch rows in batches of 1024 rows for a given
DataSource.
- next returns a RowVector for each batch that contains matching rows.
- next returns nullptr to signal that the split is fully consumed, at which point TableScan moves on to the next split.
Note, A split must be fully drained by next before another split can be added.
Problem Statement
Bug 1: Premature split termination
ClpDataSource::next returns nullptr whenever the current batch has zero matching rows:
auto rowsScanned = cursor_->fetchNext(size);
auto rowsFiltered = cursor_->getNumFilteredRows();
if (rowsFiltered == 0) {
return nullptr;
}
Per the DataSource contract, nullptr signals that the split is fully consumed. TableScan then stops calling next and moves on to the next split. However, rowsFiltered == 0 only means the current batch had no matches — later batches in the same split may still contain matching rows.
Impact: This compromises the correctness of the returned result! Matching rows are silently dropped whenever a non-matching batch precedes them within the same split.
Example: An IR split has 2048 log events, batch size 1024. Events 1–1024 match nothing; events 1025–1100 has 1 match. The first next call scans batch 1, gets 0 filtered rows, returns nullptr. The matching row in batch 2 is never returned.
Bug 2: Inflated completedRows_
ClpIrCursor::fetchNext returns irDeserializer_->get_num_log_events_deserialized(), which is a cumulative count across all invocations, not a per-batch count. However, ClpDataSource::next adds this directly to completedRows_:
completedRows_ += rowsScanned; // rowsScanned is cumulative
This causes completedRows_ to grow quadratically.
Example: An IR split contains 3000 log events with a batch size of 1024. Every batch has matching rows.
| Call |
fetchNext returns |
completedRows_ += |
completedRows_ |
Correct |
| 1 |
1024 |
1024 |
1024 |
1024 |
| 2 |
2048 |
2048 |
3072 |
2048 |
| 3 |
3000 |
3000 |
6072 |
3000 |
Impact: TableScan reads completedRows_ via getCompletedRows() and records it as
rawInputPositions in operator statistics. The inflated count causes rawInputPositions
to over-report the number of rows scanned (~2x in this example), making query performance
stats unreliable.
System information
Partial result is returned, incorrect functionality
Relevant logs
Bug description
Background
Velox executes SQL queries as pipelines of operators. For this bug, we are concerned with
TableScanwhich drives the reading of the data.TableScanfetches data in batches by repeatedly calling into a connector-providedDataSource, and feeds the resulting rows into the rest of the query pipeline.DataSourceis the connector-side interface thatTableScancalls to actually read data. Each connector (Hive, CLP, etc.) implements this interface. The lifecycle is:DataSource.DataSource.Note, A split must be fully drained by next before another split can be added.
Problem Statement
Bug 1: Premature split termination
ClpDataSource::next returns nullptr whenever the current batch has zero matching rows:
Per the DataSource contract, nullptr signals that the split is fully consumed. TableScan then stops calling next and moves on to the next split. However, rowsFiltered == 0 only means the current batch had no matches — later batches in the same split may still contain matching rows.
Impact: This compromises the correctness of the returned result! Matching rows are silently dropped whenever a non-matching batch precedes them within the same split.
Example: An IR split has 2048 log events, batch size 1024. Events 1–1024 match nothing; events 1025–1100 has 1 match. The first next call scans batch 1, gets 0 filtered rows, returns nullptr. The matching row in batch 2 is never returned.
Bug 2: Inflated completedRows_
ClpIrCursor::fetchNext returns irDeserializer_->get_num_log_events_deserialized(), which is a cumulative count across all invocations, not a per-batch count. However, ClpDataSource::next adds this directly to completedRows_:
This causes completedRows_ to grow quadratically.
Example: An IR split contains 3000 log events with a batch size of 1024. Every batch has matching rows.
fetchNextreturnscompletedRows_ +=completedRows_Impact: TableScan reads completedRows_ via getCompletedRows() and records it as
rawInputPositions in operator statistics. The inflated count causes rawInputPositions
to over-report the number of rows scanned (~2x in this example), making query performance
stats unreliable.
System information
Partial result is returned, incorrect functionality
Relevant logs