Skip to content

Commit 7534c2e

Browse files
bdicemeta-codesync[bot]
authored andcommitted
fix(cudf): Refactor CudfToVelox output batching to avoid O(n) D->H syncs (facebookincubator#16620)
Summary: This PR implements an optimized `CudfToVelox` batching strategy that results in fewer device-to-host copies and corresponding stream synchronizations, regardless of the input/output batch sizes. The previous implementation split large GPU inputs by doing `cudf::split` + two `cudf::table` deep-copies per output batch, resulting in `O(n_batches)` GPU kernel launches and D->H synchronizations. With maxOutBatchRows=1 and 1000 rows this took ~31s and would timeout under compute-sanitizer. New strategy (see block comment above `getOutput`): **(A)** Large input (`>= targetBatchSize` rows): convert the GPU table to Velox once via a single `to_arrow_host` + synchronize, store it in `veloxBuffer_`, then emit CPU-side slices via `BaseVector::slice()` on subsequent `getOutput()` calls. Zero additional D->H work per slice. **(B)** Small inputs (`< targetBatchSize` rows, e.g. `CudfFilterProject` with high filter selectivity): GPU-concatenate `inputs_` until we reach `targetBatchSize`, then convert the merged table in one D->H transfer. This preserves the GPU-side batching that avoids emitting many undersized Velox batches downstream. Both paths issue exactly one `toVeloxColumn` + `stream.synchronize()` per output batch. The `outputBatchRows` test runtime decreased from around 30 seconds to around 3 seconds. Pull Request resolved: facebookincubator#16620 Reviewed By: srsuryadev Differential Revision: D99330510 Pulled By: peterenescu fbshipit-source-id: a97e283c72f78d9d8733362ec7a3f0587fd3fa76
1 parent f736ec1 commit 7534c2e

File tree

3 files changed

+114
-89
lines changed

3 files changed

+114
-89
lines changed

velox/experimental/cudf/exec/CudfConversion.cpp

Lines changed: 107 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,7 @@
2626
#include "velox/exec/Operator.h"
2727
#include "velox/vector/ComplexVector.h"
2828

29-
#include <cudf/copying.hpp>
30-
#include <cudf/table/table.hpp>
29+
#include <cudf/types.hpp>
3130
#include <cudf/utilities/default_stream.hpp>
3231

3332
namespace facebook::velox::cudf_velox {
@@ -204,116 +203,136 @@ std::optional<uint64_t> CudfToVelox::averageRowSize() {
204203
return averageRowSize_;
205204
}
206205

206+
// Pop inputs_.front(), convert its GPU table to a Velox RowVector via a
207+
// single to_arrow_host + synchronize, and return it. The caller is
208+
// responsible for any further slicing.
209+
RowVectorPtr CudfToVelox::convertFrontToVelox() {
210+
auto cudfVector = std::move(inputs_.front());
211+
inputs_.pop_front();
212+
auto stream = cudfVector->stream();
213+
auto tableView = cudfVector->getTableView();
214+
auto output = with_arrow::toVeloxColumn(
215+
tableView, pool(), outputType_, "", stream, get_temp_mr());
216+
stream.synchronize();
217+
output->setType(outputType_);
218+
return output;
219+
}
220+
221+
// Output batching strategy
222+
// ========================
223+
// The key constraint is minimising D->H (device-to-host) transfers.
224+
// Each call to toVeloxColumn / to_arrow_host triggers one D->H copy per
225+
// column, so calling it once per output batch (rather than once per row
226+
// or once per input batch) is critical for performance.
227+
//
228+
// Two cases arise depending on the size of the front GPU input relative
229+
// to targetBatchSize:
230+
//
231+
// (A) Front input >= targetBatchSize (e.g. CudfOrderBy: one large sorted
232+
// table). We convert the whole input to Velox in one shot and then
233+
// slice it purely on the CPU using BaseVector::slice(). Subsequent
234+
// getOutput() calls return successive CPU slices with no additional
235+
// D->H work until veloxBuffer_ is exhausted.
236+
//
237+
// (B) Front input < targetBatchSize (e.g. CudfFilterProject with high
238+
// selectivity: many small GPU batches). We concatenate inputs on device
239+
// until we accumulate targetBatchSize rows, then convert the concat
240+
// result to Velox in one shot. This preserves the GPU-side merge
241+
// that avoids emitting many undersized Velox batches downstream.
242+
//
243+
// In both cases exactly one toVeloxColumn + stream.synchronize() is issued
244+
// per output batch, regardless of how many GPU inputs were consumed.
207245
RowVectorPtr CudfToVelox::getOutput() {
208246
VELOX_NVTX_OPERATOR_FUNC_RANGE();
209-
if (finished_ || inputs_.empty()) {
210-
finished_ = noMoreInput_ && inputs_.empty();
247+
if (finished_) {
211248
return nullptr;
212249
}
213250

214-
// Get the target batch size
215-
const auto targetBatchSize = outputBatchRows(averageRowSize());
216-
217-
// Process single input directly in these cases:
218-
// 1. In passthrough mode
219-
// 2. If we only have one input and it's smaller than or equal to the target
220-
// batch size
221-
if (isPassthroughMode() ||
222-
(inputs_.size() == 1 && inputs_.front()->size() <= targetBatchSize)) {
223-
// Move the CudfVector out to keep it alive while we use the view.
224-
// This avoids expensive materialization when constructed from packed_table.
225-
auto cudfVector = std::move(inputs_.front());
226-
inputs_.pop_front();
227-
228-
auto tableView = cudfVector->getTableView();
229-
auto stream = cudfVector->stream();
230-
if (tableView.num_rows() == 0) {
231-
finished_ = noMoreInput_ && inputs_.empty();
251+
// Drain veloxBuffer_ (populated on a previous call) before consuming
252+
// more GPU inputs.
253+
if (!veloxBuffer_) {
254+
if (inputs_.empty()) {
255+
finished_ = noMoreInput_;
232256
return nullptr;
233257
}
234-
RowVectorPtr output = with_arrow::toVeloxColumn(
235-
tableView, pool(), outputType_, "", stream, get_temp_mr());
236-
stream.synchronize();
237-
finished_ = noMoreInput_ && inputs_.empty();
238-
output->setType(outputType_);
239-
// cudfVector goes out of scope here, freeing the GPU memory
240-
return output;
241-
}
242258

243-
// Calculate how many tables we need to concatenate to reach the target batch
244-
// size and collect them in a vector
245-
std::vector<CudfVectorPtr> selectedInputs;
246-
vector_size_t totalSize = 0;
259+
// Passthrough mode: emit each GPU input as a single Velox batch with no
260+
// re-batching. Used when the caller knows the batch size is already
261+
// correct (e.g. default pipeline without explicit batch-size overrides).
262+
if (isPassthroughMode()) {
263+
auto output = convertFrontToVelox();
264+
finished_ = noMoreInput_ && inputs_.empty();
265+
if (output->size() == 0) {
266+
return nullptr;
267+
}
268+
return output;
269+
}
247270

248-
while (!inputs_.empty() && totalSize < targetBatchSize) {
249-
auto& input = inputs_.front();
250-
if (totalSize + input->size() <= targetBatchSize) {
251-
totalSize += input->size();
252-
selectedInputs.push_back(std::move(input));
253-
inputs_.pop_front();
271+
const auto targetBatchSize = outputBatchRows(averageRowSize());
272+
273+
if (static_cast<vector_size_t>(inputs_.front()->size()) >=
274+
targetBatchSize) {
275+
// Case A: large input. Convert once; subsequent calls slice CPU-side.
276+
veloxBuffer_ = convertFrontToVelox();
277+
veloxOffset_ = 0;
278+
averageRowSize_ = std::nullopt; // recompute from next input
254279
} else {
255-
// If the next input would exceed targetBatchSize,
256-
// we need to split it and only take what we need
257-
auto cudfTableView = input->getTableView();
258-
auto stream = input->stream();
259-
auto partitions = std::vector<cudf::size_type>{
260-
static_cast<cudf::size_type>(targetBatchSize - totalSize)};
261-
auto tableSplits = cudf::split(cudfTableView, partitions, stream);
262-
263-
// Create new CudfVector from the first part
264-
auto firstPart =
265-
std::make_unique<cudf::table>(tableSplits[0], stream, get_temp_mr());
266-
auto firstPartSize = firstPart->num_rows();
267-
auto firstPartVector = std::make_shared<CudfVector>(
268-
pool(), input->type(), firstPartSize, std::move(firstPart), stream);
269-
270-
// Create new CudfVector from the second part
271-
auto secondPart =
272-
std::make_unique<cudf::table>(tableSplits[1], stream, get_temp_mr());
273-
auto secondPartSize = secondPart->num_rows();
274-
auto secondPartVector = std::make_shared<CudfVector>(
275-
pool(), input->type(), secondPartSize, std::move(secondPart), stream);
276-
277-
// Replace the original input with the second part
278-
input = std::move(secondPartVector);
279-
280-
// Add the first part to selectedInputs
281-
selectedInputs.push_back(std::move(firstPartVector));
282-
totalSize += firstPartSize;
283-
break;
280+
// Case B: small inputs. GPU-concat until we reach targetBatchSize,
281+
// then convert the merged table in one D->H transfer.
282+
auto stream = inputs_.front()->stream();
283+
std::vector<CudfVectorPtr> toConcat;
284+
vector_size_t accumulated = 0;
285+
while (!inputs_.empty() && accumulated < targetBatchSize) {
286+
accumulated += static_cast<vector_size_t>(inputs_.front()->size());
287+
toConcat.push_back(std::move(inputs_.front()));
288+
inputs_.pop_front();
289+
}
290+
VELOX_CHECK_LE(
291+
accumulated,
292+
std::numeric_limits<cudf::size_type>::max(),
293+
"Accumulated row count exceeds cudf int32 limit");
294+
auto concatTable = getConcatenatedTable(
295+
std::move(toConcat), outputType_, stream, get_temp_mr());
296+
auto tableView = concatTable->view();
297+
veloxBuffer_ = with_arrow::toVeloxColumn(
298+
tableView, pool(), outputType_, "", stream, get_temp_mr());
299+
stream.synchronize();
300+
veloxBuffer_->setType(outputType_);
301+
veloxOffset_ = 0;
302+
averageRowSize_ = std::nullopt;
284303
}
285304
}
286305

287-
finished_ = noMoreInput_ && inputs_.empty();
288-
289-
// If we have no inputs to process, return nullptr
290-
if (selectedInputs.empty()) {
306+
// Slice veloxBuffer_ on the CPU to produce the next output batch.
307+
const auto totalRows = static_cast<vector_size_t>(veloxBuffer_->size());
308+
if (veloxOffset_ >= totalRows) {
309+
veloxBuffer_.reset();
310+
finished_ = noMoreInput_ && inputs_.empty();
291311
return nullptr;
292312
}
293313

294-
// Concatenate the selected tables on the GPU
295-
auto stream = cudfGlobalStreamPool().get_stream();
296-
auto resultTable = getConcatenatedTable(
297-
std::move(selectedInputs), outputType_, stream, get_temp_mr());
314+
const auto targetBatchSize = outputBatchRows(
315+
veloxBuffer_->estimateFlatSize() /
316+
static_cast<uint64_t>(std::max<vector_size_t>(totalRows, 1)));
317+
const auto take = std::min(targetBatchSize, totalRows - veloxOffset_);
298318

299-
// Convert the concatenated table to a RowVector
300-
const auto size = resultTable->num_rows();
301-
VELOX_CHECK_NOT_NULL(resultTable);
302-
if (size == 0) {
303-
return nullptr;
319+
auto slice = std::dynamic_pointer_cast<RowVector>(
320+
veloxBuffer_->slice(veloxOffset_, take));
321+
VELOX_CHECK_NOT_NULL(slice);
322+
veloxOffset_ += take;
323+
324+
if (veloxOffset_ >= totalRows) {
325+
veloxBuffer_.reset();
326+
finished_ = noMoreInput_ && inputs_.empty();
304327
}
305328

306-
RowVectorPtr output = with_arrow::toVeloxColumn(
307-
resultTable->view(), pool(), outputType_, "", stream, get_temp_mr());
308-
stream.synchronize();
309-
finished_ = noMoreInput_ && inputs_.empty();
310-
output->setType(outputType_);
311-
return output;
329+
return slice;
312330
}
313331

314332
void CudfToVelox::close() {
315333
exec::Operator::close();
316334
inputs_.clear();
335+
veloxBuffer_.reset();
317336
}
318337

319338
} // namespace facebook::velox::cudf_velox

velox/experimental/cudf/exec/CudfConversion.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,8 +96,14 @@ class CudfToVelox : public exec::Operator, public NvtxHelper {
9696
private:
9797
bool isPassthroughMode() const;
9898
std::optional<uint64_t> averageRowSize();
99+
// Convert inputs_.front() to Velox once; slice it CPU-side per batch.
100+
RowVectorPtr convertFrontToVelox();
99101
std::optional<uint64_t> averageRowSize_;
100102
std::deque<CudfVectorPtr> inputs_;
103+
// Converted CPU-side buffer being drained by successive getOutput() calls.
104+
RowVectorPtr veloxBuffer_;
105+
// Current offset into veloxBuffer_ for the next slice.
106+
vector_size_t veloxOffset_{0};
101107
bool finished_ = false;
102108
};
103109

velox/experimental/cudf/exec/CudfOrderBy.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ RowVectorPtr CudfOrderBy::getOutput() {
9898
if (finished_ || !noMoreInput_) {
9999
return nullptr;
100100
}
101-
finished_ = noMoreInput_;
101+
finished_ = true;
102102
return outputTable_;
103103
}
104104

0 commit comments

Comments
 (0)