50% of the time spent in the tpch1 benchmark spent syncing in cudf::detail::gather. The majority of that time is in table_device_view::create/column_device_view::create.
cudf calls table_device_view::create/column_device_view::create to copy the host column_view and table_view to the device. Notably, there's a fast and slow path in column_device_view::create:
https://github.com/rapidsai/cudf/blob/363920c83694ee88f2af12568241250d81983144/cpp/src/column/column_device_view.cu#L110-L120
std::unique_ptr<column_device_view, std::function<void(column_device_view*)>>
column_device_view::create(column_view source, rmm::cuda_stream_view stream)
{
size_type num_children = source.num_children();
if (num_children == 0) {
// Can't use make_unique since the ctor is protected
return std::unique_ptr<column_device_view>(new column_device_view(source));
}
return create_device_view_from_view<column_view, column_device_view>(source, stream);
}
The slow path is taken for types such as string, struct, list, and dictionary, which have children. It would be nice to avoid the slow path which has expensive stream syncs.
Collected these on 07d9aff
tpch_q1_cudf_parquet_10iters_gpu_samply.json.syms.json
tpch_q1_cudf_parquet_10iters_gpu_samply.json.gz
50% of the time spent in the tpch1 benchmark spent syncing in cudf::detail::gather. The majority of that time is in
table_device_view::create/column_device_view::create.cudf calls
table_device_view::create/column_device_view::createto copy the hostcolumn_viewandtable_viewto the device. Notably, there's a fast and slow path incolumn_device_view::create:https://github.com/rapidsai/cudf/blob/363920c83694ee88f2af12568241250d81983144/cpp/src/column/column_device_view.cu#L110-L120
The slow path is taken for types such as string, struct, list, and dictionary, which have children. It would be nice to avoid the slow path which has expensive stream syncs.
Collected these on 07d9aff
tpch_q1_cudf_parquet_10iters_gpu_samply.json.syms.json
tpch_q1_cudf_parquet_10iters_gpu_samply.json.gz