Skip to content

reduce stream syncs due to cudf::detail::gather #75

@jayshrivastava

Description

@jayshrivastava

50% of the time spent in the tpch1 benchmark spent syncing in cudf::detail::gather. The majority of that time is in table_device_view::create/column_device_view::create.

Image

cudf calls table_device_view::create/column_device_view::create to copy the host column_view and table_view to the device. Notably, there's a fast and slow path in column_device_view::create:
https://github.com/rapidsai/cudf/blob/363920c83694ee88f2af12568241250d81983144/cpp/src/column/column_device_view.cu#L110-L120

std::unique_ptr<column_device_view, std::function<void(column_device_view*)>>
column_device_view::create(column_view source, rmm::cuda_stream_view stream)
{
  size_type num_children = source.num_children();
  if (num_children == 0) {
    // Can't use make_unique since the ctor is protected
    return std::unique_ptr<column_device_view>(new column_device_view(source));
  }

  return create_device_view_from_view<column_view, column_device_view>(source, stream);
}

The slow path is taken for types such as string, struct, list, and dictionary, which have children. It would be nice to avoid the slow path which has expensive stream syncs.

Collected these on 07d9aff
tpch_q1_cudf_parquet_10iters_gpu_samply.json.syms.json
tpch_q1_cudf_parquet_10iters_gpu_samply.json.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions