Open
Description
Describe the bug
I was surprised to see extra calls to cuLibraryLoadData
just before the first decompress_page_data
range when LIBCUDF_HOST_DECOMPRESSION=AUTO
is enabled. This happens even when CUDA_MODULE_LOADING=EAGER
.
This load library region does not happen when LIBCUDF_HOST_DECOMPRESSION
is unset or LIBCUDF_HOST_DECOMPRESSION=ON
. In the PDS benchmark it adds perhaps 25 ms per query.
The library loading seems to be in sorting the blocks I guess SortPairsDescending
.
Expected behavior
All the cuLibraryLoadData
calls should be in the beginning when CUDA_MODULE_LOADING=EAGER