Skip to content

Conversation

@asubah
Copy link

@asubah asubah commented Oct 6, 2022

This PR introduces an autotuning interface and a python script to autotune the CUDA kernel launch configurations.

Refer to the README.md for instructions on how to use the script.

The interface is simple. It just reads the block size from a file in src/cudaautotune/autotuning/kernel_configs.

Tested on a node with Intel Xeon Silver 4214R and a single T4.
Baseline configuration: src/cudaautotune/autotuning/tunables-baseline.csv.
Average Throughput: 1,195.66 ± 1.96 events/s.

Best config found by autotuning: src/cudaautotune/autotuning/tunables.csv.
Average Throughput: 1,269.46 ± 3.52 events/s.

The search space is too big, hence, I selected a subset of the kernels that has the highest runtime and randomly autotuned them. Then fixed the configurations on the best found configuration, and tuned another set of kernels until all kernels are visited.

I appreciate your comments and suggestions regarding the interface or the script.

@asubah asubah changed the title [cuda] Implementing an autotuning interface and script [cudaautotune] Implementing an autotuning interface and script Oct 11, 2022
Makefile Outdated
export CUDA_CXXFLAGS := -I$(CUDA_BASE)/include
export CUDA_TEST_CXXFLAGS := -DGPU_DEBUG
export CUDA_LDFLAGS := -L$(CUDA_LIBDIR) -lcudart -lcudadevrt
export CUDA_LDFLAGS := -L$(CUDA_LIBDIR) -lcudart -lcudadevrt -lcudaautotunert
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this library be added only in src/cudaautotune/Makefile?

namespace cms {
namespace cuda {

class ExecutionConfiguration {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this class appears to be used only as a "namespace", because the objects have no state.

On the other hand, if I understood correctly, the launch parameters are read from the file for each kernel for each event. Would it be feasible to read each file only once in some way?

file >> blockSize;
file.close();
} else {
std::cout << "Error in opening file " + filename + "\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to continue the program if a file can not be opened? (currently the function returns an indeterminate value)

#ifdef __CUDACC__
uint32_t *poff = (uint32_t *)((char *)(h) + offsetof(Histo, off));
int32_t *ppsws = (int32_t *)((char *)(h) + offsetof(Histo, psws));
cms::cuda::ExecutionConfiguration exec;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used.

TARGET_NAME := $(notdir $(TARGET_DIR))
TARGET := $(BASE_DIR)/$(TARGET_NAME)
include Makefile.deps
EXTERNAL_DEPENDS := $(cuda_EXTERNAL_DEPENDS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

Suggested change
EXTERNAL_DEPENDS := $(cuda_EXTERNAL_DEPENDS)
EXTERNAL_DEPENDS := $(cudaautotune_EXTERNAL_DEPENDS)

@@ -0,0 +1,12 @@
cuda_EXTERNAL_DEPENDS := TBB CUDA EIGEN BOOST BACKTRACE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

Suggested change
cuda_EXTERNAL_DEPENDS := TBB CUDA EIGEN BOOST BACKTRACE
cudaautotune_EXTERNAL_DEPENDS := TBB CUDA EIGEN BOOST BACKTRACE

@@ -0,0 +1,149 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding

Suggested change
import argparse
#!/usr/bin/env python3
import argparse

so that the script could be run "directly" (./src/cudaautotune/autotuning/tuner.py)?

parser.add_argument('-p', '--process', type=pathlib.Path, nargs=1, required=True,
help='path to the program to be autotuned')
parser.add_argument('-c', '--configurations', type=pathlib.Path, nargs=1,
default=[pathlib.Path('src/cudaautotune/autotuning/kernel_configs')], help='path to save the configurations for the tunable process to read them. Default = autotuning/kernel_configs/')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the default value is in a list? (same for the following two arguments too)

status = ""

cpu_threads = config[tunables["cpu_threads"]]
gpu_streams = cpu_threads + config[tunables["gpu_streams"]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me why the number of concurrent events (--numberOfStreams below) is set to "number of CPU threads" + something named "GPU streams". Could you clarify what is the intended behavior here wrt. CPU threads and concurrent events?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants