-
Notifications
You must be signed in to change notification settings - Fork 47
[cudaautotune] Implementing an autotuning interface and script #380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Makefile
Outdated
| export CUDA_CXXFLAGS := -I$(CUDA_BASE)/include | ||
| export CUDA_TEST_CXXFLAGS := -DGPU_DEBUG | ||
| export CUDA_LDFLAGS := -L$(CUDA_LIBDIR) -lcudart -lcudadevrt | ||
| export CUDA_LDFLAGS := -L$(CUDA_LIBDIR) -lcudart -lcudadevrt -lcudaautotunert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this library be added only in src/cudaautotune/Makefile?
| namespace cms { | ||
| namespace cuda { | ||
|
|
||
| class ExecutionConfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this class appears to be used only as a "namespace", because the objects have no state.
On the other hand, if I understood correctly, the launch parameters are read from the file for each kernel for each event. Would it be feasible to read each file only once in some way?
| file >> blockSize; | ||
| file.close(); | ||
| } else { | ||
| std::cout << "Error in opening file " + filename + "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to continue the program if a file can not be opened? (currently the function returns an indeterminate value)
| #ifdef __CUDACC__ | ||
| uint32_t *poff = (uint32_t *)((char *)(h) + offsetof(Histo, off)); | ||
| int32_t *ppsws = (int32_t *)((char *)(h) + offsetof(Histo, psws)); | ||
| cms::cuda::ExecutionConfiguration exec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be used.
src/cudaautotune/Makefile
Outdated
| TARGET_NAME := $(notdir $(TARGET_DIR)) | ||
| TARGET := $(BASE_DIR)/$(TARGET_NAME) | ||
| include Makefile.deps | ||
| EXTERNAL_DEPENDS := $(cuda_EXTERNAL_DEPENDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be
| EXTERNAL_DEPENDS := $(cuda_EXTERNAL_DEPENDS) | |
| EXTERNAL_DEPENDS := $(cudaautotune_EXTERNAL_DEPENDS) |
src/cudaautotune/Makefile.deps
Outdated
| @@ -0,0 +1,12 @@ | |||
| cuda_EXTERNAL_DEPENDS := TBB CUDA EIGEN BOOST BACKTRACE | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be
| cuda_EXTERNAL_DEPENDS := TBB CUDA EIGEN BOOST BACKTRACE | |
| cudaautotune_EXTERNAL_DEPENDS := TBB CUDA EIGEN BOOST BACKTRACE |
src/cudaautotune/autotuning/tuner.py
Outdated
| @@ -0,0 +1,149 @@ | |||
| import argparse | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding
| import argparse | |
| #!/usr/bin/env python3 | |
| import argparse |
so that the script could be run "directly" (./src/cudaautotune/autotuning/tuner.py)?
src/cudaautotune/autotuning/tuner.py
Outdated
| parser.add_argument('-p', '--process', type=pathlib.Path, nargs=1, required=True, | ||
| help='path to the program to be autotuned') | ||
| parser.add_argument('-c', '--configurations', type=pathlib.Path, nargs=1, | ||
| default=[pathlib.Path('src/cudaautotune/autotuning/kernel_configs')], help='path to save the configurations for the tunable process to read them. Default = autotuning/kernel_configs/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the default value is in a list? (same for the following two arguments too)
src/cudaautotune/autotuning/tuner.py
Outdated
| status = "" | ||
|
|
||
| cpu_threads = config[tunables["cpu_threads"]] | ||
| gpu_streams = cpu_threads + config[tunables["gpu_streams"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me why the number of concurrent events (--numberOfStreams below) is set to "number of CPU threads" + something named "GPU streams". Could you clarify what is the intended behavior here wrt. CPU threads and concurrent events?
This PR introduces an autotuning interface and a python script to autotune the CUDA kernel launch configurations.
Refer to the
README.mdfor instructions on how to use the script.The interface is simple. It just reads the block size from a file in
src/cudaautotune/autotuning/kernel_configs.Tested on a node with Intel Xeon Silver 4214R and a single T4.
Baseline configuration:
src/cudaautotune/autotuning/tunables-baseline.csv.Average Throughput: 1,195.66 ± 1.96 events/s.
Best config found by autotuning:
src/cudaautotune/autotuning/tunables.csv.Average Throughput: 1,269.46 ± 3.52 events/s.
The search space is too big, hence, I selected a subset of the kernels that has the highest runtime and randomly autotuned them. Then fixed the configurations on the best found configuration, and tuned another set of kernels until all kernels are visited.
I appreciate your comments and suggestions regarding the interface or the script.