This is an LLM-based multi-agent framework, which can generate functional and efficient gpu kernels automatically.
The framework is extendable and flexible. You can easily make you own coding agent and test it on our TritonBench-revised Benchmark and ROCm Benchmark.
We also provide a baseline agent, GEAK-Agent to let you run directly.
It contains a Generator, a Reflector, an Evaluator and an Optimizer. The actor generates codes according to the query and context information. The Reflector is responsible for reflecting on the generated code and the error trace if the code failed to run. The Evaluator has a cascade structure. It tests the generated code for the functionality first. If the generated code doesn't pass the functionality test, the error trace will be fedback to the Reflector. Otherwise, the Evaluator will evaluate the performance including latency and efficiency. The Optimizer gets the generated codes, which pass the evaluator's tests, and gives a strategy to optimize the code in terms of latency and efficiency.
-
prepare the Agent environment
git clone https://github.com/AMD-AGI/gpu-kernel-agent.git cd gpu-kernel-agent python3 -m pip install -r requirements.txt -
prepare the TritonBench environment
cd .. git clone https://github.com/AMD-AGI/TB-eval.git cd TB-eval python3 -m pip install -r requirements.txt python3 -m pip install -e . -
edit config file. You need to give your API key, dataloader path and agent parameters in your config file.
cd ../gpu-kernel-agent/src cp configs/tritonbench_gaagent_config.yaml configs/tritonbench_gaagent_config_new.yamlYou can modify dataloader paths to the above downloaded TritonBench.
-
put the config file in the main_gaagent.py and run the script
python main_gaagent.py
Result and memories will be stored in the output_path specified in the config file for each iteration. You can resume from any iter you want by specifying the result_file, mem_file and start_iter in the config file. For example:
result_path: "../outputs/optimagent_10.jsonl"
mem_file: "../outputs/optimagent_mem_10.json"
start_iter: 11
If you need hardware-level profiling and performance analysis capabilities, please visit the Profiler Analyzer branch:
- Branch:
profiler-analyzer - Includes ROCm profiling tools and analyzer for detailed hardware metrics
- Provides insights into memory bandwidth, compute unit occupancy, and kernel optimization strategies
-
create a new file for your own dataloader in dataloaders
touch dataloaders/YourData.py -
In your own dataloader, define a new data class
class YourData: -
In the YourData class, you need to load
problem_states, which is a list of ProblemState instances. The agent will run loops over allproblem_states. Your can define your own ProblemState class indataloaders/ProblemState.py. To meet the minimum requirement, each ProblemState instance should include the fields ofinstructionandfilename. Providinglabelfield (golden code) may be helpful for the Agent. -
In order to use our Agent, the YourData class must implement the following methods:
__len__() -> int load_ps(path) -> problem_states test_opt_correctness(code, filename, tmp_dir, exe_dir) -> pass_call, pass_exe, call_stdout, call_stderr, exe_stdout, exe_stderr-
__len__()
Returns the number of problem_states in the dataset.
-
test_opt_correctness(code, filename, tmp_dir, exe_dir)
Tests whether the generated code is functionally correct.
Parameters:
code: The generated code to be tested. filename: Name of the test script file. tmp_dir: Directory to save the script (generated code + unit test). exe_dir: Directory to store scripts that pass execution tests.Returns:
pass_call: True if the script runs without errors. pass_exe: True if the script produces the correct output. speedup: float, defined as the latency of golden code compared to that of generated code. stdout, stderr: Stdout and stderr from the test script execution.
-