An automated optimization framework that uses LLMs to suggest compiler pass parameters for Triton GPU kernels. The system implements a closed-loop refinement process that learns from previous attempts to optimize kernel performance.
- PyTorch baseline implementations for matmul and softmax
- Automated benchmarking and performance measurement
- Correctness validation against reference implementations
- Matmul kernel with configurable:
- Block sizes (M, N, K dimensions)
- Group size for program ID mapping
- Pipeline stages
- Number of warps
- Softmax kernel with configurable:
- Block size
- Automatic input generation
- Correctness testing with numerical validation
- Performance benchmarking with statistical analysis
- Stability testing across different input sizes
- Stores all kernel versions and optimization attempts
- Tracks parameters, speedup, correctness, and metadata
- Provides history and statistics for analysis
- Structured prompts including:
- Kernel code
- Hardware/device information
- Performance goals and constraints
- Optimization history
- Heuristic fallback when LLM unavailable
- Iterative refinement with feedback
- Parameter suggestion → testing → scoring → refinement
- Early stopping on good speedup
- Maximum iteration budget
- Comprehensive optimization reports
- Parameter impact analysis
- Stability analysis across input sizes
- Top-performing configurations
# Install dependencies
pip install -r requirements.txt# Optimize matmul kernel (default)
python optimizer.py --kernel matmul
# Optimize softmax kernel
python optimizer.py --kernel softmax
# Specify device and iterations
python optimizer.py --kernel matmul --device cuda --max-iterations 20
# With OpenAI API key
python optimizer.py --kernel matmul --api-key YOUR_API_KEY
# Or set environment variable: export OPENAI_API_KEY=your_keyfrom optimizer import KernelOptimizer
# Create optimizer
optimizer = KernelOptimizer(
kernel_name="matmul",
device="cuda",
max_iterations=20,
llm_api_key="your-api-key" # Optional
)
# Run optimization
results = optimizer.optimize()
print(f"Best speedup: {results['best_speedup']:.3f}x")
print(f"Best parameters: {results['best_params']}")from test_framework import TestFramework
from triton_kernels import triton_matmul
# Create test framework
framework = TestFramework(device="cuda")
# Test specific parameters
params = {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 32,
"GROUP_SIZE_M": 8,
"num_stages": 4,
"num_warps": 8,
}
result = framework.full_test_matmul(params, m=1024, n=1024, k=1024)
print(f"Speedup: {result['speedup']:.3f}x")
print(f"Correct: {result['correct']}")from knowledge_archive import KnowledgeArchive
archive = KnowledgeArchive()
# Get best kernel
best = archive.get_best_kernel("matmul")
print(f"Best speedup: {best['speedup']:.3f}x")
# Get optimization history
history = archive.get_kernel_history("matmul", limit=10)
# Get statistics
stats = archive.get_statistics("matmul")
print(f"Total attempts: {stats['total_attempts']}")compiler-pass-generation/
├── baseline.py # PyTorch baseline implementations
├── triton_kernels.py # Triton kernels with tunable parameters
├── test_framework.py # Testing and benchmarking framework
├── knowledge_archive.py # Storage for optimization results
├── llm_optimizer.py # LLM integration for parameter suggestions
├── optimizer.py # Main optimization loop
├── reporter.py # Report generation system
├── requirements.txt # Python dependencies
└── README.md # This file
- Initialization: Sets up baseline PyTorch functions and Triton kernels with default parameters
- Baseline Benchmarking: Measures baseline performance for comparison
- Optimization Loop (repeats up to max_iterations):
- LLM suggests new parameter values based on:
- Current kernel code
- Hardware characteristics
- Previous optimization attempts
- Performance goals
- Kernel is compiled and tested with new parameters
- Results are scored (correctness + speedup)
- Best results are stored in archive
- Feedback is provided to LLM for next iteration
- LLM suggests new parameter values based on:
- Analysis:
- Parameter impact analysis
- Stability testing across input sizes
- Comprehensive reporting
BLOCK_SIZE_M: Block size for M dimension (16, 32, 64, 128)BLOCK_SIZE_N: Block size for N dimension (16, 32, 64, 128)BLOCK_SIZE_K: Block size for K dimension (16, 32, 64)GROUP_SIZE_M: Group size for program ID mapping (1, 2, 4, 8)num_stages: Number of pipeline stages (1-5)num_warps: Number of warps per block (1, 2, 4, 8, 16)
BLOCK_SIZE: Block size for processing (256, 512, 1024, 2048, 4096)
The optimizer generates:
- Console output with optimization progress
- Archive files in
archive/directory:kernels.json: All kernel versions and metadatametadata.json: Optimization statistics
- Reports in
reports/directory:{kernel}_optimization_report.txt: Comprehensive optimization report
- Python 3.8+
- PyTorch 2.0+
- Triton 2.0+
- CUDA-capable GPU (for optimal performance)
- OpenAI API key (optional, for LLM suggestions)
- Without an OpenAI API key, the system falls back to heuristic-based parameter suggestions
- The framework is designed to work with CUDA, but CPU fallback is available
- Optimization results are stored persistently for later analysis
- The system learns from previous attempts to improve suggestions over time