GinS is an automated system for generating and validating CUDA kernels from PyTorch operations. It uses AI-powered iterative refinement to convert high-level PyTorch operations into optimized CUDA kernels with comprehensive validation.
- Automated CUDA Kernel Generation: Converts PyTorch operations to equivalent CUDA kernels
- Iterative Validation & Refinement: AI-powered error fixing with up to 5 attempts per operation
- Multi-LLM Support: Works with Google Gemini and Ollama models
- Complete PyTorch Integration: Generates kernels as PyTorch C++ extensions
- Robust Error Handling: Comprehensive validation covering compilation and correctness
- Benchmark Processing: Handles sequences of operations with shared execution context
- Detailed Logging: Saves all attempts, feedback, and results for analysis
The system follows a 3-stage pipeline for each operation:
- Monitor (
src/monitor.py): Profiles PyTorch operations to capture ATen calls and CUDA kernel information - Generate (
src/generator.py): Uses LLMs to generate CUDA kernel code based on profiling data - Verify (
src/verifier.py): Compiles and validates generated kernels against ground truth
- Python ≥ 3.12
- PyTorch (CUDA-enabled build)
- NVIDIA GPU with CUDA support
- LLM Provider of your choice:
- Google Gemini API key - required for Gemini-based models
- Ollama - required for running local models
- OpenAI API key - required for OpenAI-based models
| Component | Minimum Version | Reason / Notes |
|---|---|---|
| NVCC | ≥ 12.1 | Required for full C++17 support with PyTorch 2.x |
| GCC | ≥ 11.x | Compatible with NVCC 12.1 toolchain |
| PyTorch | ≥ 2.0 | Required for modern extension APIs and C++17 |
| CUDA Driver | ≥ 12.0 | Must support CUDA 12.x toolkit |
** If issue with compiling generated code, likely compatibility issue with one of these (update them).
-
Clone the repository:
git clone <repository-url> cd GinS
-
Install dependencies:
python -m venv env source env/bin/activate pip install -r requirements.txt -
Set up API keys (for Gemini):
export GOOGLE_API_KEY="your-api-key-here"
-
Install Ollama (optional, for local models):
# Follow Ollama installation instructions for your platform ollama pull llama3.2:latest
Run the system on a benchmark file:
python -m src.main benchmarks/initial_testing.jsonCreate a JSON file with the following structure:
[
{
"name": "program1",
"operations": [
{
"assignment": "c",
"operation": "torch.matmul",
"inputs": ["a", "b"]
},
{
"assignment": "d",
"operation": "torch.sin",
"inputs": ["c"]
}
],
"definitions": [
{"variable": "a", "value": "torch.randn(2048, 2048, device=\"cuda\")"},
{"variable": "b", "value": "torch.randn(2048, 2048, device=\"cuda\")"}
]
}
]Key configuration options in src/main.py:
MAX_ATTEMPTS = 5: Maximum retry attempts per operationOUTPUT_DIR = "generated_kernels": Directory for output files
GinS/
├── src/
│ ├── main.py # Main pipeline orchestrator
│ ├── monitor.py # PyTorch operation profiling
│ ├── generator.py # LLM-based code generation
│ ├── verifier.py # Kernel validation and testing
│ └── prompts/
│ └── prompts.py # LLM system prompts
├── benchmarks/
│ └── initial_testing.json # Example benchmark file
├── generated_kernels/ # Output directory for results
├── requirements.txt # Python dependencies
└── run.sh # Example run script
- Load Benchmark: Parse JSON file with operations and variable definitions
- Initialize Context: Set up execution environment with defined variables
- Process Operations: For each operation:
- Profile: Monitor PyTorch execution to capture ATen/kernel information
- Generate: Use LLM to create CUDA kernel code
- Validate: Compile and test against ground truth
- Refine: If validation fails, use AI to fix errors and retry (default: up to 5 attempts)
- Save Results: Store final kernels and logs
The system generates several types of output files:
*_inputs.pt: Input tensors for validation*_gold.pt: Ground truth output tensors*_iter{N}.log: Logs for each validation attempt*_kernel_final.cu: Final generated kernel (successful)*_kernel_final_FAILED.cu: Final kernel (if all attempts failed)
The included benchmarks/initial_testing.json demonstrates:
- Matrix multiplication (
torch.matmul) - Element-wise operations (
torch.sin) - Sequential operations with shared context
- CUDA not available: Ensure PyTorch is installed with CUDA support
- Compilation errors: Check that generated kernels follow PyTorch C++ extension requirements (NVCC, GCC, etc.)
- API key issues: Verify API key is set correctly for models OR ollama running in separate terminal for local models
- Memory issues: Reduce tensor sizes in benchmark definitions
Enable detailed logging by modifying the logging level in src/verifier.py:
logging.basicConfig(level=logging.DEBUG)