Skip to content

Conversation

@Infatoshi
Copy link

Summary

  • Adds kernrl - an RL environment for training LLM agents to write optimized GPU kernels
  • Agents receive PyTorch reference implementations and must write faster CUDA/Triton kernels
  • 89 problems across 10 difficulty levels (matmul, attention, MoE, cryptography, etc.)

Environment Interface

  • Action: CUDA/Triton kernel code
  • Observation: Compilation status, correctness check, speedup measurement, profiling data
  • Reward: +0.1 compile, +0.3 correct, +0.3 beats baseline, +0.3 scaled by log2(speedup)

Features

  • Local GPU evaluation with NSight Systems/Compute profiling
  • Docker support for isolated execution
  • Compatible with TRL/GRPO training via rollout_func pattern

Test plan

  • Verify server starts: uvicorn kernrl.server.app:app
  • Test reset/step cycle with sample kernel
  • Verify correctness checking against reference
  • Benchmark timing measurement works

🤖 Generated with Claude Code

kernrl is an RL environment for training LLM agents to write optimized
CUDA/Triton GPU kernels. Agents receive a PyTorch reference implementation
and must write a kernel that produces the same output faster.

Features:
- 89 problems across 10 difficulty levels
- Comprehensive profiling (NSight Systems/Compute, torch.profiler)
- Correctness verification with configurable tolerances
- Benchmark timing for speedup measurement
- Rich feedback for iterative optimization

Problem levels:
- L1: Simple operators (matmul, softmax, conv, norms)
- L2: Fused operations (matmul+activation chains)
- L3: Single blocks (attention, transformer block)
- L4: Novel layers (MLA, MoE, GQA, FP8, INT4)
- L5-L10: Scientific computing, graphics, signal processing,
  video processing, parallel primitives, cryptography

Requires: NVIDIA GPU with CUDA toolkit, PyTorch, Triton

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@meta-cla
Copy link

meta-cla bot commented Jan 20, 2026

Hi @Infatoshi!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@burtenshaw
Copy link
Collaborator

burtenshaw commented Jan 20, 2026

Hi @Infatoshi

Thanks for the PR. From a high level, here are some points we should work on first:

  • you will need to resolve the meta CLA registration.
  • the space is in an error state right now: https://huggingface.co/spaces/Infatoshi/kernrl
  • could you add an inference example to examples/ that uses the OpenAI client to interact with the environment.
  • could you update docs/environments.md to include this environment.

Once again, cool project!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 20, 2026
@Infatoshi
Copy link
Author

Thanks @burtenshaw! All addressed:

Ready for review!

@burtenshaw
Copy link
Collaborator

Thanks @Infatoshi . This looks great. One thing concerns me though. The environment does not conform to the openenv structure created by the CLI. And within this repo, we want all the envs to conform to this structure because they should serve as 'examples'. Two options:

Could you restructure the env following this guide and using openenv init so that it is compatible with envs like echo_env etc?

This will add package management, build and push support, and make it easier to build on top of.

- Add try/except import pattern for in-repo vs standalone compatibility
- Rename kernel_env.py to kernrl_environment.py (follow naming convention)
- Update openenv.yaml to match spec_version 1 format
- Update pyproject.toml with openenv-core git dependency
- Update Dockerfile to use openenv-base multi-stage build pattern
- Update __init__.py exports for both package and server

This makes kernrl compatible with `openenv init` structure and other
environments like echo_env, enabling CLI support for build/push/serve.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@Infatoshi
Copy link
Author

Restructured kernrl to match the openenv CLI structure! Changes in commit 5f24f37:

Structure changes:

  • Added try/except import pattern for in-repo vs standalone compatibility (matching echo_env pattern)
  • Renamed kernel_env.py to kernrl_environment.py (following naming convention)
  • Updated openenv.yaml to use spec_version 1 format with proper type/runtime/app/port fields
  • Updated pyproject.toml with openenv-core @ git+... dependency pattern
  • Updated Dockerfile to use the openenv-base multi-stage build pattern
  • Updated __init__.py exports for both package and server modules

The environment now follows the same structure as echo_env and should be compatible with openenv init, openenv build, openenv serve, etc.

Let me know if anything else needs adjustment!

@zkwentz
Copy link
Contributor

zkwentz commented Jan 21, 2026

@greptile

@greptile-apps
Copy link

greptile-apps bot commented Jan 21, 2026

Greptile Summary

Adds kernrl, a GPU kernel optimization environment that trains LLM agents to write fast CUDA/Triton kernels through real hardware feedback. Agents receive PyTorch reference implementations and iteratively optimize kernels based on compilation status, correctness checks, and performance benchmarks.

Key Features

  • 89 problems across 10 difficulty levels (basic ops, attention, MoE, cryptography, signal processing)
  • Local GPU evaluation with comprehensive profiling (NSight Systems/Compute, Compute Sanitizer)
  • Compatible with TRL/GRPO training patterns via rollout_func
  • Docker support for isolated execution

Issues Found

Critical - Reward Inconsistency: Three different reward calculations exist:

  1. README documents one structure (compile +0.1, correct +0.3, beats baseline +0.3, speedup bonus +0.3)
  2. evaluator._compute_reward() implements a different structure
  3. kernrl_environment._calculate_reward() implements a third version that's never used (dead code)

The actual reward comes from evaluator._compute_reward() (called at evaluator.py:313), making the environment's method dead code.

Moderate - Configuration Mismatch: enable_ncu defaults to False in KernelOptEnvironment.__init__ but True in LocalGPUEvaluator.__init__, creating inconsistent profiling behavior.

Security Concerns: User-submitted kernel code executes with full Python privileges via dynamic imports. While Docker provides isolation, there's no sandboxing within the container for:

  • Filesystem access
  • Network requests
  • Resource consumption
  • Module imports

This is acceptable for trusted research environments but should be documented as a security consideration.

Architecture

The environment follows OpenEnv patterns correctly:

  • Clean separation between client and server
  • WebSocket-based persistent sessions
  • Proper Pydantic models
  • Docker containerization
  • Comprehensive profiling feedback for agent learning

Confidence Score: 3/5

  • Safe for research use but has critical reward logic bugs and configuration inconsistencies that need fixing
  • Score reflects critical reward calculation bug (documented vs implemented), dead code in reward function, and configuration mismatches between components. Security concerns are acceptable for research but should be documented. Core architecture is solid.
  • Pay close attention to envs/kernrl/README.md, envs/kernrl/server/evaluator.py, and envs/kernrl/server/kernrl_environment.py for reward logic unification

Important Files Changed

Filename Overview
envs/kernrl/README.md Documentation issue: Reward structure in README doesn't match implementation in evaluator.py
envs/kernrl/server/evaluator.py Core evaluator with comprehensive profiling - implements different reward than documented
envs/kernrl/server/kernrl_environment.py Main environment implementation - contains duplicate reward calculations that don't match

Sequence Diagram

sequenceDiagram
    participant Agent as LLM Agent
    participant Client as kernrl_env (Client)
    participant Server as FastAPI Server
    participant Env as KernelOptEnvironment
    participant Eval as LocalGPUEvaluator
    participant GPU as GPU Hardware
    
    Agent->>Client: reset(problem_id)
    Client->>Server: WebSocket: reset
    Server->>Env: reset(problem_id)
    Env->>Env: Load problem from problems/
    Env-->>Server: KernelObservation (description, reference_code)
    Server-->>Client: JSON response
    Client-->>Agent: KernelObservation
    
    loop Until solved or max_turns
        Agent->>Agent: Generate optimized kernel code
        Agent->>Client: step(KernelAction(code))
        Client->>Server: WebSocket: step
        Server->>Env: step(action)
        Env->>Eval: evaluate(solution_code, reference_code)
        
        Eval->>Eval: Write code to tmpdir
        Eval->>GPU: Compile solution (subprocess)
        GPU-->>Eval: Compilation result
        
        alt Compilation succeeds
            Eval->>GPU: Run Compute Sanitizer
            GPU-->>Eval: Memory/race errors
            
            Eval->>GPU: Execute correctness check
            GPU-->>Eval: Output diff statistics
            
            alt Correctness passes
                Eval->>GPU: Run benchmark (warmup + timing)
                GPU-->>Eval: Speedup measurement
                
                Eval->>GPU: NSight Systems profiling
                GPU-->>Eval: System-level metrics
                
                Eval->>GPU: NSight Compute profiling
                GPU-->>Eval: Kernel-level metrics
            end
        end
        
        Eval->>Eval: _compute_reward(result)
        Eval-->>Env: EvalResult (compilation, correctness, benchmark, profiling)
        Env->>Env: Update state (best_speedup, solved)
        Env-->>Server: KernelObservation (feedback, reward, done)
        Server-->>Client: JSON response
        Client-->>Agent: StepResult[KernelObservation]
        Agent->>Agent: Analyze feedback, adjust kernel
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

103 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 104 to 111
## Reward Structure

| Component | Reward | Description |
|-----------|--------|-------------|
| Compilation | +0.1 | Code compiles successfully |
| Correctness | +0.3 | Output matches reference |
| Beats baseline | +0.3 | Speedup > 1.0x |
| Speedup bonus | +0.3 | Scales with log2(speedup) |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Reward structure doesn't match implementation.

README says:

  • Compilation: +0.1
  • Correctness: +0.3
  • Beats baseline: +0.3
  • Speedup bonus: +0.3 (log2 scaled)

But evaluator.py:690-715 implements:

  • Compilation: +0.1
  • Correctness: +0.3
  • Speedup > 1.0: +0.3 + bonus (log2 scaled, capped at 0.3)

And kernrl_environment.py:324-340 implements a THIRD version:

  • Compilation fail: -0.5
  • Incorrect: -0.25
  • Speedup > 1.0: min(speedup - 1.0, 2.0)
  • Speedup < 1.0: (speedup - 1.0) * 0.5

Need to unify the reward logic and update docs accordingly.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/kernrl/README.md
Line: 104:111

Comment:
**logic:** Reward structure doesn't match implementation. 

README says:
- Compilation: +0.1
- Correctness: +0.3  
- Beats baseline: +0.3
- Speedup bonus: +0.3 (log2 scaled)

But `evaluator.py:690-715` implements:
- Compilation: +0.1
- Correctness: +0.3
- Speedup &gt; 1.0: +0.3 + bonus (log2 scaled, capped at 0.3)

And `kernrl_environment.py:324-340` implements a THIRD version:
- Compilation fail: -0.5
- Incorrect: -0.25  
- Speedup &gt; 1.0: `min(speedup - 1.0, 2.0)`
- Speedup &lt; 1.0: `(speedup - 1.0) * 0.5`

Need to unify the reward logic and update docs accordingly.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +324 to +340
def _calculate_reward(self, eval_result) -> float:
"""Calculate reward based on evaluation results."""
if not eval_result.compilation.success:
return -0.5 # Penalty for compilation failure

if eval_result.correctness and not eval_result.correctness.correct:
return -0.25 # Penalty for incorrect output

if eval_result.benchmark and eval_result.benchmark.speedup:
# Reward proportional to speedup
speedup = eval_result.benchmark.speedup
if speedup > 1.0:
return min(speedup - 1.0, 2.0) # Cap reward at 2.0
else:
return (speedup - 1.0) * 0.5 # Smaller penalty for being slower

return 0.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Duplicate reward calculation. The _calculate_reward method here is overridden by evaluator._compute_reward() at line 313 of evaluator.py, making this implementation dead code. Remove this method or clarify which reward function should be used.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/kernrl/server/kernrl_environment.py
Line: 324:340

Comment:
**logic:** Duplicate reward calculation. The `_calculate_reward` method here is overridden by `evaluator._compute_reward()` at line 313 of `evaluator.py`, making this implementation dead code. Remove this method or clarify which reward function should be used.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +79 to +80
# Step info
step: int = 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Enable flags default to True but the KernelOptEnvironment constructor at kernrl_environment.py:79-80 defaults enable_nsys=True, enable_ncu=False. The evaluator then enables NCU by default at line 212, creating inconsistency.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/kernrl/server/evaluator.py
Line: 79:80

Comment:
**logic:** Enable flags default to `True` but the `KernelOptEnvironment` constructor at `kernrl_environment.py:79-80` defaults `enable_nsys=True, enable_ncu=False`. The evaluator then enables NCU by default at line 212, creating inconsistency.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +325 to +365
runner_path.write_text(f'''
import torch
import importlib.util

def load_module(path, name):
spec = importlib.util.spec_from_file_location(name, path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
return mod

ref_mod = load_module("{reference_path}", "reference")
sol_mod = load_module("{solution_path}", "solution")

device = "{self.device}"

if hasattr(ref_mod, "get_init_inputs"):
init_inputs = ref_mod.get_init_inputs()
else:
init_inputs = []

model = sol_mod.Model(*init_inputs).to(device).eval()

if hasattr(ref_mod, "get_inputs"):
inputs = [x.to(device) if isinstance(x, torch.Tensor) else x for x in ref_mod.get_inputs()]
else:
inputs = [torch.randn(16, 1024, device=device)]

# Warmup
with torch.no_grad():
for _ in range(5):
model(*inputs)

torch.cuda.synchronize()

# Profile this
with torch.no_grad():
for _ in range(10):
model(*inputs)

torch.cuda.synchronize()
''')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Arbitrary code execution without sandboxing. The runner script dynamically imports and executes user-submitted kernel code via spec.loader.exec_module(mod). While Docker provides some isolation, the code runs with full Python privileges and can:

  • Access filesystem
  • Make network requests
  • Execute system commands via subprocess
  • Import any available module

Consider:

  • Adding resource limits (CPU, memory, time)
  • Restricting filesystem access to temp directories only
  • Network isolation
  • Disabling dangerous imports (subprocess, os.system, etc.)
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/kernrl/server/evaluator.py
Line: 325:365

Comment:
**logic:** Arbitrary code execution without sandboxing. The runner script dynamically imports and executes user-submitted kernel code via `spec.loader.exec_module(mod)`. While Docker provides some isolation, the code runs with full Python privileges and can:
- Access filesystem
- Make network requests
- Execute system commands via `subprocess`
- Import any available module

Consider:
- Adding resource limits (CPU, memory, time)
- Restricting filesystem access to temp directories only
- Network isolation
- Disabling dangerous imports (`subprocess`, `os.system`, etc.)

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +401 to +425
'''
try:
proc = subprocess.run(
[sys.executable, "-c", check_script],
capture_output=True,
text=True,
timeout=30,
)

output = proc.stdout + proc.stderr

if "OK" in proc.stdout:
warnings = [
line.replace("WARNING: ", "")
for line in proc.stdout.split("\n")
if line.startswith("WARNING:")
]
return CompilationResult(success=True, warnings=warnings)
else:
return CompilationResult(success=False, error=output[:2000])

except subprocess.TimeoutExpired:
return CompilationResult(success=False, error="Compilation timeout (30s)")
except Exception as e:
return CompilationResult(success=False, error=str(e))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Code injection vulnerability. User code is executed with subprocess.run([sys.executable, "-c", check_script]) where check_script contains f-string interpolation of solution_path. If a malicious user could control the path, they could inject code. While solution_path is a Path object created internally, this pattern is risky. Is there any way a user could influence the solution_path through the API to inject malicious code into the f-string?

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/kernrl/server/evaluator.py
Line: 401:425

Comment:
**logic:** Code injection vulnerability. User code is executed with `subprocess.run([sys.executable, "-c", check_script])` where `check_script` contains f-string interpolation of `solution_path`. If a malicious user could control the path, they could inject code. While `solution_path` is a `Path` object created internally, this pattern is risky. Is there any way a user could influence the `solution_path` through the API to inject malicious code into the f-string?

How can I resolve this? If you propose a fix, please make it concise.

- Unify reward calculation: evaluator now uses same logic as environment
  - Compilation failure: -0.5 penalty
  - Correctness failure: -0.25 penalty
  - Slower than baseline: (speedup - 1.0) * 0.5 (negative)
  - Faster than baseline: min(speedup - 1.0, 2.0) (positive, capped)
  - Key: positive reward ONLY when speedup > 1.0x baseline
- Fix enable_ncu default mismatch: align to False (NCU is slow, opt-in)
- Update README with correct reward documentation and examples
- Add security considerations section to README

Addresses feedback from greptile automated code review.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@Infatoshi
Copy link
Author

Thanks @greptile-apps for the thorough review! Fixed in commit e41aee4:

Critical - Reward Inconsistency: Fixed. Unified reward logic - evaluator now uses the same calculation as the environment:

  • Compilation failure: -0.5 penalty
  • Correctness failure: -0.25 penalty
  • Slower than baseline: (speedup - 1.0) * 0.5 (negative)
  • Faster than baseline: min(speedup - 1.0, 2.0) (positive, capped)

Key principle: Positive reward only when speedup > 1.0x baseline. Compilation + correctness alone does NOT give positive reward.

Moderate - Config Mismatch: Fixed. Aligned enable_ncu default to False in both LocalGPUEvaluator and KernelOptEnvironment. NCU is slow, so it's opt-in.

Security Concerns: Added Security Considerations section to README documenting that user-submitted code executes with full privileges within the container.

Updated README with correct reward documentation including concrete examples.

@greptile-apps
Copy link

greptile-apps bot commented Jan 26, 2026

Excellent work! 🎯 I've verified that commit e41aee4 successfully addresses all the critical issues I identified:

✅ Critical - Reward Inconsistency: FIXED

The reward logic is now perfectly unified between evaluator and environment:

def _compute_reward(self, result: EvalResult) -> float:
    # Compilation failure: penalty
    if not result.compilation.success:
        return -0.5

    # Correctness failure: penalty  
    if result.correctness and not result.correctness.correct:
        return -0.25

    # Benchmark available: reward based on speedup
    if result.benchmark and result.benchmark.speedup:
        speedup = result.benchmark.speedup
        if speedup > 1.0:
            # Positive reward only when beating baseline, capped at 2.0
            return min(speedup - 1.0, 2.0)
        else:
            # Small penalty for being slower than baseline
            return (speedup - 1.0) * 0.5

    # Correct but no benchmark (edge case): no reward
    return 0.0

The key principle is now crystal clear: positive reward ONLY when speedup > 1.0x baseline. Perfect! 👌

✅ Moderate - Config Mismatch: FIXED

Both LocalGPUEvaluator and the client now default enable_ncu=False, properly reflecting that NCU profiling is slow and should be opt-in.

✅ Security Concerns: ADDRESSED

The README now includes a comprehensive Security Considerations section that clearly documents the execution model and potential risks.

✅ Documentation: ENHANCED

The README reward structure table with concrete examples is excellent - it makes the reward logic completely transparent to users.

This is a thorough and well-executed fix that maintains the environment's integrity while addressing all identified concerns. The unified reward calculation ensures consistent agent training across both local and server-based evaluation modes.

@burtenshaw
Copy link
Collaborator

Hey @Infatoshi , Are we ready to go on this right?

@burtenshaw burtenshaw self-requested a review January 30, 2026 18:50
@Infatoshi
Copy link
Author

Yup!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants