Skip to content

Conversation

@lenroe
Copy link

@lenroe lenroe commented Jan 8, 2026

Reduce CPU kernel launch overhead

CPU runtime for small to medium-sized workloads is currently dominated by wp.launch() overhead.
This PR reduces the overhead of wp.launch() on CPU by ~65-85%.

Changes

1. Cache invoke() struct types on Kernel object

File: warp/_src/context.py

The invoke() function dynamically creates ctypes struct classes using type() on every call (~6 µs). This PR caches the struct types in Kernel._invoke_cache, keyed by (param_types, adjoint).

2. Fast path for types_equal()

File: warp/_src/types.py

Added identity check if a is b: return True at the top of types_equal(). This avoids expensive comparisons when comparing identical type objects (common case for dtype checks).

Benchmark

@wp.kernel
def add_kernel(a: wp.array(dtype=float), b: wp.array(dtype=float), c: wp.array(dtype=float)):
    i = wp.tid()
    c[i] = a[i] + b[i]

wp.launch(add_kernel, dim=1, inputs=[a, b, c], device="cpu")

Results (Apple Silicon, CPU — median of 5 runs, 20k iterations each):

dim Before After Before (+record_cmd) After (+record_cmd) Speedup
1 11.00 µs 3.81 µs 6.73 µs 1.02 µs 85%
10 11.07 µs 3.86 µs 6.80 µs 1.06 µs 84%
100 11.31 µs 4.01 µs 6.99 µs 1.16 µs 83%
1,000 12.72 µs 5.37 µs 8.27 µs 2.44 µs 70%
10,000 26.35 µs 18.52 µs 21.73 µs 15.29 µs 30%
100,000 156.43 µs 147.70 µs 150.73 µs 142.34 µs 6%
  • Before/After: Standard wp.launch() overhead
  • Before/After (+record_cmd): Using wp.launch(..., record_cmd=True) + cmd.launch() for replay

The invoke caching optimization benefits both paths:

  • wp.launch(): 11.00 → 3.81 µs (~65% faster)
  • record_cmd: 6.73 → 1.02 µs (~85% faster)

Before your PR is "Ready for review"

  • All commits are signed-off to indicate that your contribution adheres to the Developer Certificate of Origin requirements
  • Necessary tests have been added
  • Documentation is up-to-date
  • Auto-generated files modified by compiling Warp and building the documentation have been updated (e.g. __init__.pyi, docs/api_reference/, docs/language_reference/)
  • Code passes formatting and linting checks with pre-commit run -a

Summary by CodeRabbit

  • Performance
    • Kernel launches now execute faster through optimized caching of argument structures, reducing overhead on repeated invocations.
    • Type comparison operations are accelerated with an optimized fast-path for identical references.

✏️ Tip: You can customize this high-level summary in your review settings.

lenroe added 3 commits January 8, 2026 12:30
Signed-off-by: Lennart Roestel <[email protected]>
Signed-off-by: Lennart Roestel <[email protected]>
Signed-off-by: Lennart Roestel <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a per-kernel invocation cache that stores generated ctypes.Structure types for kernel argument packing, eliminating repeated dynamic type generation during repeated kernel invocations. Additionally, a fast-path optimization is added to the type equality comparison function to immediately return true for reference-identical objects.

Changes

Cohort / File(s) Summary
Kernel invocation struct caching
warp/_src/context.py
Introduces Kernel._invoke_cache dictionary to cache generated ArgsStruct and AdjArgsStruct ctypes structures. Implements cache key generation from parameter types, conditional caching logic for forward/adjoint paths, and structure reuse on subsequent invocations to avoid repeated dynamic type generation.
Type equality optimization
warp/_src/types.py
Adds reference-identity fast-path check in types_equal function to immediately return True when comparing identical objects, bypassing heavier comparison logic.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'Reduce CPU launch overhead' directly and clearly summarizes the main objective of the changeset, which is to optimize CPU launch performance through caching and fast-path optimizations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb7e02c and 1a4d133.

📒 Files selected for processing (2)
  • warp/_src/context.py
  • warp/_src/types.py
🔇 Additional comments (3)
warp/_src/types.py (1)

2342-2343: LGTM! Clean fast-path optimization.

The identity check is a correct and efficient optimization that short-circuits expensive type comparisons when the objects are the same. This aligns well with the PR's goal of reducing CPU launch overhead.

warp/_src/context.py (2)

776-778: Per-kernel _invoke_cache is a good fit; shared across overloads is acceptable

Initializing a simple per-Kernel cache for invoke-time struct types is appropriate here. Note that because Kernel.add_overload() uses shallowcopy(self), generic overload instances will share the same _invoke_cache dict; given that the cache key includes the full param ctypes signature, this shared cache is still safe and avoids duplicate struct definitions across overloads. Nothing to change.


6375-6436: CPU invoke() caching logic looks correct and preserves semantics

The new fast path in invoke() correctly:

  • Keys the cache on the ctypes parameter types (excluding launch bounds) plus adjoint, so reused structs are only applied when both forward and adjoint argument ctypes layouts match.
  • Reconstructs ArgsStruct and AdjArgsStruct instances the same way as the original code (names from kernel.adj.args, types from type(params[...])), so the ABI to the compiled CPU kernels is unchanged.
  • Handles forward and adjoint calls consistently, with the adjoint path also caching and reusing AdjArgsStruct and its field metadata.

Given CPython’s GIL, concurrent access to kernel._invoke_cache is benign (at worst redundant struct creation), and the cache size is naturally bounded by the small set of distinct ctypes combinations per kernel. No functional issues spotted.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR reduces CPU kernel launch overhead by 65-85% through two targeted optimizations:

Major Changes:

  • Caching invoke() struct types: The invoke() function now caches dynamically-created ctypes struct classes in Kernel._invoke_cache, keyed by parameter types and adjoint flag. This eliminates ~6µs of type() overhead per launch.
  • Fast-path types_equal(): Added identity check (if a is b) before expensive type comparison, optimizing the common case where identical type objects are compared.

Implementation Notes:

  • Cache key correctly includes runtime parameter types, preventing collisions between different call signatures
  • Shallow copy in add_overload() causes parent kernel and overloads to share the cache dict, which is functionally correct but may accumulate entries from multiple overloads
  • No thread-safety issues: while the check-then-set pattern has a benign race condition, both threads would compute equivalent values, so overwrites are harmless

Performance Impact:
Benchmarks show dramatic improvements for small workloads (85% faster for dim=1 with record_cmd), with diminishing returns as kernel execution time dominates (6% for dim=100k).

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - the optimizations are well-targeted and preserve existing behavior
  • Score reflects clean implementation of focused performance optimizations. The caching mechanism is correct (cache key properly includes param types), the identity check is a standard optimization pattern, and no breaking changes were introduced. Deducted one point for minor concern about cache sharing between overloads (functionally correct but could accumulate entries), and lack of explicit thread-safety mechanisms (though Python dict operations are GIL-protected and race conditions are benign).
  • No files require special attention - both changes are straightforward optimizations with clear intent

Important Files Changed

File Analysis

Filename Score Overview
warp/_src/context.py 4/5 Adds caching of ctypes struct types in invoke() to reduce overhead; cache is shared between overloads due to shallow copy, which is safe since cache key includes param types
warp/_src/types.py 5/5 Adds identity check optimization to types_equal() for fast path when comparing identical type objects; safe and correct

Sequence Diagram

sequenceDiagram
    participant User
    participant Launch as wp.launch
    participant Invoke as invoke()
    participant Cache as invoke_cache
    participant CTypes as ctypes
    participant Hooks as KernelHooks

    User->>Launch: launch(kernel, dim, inputs)
    Launch->>Invoke: invoke(kernel, hooks, params, adjoint)
    
    Note over Invoke: Build cache key from types
    Invoke->>Invoke: Compute key from param types
    
    Invoke->>Cache: Check cache for key
    
    alt Cache hit - Fast Path
        Cache-->>Invoke: Return cached structs
        Note over Invoke: Reuse cached ArgsStruct
        Invoke->>Invoke: Populate struct fields
        Invoke->>Hooks: Execute kernel
    else Cache miss - Slow Path
        Note over Invoke: Build structs dynamically
        Invoke->>Invoke: Extract fields from kernel
        Invoke->>CTypes: Create new ArgsStruct
        CTypes-->>Invoke: New class object
        Invoke->>Invoke: Populate struct fields
        Invoke->>Cache: Store for future use
        Invoke->>Hooks: Execute kernel
    end
    
    Hooks-->>Invoke: Execution complete
    Invoke-->>Launch: Return
    Launch-->>User: Complete
Loading

@greptile-apps
Copy link

greptile-apps bot commented Jan 8, 2026

Additional Comments (1)

warp/_src/context.py
Shallow copy means _invoke_cache dict is shared between parent generic kernel and all its overloads. While functionally correct (cache key includes param types), this could lead to cache growing with entries from multiple overloads. Consider explicitly resetting the cache for overloads if memory usage becomes a concern:

        ovl = shallowcopy(self)
        ovl._invoke_cache = {}

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@shi-eric shi-eric requested a review from nvlukasz January 8, 2026 15:57
@greptile-apps
Copy link

greptile-apps bot commented Jan 12, 2026

Greptile Overview

Greptile Summary

This PR introduces two performance optimizations to reduce CPU kernel launch overhead by 65-85%:

Key Changes

1. Caching of ctypes struct types in invoke() (context.py)

  • Adds _invoke_cache dict to Kernel class to cache dynamically-created ctypes struct classes
  • Cache key: (tuple of param types, adjoint bool) based on runtime parameter types from pack_arg()
  • Avoids repeated type() calls (~6µs overhead) for creating ArgsStruct/AdjArgsStruct on each invocation
  • Fast path: retrieve cached struct types and populate with current parameter values
  • Slow path: build struct types, cache them, then use

2. Identity check fast-path in types_equal() (types.py)

  • Adds if a is b: return True before calling types_equal_generic()
  • Optimizes common case where same type object is compared to itself (e.g., dtype checks)

Architecture Integration

The caching mechanism integrates cleanly with the existing launch pipeline:

  • wp.launch()pack_args()invoke() (CPU only)
  • Works for both standard launches and Launch.launch() replay
  • Cache survives across multiple kernel invocations with same parameter types
  • Parameter validation in pack_arg() ensures cache key integrity

Benchmark Results

Significant speedups for small-to-medium workloads where launch overhead dominates:

  • dim=1: 11.00µs → 3.81µs (65% faster)
  • dim=1 with record_cmd: 6.73µs → 1.02µs (85% faster)
  • Speedup decreases as kernel execution time increases (6% for dim=100k)

Confidence Score: 4/5

  • This PR is safe to merge with minor concerns about cache memory management in specific edge cases.
  • The implementation is functionally correct and delivers significant performance improvements. The caching logic correctly handles parameter type variations, and validation in pack_arg() prevents cache key collisions. However, the shallow copy behavior causes generic kernels and their overloads to share the same cache dict, which could lead to unbounded cache growth in applications with many overloads. This is a memory efficiency concern rather than a correctness issue, and the previous review thread has already noted this. The types_equal() optimization is straightforward and safe.
  • No files require special attention - the changes are well-contained and the implementation is solid. The shallow copy cache-sharing issue mentioned in previous threads is a known trade-off.

Important Files Changed

File Analysis

Filename Score Overview
warp/_src/context.py 4/5 Adds _invoke_cache to cache ctypes struct types, reducing overhead. Cache shared between generic kernels and overloads via shallow copy, which is functionally correct but may cause cache pollution.
warp/_src/types.py 5/5 Adds identity check fast-path to types_equal(). Simple, safe, and correct optimization.

Sequence Diagram

sequenceDiagram
    participant User
    participant launch
    participant invoke
    participant Kernel
    participant Cache as _invoke_cache
    participant ctypes
    
    User->>launch: wp.launch(kernel, dim, inputs, device="cpu")
    launch->>launch: pack_args() → params list
    launch->>invoke: invoke(kernel, hooks, params, adjoint)
    
    alt First call (cache miss)
        invoke->>Cache: get(cache_key)
        Cache-->>invoke: None
        invoke->>invoke: Build fields from kernel.adj.args
        invoke->>ctypes: type("ArgsStruct", ...) [SLOW ~6µs]
        ctypes-->>invoke: ArgsStruct class
        invoke->>Cache: store(cache_key, (ArgsStruct, fields))
        invoke->>invoke: Create args instance & set attributes
        invoke->>ctypes: hooks.forward(bounds, args)
    else Subsequent calls (cache hit)
        invoke->>Cache: get(cache_key)
        Cache-->>invoke: (ArgsStruct, fields)
        invoke->>invoke: Create args instance & set attributes [FAST]
        invoke->>ctypes: hooks.forward(bounds, args)
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant