Reduce CPU launch overhead #1160

lenroe · 2026-01-08T12:05:27Z

Reduce CPU kernel launch overhead

CPU runtime for small to medium-sized workloads is currently dominated by wp.launch() overhead.
This PR reduces the overhead of wp.launch() on CPU by ~65-85%.

Changes

1. Cache `invoke()` struct types on Kernel object

File: warp/_src/context.py

The invoke() function dynamically creates ctypes struct classes using type() on every call (~6 µs). This PR caches the struct types in Kernel._invoke_cache, keyed by (param_types, adjoint).

2. Fast path for `types_equal()`

File: warp/_src/types.py

Added identity check if a is b: return True at the top of types_equal(). This avoids expensive comparisons when comparing identical type objects (common case for dtype checks).

Benchmark

@wp.kernel
def add_kernel(a: wp.array(dtype=float), b: wp.array(dtype=float), c: wp.array(dtype=float)):
    i = wp.tid()
    c[i] = a[i] + b[i]

wp.launch(add_kernel, dim=1, inputs=[a, b, c], device="cpu")

Results (Apple Silicon, CPU — median of 5 runs, 20k iterations each):

dim	Before	After	Before (+record_cmd)	After (+record_cmd)	Speedup
1	11.00 µs	3.81 µs	6.73 µs	1.02 µs	85%
10	11.07 µs	3.86 µs	6.80 µs	1.06 µs	84%
100	11.31 µs	4.01 µs	6.99 µs	1.16 µs	83%
1,000	12.72 µs	5.37 µs	8.27 µs	2.44 µs	70%
10,000	26.35 µs	18.52 µs	21.73 µs	15.29 µs	30%
100,000	156.43 µs	147.70 µs	150.73 µs	142.34 µs	6%

Before/After: Standard wp.launch() overhead
Before/After (+record_cmd): Using wp.launch(..., record_cmd=True) + cmd.launch() for replay

The invoke caching optimization benefits both paths:

wp.launch(): 11.00 → 3.81 µs (~65% faster)
record_cmd: 6.73 → 1.02 µs (~85% faster)

Before your PR is "Ready for review"

All commits are signed-off to indicate that your contribution adheres to the Developer Certificate of Origin requirements
Necessary tests have been added
Documentation is up-to-date
Auto-generated files modified by compiling Warp and building the documentation have been updated (e.g. __init__.pyi, docs/api_reference/, docs/language_reference/)
Code passes formatting and linting checks with pre-commit run -a

Summary by CodeRabbit

Performance
- Kernel launches now execute faster through optimized caching of argument structures, reducing overhead on repeated invocations.
- Type comparison operations are accelerated with an optimized fast-path for identical references.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Lennart Roestel <[email protected]>

copy-pr-bot · 2026-01-08T12:05:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-08T12:05:41Z

📝 Walkthrough

Walkthrough

This pull request introduces a per-kernel invocation cache that stores generated ctypes.Structure types for kernel argument packing, eliminating repeated dynamic type generation during repeated kernel invocations. Additionally, a fast-path optimization is added to the type equality comparison function to immediately return true for reference-identical objects.

Changes

Cohort / File(s)	Summary
Kernel invocation struct caching `warp/_src/context.py`	Introduces `Kernel._invoke_cache` dictionary to cache generated `ArgsStruct` and `AdjArgsStruct` ctypes structures. Implements cache key generation from parameter types, conditional caching logic for forward/adjoint paths, and structure reuse on subsequent invocations to avoid repeated dynamic type generation.
Type equality optimization `warp/_src/types.py`	Adds reference-identity fast-path check in `types_equal` function to immediately return `True` when comparing identical objects, bypassing heavier comparison logic.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Reduce CPU launch overhead' directly and clearly summarizes the main objective of the changeset, which is to optimize CPU launch performance through caching and fast-path optimizations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb7e02c and 1a4d133.

📒 Files selected for processing (2)

warp/_src/context.py
warp/_src/types.py

🔇 Additional comments (3)

warp/_src/types.py (1)

2342-2343: LGTM! Clean fast-path optimization.

The identity check is a correct and efficient optimization that short-circuits expensive type comparisons when the objects are the same. This aligns well with the PR's goal of reducing CPU launch overhead.

warp/_src/context.py (2)

776-778: Per-kernel _invoke_cache is a good fit; shared across overloads is acceptable

Initializing a simple per-Kernel cache for invoke-time struct types is appropriate here. Note that because Kernel.add_overload() uses shallowcopy(self), generic overload instances will share the same _invoke_cache dict; given that the cache key includes the full param ctypes signature, this shared cache is still safe and avoids duplicate struct definitions across overloads. Nothing to change.

6375-6436: CPU invoke() caching logic looks correct and preserves semantics

The new fast path in invoke() correctly:

Keys the cache on the ctypes parameter types (excluding launch bounds) plus adjoint, so reused structs are only applied when both forward and adjoint argument ctypes layouts match.

Reconstructs ArgsStruct and AdjArgsStruct instances the same way as the original code (names from kernel.adj.args, types from type(params[...])), so the ABI to the compiled CPU kernels is unchanged.

Handles forward and adjoint calls consistently, with the adjoint path also caching and reusing AdjArgsStruct and its field metadata.

Given CPython’s GIL, concurrent access to kernel._invoke_cache is benign (at worst redundant struct creation), and the cache size is naturally bounded by the small set of distinct ctypes combinations per kernel. No functional issues spotted.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps

Greptile Overview

Greptile Summary

This PR reduces CPU kernel launch overhead by 65-85% through two targeted optimizations:

Major Changes:

Caching invoke() struct types: The invoke() function now caches dynamically-created ctypes struct classes in Kernel._invoke_cache, keyed by parameter types and adjoint flag. This eliminates ~6µs of type() overhead per launch.
Fast-path types_equal(): Added identity check (if a is b) before expensive type comparison, optimizing the common case where identical type objects are compared.

Implementation Notes:

Cache key correctly includes runtime parameter types, preventing collisions between different call signatures
Shallow copy in add_overload() causes parent kernel and overloads to share the cache dict, which is functionally correct but may accumulate entries from multiple overloads
No thread-safety issues: while the check-then-set pattern has a benign race condition, both threads would compute equivalent values, so overwrites are harmless

Performance Impact:
Benchmarks show dramatic improvements for small workloads (85% faster for dim=1 with record_cmd), with diminishing returns as kernel execution time dominates (6% for dim=100k).

Confidence Score: 4/5

This PR is safe to merge with minimal risk - the optimizations are well-targeted and preserve existing behavior
Score reflects clean implementation of focused performance optimizations. The caching mechanism is correct (cache key properly includes param types), the identity check is a standard optimization pattern, and no breaking changes were introduced. Deducted one point for minor concern about cache sharing between overloads (functionally correct but could accumulate entries), and lack of explicit thread-safety mechanisms (though Python dict operations are GIL-protected and race conditions are benign).
No files require special attention - both changes are straightforward optimizations with clear intent

Important Files Changed

File Analysis

Filename	Score	Overview
warp/_src/context.py	4/5	Adds caching of ctypes struct types in `invoke()` to reduce overhead; cache is shared between overloads due to shallow copy, which is safe since cache key includes param types
warp/_src/types.py	5/5	Adds identity check optimization to `types_equal()` for fast path when comparing identical type objects; safe and correct

Sequence Diagram

sequenceDiagram
    participant User
    participant Launch as wp.launch
    participant Invoke as invoke()
    participant Cache as invoke_cache
    participant CTypes as ctypes
    participant Hooks as KernelHooks

    User->>Launch: launch(kernel, dim, inputs)
    Launch->>Invoke: invoke(kernel, hooks, params, adjoint)
    
    Note over Invoke: Build cache key from types
    Invoke->>Invoke: Compute key from param types
    
    Invoke->>Cache: Check cache for key
    
    alt Cache hit - Fast Path
        Cache-->>Invoke: Return cached structs
        Note over Invoke: Reuse cached ArgsStruct
        Invoke->>Invoke: Populate struct fields
        Invoke->>Hooks: Execute kernel
    else Cache miss - Slow Path
        Note over Invoke: Build structs dynamically
        Invoke->>Invoke: Extract fields from kernel
        Invoke->>CTypes: Create new ArgsStruct
        CTypes-->>Invoke: New class object
        Invoke->>Invoke: Populate struct fields
        Invoke->>Cache: Store for future use
        Invoke->>Hooks: Execute kernel
    end
    
    Hooks-->>Invoke: Execution complete
    Invoke-->>Launch: Return
    Launch-->>User: Complete

greptile-apps · 2026-01-08T12:12:50Z

Additional Comments (1)

warp/_src/context.py
Shallow copy means _invoke_cache dict is shared between parent generic kernel and all its overloads. While functionally correct (cache key includes param types), this could lead to cache growing with entries from multiple overloads. Consider explicitly resetting the cache for overloads if memory usage becomes a concern:

        ovl = shallowcopy(self)
        ovl._invoke_cache = {}

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-12T13:02:48Z

Greptile Overview

Greptile Summary

This PR introduces two performance optimizations to reduce CPU kernel launch overhead by 65-85%:

Key Changes

1. Caching of ctypes struct types in invoke() (context.py)

Adds _invoke_cache dict to Kernel class to cache dynamically-created ctypes struct classes
Cache key: (tuple of param types, adjoint bool) based on runtime parameter types from pack_arg()
Avoids repeated type() calls (~6µs overhead) for creating ArgsStruct/AdjArgsStruct on each invocation
Fast path: retrieve cached struct types and populate with current parameter values
Slow path: build struct types, cache them, then use

2. Identity check fast-path in types_equal() (types.py)

Adds if a is b: return True before calling types_equal_generic()
Optimizes common case where same type object is compared to itself (e.g., dtype checks)

Architecture Integration

The caching mechanism integrates cleanly with the existing launch pipeline:

wp.launch() → pack_args() → invoke() (CPU only)
Works for both standard launches and Launch.launch() replay
Cache survives across multiple kernel invocations with same parameter types
Parameter validation in pack_arg() ensures cache key integrity

Benchmark Results

Significant speedups for small-to-medium workloads where launch overhead dominates:

dim=1: 11.00µs → 3.81µs (65% faster)
dim=1 with record_cmd: 6.73µs → 1.02µs (85% faster)
Speedup decreases as kernel execution time increases (6% for dim=100k)

Confidence Score: 4/5

This PR is safe to merge with minor concerns about cache memory management in specific edge cases.
The implementation is functionally correct and delivers significant performance improvements. The caching logic correctly handles parameter type variations, and validation in pack_arg() prevents cache key collisions. However, the shallow copy behavior causes generic kernels and their overloads to share the same cache dict, which could lead to unbounded cache growth in applications with many overloads. This is a memory efficiency concern rather than a correctness issue, and the previous review thread has already noted this. The types_equal() optimization is straightforward and safe.
No files require special attention - the changes are well-contained and the implementation is solid. The shallow copy cache-sharing issue mentioned in previous threads is a known trade-off.

Important Files Changed

File Analysis

Filename	Score	Overview
warp/_src/context.py	4/5	Adds _invoke_cache to cache ctypes struct types, reducing overhead. Cache shared between generic kernels and overloads via shallow copy, which is functionally correct but may cause cache pollution.
warp/_src/types.py	5/5	Adds identity check fast-path to types_equal(). Simple, safe, and correct optimization.

Sequence Diagram

sequenceDiagram
    participant User
    participant launch
    participant invoke
    participant Kernel
    participant Cache as _invoke_cache
    participant ctypes
    
    User->>launch: wp.launch(kernel, dim, inputs, device="cpu")
    launch->>launch: pack_args() → params list
    launch->>invoke: invoke(kernel, hooks, params, adjoint)
    
    alt First call (cache miss)
        invoke->>Cache: get(cache_key)
        Cache-->>invoke: None
        invoke->>invoke: Build fields from kernel.adj.args
        invoke->>ctypes: type("ArgsStruct", ...) [SLOW ~6µs]
        ctypes-->>invoke: ArgsStruct class
        invoke->>Cache: store(cache_key, (ArgsStruct, fields))
        invoke->>invoke: Create args instance & set attributes
        invoke->>ctypes: hooks.forward(bounds, args)
    else Subsequent calls (cache hit)
        invoke->>Cache: get(cache_key)
        Cache-->>invoke: (ArgsStruct, fields)
        invoke->>invoke: Create args instance & set attributes [FAST]
        invoke->>ctypes: hooks.forward(bounds, args)
    end

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

lenroe added 3 commits January 8, 2026 12:30

Reduce CPU kernel launch overhead

fb71ead

Signed-off-by: Lennart Roestel <[email protected]>

typo

51d5e24

Signed-off-by: Lennart Roestel <[email protected]>

fix whitespace

1a4d133

Signed-off-by: Lennart Roestel <[email protected]>

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

shi-eric requested a review from nvlukasz January 8, 2026 15:57

Merge branch 'main' into cpu-launch-overhead

c77298b

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce CPU launch overhead #1160

Reduce CPU launch overhead #1160

Uh oh!

lenroe commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot commented Jan 8, 2026

Uh oh!

greptile-apps bot commented Jan 12, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Reduce CPU launch overhead #1160

Are you sure you want to change the base?

Reduce CPU launch overhead #1160

Uh oh!

Conversation

lenroe commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reduce CPU kernel launch overhead

Changes

1. Cache invoke() struct types on Kernel object

2. Fast path for types_equal()

Benchmark

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Jan 8, 2026

Uh oh!

greptile-apps bot commented Jan 12, 2026

Greptile Overview

Greptile Summary

Key Changes

Architecture Integration

Benchmark Results

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lenroe commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

1. Cache `invoke()` struct types on Kernel object

2. Fast path for `types_equal()`

coderabbitai bot commented Jan 8, 2026 •

edited

Loading