-
Notifications
You must be signed in to change notification settings - Fork 414
Reduce CPU launch overhead #1160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Lennart Roestel <[email protected]>
Signed-off-by: Lennart Roestel <[email protected]>
Signed-off-by: Lennart Roestel <[email protected]>
📝 WalkthroughWalkthroughThis pull request introduces a per-kernel invocation cache that stores generated ctypes.Structure types for kernel argument packing, eliminating repeated dynamic type generation during repeated kernel invocations. Additionally, a fast-path optimization is added to the type equality comparison function to immediately return true for reference-identical objects. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🔇 Additional comments (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR reduces CPU kernel launch overhead by 65-85% through two targeted optimizations:
Major Changes:
- Caching invoke() struct types: The
invoke()function now caches dynamically-created ctypes struct classes inKernel._invoke_cache, keyed by parameter types and adjoint flag. This eliminates ~6µs oftype()overhead per launch. - Fast-path types_equal(): Added identity check (
if a is b) before expensive type comparison, optimizing the common case where identical type objects are compared.
Implementation Notes:
- Cache key correctly includes runtime parameter types, preventing collisions between different call signatures
- Shallow copy in
add_overload()causes parent kernel and overloads to share the cache dict, which is functionally correct but may accumulate entries from multiple overloads - No thread-safety issues: while the check-then-set pattern has a benign race condition, both threads would compute equivalent values, so overwrites are harmless
Performance Impact:
Benchmarks show dramatic improvements for small workloads (85% faster for dim=1 with record_cmd), with diminishing returns as kernel execution time dominates (6% for dim=100k).
Confidence Score: 4/5
- This PR is safe to merge with minimal risk - the optimizations are well-targeted and preserve existing behavior
- Score reflects clean implementation of focused performance optimizations. The caching mechanism is correct (cache key properly includes param types), the identity check is a standard optimization pattern, and no breaking changes were introduced. Deducted one point for minor concern about cache sharing between overloads (functionally correct but could accumulate entries), and lack of explicit thread-safety mechanisms (though Python dict operations are GIL-protected and race conditions are benign).
- No files require special attention - both changes are straightforward optimizations with clear intent
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| warp/_src/context.py | 4/5 | Adds caching of ctypes struct types in invoke() to reduce overhead; cache is shared between overloads due to shallow copy, which is safe since cache key includes param types |
| warp/_src/types.py | 5/5 | Adds identity check optimization to types_equal() for fast path when comparing identical type objects; safe and correct |
Sequence Diagram
sequenceDiagram
participant User
participant Launch as wp.launch
participant Invoke as invoke()
participant Cache as invoke_cache
participant CTypes as ctypes
participant Hooks as KernelHooks
User->>Launch: launch(kernel, dim, inputs)
Launch->>Invoke: invoke(kernel, hooks, params, adjoint)
Note over Invoke: Build cache key from types
Invoke->>Invoke: Compute key from param types
Invoke->>Cache: Check cache for key
alt Cache hit - Fast Path
Cache-->>Invoke: Return cached structs
Note over Invoke: Reuse cached ArgsStruct
Invoke->>Invoke: Populate struct fields
Invoke->>Hooks: Execute kernel
else Cache miss - Slow Path
Note over Invoke: Build structs dynamically
Invoke->>Invoke: Extract fields from kernel
Invoke->>CTypes: Create new ArgsStruct
CTypes-->>Invoke: New class object
Invoke->>Invoke: Populate struct fields
Invoke->>Cache: Store for future use
Invoke->>Hooks: Execute kernel
end
Hooks-->>Invoke: Execution complete
Invoke-->>Launch: Return
Launch-->>User: Complete
Additional Comments (1)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
Greptile OverviewGreptile SummaryThis PR introduces two performance optimizations to reduce CPU kernel launch overhead by 65-85%: Key Changes1. Caching of ctypes struct types in
2. Identity check fast-path in
Architecture IntegrationThe caching mechanism integrates cleanly with the existing launch pipeline:
Benchmark ResultsSignificant speedups for small-to-medium workloads where launch overhead dominates:
Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant launch
participant invoke
participant Kernel
participant Cache as _invoke_cache
participant ctypes
User->>launch: wp.launch(kernel, dim, inputs, device="cpu")
launch->>launch: pack_args() → params list
launch->>invoke: invoke(kernel, hooks, params, adjoint)
alt First call (cache miss)
invoke->>Cache: get(cache_key)
Cache-->>invoke: None
invoke->>invoke: Build fields from kernel.adj.args
invoke->>ctypes: type("ArgsStruct", ...) [SLOW ~6µs]
ctypes-->>invoke: ArgsStruct class
invoke->>Cache: store(cache_key, (ArgsStruct, fields))
invoke->>invoke: Create args instance & set attributes
invoke->>ctypes: hooks.forward(bounds, args)
else Subsequent calls (cache hit)
invoke->>Cache: get(cache_key)
Cache-->>invoke: (ArgsStruct, fields)
invoke->>invoke: Create args instance & set attributes [FAST]
invoke->>ctypes: hooks.forward(bounds, args)
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
Reduce CPU kernel launch overhead
CPU runtime for small to medium-sized workloads is currently dominated by
wp.launch()overhead.This PR reduces the overhead of
wp.launch()on CPU by ~65-85%.Changes
1. Cache
invoke()struct types on Kernel objectFile:
warp/_src/context.pyThe
invoke()function dynamically creates ctypes struct classes usingtype()on every call (~6 µs). This PR caches the struct types inKernel._invoke_cache, keyed by(param_types, adjoint).2. Fast path for
types_equal()File:
warp/_src/types.pyAdded identity check
if a is b: return Trueat the top oftypes_equal(). This avoids expensive comparisons when comparing identical type objects (common case for dtype checks).Benchmark
Results (Apple Silicon, CPU — median of 5 runs, 20k iterations each):
wp.launch()overheadwp.launch(..., record_cmd=True)+cmd.launch()for replayThe invoke caching optimization benefits both paths:
wp.launch(): 11.00 → 3.81 µs (~65% faster)record_cmd: 6.73 → 1.02 µs (~85% faster)Before your PR is "Ready for review"
__init__.pyi,docs/api_reference/,docs/language_reference/)pre-commit run -aSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.