Extend threaded macro to use shared memory#113
Conversation
0122cbd to
0776306
Compare
WalkthroughThis change extends the Possibly related PRs
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (8)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
🧰 Additional context used🪛 LanguageTooldocs/src/threaded.md[style] ~52-~52: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase. (EN_WEAK_ADJECTIVE) [uncategorized] ~79-~79: The hyphen in statically-sized is redundant. (ADVERB_LY_HYPHEN_FIX) ⏰ Context from checks skipped due to timeout of 90000ms (1)
🔇 Additional comments (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
Project.toml (1)
26-26: Consider using a version range for StaticArrays.The exact version constraint
1.9.13might be too restrictive. Consider using a range like1.9to allow patch updates.-StaticArrays = "1.9.13" +StaticArrays = "1.9"test/runtests.jl (2)
320-320: Consider tracking the allocation issue.The
@test_brokenindicates known allocations in threaded execution. Consider adding a comment or issue reference explaining why allocations occur.- @test_broken threaded_allocations == 0 + # TODO: Fix allocations in threaded execution (issue #XXX) + @test_broken threaded_allocations == 0
383-383: Consider tracking the allocation issue.Same as above - consider documenting why allocations occur.
- @test_broken threaded_allocations == 0 + # TODO: Fix allocations in threaded execution (issue #XXX) + @test_broken threaded_allocations == 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
Project.toml(2 hunks)docs/Manifest.toml(4 hunks)docs/src/apis.md(1 hunks)ext/ClimaCommsCUDAExt.jl(5 hunks)src/devices.jl(5 hunks)test/runtests.jl(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`*`: # CodeRabbit Style Guide (CliMA Inspired)
Leverage CodeRabbit for code reviews aligning with CliMA's practices.
I. Key Areas for CodeRabbit:
- Style: Naming (Titl...
*: # CodeRabbit Style Guide (CliMA Inspired)Leverage CodeRabbit for code reviews aligning with CliMA's practices.
I. Key Areas for CodeRabbit:
- Style: Naming (TitleCase, lowercase_with_underscores), line length (<78), indentation (4 spaces), import order.
- DRY: Flag duplicated code; encourage modularization.
- Docstrings: Flag missing docstrings for modules, structs, functions.
- Tests: Detect missing unit tests (if configured).
- Complexity: Report on cyclomatic complexity.
II. Conventions (CodeRabbit Can Help):
- Naming: Follow CliMA/CMIP conventions. Avoid
l,O,Ias single-char vars.- Unicode: Human review needed. Acceptable chars.
III. Documentation (CodeRabbit Flags Deficiencies):
- Docstrings: Describe purpose, inputs, outputs, examples.
IV. Testing (CodeRabbit: Missing Tests):
- Unit tests required for new/modified code.
V. CodeRabbit Config:
.coderabbit.yaml: Customize language, line length, tests.VI. Workflow:
- Review CodeRabbit's suggestions; investigate, provide context, address issues, improve config.
VII. Abandoned Code: Delete or clearly mark as deprecated.
VIII. Reviewer Responsibilities: Ensure adherence to best practices.
Project.toml
🔇 Additional comments (9)
docs/src/apis.md (1)
37-42: Documentation updates look good!The new API entries are well-organized within the Devices section.
docs/Manifest.toml (1)
1-436: Machine-generated manifest file.test/runtests.jl (1)
268-384: Comprehensive test coverage!The new tests thoroughly exercise the interdependent threading functionality with good edge case handling.
ext/ClimaCommsCUDAExt.jl (3)
56-57: LGTM!Clean implementation of GPU shared memory allocation using CUDA's static shared arrays.
143-188: Well-structured multi-dimensional kernel launch!The implementation correctly maps blocks to independent iterator items and threads within blocks to interdependent iterator items, with automatic coarsening when limits are exceeded.
223-227: Verify dimension function usage after fix.This grid-stride loop implementation looks correct but depends on the fixed dimension functions from lines 59-66.
src/devices.jl (3)
483-559: Robust macro implementation!The extended
@threadedmacro properly handles:
- Single and nested loops
- Independent and interdependent iterators
- Device-specific code generation
- Clear error messages for invalid usage
577-580: Clean annotation design!The error-only implementation ensures
@interdependentis used exclusively within@threadedloops.
714-715:Details
✅ Verification successful
Smart CPU shared memory emulation!
Using
MArrayprovides stack-allocated, statically-sized arrays that mimic GPU shared memory performance characteristics.Verify the StaticArrays dependency is properly declared:
🏁 Script executed:
#!/bin/bash # Description: Verify StaticArrays is listed as a dependency # Test: Check if StaticArrays is in Project.toml rg -A 5 "StaticArrays" Project.tomlLength of output: 234
StaticArrays dependency confirmed
StaticArrays is properly declared in Project.toml. No further action required.
0776306 to
00a171c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
ext/ClimaCommsCUDAExt.jl (2)
54-57: Fix formatting to resolve pipeline failure.The pipeline indicates formatting issues in this function. Apply proper Julia formatting.
-ClimaComms.static_shared_memory_array(::CUDADevice, ::Type{T}, dims...) where {T} = - CUDA.CuStaticSharedArray(T, dims) +ClimaComms.static_shared_memory_array( + ::CUDADevice, + ::Type{T}, + dims..., +) where {T} = CUDA.CuStaticSharedArray(T, dims)
215-216: Enhance error message clarity.Consider making the error message more specific about which coarsening values are invalid.
- (min_items_in_thread[1] > 0 && min_items_in_thread[2] > 0) || - throw(ArgumentError("integer `coarsen` values must be positive")) + (min_items_in_thread[1] > 0 && min_items_in_thread[2] > 0) || + throw(ArgumentError("both integer `coarsen` values must be positive, got $(min_items_in_thread)"))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
Project.toml(2 hunks)docs/Manifest.toml(4 hunks)docs/src/apis.md(1 hunks)ext/ClimaCommsCUDAExt.jl(5 hunks)src/devices.jl(5 hunks)test/runtests.jl(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- Project.toml
- docs/src/apis.md
- docs/Manifest.toml
🧰 Additional context used
🪛 GitHub Actions: JuliaFormatter
ext/ClimaCommsCUDAExt.jl
[error] 53-60: Code formatting or style change detected in static_shared_memory_array function. The function signature was reformatted, which caused the CI to fail.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: docbuild
- GitHub Check: test-os (macos-latest)
🔇 Additional comments (9)
test/runtests.jl (2)
275-282: Derivative computation looks mathematically sound.The finite difference implementation correctly handles boundary conditions with forward/backward differences at endpoints and central differences elsewhere.
325-331: Good handling of known allocation limitations.Appropriately marking the allocation test as broken for single CPU thread while tracking the issue with a TODO.
ext/ClimaCommsCUDAExt.jl (2)
60-66: CUDA dimension functions are now correctly implemented.Good fix from the previous review -
blocks_in_kernel()correctly usesCUDA.gridDim().xandthreads_in_block()usesCUDA.blockDim().x.
143-188: Dual iterator kernel launch logic is well-structured.The implementation correctly maps independent iterators to blocks and interdependent iterators to threads, with proper fallback to coarsening when limits are exceeded.
src/devices.jl (5)
451-560: Complex macro logic handles multiple iterator patterns correctly.The parsing logic properly distinguishes independent and interdependent iterators, with appropriate error handling for invalid combinations. The fallback to CPU single-threaded execution is well-implemented.
504-506: Robust macro detection for interdependent iterators.Good defensive programming - checking for both unqualified and fully qualified macro names prevents issues with macro hygiene.
610-613: Helpful error message for misused @Interdependent.Clear error message prevents confusion when the macro is used outside its intended context.
679-691: Well-designed type hierarchy for iterator data.The abstract base type with concrete implementations for different scenarios (one item, multiple items, all items) provides clean dispatch and extensibility.
747-748: Device-agnostic shared memory abstraction is elegant.Using
MArrayon CPUs to mimic GPU shared memory arrays provides a clean abstraction that enables portable high-performance code.
3706f1f to
fcb8681
Compare
6fce308 to
33f3d48
Compare
31da3d1 to
e711ba8
Compare
Purpose
This PR extends
ClimaComms.@threadedso that we can use shared memory on GPUs in a device-agnostic way. Specifically, it introduces an@interdependentannotation for@threadediterators, along with astatic_shared_memory_arrayfunction for allocating shared memory and a@sync_interdependentmacro for synchronizing interdependent threads. Unit tests and documentation with illustrative examples have also been provided.This should be the minimal set of changes we need to replace the entirety of
ClimaCore's CUDA extension with device-agnostic code. I will investigate the potential performance impacts of this in a future PR. For now, we can use this to test out performant kernels in ClimaAtmos without needing to add a new CUDA extension ordevother packages.