Skip to content

Extend threaded macro to use shared memory#113

Open
dennisYatunin wants to merge 3 commits intomainfrom
dy/gpu_threaded
Open

Extend threaded macro to use shared memory#113
dennisYatunin wants to merge 3 commits intomainfrom
dy/gpu_threaded

Conversation

@dennisYatunin
Copy link
Copy Markdown
Member

@dennisYatunin dennisYatunin commented May 24, 2025

Purpose

This PR extends ClimaComms.@threaded so that we can use shared memory on GPUs in a device-agnostic way. Specifically, it introduces an @interdependent annotation for @threaded iterators, along with a static_shared_memory_array function for allocating shared memory and a @sync_interdependent macro for synchronizing interdependent threads. Unit tests and documentation with illustrative examples have also been provided.

This should be the minimal set of changes we need to replace the entirety of ClimaCore's CUDA extension with device-agnostic code. I will investigate the potential performance impacts of this in a future PR. For now, we can use this to test out performant kernels in ClimaAtmos without needing to add a new CUDA extension or dev other packages.


  • I have read and checked the items on the review checklist.

@dennisYatunin dennisYatunin force-pushed the dy/gpu_threaded branch 23 times, most recently from 0122cbd to 0776306 Compare May 28, 2025 09:44
@dennisYatunin dennisYatunin marked this pull request as ready for review May 28, 2025 09:51
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented May 28, 2025

Walkthrough

This change extends the @threaded macro and the threaded function to support loops with one or two iterators, enabling interdependent parallelism on both CPU and CUDA devices. New macros (@interdependent, @sync_interdependent) and types for handling interdependent iterator data are introduced. CUDA kernel launch utilities and multi-dimensional threading with coarsening are added. Documentation and tests are updated to reflect the new API surface and verify correctness. The StaticArrays package is added as a dependency to support statically sized arrays, especially for CPU shared memory emulation.

Possibly related PRs

  • Extend threaded macro to CUDADevice #111: Adds initial support for single-iterator threaded kernel launches on CUDA devices; this PR directly builds on that foundation by introducing multi-dimensional and interdependent iterator support.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4fd26c and 479d480.

📒 Files selected for processing (8)
  • Project.toml (2 hunks)
  • docs/Manifest.toml (4 hunks)
  • docs/make.jl (1 hunks)
  • docs/src/apis.md (1 hunks)
  • docs/src/threaded.md (1 hunks)
  • ext/ClimaCommsCUDAExt.jl (5 hunks)
  • src/devices.jl (5 hunks)
  • test/runtests.jl (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/make.jl
🚧 Files skipped from review as they are similar to previous changes (3)
  • Project.toml
  • docs/src/apis.md
  • docs/Manifest.toml
🧰 Additional context used
🪛 LanguageTool
docs/src/threaded.md

[style] ~52-~52: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...However, this implementation results in a very large number of global memory reads: matrix...

(EN_WEAK_ADJECTIVE)


[uncategorized] ~79-~79: The hyphen in statically-sized is redundant.
Context: ...mory array used to store each column is statically-sized, so the number of rows in each column m...

(ADVERB_LY_HYPHEN_FIX)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: docbuild
🔇 Additional comments (10)
docs/src/threaded.md (1)

1-122: Excellent tutorial structure and code examples.

The tutorial effectively demonstrates the progression from naive parallelization to optimized shared memory implementations. The finite difference examples are mathematically correct and showcase the new API well.

🧰 Tools
🪛 LanguageTool

[style] ~52-~52: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...However, this implementation results in a very large number of global memory reads: matrix...

(EN_WEAK_ADJECTIVE)


[uncategorized] ~79-~79: The hyphen in statically-sized is redundant.
Context: ...mory array used to store each column is statically-sized, so the number of rows in each column m...

(ADVERB_LY_HYPHEN_FIX)

test/runtests.jl (2)

276-335: Comprehensive test coverage for interdependent threading.

The test design comparing threaded vs unthreaded implementations is solid. The broken allocation tests appropriately acknowledge current limitations.


337-403: Good coverage of dual-iterator scenarios.

The 2D test case effectively exercises the independent + interdependent iterator combinations.

ext/ClimaCommsCUDAExt.jl (3)

54-75: Clean CUDA utility functions.

The synchronization and kernel parameter query functions provide good abstractions over CUDA primitives.


146-191: Solid two-iterator kernel launch logic.

The independent/interdependent iterator mapping to blocks/threads is correct. The fallback to coarsening when limits are exceeded is well-handled.


225-277: Complex but necessary coarsening implementation.

The grid-stride loop with adaptive coarsening correctly handles CUDA's hardware constraints. The parameter validation is thorough.

src/devices.jl (4)

451-578: Complex but well-structured macro expansion.

The @threaded macro correctly handles the challenging syntax parsing for one/two iterators with interdependent annotations. The automatic @sync_interdependent filling is clever.


690-709: Clean abstraction for interdependent data.

The type hierarchy effectively represents different interdependent iterator scenarios while maintaining type safety.


725-752: Elegant synchronization abstraction.

The @sync_interdependent macro and its implementations correctly handle device differences while maintaining consistent syntax.


774-775: Good CPU fallback for shared memory.

Using MArray for CPU static shared memory provides consistent performance characteristics across devices.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
Project.toml (1)

26-26: Consider using a version range for StaticArrays.

The exact version constraint 1.9.13 might be too restrictive. Consider using a range like 1.9 to allow patch updates.

-StaticArrays = "1.9.13"
+StaticArrays = "1.9"
test/runtests.jl (2)

320-320: Consider tracking the allocation issue.

The @test_broken indicates known allocations in threaded execution. Consider adding a comment or issue reference explaining why allocations occur.

-    @test_broken threaded_allocations == 0
+    # TODO: Fix allocations in threaded execution (issue #XXX)
+    @test_broken threaded_allocations == 0

383-383: Consider tracking the allocation issue.

Same as above - consider documenting why allocations occur.

-    @test_broken threaded_allocations == 0
+    # TODO: Fix allocations in threaded execution (issue #XXX)
+    @test_broken threaded_allocations == 0
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2bf168 and 0776306.

📒 Files selected for processing (6)
  • Project.toml (2 hunks)
  • docs/Manifest.toml (4 hunks)
  • docs/src/apis.md (1 hunks)
  • ext/ClimaCommsCUDAExt.jl (5 hunks)
  • src/devices.jl (5 hunks)
  • test/runtests.jl (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`*`: # CodeRabbit Style Guide (CliMA Inspired)

Leverage CodeRabbit for code reviews aligning with CliMA's practices.

I. Key Areas for CodeRabbit:

  • Style: Naming (Titl...

*: # CodeRabbit Style Guide (CliMA Inspired)

Leverage CodeRabbit for code reviews aligning with CliMA's practices.

I. Key Areas for CodeRabbit:

  • Style: Naming (TitleCase, lowercase_with_underscores), line length (<78), indentation (4 spaces), import order.
  • DRY: Flag duplicated code; encourage modularization.
  • Docstrings: Flag missing docstrings for modules, structs, functions.
  • Tests: Detect missing unit tests (if configured).
  • Complexity: Report on cyclomatic complexity.

II. Conventions (CodeRabbit Can Help):

  • Naming: Follow CliMA/CMIP conventions. Avoid l, O, I as single-char vars.
  • Unicode: Human review needed. Acceptable chars.

III. Documentation (CodeRabbit Flags Deficiencies):

  • Docstrings: Describe purpose, inputs, outputs, examples.

IV. Testing (CodeRabbit: Missing Tests):

  • Unit tests required for new/modified code.

V. CodeRabbit Config:

  • .coderabbit.yaml: Customize language, line length, tests.

VI. Workflow:

  • Review CodeRabbit's suggestions; investigate, provide context, address issues, improve config.

VII. Abandoned Code: Delete or clearly mark as deprecated.

VIII. Reviewer Responsibilities: Ensure adherence to best practices.

  • Project.toml
🔇 Additional comments (9)
docs/src/apis.md (1)

37-42: Documentation updates look good!

The new API entries are well-organized within the Devices section.

docs/Manifest.toml (1)

1-436: Machine-generated manifest file.

test/runtests.jl (1)

268-384: Comprehensive test coverage!

The new tests thoroughly exercise the interdependent threading functionality with good edge case handling.

ext/ClimaCommsCUDAExt.jl (3)

56-57: LGTM!

Clean implementation of GPU shared memory allocation using CUDA's static shared arrays.


143-188: Well-structured multi-dimensional kernel launch!

The implementation correctly maps blocks to independent iterator items and threads within blocks to interdependent iterator items, with automatic coarsening when limits are exceeded.


223-227: Verify dimension function usage after fix.

This grid-stride loop implementation looks correct but depends on the fixed dimension functions from lines 59-66.

src/devices.jl (3)

483-559: Robust macro implementation!

The extended @threaded macro properly handles:

  • Single and nested loops
  • Independent and interdependent iterators
  • Device-specific code generation
  • Clear error messages for invalid usage

577-580: Clean annotation design!

The error-only implementation ensures @interdependent is used exclusively within @threaded loops.


714-715:

Details

✅ Verification successful

Smart CPU shared memory emulation!

Using MArray provides stack-allocated, statically-sized arrays that mimic GPU shared memory performance characteristics.

Verify the StaticArrays dependency is properly declared:


🏁 Script executed:

#!/bin/bash
# Description: Verify StaticArrays is listed as a dependency
# Test: Check if StaticArrays is in Project.toml
rg -A 5 "StaticArrays" Project.toml

Length of output: 234


StaticArrays dependency confirmed
StaticArrays is properly declared in Project.toml. No further action required.

Comment thread ext/ClimaCommsCUDAExt.jl Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
ext/ClimaCommsCUDAExt.jl (2)

54-57: Fix formatting to resolve pipeline failure.

The pipeline indicates formatting issues in this function. Apply proper Julia formatting.

-ClimaComms.static_shared_memory_array(::CUDADevice, ::Type{T}, dims...) where {T} =
-    CUDA.CuStaticSharedArray(T, dims)
+ClimaComms.static_shared_memory_array(
+    ::CUDADevice,
+    ::Type{T},
+    dims...,
+) where {T} = CUDA.CuStaticSharedArray(T, dims)

215-216: Enhance error message clarity.

Consider making the error message more specific about which coarsening values are invalid.

-    (min_items_in_thread[1] > 0 && min_items_in_thread[2] > 0) ||
-        throw(ArgumentError("integer `coarsen` values must be positive"))
+    (min_items_in_thread[1] > 0 && min_items_in_thread[2] > 0) ||
+        throw(ArgumentError("both integer `coarsen` values must be positive, got $(min_items_in_thread)"))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0776306 and 00a171c.

📒 Files selected for processing (6)
  • Project.toml (2 hunks)
  • docs/Manifest.toml (4 hunks)
  • docs/src/apis.md (1 hunks)
  • ext/ClimaCommsCUDAExt.jl (5 hunks)
  • src/devices.jl (5 hunks)
  • test/runtests.jl (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • Project.toml
  • docs/src/apis.md
  • docs/Manifest.toml
🧰 Additional context used
🪛 GitHub Actions: JuliaFormatter
ext/ClimaCommsCUDAExt.jl

[error] 53-60: Code formatting or style change detected in static_shared_memory_array function. The function signature was reformatted, which caused the CI to fail.

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: docbuild
  • GitHub Check: test-os (macos-latest)
🔇 Additional comments (9)
test/runtests.jl (2)

275-282: Derivative computation looks mathematically sound.

The finite difference implementation correctly handles boundary conditions with forward/backward differences at endpoints and central differences elsewhere.


325-331: Good handling of known allocation limitations.

Appropriately marking the allocation test as broken for single CPU thread while tracking the issue with a TODO.

ext/ClimaCommsCUDAExt.jl (2)

60-66: CUDA dimension functions are now correctly implemented.

Good fix from the previous review - blocks_in_kernel() correctly uses CUDA.gridDim().x and threads_in_block() uses CUDA.blockDim().x.


143-188: Dual iterator kernel launch logic is well-structured.

The implementation correctly maps independent iterators to blocks and interdependent iterators to threads, with proper fallback to coarsening when limits are exceeded.

src/devices.jl (5)

451-560: Complex macro logic handles multiple iterator patterns correctly.

The parsing logic properly distinguishes independent and interdependent iterators, with appropriate error handling for invalid combinations. The fallback to CPU single-threaded execution is well-implemented.


504-506: Robust macro detection for interdependent iterators.

Good defensive programming - checking for both unqualified and fully qualified macro names prevents issues with macro hygiene.


610-613: Helpful error message for misused @Interdependent.

Clear error message prevents confusion when the macro is used outside its intended context.


679-691: Well-designed type hierarchy for iterator data.

The abstract base type with concrete implementations for different scenarios (one item, multiple items, all items) provides clean dispatch and extensibility.


747-748: Device-agnostic shared memory abstraction is elegant.

Using MArray on CPUs to mimic GPU shared memory arrays provides a clean abstraction that enables portable high-performance code.

Comment thread test/runtests.jl Outdated
@dennisYatunin dennisYatunin force-pushed the dy/gpu_threaded branch 2 times, most recently from 3706f1f to fcb8681 Compare May 28, 2025 11:04
@dennisYatunin dennisYatunin force-pushed the dy/gpu_threaded branch 3 times, most recently from 6fce308 to 33f3d48 Compare August 15, 2025 09:23
@dennisYatunin dennisYatunin removed the request for review from sriharshakandala August 15, 2025 09:25
@dennisYatunin dennisYatunin force-pushed the dy/gpu_threaded branch 23 times, most recently from 31da3d1 to e711ba8 Compare August 22, 2025 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants