Skip to content

Conversation

lightsighter
Copy link
Contributor

This branch adds supports for instance redistricting on instances allocated in the dynamic GPU framebuffer memory. It does this by keeping the underlying CUDA allocation alive as long as there is any instance alive that still refers to the underlying allocation. Once all the instances redistricted on top of the base allocation have been destroyed then the underlying CUDA allocation is freed back to the CUDA driver. This unfortunately means that there is a potential for fragmentation as a large initial allocation could result in just a few small instances surviving after redistricting, but it is at least functionally correct.

artempriakhin and others added 30 commits June 25, 2025 14:50
ci: Disable debug info in Rust binaries

See merge request StanfordLegion/legion!1811
ci: Disable debug info in Rust binaries

See merge request StanfordLegion/legion!1811

(cherry picked from commit 82c2ede)

227303c ci: Disable all debug info in Rust binaries to save space.

Co-authored-by: Elliott Slaughter <[email protected]>
legion-ci: remove the specific tag definition for msvc jobs

See merge request StanfordLegion/legion!1802
realm: fix examples and tests

See merge request StanfordLegion/legion!1813
disable cuhook build by default, add option

See merge request StanfordLegion/legion!1810
Realm: disable barrier broadcast

See merge request StanfordLegion/legion!1809
disable cuhook build by default, add option

See merge request StanfordLegion/legion!1810

(cherry picked from commit eb1ec26)

f6d559c disable cuhook build by default, add option
08572ac Merge branch 'master' into sm/fix_cuhook_make

Co-authored-by: Seema Mirchandaney <[email protected]>
Realm: disable barrier broadcast

See merge request StanfordLegion/legion!1809

(cherry picked from commit 3eba361)

8bb9bfb realm: add broadcast kill-switch
dd232c7 realm: add broadcast kill-switch
548e198 realm: remove assert
3609e63 realm: fix clang-format
c33960f realm: undo cmake changes
aa277d0 realm: remove empty line
487652d realm: disable fragmentation
3478e0e realm: disable broadcast_previous
539601f realm: disable tests
925078f realm: completely disable broadcast paths
fe8daf5 realm: undo test changes
bae24d7 realm: undo changes to runtime_impl.h
ea18bec realm: fix radix
6542a35 realm: fix clang-format
166f0c2 realm: fix typo
dace381 realm: undo lines
359e7c9 realm: undo namespace changes

Co-authored-by: apryakhin <[email protected]>
[P0] build: Turn RDTSC off by default until we can put the PPC check back

See merge request StanfordLegion/legion!1816
[P0] build: Turn RDTSC off by default until we can put the PPC check back

See merge request StanfordLegion/legion!1816

(cherry picked from commit 9a4604e)

83ea9d6 build: Turn RDTSC off by default until we can put the PPC check back.

Co-authored-by: Elliott Slaughter <[email protected]>
muraj and others added 21 commits August 25, 2025 10:11
* cuGetProcAddress broke it's own rules for a couple of apis and kept
back old versions of apis, but left cuGetProcAddress to return the newer
ones, breaking source compatibility guarentees for 13.0
* For those apis that were held back, an explicit context or location
was added, so modify the calls to pass in the required arguments, being
compatible with older toolkits as well.
#243)

…ring the build. This is required for clangd, et al, to function.
Also adds the gh copilot thinking file.
Change the behavior to check the target was defined and the feature was
enabled before setting it to be used/built. This is useful when the
project is built as part of other projects that may have found a package
but do not necessarily want that feature to be built by this project.
* Add sanitizer tests for ASAN, TSAN, and UBSAN
  * Unfortunately we have a leak in the python module, so disable that.
  * Fix a couple of found leaks
* Add ucx and gasnet builds to the ci
* Fix up cuda compile args to compile SASS for everything up to the last
architecture, adding ptx for the latest (fixing an issue with r580 when
both sass and ptx is available for a removed architecture)
* Add a REALM_DEFAULT_ARGS environment and put the default test
arguments here, prepended to what is set in the test environment at
runtime. This allows us to set the default, but allow the test
environment to override it.
* This allows us to, by default, enable GPU tests with one gpu, but
override it in the test environment for more than one GPU
This changes request_cancellation to return a bool stating success/fail
and also refactors the flow to make it readable.
fixup! Add the ASSERT_REALM macro to machine_config_test to make it more readable (#259)
This eliminates the need for timed_wait as a separate function. Then update the tests to the new API definition
@lightsighter lightsighter self-assigned this Aug 27, 2025
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 26.48%. Comparing base (77790b2) to head (10c1e79).

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #272    +/-   ##
========================================
  Coverage   26.48%   26.48%            
========================================
  Files         185      185            
  Lines       38774    38774            
  Branches    14316    14192   -124     
========================================
+ Hits        10270    10271     +1     
+ Misses      28127    27217   -910     
- Partials      377     1286   +909     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@manopapad
Copy link
Contributor

@lightsighter Say you have an initial 1GB CUDA allocation, and you redistrict it such that you're only using the first 100MB going forward. Can Realm still serve deferred allocations using the remaining 900MB? Or is it just dead memory at this point?

@muraj muraj added the enhancement New feature or request label Aug 27, 2025
@lightsighter
Copy link
Contributor Author

@lightsighter Say you have an initial 1GB CUDA allocation, and you redistrict it such that you're only using the first 100MB going forward. Can Realm still serve deferred allocations using the remaining 900MB? Or is it just dead memory at this point?

At the moment, the memory is just unusable. It will require a more sophisticated implementation of the GPU dynamic memory to support that.

@apryakhin
Copy link
Contributor

apryakhin commented Aug 27, 2025

@lightsighter Does this need to be reviewed now or we are in POC/testing stage?

@lightsighter
Copy link
Contributor Author

@lightsighter Does this need to be reviewed now or we are in POC/testing stage?

Depends how quickly @manopapad needs it to be supported? If he is willing to wait for @muraj's implementation of the new GPU dynamic framebuffer memory then we can close this request. If he needs a stop-gap solution so that instance redistricting is functional (but not "performant", in the memory capacity sense) then we will need to review and merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.