Instance Redicstricting for Dynamic Frambuffer Memory #272

lightsighter · 2025-08-27T07:03:29Z

This branch adds supports for instance redistricting on instances allocated in the dynamic GPU framebuffer memory. It does this by keeping the underlying CUDA allocation alive as long as there is any instance alive that still refers to the underlying allocation. Once all the instances redistricted on top of the base allocation have been destroyed then the underlying CUDA allocation is freed back to the CUDA driver. This unfortunately means that there is a potential for fragmentation as a large initial allocation could result in just a few small instances surviving after redistricting, but it is at least functionally correct.

ci: Disable debug info in Rust binaries See merge request StanfordLegion/legion!1811

ci: Disable debug info in Rust binaries See merge request StanfordLegion/legion!1811 (cherry picked from commit 82c2ede) 227303c ci: Disable all debug info in Rust binaries to save space. Co-authored-by: Elliott Slaughter <[email protected]>

legion-ci: remove the specific tag definition for msvc jobs See merge request StanfordLegion/legion!1802

realm: fix examples and tests See merge request StanfordLegion/legion!1813

disable cuhook build by default, add option See merge request StanfordLegion/legion!1810

Realm: disable barrier broadcast See merge request StanfordLegion/legion!1809

disable cuhook build by default, add option See merge request StanfordLegion/legion!1810 (cherry picked from commit eb1ec26) f6d559c disable cuhook build by default, add option 08572ac Merge branch 'master' into sm/fix_cuhook_make Co-authored-by: Seema Mirchandaney <[email protected]>

Realm: disable barrier broadcast See merge request StanfordLegion/legion!1809 (cherry picked from commit 3eba361) 8bb9bfb realm: add broadcast kill-switch dd232c7 realm: add broadcast kill-switch 548e198 realm: remove assert 3609e63 realm: fix clang-format c33960f realm: undo cmake changes aa277d0 realm: remove empty line 487652d realm: disable fragmentation 3478e0e realm: disable broadcast_previous 539601f realm: disable tests 925078f realm: completely disable broadcast paths fe8daf5 realm: undo test changes bae24d7 realm: undo changes to runtime_impl.h ea18bec realm: fix radix 6542a35 realm: fix clang-format 166f0c2 realm: fix typo dace381 realm: undo lines 359e7c9 realm: undo namespace changes Co-authored-by: apryakhin <[email protected]>

[P0] build: Turn RDTSC off by default until we can put the PPC check back See merge request StanfordLegion/legion!1816

[P0] build: Turn RDTSC off by default until we can put the PPC check back See merge request StanfordLegion/legion!1816 (cherry picked from commit 9a4604e) 83ea9d6 build: Turn RDTSC off by default until we can put the PPC check back. Co-authored-by: Elliott Slaughter <[email protected]>

* cuGetProcAddress broke it's own rules for a couple of apis and kept back old versions of apis, but left cuGetProcAddress to return the newer ones, breaking source compatibility guarentees for 13.0 * For those apis that were held back, an explicit context or location was added, so modify the calls to pass in the required arguments, being compatible with older toolkits as well.

#243) …ring the build. This is required for clangd, et al, to function.

Also adds the gh copilot thinking file.

Change the behavior to check the target was defined and the feature was enabled before setting it to be used/built. This is useful when the project is built as part of other projects that may have found a package but do not necessarily want that feature to be built by this project.

* Add sanitizer tests for ASAN, TSAN, and UBSAN * Unfortunately we have a leak in the python module, so disable that. * Fix a couple of found leaks * Add ucx and gasnet builds to the ci

* Fix up cuda compile args to compile SASS for everything up to the last architecture, adding ptx for the latest (fixing an issue with r580 when both sass and ptx is available for a removed architecture) * Add a REALM_DEFAULT_ARGS environment and put the default test arguments here, prepended to what is set in the test environment at runtime. This allows us to set the default, but allow the test environment to override it. * This allows us to, by default, enable GPU tests with one gpu, but override it in the test environment for more than one GPU

…dable (#259)

This changes request_cancellation to return a bool stating success/fail and also refactors the flow to make it readable.

fixup! Add the ASSERT_REALM macro to machine_config_test to make it more readable (#259)

This eliminates the need for timed_wait as a separate function. Then update the tests to the new API definition

… gpu dynamic framebuffer memory

codecov · 2025-08-27T07:15:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 26.48%. Comparing base (77790b2) to head (10c1e79).

Additional details and impacted files

@@           Coverage Diff            @@
##             main     #272    +/-   ##
========================================
  Coverage   26.48%   26.48%            
========================================
  Files         185      185            
  Lines       38774    38774            
  Branches    14316    14192   -124     
========================================
+ Hits        10270    10271     +1     
+ Misses      28127    27217   -910     
- Partials      377     1286   +909

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

manopapad · 2025-08-27T16:41:02Z

@lightsighter Say you have an initial 1GB CUDA allocation, and you redistrict it such that you're only using the first 100MB going forward. Can Realm still serve deferred allocations using the remaining 900MB? Or is it just dead memory at this point?

lightsighter · 2025-08-27T20:04:07Z

@lightsighter Say you have an initial 1GB CUDA allocation, and you redistrict it such that you're only using the first 100MB going forward. Can Realm still serve deferred allocations using the remaining 900MB? Or is it just dead memory at this point?

At the moment, the memory is just unusable. It will require a more sophisticated implementation of the GPU dynamic memory to support that.

apryakhin · 2025-08-27T22:28:59Z

@lightsighter Does this need to be reviewed now or we are in POC/testing stage?

lightsighter · 2025-08-27T22:37:22Z

@lightsighter Does this need to be reviewed now or we are in POC/testing stage?

Depends how quickly @manopapad needs it to be supported? If he is willing to wait for @muraj's implementation of the new GPU dynamic framebuffer memory then we can close this request. If he needs a stop-gap solution so that instance redistricting is functional (but not "performant", in the memory capacity sense) then we will need to review and merge this.

artempriakhin and others added 30 commits June 25, 2025 14:50

realm: remove assert

548e198

disable cuhook build by default, add option

f6d559c

realm: fix clang-format

3609e63

realm: undo cmake changes

c33960f

ci: Disable all debug info in Rust binaries to save space.

227303c

realm: remove empty line

aa277d0

Merge branch 'ci-rust-release' into 'master'

82c2ede

ci: Disable debug info in Rust binaries See merge request StanfordLegion/legion!1811

Merge branch 'cperry/msvc-runner-fix' into 'master'

be8b64d

legion-ci: remove the specific tag definition for msvc jobs See merge request StanfordLegion/legion!1802

realm: fix examples and tests

f9b552e

Merge branch 'cperry/ci-fix' into 'master'

d9e987e

realm: fix examples and tests See merge request StanfordLegion/legion!1813

Merge branch 'master' into sm/fix_cuhook_make

08572ac

Merge branch 'sm/fix_cuhook_make' into 'master'

eb1ec26

disable cuhook build by default, add option See merge request StanfordLegion/legion!1810

realm: disable fragmentation

487652d

realm: disable broadcast_previous

3478e0e

realm: disable tests

539601f

realm: completely disable broadcast paths

925078f

realm: undo test changes

fe8daf5

realm: undo changes to runtime_impl.h

bae24d7

realm: fix radix

ea18bec

realm: fix clang-format

6542a35

realm: fix typo

166f0c2

realm: undo lines

dace381

realm: undo namespace changes

359e7c9

Merge branch 'apriakhin/barrier-kill-switch' into 'master'

3eba361

Realm: disable barrier broadcast See merge request StanfordLegion/legion!1809

build: Turn RDTSC off by default until we can put the PPC check back.

83ea9d6

Merge branch 'rdtsc-off-default' into 'master'

9a4604e

[P0] build: Turn RDTSC off by default until we can put the PPC check back See merge request StanfordLegion/legion!1816

muraj and others added 21 commits August 25, 2025 10:11

Add git attributes and VERSION file for source archive versioning

af319e6

Fix build log update and add codecov badge

62f4a1d

Add the workflow file as a trigger for updating code coverage

387ed2b

Add remote trigger of gitlab pipelines

3cf4be3

Turn on the cmake setting to generate a compile_commands.json file du… (

b9dcbb4

#243) …ring the build. This is required for clangd, et al, to function.

Add cursor's settings folder to the gitignore file (#244)

da7b565

Also adds the gh copilot thinking file.

Various CI fixes (#248)

0de0f8a

* Add sanitizer tests for ASAN, TSAN, and UBSAN * Unfortunately we have a leak in the python module, so disable that. * Fix a couple of found leaks * Add ucx and gasnet builds to the ci

Rewrite logic to remove NDEBUG and apply it to all builds (#250)

9439158

Refactor the CAPI integration tests to use the CHECK_REALM macro (#258)

2213377

Add the ASSERT_REALM macro to machine_config_test to make it more rea…

7ef8789

…dable (#259)

Remove accidently added autotracing test

5641c2f

Add missing copyright information in test

a9d38eb

Refactor OperationTable::request_cancellation (#255)

69e9d27

This changes request_cancellation to return a bool stating success/fail and also refactors the flow to make it readable.

Update github actions to build cuda 13.0 (#268)

147fc23

Fix cmake prepend to include a space in REALM_DEFAULT_ARGS

9c056f5

Fix machine_config_test to check return value

9507ac1

fixup! Add the ASSERT_REALM macro to machine_config_test to make it more readable (#259)

Refactor realm_event_wait to take a timer parameter (#256)

77790b2

This eliminates the need for timed_wait as a separate function. Then update the tests to the new API definition

Add support for instance redistricting for instances allocated in the…

7327197

… gpu dynamic framebuffer memory

lightsighter self-assigned this Aug 27, 2025

fix formatting for dynamic framebuffer redistricting

10c1e79

muraj added the enhancement New feature or request label Aug 27, 2025

muraj force-pushed the main branch from a39ff4c to 9181321 Compare September 19, 2025 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Instance Redicstricting for Dynamic Frambuffer Memory #272

Instance Redicstricting for Dynamic Frambuffer Memory #272

Uh oh!

lightsighter commented Aug 27, 2025

Uh oh!

codecov bot commented Aug 27, 2025

Uh oh!

manopapad commented Aug 27, 2025

Uh oh!

lightsighter commented Aug 27, 2025

Uh oh!

apryakhin commented Aug 27, 2025 •

edited

Loading

Uh oh!

lightsighter commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Instance Redicstricting for Dynamic Frambuffer Memory #272

Are you sure you want to change the base?

Instance Redicstricting for Dynamic Frambuffer Memory #272

Uh oh!

Conversation

lightsighter commented Aug 27, 2025

Uh oh!

codecov bot commented Aug 27, 2025

Codecov Report

Uh oh!

manopapad commented Aug 27, 2025

Uh oh!

lightsighter commented Aug 27, 2025

Uh oh!

apryakhin commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lightsighter commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

apryakhin commented Aug 27, 2025 •

edited

Loading