-
Notifications
You must be signed in to change notification settings - Fork 10
Byfield gpu #271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Byfield gpu #271
Conversation
realm: fix and add capi tests See merge request StanfordLegion/legion!1812
[P0] test: Match the allocation function used in test See merge request StanfordLegion/legion!1817
[P0] test: Match the allocation function used in test See merge request StanfordLegion/legion!1817 (cherry picked from commit a101776) f753d5d test: Match the allocation function used in test. Co-authored-by: Elliott Slaughter <[email protected]>
update realm release notes for 25.06 See merge request StanfordLegion/legion!1818
Co-authored-by: Elliott Slaughter <[email protected]>
…pi in new build system
realm: update gitlab-ci to build realm externally and fix bootstrap_mpi in new build system See merge request StanfordLegion/legion!1820
…ents of DynamicTableNode
Refactor DynamicTable such that we can pass a RuntimeImpl to the elements of DynamicTableNode See merge request StanfordLegion/legion!1814
test: Re-enable attach tests on macOS See merge request StanfordLegion/legion!1826
legion: Modify legion cmake to pick up the new realm build system See merge request StanfordLegion/legion!1644
…erra without CMake.
regent: Build Legion with CMake by default See merge request StanfordLegion/legion!1825
ci: Run MSVC jobs on AVX only See merge request StanfordLegion/legion!1828
3fd263d
to
806a2e6
Compare
@rohany @lightsighter @artempriakhin Here's a writeup for the two proposals on how to handle dynamic allocations: https://docs.google.com/presentation/d/1JQrFJlsi_BNijg2eTmTd3DxqYCBQYjiZsqNl0FP8k10/edit?usp=sharing Proposal 1: If an allocation fails/is deferred, recycle all the sparsity map outputs and set their ids to 0. The user is then responsible for checking the sparsity IDs of their output index spaces once the partitioning op event resolves and handling failure. |
Mike and Artem, please take a look at this document and let Rohan know what route he should be going down. |
[like] Artem Priakhin reacted to your message:
…________________________________
From: Rohan Yadav ***@***.***>
Sent: Monday, September 29, 2025 3:03:07 AM
To: StanfordLegion/realm ***@***.***>
Cc: Artem Priakhin ***@***.***>; Mention ***@***.***>
Subject: Re: [StanfordLegion/realm] Byfield gpu (PR #271)
[https://avatars.githubusercontent.com/u/15331872?s=20&v=4]rohany left a comment (StanfordLegion/realm#271)<#271 (comment)>
Mike and Artem, please take a look at this document and let Rohan know what route he should be going down.
—
Reply to this email directly, view it on GitHub<#271 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AXSAQ6XQWXTWKPUZXEL2N6T3VCOOXAVCNFSM6AAAAACE43RZM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNBUGY4DONRUG4>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I have some questions:
I will also offer two other proposals for completeness: Proposal 3: Same as proposal 1, but must return in a profiling response exactly how much memory needs to be reserved and provides a way for the caller to pass in a buffer of the appropriate size so that it doesn't fail again. A somewhat unrelated question: do we want to scope how much of the GPU is available for use with dependent partitioning? We might not want deppart operations consuming the entire GPU if there are high priority tasks also trying to run on the GPU. We could create a green context for deppart operations to scope them to a subset of SMs. We could also run the kernel launches through the normal GPU task path and allow for priorities on deppart operations (like we already do with copies/fills). |
As it's written, proposal 1 would require the client to either have enough knowledge of their application's memory usage to guarantee sufficient space for the deppart allocations or take an experimental approach (i.e. iteratively free other memory until the deppart succeeds or they can't free anymore, in which case they'd probably have to crash the application) - incorporating in proposal 3 would make this process less of a black box.
I think the simplest way would be as an API/configuration argument. Without client specification, it would probably be the instance the field data is in for by field, image, and preimage, and an output of the cost-model calculation for the set operations.
Either way would work - it seems like the first way would be more Legion-esque in that the user can write their application without thinking about where it'll run and then some sort of mapping/cost-model calculation makes that decision at runtime. For this to work, a lot of the auxiliary calls would just be no-ops in the CPU path.
I would probably try to mirror how the current deppart calls take a dependent event as an argument, so they're non-blocking but can't start doing actual work until their provided dependency resolves.
Any version of proposal 1 should definitely provide information to the client on failure, but in some cases I don't think it's possible to tell them the size they need to get to the end of the operation. Like proposal 2 shows, usually we can just tell them enough to get to the next "stage."
In this case would they be assuming the responsibility of preventing deadlock?
I think this makes a lot of sense. One consideration is that the lifetime of GPU deppart operations will generally be a small fraction of the application's runtime, but they need a meaningful chunk of memory for their small burst - maybe this could include a way for the client to reclaim the memory when they know they aren't doing any partitioning?
Either way works for me - I think copies/fills are the only other device code in Realm right now so I'd be happy to mirror that. Legion is by far the biggest consumer of Realm, so I'll gladly defer to whatever combination of these proposals you think would be the most natural/intuitive for incorporating GPU-deppart into Legion while maintaining Realm's invariants. |
Rather than respond to each of the comments above for each different proposal, I'm going to focus on this:
If it is possible for the code to potentially work out of a smaller buffer as well as running faster with a larger buffer, then I think the interface I would like would be a two-phase model. In the first phase we do some kind of API call that estimates bounds on memory usage and returns two sizes: a minimal size needed to run the computation and an "optimal" size that ensures that deppart can get "good" performance (I feel like this value should be more a function of the GPU being targeted and not a property of the data being used for the deppart operation, but tell me if I am wrong about that assumption). If Realm is going to run on CPUs it will just return zero for both values (or maybe a small value, not sure if we use buffers on the CPU implementation). The client can then do an allocation at either value (or somewhere in-between if they're willing to search) and pass that in to the implementation. Tell me if that is possible and if not then we'll go back to exploring the other proposals.
Even though I care a lot about the Legion implementation, we've always designed Realm in a way so that it is Legion-agnostic. Realm should be usable by lots of different clients (I'm actually working on a different project built on Realm right now, it's just not public) so we try very hard to maintain that independence when designing APIs for Realm. |
"https://docs.google.com/presentation/d/1JQrFJlsi_BNijg2eTmTd3DxqYCBQYjiZsqNl0FP8k10/edit?usp=sharing"
Any accommodation I can think of to these challenges ends up serializing things to the point of defeating the purpose of gpu partitioning, although I'm definitely open to if there's a better way to do things in less than the "optimal" size. Here's an approach Rohan and I talked through that adapts Proposal 2 above and addresses the earlier questions: Here’s the state:
Here’s what an image (but really any op) would look like:
Some notes: |
Can you say what your definition of the "optimal size" is? To be fair, let me give you my definition of "optimal size". In general, all of these dependent partitioning operations are going to be memory bandwidth limited. You can saturate the memory bandwidth of a GPU with many fewer SMs than most GPUs have (only really tiny GPUs are balanced and you'll mostly find them in laptops). Therefore, my definition of "optimal size" is however much memory I would need to allocate in order to have a big enough working set for the number of SMs required to saturate memory bandwidth. After that, everything else should just be executed in serial because even if I "parallelize" it with more blocks/threads, it's also going to be implicitly serialized anyway by the memory controller for the framebuffer memory.
If you serialize things only after having saturated memory bandwidth, then you're not going to lose any performance. In fact, I'll add that you might even be able to get a speed-up by hitting more frequently in the L2 cache ;). |
To further elaborate, you can do all this with a single kernel launch with a small number of threadblocks and using a cooperative kernel and then the GPU version of the algorithm a bit like the CPU algorithm with the inner loop(s) parallelized across some number of threads/threadblocks with necessary synchronization inserted using cooperative groups. You can then scale it up or down as necessary depending on the available memory bandwidth of the particular GPU you're running on. Bonus points if you can make it use temporal locality to leverage the L2 cache. |
What I mean by “optimal” size is enough to keep all the data the partitioning operation needs on the device for the entire operation.
What I mean here is that in order to to use any data less than the “optimal” size, you have to tile the operation in some way.
In the approach you’re describing, the tilling would be such that the total amount of work done stays the same, but the active working set at any given time is less. Thus, there’d only be performance degradation if you’re not saturating memory bandwidth, but no matter what the total amount of computation you’re doing would be the same (just N pieces of size 1 / N rather than 1 whole piece). I don’t think it’s possible to tile the operations in this way because of the last step, where all the points/rectangles are coalesced into a sparsity map, because the points/rectangles aren’t independent (they need to know if they duplicate/overlap any others). If you split the operation into N pieces, you’d have to do the following: coalesce the first 1/N points, keep that active somewhere, coalesce the next 1/N points, merge that into the working set (more expensive than the coalescing), coalesce the next 1/N points, merge that, etc. So rather than doing the same amount of work with a smaller working set, this adds a meaningful amount of computation that wouldn’t have originally been done with 1 big piece (all the merge operations). Each merge operation would have to do the slide 19 algorithm on the combination of the two spaces: https://docs.google.com/presentation/d/1Iwo0IwXBk14-E8i5i5kLN2B0KP7Cl-RwC0yDIN0_Fh8/edit?usp=sharing
Here I don’t mean serialized in terms of less block/thread parallelism, but rather in terms of 1 bulk operation serialized into N tile operations, each of which requires a new, expensive integration step in between. And the nature of the computation goes from something that’s well-suited for the GPU (dumping a bulk operation onto the device and treating its elements as independent until the last possible moment) to something that’s generally better suited for the CPU (maintaining an active, disjoint state and iteratively merging pieces into it, which is also exactly what the CPU deppart path currently does).
I also still think this is a key challenge for any tilling approach with a fixed size buffer. Given a fixed size you have to stay within, I don’t think it’s possible to efficiently choose an input chunk size that's guaranteed to stay within the given size for the entire operation, i.e. to efficiently invert the pipeline. So when you choose a tile, there's a chance that tile will get to say the third stage and increase to bigger than your buffer, at which point the operation would have to fail. If it's helpful, we could also hop on a call to discuss this. |
Do you believe that is strictly necessary to achieve speed-of-light performance, or is it purely a convenience because it makes the code easier to write?
Using your definition of "optimal" size then yes, tiling is required.
Well, I know that it can be done because the CPU code does that (with tile size=1). ;) It just so happens that the CPU code does it in serial and therefore doesn't need to perform any additional conflict checks or synchronization. If you tile it the way I'm describing, then some additional synchronization will be required because effectively you're going to take a subset of rectangles and try to reduce them in parallel, but you'll also need to check for conflicts with other rectangles in the subset being reduced. This requires additional synchronization and if we were doing this across the entire GPU, then I would agree that would be a bad idea. However, presuming we're only operating on a subset of the GPU needed to saturate memory bandwidth then the overhead of those extra checks and synchronization should not be that bad.
Computation is free as long as we're saturating the memory bandwidth. You can probably afford 5-10X more computation than memory accesses without any reduction in performance (some of which is likely consumed by thread divergence or shared memory bank conflicts, but that will be measured in percentage reductions). Presuming you're keeping the rectangle data in shared memory, you can even have all the threads looking for conflicts by going through the distributed shared memory network across blocks so you won't incur any extra global memory traffic.
That's what I mean too. Each tiled operation is still parallelized over threads/blocks, just many fewer than blasting them all out unnecessarily and having the memory controller implicitly serialize them. If you keep the integration step on-chip you'll be winning even if it appears like it does more compute than would otherwise be necessary because that will be faster than being memory bandwidth limited.
Here's where I think we fundamentally disagree. I'm arguing for a hybrid approach that leverages a certain degree of parallelism in the GPU without oversubscribing resources and keeps data resident on-chip for longer and is likely to be faster and more resource-efficient than constantly streaming data to and from global memory. The end result should be something that looks like a cross between the CPU-only algorithm and the streaming GPU algorithm that we have today.
When this happens, why isn't the answer just to reduce the tile size and try again without changing the memory requirements? In the limit, you'll get to a tile size of 1 input rectangle at a time which is effectively the CPU-only algorithm and will be very slow, but is highly likely to fit in the buffer. It's still possible that one input rectangle explodes into a gigantic number of output rectangles at which point you can start tiling that one input rectangle into points. Then you know things are going to fit into memory because points are atoms and won't "shatter" into more points. |
I don't think that it's appropriate to try and use DSMEM here, as only 2 "adjacent" CTAs can access each others' shared memory. Taking a step back, I want to understand the root of your concerns here. A hybrid algorithm like the one you describe may be possible (@rohanchanani will be the arbiter of that), but what's the main reason that you are pushing for it? Is it mostly for avoiding the complexity of adding the multi-stage deppart API in Realm, or because you think that we should just doing better with the algorithmic components themselves? I am a bit worried about the complexity and cpu-side overheads (launching lots of kernels and copies) of doing a hybrid algorithm with a small tile size -- if it's just performance we can use green contexts or something to limit the number of SM's and allow other application work to proceed. If it is to avoid the multi-stage API in Realm, perhaps @rohanchanani can try to develop an algorithm for one of the deppart operations that does work in a tiled manner and see what the relative performance and code complexity tradeoffs are. In particular, if we need tiles in the 10%+ capacity of the GPU to actually get reasonable performance out of this pathway then that already feels like too much to carve out into a fixed pool and we'd want to not have the application be forced to make this choice a-priori. It also seems like if there is even a single deppart operation that we wouldn't be able to apply this to (like maybe the dynamic allocations for the BVH in preimage can't be made to fit in a fixed buffer) and we have to add a staged API then we might as well add it for all deppart operations. |
The only way for Legion to implement an unbounded memory requirement operation is to effectively map an unbounded pool in that memory which blocks all downstream operations from allocation/deallocating in that memory. I don't want that to be the default for GPU-accelerated deppart operations in Legion. Any amount of complexity is worth it to avoid that as the effects ripple far up the software stack (beyond Legion). I could care less about the implementation complexity in Legion, but as soon as the consequences become visible to Legion users then I care A LOT and am willing to go to any ends to avoid it if possible.
It's neither. It's the fact that the memory requirements ripple up the software stack and are visible to mappers and impact unrelated operations which is something that is not currently the case for deppart operations.
You should be able to do this with way less than 10% of memory. I bet having a buffer on the order of 10-100MB is enough.
Also strong disagree here. If we have to let consequences ripple up the software stack, then we should tightly bound the kinds of operations that it applies to. |
If that's the case, I believe that @rohanchanani told me that point-based operations (like image) vs rect-based operations (like image range) are definitely easier to process in a tiled manner. We could start there at least. Do the preimages for accelerated gather scatter copies need preimage range or just preimage? |
I can definitely get started on a tiled approach for the point-based pipelines - also, preimage range dumps into the points pipeline, so of the non-setops only image range requires rects. But I also think that the back-end of tiling will end up looking pretty close to identical for both points and rects because we'll want to store the active set as rectangles. |
Just point-wise preimage for now. Nobody I know is using range-based gather/scatter copies currently. The place I would like to get to would one where the client could query Realm for a recommended buffer size based on a target memory to use for the buffer and Realm could recommend a minimum size to achieve "optimal" performance. Clients could then provide whatever buffer they want (probably with a lower bound on how tiny it could be to even allow forward progress) and then the implementation would tile appropriately to fit in the buffer. This would give clients a sliding scale that they can use to trade-off memory pressure against performance of dependent partitioning operations. |
Implements the by field operator for GPUs. To build, configure cmake with -DREALM_USE_CUDART_HIJACK=OFF (necessary) and -DCMAKE_CUDA_ARCHITECTURES={your arch} (to speed up build). To test, run ./tests/deppart basic -ll:gpu 1
@rohany