Skip to content

Conversation

rohanchanani
Copy link

Implements the by field operator for GPUs. To build, configure cmake with -DREALM_USE_CUDART_HIJACK=OFF (necessary) and -DCMAKE_CUDA_ARCHITECTURES={your arch} (to speed up build). To test, run ./tests/deppart basic -ll:gpu 1
@rohany

elliottslaughter and others added 30 commits July 2, 2025 13:39
realm: fix and add capi tests

See merge request StanfordLegion/legion!1812
[P0] test: Match the allocation function used in test

See merge request StanfordLegion/legion!1817
[P0] test: Match the allocation function used in test

See merge request StanfordLegion/legion!1817

(cherry picked from commit a101776)

f753d5d test: Match the allocation function used in test.

Co-authored-by: Elliott Slaughter <[email protected]>
update realm release notes for 25.06

See merge request StanfordLegion/legion!1818
Co-authored-by: Elliott Slaughter <[email protected]>
realm: update gitlab-ci to build realm externally and fix bootstrap_mpi in new build system

See merge request StanfordLegion/legion!1820
Refactor DynamicTable such that we can pass a RuntimeImpl to the elements of DynamicTableNode

See merge request StanfordLegion/legion!1814
test: Re-enable attach tests on macOS

See merge request StanfordLegion/legion!1826
legion: Modify legion cmake to pick up the new realm build system

See merge request StanfordLegion/legion!1644
regent: Build Legion with CMake by default

See merge request StanfordLegion/legion!1825
ci: Run MSVC jobs on AVX only

See merge request StanfordLegion/legion!1828
@rohanchanani
Copy link
Author

@rohany @lightsighter @artempriakhin Here's a writeup for the two proposals on how to handle dynamic allocations:

https://docs.google.com/presentation/d/1JQrFJlsi_BNijg2eTmTd3DxqYCBQYjiZsqNl0FP8k10/edit?usp=sharing

Proposal 1: If an allocation fails/is deferred, recycle all the sparsity map outputs and set their ids to 0. The user is then responsible for checking the sparsity IDs of their output index spaces once the partitioning op event resolves and handling failure.
Proposal 2: Break each operation into multiple calls, with allocations between the calls based on calculated sizes. The total # of required functions for a single partitioning op range from 3 to 7, although many of the functions can be used for multiple ops.

@rohany
Copy link
Contributor

rohany commented Sep 29, 2025

Mike and Artem, please take a look at this document and let Rohan know what route he should be going down.

@artempriakhin
Copy link
Contributor

artempriakhin commented Sep 29, 2025 via email

@lightsighter
Copy link
Contributor

I have some questions:

  • For proposal 1, how exactly is the user supposed to guarantee forward progress?
  • For proposal 1, how can the client control which memories Realm will attempt to allocate in?
  • For proposal 2, are you proposing that all dependent partitioning operations should work this way now, including the ones on the CPU? If not, how should the user discern which version of the API to use? Presume that data could be mapped into any Realm memory kind and not just CPU/GPU memory kinds.
  • For proposal 2, what would the synchronization model be? In general I don't think Realm's APIs should ever be blocking. None of them so far have been and I think we should maintain that precedent.

I will also offer two other proposals for completeness:

Proposal 3: Same as proposal 1, but must return in a profiling response exactly how much memory needs to be reserved and provides a way for the caller to pass in a buffer of the appropriate size so that it doesn't fail again.
Proposal 4: Same as proposal 1, but allow for a way for clients to specify to Realm that it is permitted to wait for allocations to succeed (e.g. do deferred allocations and chain those dependences when launching tasks/kernels).
Proposal 5: (from @streichler) reserve a small amount of GPU memory for all such dependent partitioning operations, optimistically try to allocate buffers (e.g. in the style of Proposal 1) and then fall back to a more serialized execution using the pre-allocated small buffer as necessary. Adding my own idea to this one: the buffer could even be in zero-copy or UVM memory in order to avoid unnecessarily consuming device-side memory.

A somewhat unrelated question: do we want to scope how much of the GPU is available for use with dependent partitioning? We might not want deppart operations consuming the entire GPU if there are high priority tasks also trying to run on the GPU. We could create a green context for deppart operations to scope them to a subset of SMs. We could also run the kernel launches through the normal GPU task path and allow for priorities on deppart operations (like we already do with copies/fills).

@rohanchanani
Copy link
Author

  • For proposal 1, how exactly is the user supposed to guarantee forward progress?

As it's written, proposal 1 would require the client to either have enough knowledge of their application's memory usage to guarantee sufficient space for the deppart allocations or take an experimental approach (i.e. iteratively free other memory until the deppart succeeds or they can't free anymore, in which case they'd probably have to crash the application) - incorporating in proposal 3 would make this process less of a black box.

  • For proposal 1, how can the client control which memories Realm will attempt to allocate in?

I think the simplest way would be as an API/configuration argument. Without client specification, it would probably be the instance the field data is in for by field, image, and preimage, and an output of the cost-model calculation for the set operations.

  • For proposal 2, are you proposing that all dependent partitioning operations should work this way now, including the ones on the CPU? If not, how should the user discern which version of the API to use? Presume that data could be mapped into any Realm memory kind and not just CPU/GPU memory kinds.

Either way would work - it seems like the first way would be more Legion-esque in that the user can write their application without thinking about where it'll run and then some sort of mapping/cost-model calculation makes that decision at runtime. For this to work, a lot of the auxiliary calls would just be no-ops in the CPU path.
The second way would treat gpu-deppart as essentially a separate library from regular deppart at the realm level, so a realm user would have to know and explicitly assert that they want the partitioning to happen on the GPU rather than having the runtime make the decision for them.

  • For proposal 2, what would the synchronization model be? In general I don't think Realm's APIs should ever be blocking. None of them so far have been and I think we should maintain that precedent.

I would probably try to mirror how the current deppart calls take a dependent event as an argument, so they're non-blocking but can't start doing actual work until their provided dependency resolves.

Proposal 3: Same as proposal 1, but must return in a profiling response exactly how much memory needs to be reserved and provides a way for the caller to pass in a buffer of the appropriate size so that it doesn't fail again.

Any version of proposal 1 should definitely provide information to the client on failure, but in some cases I don't think it's possible to tell them the size they need to get to the end of the operation. Like proposal 2 shows, usually we can just tell them enough to get to the next "stage."

Proposal 4: Same as proposal 1, but allow for a way for clients to specify to Realm that it is permitted to wait for allocations to succeed (e.g. do deferred allocations and chain those dependences when launching tasks/kernels).

In this case would they be assuming the responsibility of preventing deadlock?

Proposal 5: (from @streichler) reserve a small amount of GPU memory for all such dependent partitioning operations, optimistically try to allocate buffers (e.g. in the style of Proposal 1) and then fall back to a more serialized execution using the pre-allocated small buffer as necessary. Adding my own idea to this one: the buffer could even be in zero-copy or UVM memory in order to avoid unnecessarily consuming device-side memory.

I think this makes a lot of sense. One consideration is that the lifetime of GPU deppart operations will generally be a small fraction of the application's runtime, but they need a meaningful chunk of memory for their small burst - maybe this could include a way for the client to reclaim the memory when they know they aren't doing any partitioning?

A somewhat unrelated question: do we want to scope how much of the GPU is available for use with dependent partitioning? We might not want deppart operations consuming the entire GPU if there are high priority tasks also trying to run on the GPU. We could create a green context for deppart operations to scope them to a subset of SMs. We could also run the kernel launches through the normal GPU task path and allow for priorities on deppart operations (like we already do with copies/fills).

Either way works for me - I think copies/fills are the only other device code in Realm right now so I'd be happy to mirror that.

Legion is by far the biggest consumer of Realm, so I'll gladly defer to whatever combination of these proposals you think would be the most natural/intuitive for incorporating GPU-deppart into Legion while maintaining Realm's invariants.

@lightsighter
Copy link
Contributor

Rather than respond to each of the comments above for each different proposal, I'm going to focus on this:

Proposal 5: (from @streichler) reserve a small amount of GPU memory for all such dependent partitioning operations, optimistically try to allocate buffers (e.g. in the style of Proposal 1) and then fall back to a more serialized execution using the pre-allocated small buffer as necessary. Adding my own idea to this one: the buffer could even be in zero-copy or UVM memory in order to avoid unnecessarily consuming device-side memory.

I think this makes a lot of sense. One consideration is that the lifetime of GPU deppart operations will generally be a small fraction of the application's runtime, but they need a meaningful chunk of memory for their small burst - maybe this could include a way for the client to reclaim the memory when they know they aren't doing any partitioning?

If it is possible for the code to potentially work out of a smaller buffer as well as running faster with a larger buffer, then I think the interface I would like would be a two-phase model. In the first phase we do some kind of API call that estimates bounds on memory usage and returns two sizes: a minimal size needed to run the computation and an "optimal" size that ensures that deppart can get "good" performance (I feel like this value should be more a function of the GPU being targeted and not a property of the data being used for the deppart operation, but tell me if I am wrong about that assumption). If Realm is going to run on CPUs it will just return zero for both values (or maybe a small value, not sure if we use buffers on the CPU implementation). The client can then do an allocation at either value (or somewhere in-between if they're willing to search) and pass that in to the implementation.

Tell me if that is possible and if not then we'll go back to exploring the other proposals.

Legion is by far the biggest consumer of Realm, so I'll gladly defer to whatever combination of these proposals you think would be the most natural/intuitive for incorporating GPU-deppart into Legion while maintaining Realm's invariants.

Even though I care a lot about the Legion implementation, we've always designed Realm in a way so that it is Legion-agnostic. Realm should be usable by lots of different clients (I'm actually working on a different project built on Realm right now, it's just not public) so we try very hard to maintain that independence when designing APIs for Realm.

@rohanchanani
Copy link
Author

If it is possible for the code to potentially work out of a smaller buffer as well as running faster with a larger buffer, then I think the interface I would like would be a two-phase model. In the first phase we do some kind of API call that estimates bounds on memory usage and returns two sizes: a minimal size needed to run the computation and an "optimal" size that ensures that deppart can get "good" performance (I feel like this value should be more a function of the GPU being targeted and not a property of the data being used for the deppart operation, but tell me if I am wrong about that assumption). If Realm is going to run on CPUs it will just return zero for both values (or maybe a small value, not sure if we use buffers on the CPU implementation). The client can then do an allocation at either value (or somewhere in-between if they're willing to search) and pass that in to the implementation.

"https://docs.google.com/presentation/d/1JQrFJlsi_BNijg2eTmTd3DxqYCBQYjiZsqNl0FP8k10/edit?usp=sharing"
Each of the "pieces" on slide 4 here with nonzero values (i.e. all but complete points/1D rects) have an output whose size depends on the input data and has an upper bound that is much larger than what you'd typically expect the real-world results to look. For example, construct input rectlist intersects two index spaces on the device and stores the result on the device for further computation. The upper bound is the product of the number of rectangles in each, but realistically it'll probably end up being closer to the maximum than the product. The only way I can think of doing these in a size less than the "optimal size" is by chopping up the operations somehow, and then sending each chunk through the whole pipeline one by one. This yields two challenges:

  1. Say you want to do the computation with memory size K such that K < "optimal" size. Then for each chunk C, you need the maximum memory used for computing C at any stage in the pipeline to be less than K. But to know the memory used at stage N for chunk C, you need to have already computed each stage n < N. Most of the deppart ops have pipeline lengths > 2, so choosing chunks from a memory size would be very awkward; I can't think of how to do it without some sort of dynamic trial and error.
  2. The final complete points/rects stages have a necessary global component because a key invariant is that the output rectangles are disjoint, so every rectangle/every point needs to be aware of every other rectangle/point to detect overlaps. This means that in a chunking approach, we would either have to dump each chunk’s output into the host sort/merge path (killing performance), or mirror the host approach on the device by maintaining a running output list, and iteratively merging the output of each chunk into this output list while keeping it disjoint.

Any accommodation I can think of to these challenges ends up serializing things to the point of defeating the purpose of gpu partitioning, although I'm definitely open to if there's a better way to do things in less than the "optimal" size.

Here's an approach Rohan and I talked through that adapts Proposal 2 above and addresses the earlier questions:
The new API for each operation would have all the arguments of the original partitioning operation, plus a deppart "state" object that allows the client and operation to send information back and forth between chained calls.

Here’s the state:

struct state {
	int stage_num = 0;
	RegionInstance instance1;
	RegionInstance instance2;
	int size;
	Memory my_mem;
	Event dep;
}

Here’s what an image (but really any op) would look like:

//Each deppart op has a "pipeline depth" as shown in slide 4 in the Allocations presentation, which corresponds to the number of count + emit stages that have to be chained together.
//This might just be 1 for CPU image, 3 for GPU image, and 3 + 2 * dim for GPU image_range
int depth = get_image_count(image_args);

//Each call in the chain will depend on its preceding event
std::vector<Event> events(depth);

//Rather than having multiple functions for each op, each op has 1 function that does different things depending on the stage_num of the state it’s given and mutates the state 
for (int i = 0; i < depth; i++) {
	events[i] = image(image_args, state);
	state.instance1.destroy(state.dep);
	state.instance1 = state.instance2;

	//The deppart call sets my_mem
	state.instance2 = alloc(state.size, state.my_mem, state.dep)
	state.dep = events[i];
}

//Nothing above blocks

//When you need the partitioning result
events[depth-1].wait();

Some notes:
All the objects in the deppart state would have to be in a fixed memory location for the asynchrony to be well-defined. This approach generalizes to all the deppart ops, with instances in any memory location (GPU, CPU, etc.). The initial call with the original deppart args would determine how long the pipeline is, which depends on whether it’s CPU v GPU v other path and which op it is. Then, you’d do a for loop using that length to kick off the chained calls, and use the list of events to keep all the calls non-blocking (including allocations/destructions). Everything written/read in the state is assumed to be an address, so that once the dependency for each call has resolved, the value in each address is right for that call.

@lightsighter
Copy link
Contributor

lightsighter commented Oct 4, 2025

Say you want to do the computation with memory size K such that K < "optimal" size.

Can you say what your definition of the "optimal size" is? To be fair, let me give you my definition of "optimal size". In general, all of these dependent partitioning operations are going to be memory bandwidth limited. You can saturate the memory bandwidth of a GPU with many fewer SMs than most GPUs have (only really tiny GPUs are balanced and you'll mostly find them in laptops). Therefore, my definition of "optimal size" is however much memory I would need to allocate in order to have a big enough working set for the number of SMs required to saturate memory bandwidth. After that, everything else should just be executed in serial because even if I "parallelize" it with more blocks/threads, it's also going to be implicitly serialized anyway by the memory controller for the framebuffer memory.

Any accommodation I can think of to these challenges ends up serializing things to the point of defeating the purpose of gpu partitioning

If you serialize things only after having saturated memory bandwidth, then you're not going to lose any performance.

In fact, I'll add that you might even be able to get a speed-up by hitting more frequently in the L2 cache ;).

@lightsighter
Copy link
Contributor

To further elaborate, you can do all this with a single kernel launch with a small number of threadblocks and using a cooperative kernel and then the GPU version of the algorithm a bit like the CPU algorithm with the inner loop(s) parallelized across some number of threads/threadblocks with necessary synchronization inserted using cooperative groups. You can then scale it up or down as necessary depending on the available memory bandwidth of the particular GPU you're running on. Bonus points if you can make it use temporal locality to leverage the L2 cache.

@rohanchanani
Copy link
Author

What I mean by “optimal” size is enough to keep all the data the partitioning operation needs on the device for the entire operation.

Say you want to do the computation with memory size K such that K < "optimal" size.

What I mean here is that in order to to use any data less than the “optimal” size, you have to tile the operation in some way.

If you serialize things only after having saturated memory bandwidth, then you're not going to lose any performance.

In the approach you’re describing, the tilling would be such that the total amount of work done stays the same, but the active working set at any given time is less. Thus, there’d only be performance degradation if you’re not saturating memory bandwidth, but no matter what the total amount of computation you’re doing would be the same (just N pieces of size 1 / N rather than 1 whole piece).

I don’t think it’s possible to tile the operations in this way because of the last step, where all the points/rectangles are coalesced into a sparsity map, because the points/rectangles aren’t independent (they need to know if they duplicate/overlap any others). If you split the operation into N pieces, you’d have to do the following: coalesce the first 1/N points, keep that active somewhere, coalesce the next 1/N points, merge that into the working set (more expensive than the coalescing), coalesce the next 1/N points, merge that, etc. So rather than doing the same amount of work with a smaller working set, this adds a meaningful amount of computation that wouldn’t have originally been done with 1 big piece (all the merge operations). Each merge operation would have to do the slide 19 algorithm on the combination of the two spaces: https://docs.google.com/presentation/d/1Iwo0IwXBk14-E8i5i5kLN2B0KP7Cl-RwC0yDIN0_Fh8/edit?usp=sharing

Any accommodation I can think of to these challenges ends up serializing things to the point of defeating the purpose of gpu partitioning, although I'm definitely open to if there's a better way to do things in less than the "optimal" size.

Here I don’t mean serialized in terms of less block/thread parallelism, but rather in terms of 1 bulk operation serialized into N tile operations, each of which requires a new, expensive integration step in between. And the nature of the computation goes from something that’s well-suited for the GPU (dumping a bulk operation onto the device and treating its elements as independent until the last possible moment) to something that’s generally better suited for the CPU (maintaining an active, disjoint state and iteratively merging pieces into it, which is also exactly what the CPU deppart path currently does).

But to know the memory used at stage N for chunk C, you need to have already computed each stage n < N.

I also still think this is a key challenge for any tilling approach with a fixed size buffer. Given a fixed size you have to stay within, I don’t think it’s possible to efficiently choose an input chunk size that's guaranteed to stay within the given size for the entire operation, i.e. to efficiently invert the pipeline. So when you choose a tile, there's a chance that tile will get to say the third stage and increase to bigger than your buffer, at which point the operation would have to fail.

If it's helpful, we could also hop on a call to discuss this.

@lightsighter
Copy link
Contributor

What I mean by “optimal” size is enough to keep all the data the partitioning operation needs on the device for the entire operation.

Do you believe that is strictly necessary to achieve speed-of-light performance, or is it purely a convenience because it makes the code easier to write?

What I mean here is that in order to to use any data less than the “optimal” size, you have to tile the operation in some way.

Using your definition of "optimal" size then yes, tiling is required.

I don’t think it’s possible to tile the operations in this way because of the last step, where all the points/rectangles are coalesced into a sparsity map, because the points/rectangles aren’t independent (they need to know if they duplicate/overlap any others).

Well, I know that it can be done because the CPU code does that (with tile size=1). ;) It just so happens that the CPU code does it in serial and therefore doesn't need to perform any additional conflict checks or synchronization. If you tile it the way I'm describing, then some additional synchronization will be required because effectively you're going to take a subset of rectangles and try to reduce them in parallel, but you'll also need to check for conflicts with other rectangles in the subset being reduced. This requires additional synchronization and if we were doing this across the entire GPU, then I would agree that would be a bad idea. However, presuming we're only operating on a subset of the GPU needed to saturate memory bandwidth then the overhead of those extra checks and synchronization should not be that bad.

So rather than doing the same amount of work with a smaller working set, this adds a meaningful amount of computation that wouldn’t have originally been done with 1 big piece (all the merge operations).

Computation is free as long as we're saturating the memory bandwidth. You can probably afford 5-10X more computation than memory accesses without any reduction in performance (some of which is likely consumed by thread divergence or shared memory bank conflicts, but that will be measured in percentage reductions). Presuming you're keeping the rectangle data in shared memory, you can even have all the threads looking for conflicts by going through the distributed shared memory network across blocks so you won't incur any extra global memory traffic.

Here I don’t mean serialized in terms of less block/thread parallelism, but rather in terms of 1 bulk operation serialized into N tile operations, each of which requires a new, expensive integration step in between.

That's what I mean too. Each tiled operation is still parallelized over threads/blocks, just many fewer than blasting them all out unnecessarily and having the memory controller implicitly serialize them. If you keep the integration step on-chip you'll be winning even if it appears like it does more compute than would otherwise be necessary because that will be faster than being memory bandwidth limited.

And the nature of the computation goes from something that’s well-suited for the GPU (dumping a bulk operation onto the device and treating its elements as independent until the last possible moment) to something that’s generally better suited for the CPU (maintaining an active, disjoint state and iteratively merging pieces into it, which is also exactly what the CPU deppart path currently does).

Here's where I think we fundamentally disagree. I'm arguing for a hybrid approach that leverages a certain degree of parallelism in the GPU without oversubscribing resources and keeps data resident on-chip for longer and is likely to be faster and more resource-efficient than constantly streaming data to and from global memory. The end result should be something that looks like a cross between the CPU-only algorithm and the streaming GPU algorithm that we have today.

I also still think this is a key challenge for any tilling approach with a fixed size buffer. Given a fixed size you have to stay within, I don’t think it’s possible to efficiently choose an input chunk size that's guaranteed to stay within the given size for the entire operation, i.e. to efficiently invert the pipeline. So when you choose a tile, there's a chance that tile will get to say the third stage and increase to bigger than your buffer, at which point the operation would have to fail.

When this happens, why isn't the answer just to reduce the tile size and try again without changing the memory requirements? In the limit, you'll get to a tile size of 1 input rectangle at a time which is effectively the CPU-only algorithm and will be very slow, but is highly likely to fit in the buffer. It's still possible that one input rectangle explodes into a gigantic number of output rectangles at which point you can start tiling that one input rectangle into points. Then you know things are going to fit into memory because points are atoms and won't "shatter" into more points.

@rohany
Copy link
Contributor

rohany commented Oct 7, 2025

Presuming you're keeping the rectangle data in shared memory, you can even have all the threads looking for conflicts by going through the distributed shared memory network across blocks so you won't incur any extra global memory traffic.

I don't think that it's appropriate to try and use DSMEM here, as only 2 "adjacent" CTAs can access each others' shared memory.

Taking a step back, I want to understand the root of your concerns here. A hybrid algorithm like the one you describe may be possible (@rohanchanani will be the arbiter of that), but what's the main reason that you are pushing for it? Is it mostly for avoiding the complexity of adding the multi-stage deppart API in Realm, or because you think that we should just doing better with the algorithmic components themselves? I am a bit worried about the complexity and cpu-side overheads (launching lots of kernels and copies) of doing a hybrid algorithm with a small tile size -- if it's just performance we can use green contexts or something to limit the number of SM's and allow other application work to proceed. If it is to avoid the multi-stage API in Realm, perhaps @rohanchanani can try to develop an algorithm for one of the deppart operations that does work in a tiled manner and see what the relative performance and code complexity tradeoffs are. In particular, if we need tiles in the 10%+ capacity of the GPU to actually get reasonable performance out of this pathway then that already feels like too much to carve out into a fixed pool and we'd want to not have the application be forced to make this choice a-priori.

It also seems like if there is even a single deppart operation that we wouldn't be able to apply this to (like maybe the dynamic allocations for the BVH in preimage can't be made to fit in a fixed buffer) and we have to add a staged API then we might as well add it for all deppart operations.

@lightsighter
Copy link
Contributor

what's the main reason that you are pushing for it?

The only way for Legion to implement an unbounded memory requirement operation is to effectively map an unbounded pool in that memory which blocks all downstream operations from allocation/deallocating in that memory. I don't want that to be the default for GPU-accelerated deppart operations in Legion. Any amount of complexity is worth it to avoid that as the effects ripple far up the software stack (beyond Legion). I could care less about the implementation complexity in Legion, but as soon as the consequences become visible to Legion users then I care A LOT and am willing to go to any ends to avoid it if possible.

if it's just performance we can use green contexts or something to limit the number of SM's and allow other application work to proceed. If it is to avoid the multi-stage API in Realm, perhaps @rohanchanani can try to develop an algorithm for one of the deppart operations that does work in a tiled manner and see what the relative performance and code complexity tradeoffs are

It's neither. It's the fact that the memory requirements ripple up the software stack and are visible to mappers and impact unrelated operations which is something that is not currently the case for deppart operations.

In particular, if we need tiles in the 10%+ capacity of the GPU to actually get reasonable performance out of this pathway then that already feels like too much to carve out into a fixed pool and we'd want to not have the application be forced to make this choice a-priori.

You should be able to do this with way less than 10% of memory. I bet having a buffer on the order of 10-100MB is enough.

It also seems like if there is even a single deppart operation that we wouldn't be able to apply this to (like maybe the dynamic allocations for the BVH in preimage can't be made to fit in a fixed buffer) and we have to add a staged API then we might as well add it for all deppart operations.

Also strong disagree here. If we have to let consequences ripple up the software stack, then we should tightly bound the kinds of operations that it applies to.

@rohany
Copy link
Contributor

rohany commented Oct 7, 2025

Also strong disagree here. If we have to let consequences ripple up the software stack, then we should tightly bound the kinds of operations that it applies to.

If that's the case, I believe that @rohanchanani told me that point-based operations (like image) vs rect-based operations (like image range) are definitely easier to process in a tiled manner. We could start there at least. Do the preimages for accelerated gather scatter copies need preimage range or just preimage?

@rohanchanani
Copy link
Author

I can definitely get started on a tiled approach for the point-based pipelines - also, preimage range dumps into the points pipeline, so of the non-setops only image range requires rects. But I also think that the back-end of tiling will end up looking pretty close to identical for both points and rects because we'll want to store the active set as rectangles.

@lightsighter
Copy link
Contributor

Do the preimages for accelerated gather scatter copies need preimage range or just preimage?

Just point-wise preimage for now. Nobody I know is using range-based gather/scatter copies currently.

The place I would like to get to would one where the client could query Realm for a recommended buffer size based on a target memory to use for the buffer and Realm could recommend a minimum size to achieve "optimal" performance. Clients could then provide whatever buffer they want (probably with a lower bound on how tiny it could be to even allow forward progress) and then the implementation would tile appropriately to fit in the buffer. This would give clients a sliding scale that they can use to trade-off memory pressure against performance of dependent partitioning operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.