[cudadev] Improve caching allocator performance

The generalization of the caching allocator in #216 makes it easier to make various improvements to the caching allocator. https://github.com/cms-patatrack/pixeltrack-standalone/pull/211#discussion_r701705367 shows a measurement pointing that the mutex in the caching allocator would be the bottleneck (my studies ~2 years ago pointed more to the mutex in CUDA, but things seem to have evolved). This PR is to discuss improvement ideas, with a(n ordered) plan shown below

- [x] Generalize the caching allocator, done in #216 
- [ ] Improve the interaction of `ScopedContext` and the caching allocator by having the `ScopedContext` to pass `SharedEventPtr` to the caching allocator (evolution of ideas in https://github.com/cms-patatrack/cmssw/pull/412 and https://github.com/cms-patatrack/cmssw/pull/487)
   * This will reduce the number of CUDA events in flight, remove the need to create and destroy them, and simplify the logic especially on the host allocator side
- [ ] Replace the `multiset` with nested vectors (device, bin) for (much) faster lookup (from https://github.com/cms-patatrack/pixeltrack-standalone/pull/216#discussion_r700131933)
- [ ] Make locking finer grained (e.g. per bin)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cudadev] Improve caching allocator performance #218

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[cudadev] Improve caching allocator performance #218

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions