Skip to content

Terminiology for Concurrency

Peter Doak edited this page Sep 16, 2020 · 5 revisions

I'd like this to be abstract from any particular tasking, threading, or parallelization scheme.

In cases where we are considering concurrent execution

There is a calling scope with:

  • Resources
  • Work

and

  • Concurrency resources: Threads, Streams, Tasks something that begins a "desynchronized" scope

The biggest difference between a concurrent scope and a regular call scope is this "desynchronization"

Any scope where we have resources and work that can be divided and operated on independently can potentially be made concurrent.

But there are many costs to starting and ending desychronized scopes. Especially when understandable code, maintenance, race conditions, and other bugs are factored in. In fact in order to use concurrency to its fullest we don't want to be limited to such simply nested examples and any extended concurrent scope should be formalized enough that its clear what is and isn't in bounds.

The limits to parallelism are determined by a large number of details but there are usually some obvious counts of important resources or work/data. We've had some issues coming up with non confusing ways to name and talk about these counts.

Example of Cost function evaluation in Optimizer Driver.

At the Driver Scope (1 of these Scopes per MPI rank):

Resources: 100 walkers 5 CPU threads Data/work: 800 samples

the driver opens 5 Crowd Scopes they run concurrently because each is opened with a thread.

At the Crowd Scope

Which consumes a thread resource and a formal threadsafe scope resources: 20 Walkers Data/Work: 160 samples

Since the crowd scope doesn't have threads it doesn't open any concurrent scopes

At the flex_function Scope: which requires balanced resource and work to call.

resources: 20 walkers 1 GPU stream Work: 20 samples

Frequently other than the concurrency resources we use, there is usually one limited resource that determines the degree of concurrency. It tempting to call that the "batch_size" but its a bit problematic. In the example above the batch size of the crowd could be 20 walkers or it could be 160 samples. The important one is the number of walkers since that is potentially memory limited.

The flex function scope is less of a problem it clearly works in sets of 20 computations. (in this case)

Generally having a clear handle on all these counts is pretty important and perhaps one "size" isn't that important to nail down.

Definitions:

Up and Down, as in the lldb debugger you calling frame is up the call made is down. I consider trying to use enclosing, parent, etc. don't think its worth it.

Scope, The arena of the considered point of execution. What can be called or accessed. Ideally includes only what is needed to accomplish the operations in the scope.

Level, How many scopes are above you. Often we only consider levels with important concurrency opportunity. Considered this way.

  • Level 0 is the application.
  • Level 1 is the driver.
  • Level 2 is the Crowd level. (or whatever you call your hopefully longer lived desynchronized Concurrency Scope class)
  • Level 3 is the Trial Wave Function flex_ level
  • Level 4 is the mw_ level

Concurrency Context, For a long lived concurrent section the best way to avoid issues is to not modify any memory not exclusively set aside for your thread. To enforce this and allow it to be easily reasoned on we make any concurrent section one or more pure functions where all arguments are const if they are not exclusive to the thread or only accessible to this concurrent scope for the lifetime of the concurrent section. In an application like QMCPACK this can be a pretty large pack of stuff so a dedicated data object is appropriate to pass this context. This first of these contexts in QMCPACK was the Crowd which contained an exclusive set of "fat" walker elements and an exclusive set of accumulators (contained in EstimatorManagerCrowd). Since the limiting resource in these walking concurrency sections was the walker...

Resource, something used to accomplish an operation, ideally stateless but the important characteristic is they are instrumental and you can only have so many.

Work, what the resources are going to operate on, unique per some turn of a lower scopes crank. This might be a bad category. If you want to get strict it is stored in memory resource and likely needs a resource to hold its output as well.

Final note

This doesn't need to be at all universal. We have very specific forms of concurrency that are useful for QMCPACK. But could hopefully not confuse each other as much.

I'm primarily interested here in the long lived concurrent sections which run above numerous calls to block and grid concurrency on accelerators and short lived parallelism or tasking on the CPU. These thoughts here sould also be appropriate for the short lived parallelism on the CPU although the amount of structure needed should be less.

We could use some better terminology for the GPU "concurrency" but much of GPU terminology about the "parallelism" is really about vectorization and is pretty specific to architecture.