-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy patharch.tex
572 lines (532 loc) · 26 KB
/
arch.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
% Due Date: 6/15/14
\chapter{The Legion Runtime Architecture}
\label{chapter:arch}
Having completed our discussion of the Legion
programming model in Chapter~\ref{chapter:model},
we begin in earnest describing our implementation
of Legion. Legion is currently implemented entirely
as a runtime system. We decided on a runtime implementation
for two reasons. First, a runtime system is fully dynamic
and requires no static analysis of programs, thereby allowing
it to react to the dynamic nature of both modern
supercomputing applications and hardware. Second,
implementing Legion as a runtime system in an existing
language (in this case C++) makes it easier for application
developers and DSL compiler writers to port their code as
they are already familiar with the target language and simply
need to adapt to the new programming model.
Before delving into the details of our implementation
we begin with a brief overview of the various
components of the current Legion ecosystem in
Section~\ref{sec:components}. We then introduce
a useful analogy in Section~\ref{sec:framework}
for describing how the Legion runtime system operates
that will be carried through many of the following
chapters.
\section{Runtime Organization}
\label{sec:components}
Due to the complex nature of both applications and
the underlying hardware, the Legion runtime is composed
of several different components at different levels of
abstraction. Figure~\ref{fig:softwarestack} illustrates
the different levels of the Legion software stack,
including several levels (such as the Legion language
and compiler) that are still under development.
The many different layers are designed to encapsulate
the complexity associated with the various aspects of
our Legion implementation. We now discuss each of the
different levels of the software stack in further detail.
\begin{figure}[ht]
\begin{center}
\includegraphics[scale=0.6]{figs/LegionSoftwareStack.pdf}
\end{center}
\caption{Legion Software Stack\label{fig:softwarestack}}
\end{figure}
\subsection{Legion Applications}
\label{subsec:legionapps}
At the top of Figure~\ref{fig:softwarestack} are Legion
applications. Legion applications can be either applications
directly written using Legion or Domain Specific Languages
and Libraries (DSLs) that use Legion to provide additional
portability and performance. Legion applications can be
targeted either at the Legion runtime API or at the
higher-level Legion language developed in \cite{LegionTypes13}.
The design and implementation of the Legion language,
type system, and compiler are beyond the scope of this
thesis and are not discussed further. From this point
forward we will assume all Legion applications make
use of the Legion runtime API.
Included in Figure~\ref{fig:softwarestack} is a dividing
line that delineates which components of the Legion software
stack apply to writing machine-independent specifications
and which apply to determining the best ways of
mapping applications onto a target architecture.
The features discussed in Chapter~\ref{chapter:model}
are used entirely for writing machine independent specifications
of applications. Alternatively, the features of the
Legion mapping interface, that are covered in detail in
Chapter~\ref{chapter:mapping}, are entirely contained
on the right side of this divide. The line dividing the
machine independent specification of an application from
its mapping does not extend beyond the application/mapper
level because it is the responsibility of the Legion
runtime to mediate the interaction between
machine-independent application specifications and
the mapping decisions made by mapper objects.
\subsection{High-Level Runtime}
\label{subsec:highlevel}
The Legion high-level runtime is an implementation
of the Legion programming model described in
Chapter~\ref{chapter:model}. Our current implementation
of Legion supports a C++ interface for writing Legion
programs. While applications are written in C++, they
are required to abide by the Legion programming model.
Therefore, tasks are permitted to only access data
in regions that they request, all sub-tasks must be
launched through the runtime interface, and global
variables along with dynamic memory allocation are
not permitted (logical regions are used instead).
It is the responsibility of the Legion high-level
runtime to take applications written in the Legion
programming model and convert them into the primitive
operations supplied by the low-level runtime interface
discussed in Section~\ref{subsec:lowlevel}. The
mechanisms by which the high-level runtime performs this
transformation are the primary technical contributions
of this work. We begin our technical description of the
high-level runtime in Section~\ref{sec:framework}.
The other responsibility of the high-level runtime is
to mediate calls to mapper objects through the mapper
interface. As an application is executing, many decisions
need to be made regarding where tasks should be assigned
and where physical instances of logical regions should be
placed. The high-level runtime directs these queries to the
appropriate mapper object through the Legion mapper
interface\footnote{In this way the mapper interface is actually
a call-back interface where methods of the mapper objects
are invoked by the runtime.}. Using the results of the
mapping calls, the high-level runtime then carries out
the specified mapping of an application onto the
target architecture.
\subsection{Low-Level Runtime}
\label{subsec:lowlevel}
The design and implementation of the Legion low-level
runtime is covered in detail in \cite{Realm14} and is
mostly beyond the scope of this thesis. However, we
briefly cover the important details of the low-level
runtime interface necessary for understanding the
interaction with the high-level runtime.
The low-level runtime interface provides a useful set
of primitives for orchestrating computation, data
movement, and synchronization in a distributed memory
system. Specifically, it provides the following
primitive operations:
\begin{itemize}
\item Tasks - launch tasks on specific processors either
on local or remote nodes
\item Instances - allocate and free physical instances
in memories
\item Copies - issue copies or reductions between physical instances
\item Synchronization - provide {\em reservations} as an
atomicity primitive
\end{itemize}
The most important aspect of the low-level runtime is
that all of these primitives are non-blocking.
All operations are asynchronous and return
immediately with an {\em event} that names a time in the
future when the operation will be complete. Furthermore
all operations in the low-level runtime can take
events as preconditions that are required to trigger before
an operation can be executed. By using events to compose
these primitives, the high-level runtime can construct
large directed acyclic graphs (DAGs) of operations.
\begin{figure}[ht!]
\centering
\includegraphics[scale=0.2]{figs/event_graph_example.pdf}
\caption{Example Low-Level Runtime Event Graph from the Pennant Application\label{fig:event_graph_example}}
\end{figure}
Figure~\ref{fig:event_graph_example} shows an example
low-level runtime DAG generated by the high-level runtime
from a real application. In this image, boxes correspond
to operations, while edges correspond to event dependences.
Yellow boxes are copies, red boxes are reductions,
blue boxes are individual tasks, purple boxes are point
tasks from an index space task launch, and green boxes
are inline mappings. This graph shows all of the event
dependences computed by the high-level runtime.
In practice many of these event dependences are pruned
by the high-level runtime through optimizations that
are discussed in Chapter~\ref{chapter:physical}, but
are left in place for the purposes of debugging and
visualization.
While the high-level runtime is responsible for the
construction of the DAG of operations, the low-level
runtime is responsible for scheduling the operations
onto the target hardware while abiding by the
constraints specified by event dependences. It is
important to note that many of the DAGs computed
for real applications contain significant amounts of
parallelism. For example, in
Figure~\ref{fig:event_graph_example}, the graph is
very tall, indicating that there are many operations
that can be done in parallel as event dependences
primarily point from left to right. By exposing all
of this parallelism in the DAG, the low-level runtime
can discover significant amounts of work that can
be used to overlap long latency events such as
data movement and synchronization. Most importantly
all of this parallelism is extracted automatically
by the Legion high-level runtime without requiring
any assistance from the programmer. This is only
possible because Legion knows which logical regions
different tasks will be accessing and can therefore safely
infer when tasks and other operations can be
executed in parallel. In contrast, the burden of
hiding long latency operations in MPI by overlapping
computation and communication is placed directly
on the programmer (see Chapter~\ref{chapter:intro}).
The low-level runtime also serves another
useful purpose in the Legion software stack:
portability. The low-level runtime interface acts
as an abstraction layer for various hardware interfaces.
The primitives provided by the low-level runtime
interface are sufficiently simple that they can be
implemented on all current hardware architectures.
Furthermore, their simplicity also suggests that they
should be relatively straightforward to implement on future
architectures as well. Consequently, Legion can
easily be ported to both current and future
architectures without needing to re-implement the
entire software stack, but simply by providing
a new low-level runtime implementation that
incorporates new hardware. Since the majority of
Legion code resides in the high-level runtime,
the portability enabled by the low-level runtime
is crucial for making Legion a portable runtime
system.
\subsection{Mapper Interface and Machine Model}
\label{subsec:mapperinst}
The Legion mapping interface is the mechanism by which
Legion applications can directly control how they are
mapped onto target hardware. As was mentioned in
Section~\ref{subsec:highlevel}, the mapping interface
is actually a reverse interface operated with a
call-back model. Whenever mapping decisions need to
be made, the Legion runtime performs a call to
the corresponding mapper object to make the decision.
While mapper functions must always respond to queries, by
associating them with mapper objects, the mapper functions
are actually methods on mapper objects, permitting mapper
objects to memoize state that can be used for answering
future mapper queries. We discuss details of the various
mapping queries that can be made and potential
implementations of them in Chapter~\ref{chapter:mapping}.
In order to aid in making mapping decisions that
depend on the target architecture, the mapper objects
have direct access to a singleton object called
the {\em machine} object. The machine object is an
interface for introspecting the underlying hardware
on which the application is executing. As shown in
Figure~\ref{fig:softwarestack}, the machine object
is provided by the low-level runtime interface directly
to the mapper. Using the machine object provided by
the low-level runtime, mapper objects can glean
information about all of the processors and memories
in the system as well as many of their characteristics
such as the latencies and bandwidths for accessing
or copying data between different memories. Currently,
the machine object is static and does not change
during the duration of an application, but in the
future it may dynamically be modified to reflect the
changing state of hardware (e.g. if a node crashes).
The need to react to such events further underscores
the need for mappers to be dynamic objects capable
of reacting to events at runtime.
\subsection{Runtime and Mapper Instances}
\label{subsec:rtinstances}
Since Legion is designed to operate in a distributed
memory environment, it is useful to understand how
instances of the Legion runtime and mapper objects
are deployed throughout the nodes of a machine.
Figure~\ref{fig:rtinstances} shows where both Legion
runtime instances and mapper instances are instantiated
in a distributed machine with respect to both the
nodes and processors. An instance of the Legion
high-level runtime is created on every node in the
machine. This instance of the high-level runtime is
responsible for managing all of the tasks currently
executing on any of the processors on the node. It
is important to note that this approach mandates that
the Legion high-level runtime interface be thread safe
since tasks running on different processors can invoke
runtime calls simultaneously.
\begin{figure}[t]
\centering
\includegraphics[scale=0.7]{figs/RuntimeInstances}
\caption{Instantiation of Runtime and Mapper Instances\label{fig:rtinstances}}
\end{figure}
In a slightly different manner, mapper objects are
instantiated such that there exists a one-to-one
mapping between mappers and processors on a node.
Thus, there exists a different mapper object for
each processor made available by the low-level
runtime. The reason for this is two-fold. First,
by having a unique mapper for each processor, it
simplifies the process of implementing a Legion
mapper, as mapper objects can be immediately
customized for the kind of processor that they
are managing. Second, by having separate mappers
for each processor, we immediately avoid any potential
sequential bottlenecks because tasks mapping on
different processors will never conflict with each
other over a common mapper object.
\section{Program Execution Framework}
\label{sec:framework}
Having discussed the details of the runtime architecture
we now begin our discussion of the Legion high-level
runtime and how it executes Legion programs. In
Chapter~\ref{chapter:model} we described how the
Legion programming model requires all programs to be
described as a tree of tasks that operate on logical
regions. One useful way to conceptualize the execution
of this tree of tasks is as a hierarchical collection of
streams of tasks with the execution of each task generating
a new stream of sub-tasks. To execute each stream of tasks
we rely on a task pipeline that we introduce in
Section~\ref{subsec:pipeline}. Section~\ref{subsec:stages}
provides an overview of the different stages in
the Legion task pipeline. In
Section~\ref{subsec:hierarchical} we describe
why the hierarchical stream of task abstraction is safe,
allowing streams of tasks with a common parent task
to be considered independently without needing to reason
about interactions with other streams.
\subsection{Out-of-Order Pipeline Analogy}
\label{subsec:pipeline}
The dynamic nature of the Legion runtime requires
efficiency in the analysis of a stream of tasks
because this analysis is overhead that
detracts from the performance of an application.
In order to hide runtime overhead and extract
parallelism from a stream of tasks, Legion
borrows the idea of an out-of-order
pipeline from hardware architectures and adapts
it for use within a distributed software environment.
For several decades, hardware designers have relied on
out-of-order instruction pipelines to both extract
instruction level parallelism from programs and hide the
overheads associated with dynamic instruction
analysis. Legion applies this abstraction
at the coarser granularity of tasks. The
stream of sub-tasks with region requirements is
analogous to the stream of instructions with register
names that hardware processors execute.
Our implementation of the Legion runtime exploits this
analogy extensively to both implicitly extract
parallelism from a stream of tasks and to hide the overheads
associated with dynamic runtime analysis. Many of the
stages of the Legion task execution pipeline described
in Section~\ref{subsec:stages} share obvious analogs
to pipeline stages in traditional out-of-order hardware
processors. The crucial insight that enables this analogy
is that logical regions provide the necessary information
for describing the data accessed by tasks. As we discuss
in Chapter~\ref{chapter:related}, logical regions
differentiate the Legion runtime from many of the other
parallel task pipeline systems.
\subsection{Legion Pipeline Stages}
\label{subsec:stages}
Figure~\ref{fig:pipearch} shows the different stages
of the Legion runtime task pipeline\footnote{Some pipeline
stages are interchangeable depending on mapper decisions.
We cover the details of this relationship shortly.}. Each
task progresses through the stages of the pipeline in order.
However, the Legion runtime is free to re-order tasks with
respect to each other if tasks are determined to be
non-interfering. We briefly introduce each of the
different pipeline stages and where possible describe
analogies to similar stages in hardware out-of-order
processors. The details of many of these stages will
be the subject of later chapters of this thesis.
\begin{figure}[t]
\centering
\includegraphics[scale=0.7]{figs/LegionPipeline}
\caption{Legion Runtime Task Pipeline Stages\label{fig:pipearch}}
\end{figure}
The first stage in the Legion task pipeline is dependence
analysis. This stage determines which tasks are safe to
run in parallel and which tasks have dependences based
on logical region usage. Dependence analysis is computed
with respect to the sequential order in which tasks are
issued to the runtime. We refer to this initial order of tasks as
{\em program order}. Dependences always proceed from tasks coming
earlier in program order to tasks coming later in program
order. The natural partial order on dependences ensures that
deadlock cannot result from the dependence analysis stage.
The dependence analysis stage is similar to the wake-up
select stage of out-of-order processors that decides which
instructions have had all of their dependences satisfied
and are therefore safe to execute. The implementation of
the dependence analysis pipeline stage is covered in
detail in Chapter~\ref{chapter:logical}.
Once a task has all of its dependences satisfied (because
all of its dependences have mapped), a task progresses
to the mapping phase of the pipeline. The mapping phase
is composed of two tightly coupled stages that are
interchangeable depending on decisions made by the mapper
for a task. The ordering of the mapping and distribution
stages is determined by whether the mapper wants to map the
task on the node where the task will run, which we call
{\em remote mapping}, or map on the node where the task
originated, which we call {\em local mapping}. When a task
is remotely mapped, it is first sent to the target node
during the distribution stage along with its meta-data
and then it is mapped during the mapping stage onto
the target processor. In the case where a task is locally
mapped, it is mapped on the origin node and then sent to
the target node to be executed. There is an obvious
trade-off to be made by mappers here: mapping remotely
permits more tasks to be mapped in parallel on different
nodes, but requires movement of meta-data, while locally
mapping a task does not require meta-data movement, but
serializes some task mappings. We discuss this trade-off
in more detail in Chapter~\ref{chapter:mapping}.
While the distribution stage has no obvious analog in
hardware out-of-order processors, the mapping stage
is closely related to the register renaming stage in
hardware processors. In order to permit hardware
processors to extract more instruction level parallelism,
logical register names are mapped onto a separate set of
physical registers. Register renaming
provides an extra level of indirection that allows the
hardware to remove conflicts such as write-after-read (WAR)
anti-dependences. In Legion, the mapping stage yields
an even more powerful level of indirection. The mapping
stage provides the level of indirection necessary for
decoupling the specification of applications from how
they are mapped onto the target hardware. During the
mapping stage, mapper objects are queried about how
tasks should be assigned to processors and where physical
instances of logical regions should be placed in the
memory hierarchy. Mapping logical regions onto physical
instances not only allows Legion to remove conflicts
such as WAR anti-dependences on logical regions, but also
allows the mapper to control where instances are placed
in the memory hierarchy for performance. The memory
hierarchy therefore acts as a many-tiered register file
for storing different physical instances of a logical
region. We discuss the implementation of the mapping
stage in Chapter~\ref{chapter:physical} and the
implementation of the distribution stage in
Chapter~\ref{chapter:distributed}.
The execution stage of the Legion pipeline is nearly
identical to the execution stage for instructions
in hardware processors. In hardware, an instruction
is assigned to an execution unit where it is performed.
In Legion, tasks are assigned to a processor which
then executes the task. The primary difference is
that while instructions in hardware are not capable
of generating new instructions, in Legion, the execution
of a task can generate a new stream of sub-tasks
that then must also be executed by the runtime.
We discuss the implications of the hierarchical
nature of streams of tasks in
Section~\ref{subsec:hierarchical}.
The fifth stage of the pipeline is used for checking
when speculation has been resolved. For tasks that
are not predicated, this stage does not require any
work. However, for tasks that were predicated and for
which the mapper of the task chose to speculate on
the predicate value (see Section~\ref{subsec:specexec}),
this is the stage where the results of the speculation
are checked. If a task was correctly speculated, then
this stage performs no work. However, if the speculation
was incorrect, then work must be performed by the runtime
to either re-execute the task if the speculative value
of the predicate was incorrectly guessed to be false,
or the results of the task must be undone if the value
of the predicate was incorrectly guessed to be true.
Furthermore, any tasks with dependences upon the
mis-speculated task must also be rolled back. We
discuss the details of this process in greater detail
in Chapter~\ref{chapter:resilience}.
The sixth stage of the pipeline is the completion
stage that determines when a task has finished
executing. Due to Legion's deferred execution model,
it is possible for a task to finish its execution
stage of the pipeline and continue to progress through
the pipeline while the sub-tasks that it launched
have yet to execute. The completion stage is the
point where a task waits for all of its child tasks
to reach the completion stage. It is important to
note that this definition of completion is inductive:
in order for a task to complete, all of its descendant
tasks must already have completed. Once a task is
complete, it has correctly executed and may propagate
information such as the result of its future value.
Tasks that depend on the completion of this task
are also free to begin executing. However, for
resiliency purposes this is not the final stage
of the pipeline.
The final stage of the pipeline is the commit stage.
The commit stage of the pipeline is used for knowing
when it is safe to reclaim the meta-data associated
with a task. While a task may have successfully
executed, it is possible that the data generated
by the task is corrupted post execution (e.g. by
a bit-flip in memory). Under these circumstances a task may
need to be rolled back and re-executed. In order to
perform a roll-back it is necessary to retain the
meta-data (task ID, region requirements, etc.)
for re-executing the task. A task can only be
committed once we are certain that no roll-back
of the task will ever occur. We discuss the
conditions for determining when it is safe for a
task to be committed in Chapter~\ref{chapter:resilience}
and describe how mappers can choose to
preemptively commit tasks for performance reasons
in Section~\ref{subsec:commit}.
\subsection{Hierarchical Task Execution}
\label{subsec:hierarchical}
As we mentioned in Section~\ref{subsec:stages}, the
Legion pipeline analogy does not directly align with
hardware out-of-order pipelines due to the hierarchical
nature of the Legion programming model. In a Legion
program the execution of task within the pipeline can
result in an entirely new stream of sub-tasks with
region requirements that need to be executed. The
obvious question then is whether these tasks from
different streams need to be considered when performing
operations such as dependence analysis and mapping.
Importantly, due to a crucial design decision introduced
in the Legion programming model, tasks need only consider
other tasks within the same stream (i.e. originating
from the same parent task) when progressing through
the stages of the task pipeline.
Recall from Section~\ref{subsec:subtasks} that a
task is only permitted to request privileges on
a subset of the regions and fields for which its
parent task held privileges. This important property
is sufficient to prove a useful theorem: if two tasks
$t_1$ and $t_2$ are non-interfering, then all
sub-tasks of $t_1$ will be non-interfering with all
sub-tasks of $t_2$. The proof of this theorem is
based on a formal operational semantics of the Legion
programming model and is beyond the scope of this
thesis. A full proof of the theorem can be found in
\cite{LegionTypes13}.
The consequences of this theorem are profoundly
important to the implementation of the Legion runtime.
This theorem permits the Legion runtime to
implement a hierarchical scheduling algorithm.
Tasks only need to consider other tasks in the
same stream (which all come from the same parent
task) when traversing the task pipeline.
Since all tasks in the same stream originate on
the same node, no communication is
necessary for performing dependence analysis.
The hierarchical nature of the Legion task pipeline
also allows many tasks to be traversing the task
pipeline in parallel on different nodes. This
property is essential to enabling Legion to scale
well on thousands of nodes.