Skip to content

Latest commit

 

History

History
3071 lines (2653 loc) · 144 KB

File metadata and controls

3071 lines (2653 loc) · 144 KB

Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach

1 Latex options

2 The Paper

2.1 Latex configurations

2.2 Frontpage

2.3 Abstract

In this paper, we present visual analysis techniques to evaluate the performance of HPC task-based applications on hybrid architectures. Our approach is based on composing modern data analysis tools (pjdump, R, ggplot2, plotly), enabling an agile and flexible scripting framework with minor development cost. We validate our proposal by analyzing traces from the full-fledged implementation of the Cholesky decomposition available in the MORSE library running on a hybrid (CPU/GPU) platform. The analysis compares two different workloads and three different task schedulers from the StarPU runtime system. Our analysis based on composite views allows to identify allocation mistakes, priority problems in scheduling decisions, GPU tasks anomalies causing bad performance, and critical path issues.

2.4 Introduction

To fulfill the ever-growing need for computation power of High-Performance Computing (HPC) applications, it has become common to rely on hybrid nodes, composed of multi-core processors (CPUs) with multiple accelerators (GPUs). However, due to the heterogeneity and complexity of such machines, achieving portable and scalable performance has become extremely challenging. A possible solution, increasingly used by the community, is to program the application at a high level, independently of the hardware architecture, as a Directed Acyclic Graph (DAG) of tasks. It is then the responsibility of another software layer, called the runtime, to dynamically schedule the resulting tasks on the different computing resources taking into account the possible speed heterogeneity and variability as well as to automatically take care of data movements between resources. This allows removing artificial synchronizations from the application code, implementing complex scheduling and data movement algorithms (such as HEFT\cite{heft2992topcuoglu}) that would be hard to manually incorporate in the application. Thanks to dynamic decisions, the irregular behavior of applications and resources is absorbed and exploited by the scheduler during execution, effectively balancing the load among computing resources.

Task-based executions on hybrid platforms are inherently stochastic. Task mapping, for instance, can drastically change from one execution to another. From the performance analysis perspective, the nature of such opportunistic execution schemes makes classical performance analysis, which expects regularity and highlights heterogeneity, invariably ineffective. At the same time, application and runtime developers seek to understand the attained performance to improve the application design and scheduling decisions. This investigation is challenging because it is hard to compare many traces when parameters affecting task creation and scheduling change.

In this article, we explain how we designed a framework that enables an easy and faithful identification of subtle scheduling problems that would otherwise go unnoticed and misunderstood with classical trace visualization approaches. We built the framework@@latex:\setlength{\skip\footins}{0.3cm}\footnote{Code available at \url{http://perf-ev-runtime.gforge.inria.fr/vpa2016/}}@@ on top of modern data analytics tools, combining the R programming language (and in particular the ggplot2 library) and org-mode\cite{orgmode}. The tool combination comes at a very low development cost when compared to a traditional and monolithic performance visualization tool. The designed views depict task execution along time for each resource, automatically detecting several interleaving critical paths in traces. We demonstrate the effectiveness of our visualization approach by analyzing traces from the dense linear algebra Cholesky factorization of the Chameleon/MORSE package\cite{agullo2012morse}, implemented using the StarPU task-based runtime\cite{augonnet2011starpu}. Two representative factorization workloads are carried out on a hybrid multi-core/multi-GPU architecture. The large workload brings interesting insights on pinpointing resource usage mistakes and comparing three StarPU schedulers (DMDA, DMDAS and Work Stealing). The smaller workload shows that the dynamic critical path analysis provides hints for optimizations.

Section\ref{sec.context} provides some background on task-based runtimes for hybrid platforms and on the Cholesky algorithm. Section\ref{sec.relatedwork} presents some related work on trace visualization, motivating our own study. Section\ref{sec.proposal} presents our trace visualization proposal for the performance analysis of task-based runtimes. In Section\ref{sec.usecases} we detail two case studies demonstrating the effectiveness of our visualization strategy. Section\ref{sec.conclusion} gives a summary of results and future work.

2.5 Background and Experimental Context

\label{sec.context}

2.5.1 Background on Task-based Runtimes

\label{sec.background}

Traditional bulk-synchronous parallel (BSP) applications, made of supersteps (computation, communication, barrier), is a very common design when computing resources are homogeneous. The trend towards heterogeneous resources with accelerators, such as GPUs, makes the task-based programming paradigm much more suitable. Depending on the sophistication of the runtime scheduling algorithms, applications can more or less efficiently exploit heterogeneous configurations.

Several runtimes targeting hybrid platforms have been developed in the recent years. MAGMA\cite{agullo2009magma}, for instance, combines multi-core with GPUs for linear algebra applications. OmpSs\cite{duran2011ompss} provides an extension to OpenMP tasks through new directives that allow supporting multi-core systems combined with GPUs. PaRSEC\cite{bosilca2012parsec} is a generic framework for architecture aware scheduling of tasks on many-core heterogeneous clusters. StarPU\cite{augonnet2011starpu} is a task parallelism runtime initially designed to exploit hybrid architectures and additionally providing an MPI-based extension\cite{augonnet2012starpumpi} to exploit several nodes at once. Dense/sparse linear algebra have been among the first applications to exploit such runtimes but other applications have been also implemented, such as FEM applications\cite{ohshima2013FEMstarpu}, seismic wave modeling\cite{martinez2015towards}, and others\cite{agullo2016conjugate,lacoste2014sparse}.

2.5.2 Experimental Context and Workload Details

\label{sec.setup}

We propose to build on a relatively representative use case to address the lack of adequate visualization tools for task-based applications. We use traces of the dense linear algebra Cholesky factorization, more specifically of the full-fledged implementation from the MORSE library\cite{agullo2012morse}, compiled with the CUBLAS kernels. A simplified version of this application is shown in Figure\ref{fig:cholesky}; and the corresponding DAG for a $5×5$ matrix size is shown on its right (Figure\ref{fig:dag5x5}). For each step $k$ of the outer loop, one \DPOTRF task releases $\scriptstyle N-k$ \DTRSM and \DSYRK tasks, followed by $\scriptstyle ≈ (N-k)^2/2$ \DGEMM tasks. The dependencies indicate that simultaneous execution of several iterations is possible and that the iteration size decreases at the same time as $k$ increases. So, the potential parallelism gets reduced as the algorithm advances (see the DAG of the figure). Finally, the task’s execution time highly depends on the task (\DPOTRF, \DTRSM, \DSYRK, and \DGEMM) and resource types (CPU or GPU). We have used the CUBLAS without the \DPOTRF code. Hence, \DPOTRF tasks can only be run on CPUs.

Since hybrid heterogeneous nodes motivate the development of task-based runtimes, we execute this Cholesky implementation on IdCin2, a machine with two 14-core Intel(R) Xeon(R) CPU E5-2697v3@2.6GHz and three NVIDIA Titan X. From this set of resources, only 25 CPU cores participate in the computation because it is generally more efficient to let StarPU dedicate one core to manage each GPU. StarPU provides several scheduling algorithms that exploit both the DAG structure (through critical-path based heuristics) and performance models. Here, we focus on three of them.

The DMDA (Deque Model Data Aware) scheduler is a list scheduler, i.e., every time a resource is idle, if a task is ready, it will be scheduled on this particular resource. Such a scheduler therefore never leaves a resource idle on purpose, which ensures the well-known $(2-1/p)$ competitive ratio for homogeneous machines\cite{Graham66}. Deciding which ready task to select has a major influence in practice and the classical heuristic consists of prioritizing tasks based on the critical path. However, the critical path notion is dynamic and obtaining a proper estimation can be quite challenging. With heterogeneous computing resources, such prioritization is generally done with variants of the HEFT (Heterogeneous Earliest Finish Time) strategy\cite{heft2992topcuoglu}. The DMDA algorithm is a very greedy heuristic that schedules tasks in the order they become available, without taking critical path priorities into account. However, it considers data transfer time between CPUs and GPUs as well as the relative performance of resources on each computation kernel when taking its decision. The DMDAS (Deque Model Data Aware Sorted) scheduler is similar to the first strategy, except that it sorts tasks by priority, which can be expensive when the number of tasks is large. It is therefore rather close to the original HEFT algorithm by respecting priorities and taking past scheduling decisions into account. Finally, the WS (Work Stealing) scheduler uses a queue per worker; new tasks are kept local by default. When a worker is idle, it steals tasks from the most loaded worker.

The next section presents related work on performance analysis for BSP and DAG-based HPC programming models. We detail the issues of current solutions, motivating our work.

2.6 Related Work and Motivation

\label{sec.relatedwork}

Despite the plethora of runtimes to execute task-based applications on heterogeneous resources (see Section\ref{sec.background}), there are very few established tools to conduct a proper task-aware analysis. Developers usually rely on BSP-based trace visualization tools, whose objectives are different, seeking unexpected heterogeneity where regular, homogeneous behavior is normal. Such tools are therefore unsuitable for visualizing task execution behavior since heterogeneity is the expected scenario for task-based applications. We briefly detail trace visualization strategies for BSP and DAG-based applications to better understand their differences. We also describe the design challenges of novel trace visualization techniques for DAG applications, as well as the typical questions usually raised during the analysis process.

2.6.1 Trace visualization analysis for BSP-based applications

Many tools exist to visualize traces from BSP-based applications. Most are focused on MPI applications. The common technique is based on Gantt charts, depicting each thread behavior along time. Behavior is drawn using colors to represent different thread states \eg MPI operation. Message-passing is depicted with arrows from source to destination, sometimes annotated with the amount of transferred data. Vite\cite{Vite2009} is an OpenGL-based open-source tool that has such a view, capable to visualize large traces with thread states and communication. Since this tool relies on the semantic-free Paje language\cite{pajelang}, it can depict virtually any kind of traces. Paraver\cite{pillet1995paraver} is another open-source Gantt tool that allows filtering, zooming, and graphical trace aggregation. Its trace format supports many HPC programming models. Vampir\cite{knupfer2008vampir} is a proprietary tool with multiple views for OTF2 traces. It is more scalable than similar tools due to its distributed organization.

2.6.2 Visualizing task execution traces from DAG applications

There are a few tools to perform analysis and visualization of task-based executions. Typically, they are built with resources not designed for data analysis and rely on either non-scalable or non-scriptable strategies \ie with mouse pointer interaction. DAGViz\cite{huynh2015dagviz} offers a visual representation of task-based executions focusing on the DAG structure, which is retrieved using macros (translated to Cilk, Intel TBB or OpenMP) and presented in a hierarchical way. The resulting DAG can be folded/unfolded on-demand to show details and the node color indicates where they are executed. There is no way to retrieve the time dimension and task duration, which can make performance analysis difficult.

Temanejo\cite{keller2012temanejo} provides a similar timeless DAG interactive views for many task-based runtimes. Its main features are the debug capabilities, e.g. the user can put a breakpoint in a task and fix its dependencies. These capabilities are very useful during algorithm design on small scale, being unsuited for performance analysis

Haugen \etal proposes an interactive Gantt chart enhanced with dependencies, drawn as edges between tasks \cite{kurzak2015taskdep}. We believe this approach suffers from three issues. First, in term of scalability, since (\eg in Cholesky) tasks typically have many ($\scriptstyle ≈ N$) dependencies, drawing everything and finding interesting tasks and dependencies only through mouse interaction can be very tedious. In practice, only tasks belonging to the critical path are important. Second, only one-level dependencies are depicted, while several levels are required to understand the history leading to the scheduling problem. Third, this tool does not really exploit the heterogeneity of resources.

2.6.3 Challenges of DAG execution traces analysis

The performance analysis of task-based applications raises many challenges. The stochastic behavior of scheduling decisions induced by actual resource availability and by the task execution variability hinders the performance reproducibility. Moreover, executions are apparently unstructured, with no clear phases as in the BSP model. At the same time, task dependencies are part of the application and should be exploited by the analyst to understand bottlenecks.

Traces are generally much larger than the available screen space and naively displaying everything generally leads to biased views\cite{schnorr2013fits}. One, therefore, has to use selection schemes to show only data that is fully relevant from the analysis point of view. In such complex execution traces, many hypotheses (and thus filters) can be proposed regarding the expected behavior. Developing a monolithic tool, such as the ones tailored for BSP analysis, that anticipates all possible performance problems is thus impractical. Moreover, such tool would be quite difficult to maintain and customize. A more flexible solution with scripting capability is thus needed.

During the analysis of DAG traces, the typical questions that arise are different from those of BSP applications. Instead of inspecting whether and why a certain algorithm iteration or phase was slowed down, one needs to analyze if the tasks were properly scheduled and if sufficient parallelism is available. One also needs to examine whether complex data movement techniques, task submission, and many internal runtime mechanisms can be further improved.

A common approach to better understand the application behavior is to compare several execution traces, possibly each one with a different configuration (\eg scheduling parameter). However, due to the dynamic scheduling, this is generally difficult for task-based traces. To draw relevant conclusions, one needs to synchronize multiple visualizations and filter the unwanted states. Although some support exists in some tools\cite{pillet1995paraver,pagano2014framesoc}, they do not offer enough customization flexibility for such studies.

The next section details our proposal to visualize task-based application traces. It tackles some of the preceding challenges with a framework that combines data analytics tools to create a flexible environment enabling customized analysis.

2.7 Visualization Methodology

Visualizing data allows to graphically check many assumptions at once. It helps with assumptions that are difficult to formally state or for which defining a proper statistical test would require building on even more hypothesis that would also have to be verified. This is why it is important to start by listing various hypothesis (or expectations) made on the system under study. From such list, a set of visualizations can be generated. We propose visualizations that are designed for the application and scheduler developers, assisting them to rapidly identify performance issues as well as potential solutions.

The set of hypothesis to check is fairly rich in heterogeneous platforms targeted by task-based runtime systems. It is thus important to build a visualization framework that allows to easily and rapidly combine various views and propose new alternative views in an agile way. Moreover, since dynamic scheduling and machine heterogeneity bring a lot of variabilities, the ideal visualization should exploit any potential regularity coming from the application algorithm. For example, as we have seen in Figure\ref{fig:dag5}, each task can be identified by the loop indices $i,j,k$. Such kind of information is much more useful than the internal runtime task identification and should thus be provided by the application to the runtime so that it can be traced and further exploited during the visualization.

To meet these different design goals, we decided to use the workflow shown in Figure\ref{fig:combo}. With few modifications on MORSE, tracing is extended to tag tasks with loop indexes at the task creation. StarPU relies on FXT\cite{danjean2005fxt} traces to produce timestamped events in the Paje language\cite{pajelang}. A complete DAG is also created with task identifiers coherent with the Paje trace. Instead of building a complex monolithic tool, we follow the UNIX philosophy and script many small tools. Using pjdump, the paje trace is converted into a Comma-Separated Values (CSV) file that can be loaded into R. Thanks to the expressiveness and to the rich set of statistical libraries of the R language, many cleanups, filtering, statistic computations, and adaptation to another application than Cholesky can be done with few lines of code. The ggplot library provides a grammar of graphics and a very high-level way of building plots, enabling us to easily produce custom visualizations. This environment has enough expressiveness to guarantee different but coherent views (colors, scales, etc.).

This approach allows building static views in a fully automatic and very efficient way. Although such visualizations could probably be sped up even further by programming everything in C/C++, the used libraries are already well optimized and benefit from the know-how of data analysts. Furthermore, a combination of small scripts is both easier to maintain and adapt to a new necessity or to a particular situation than a rigid monolithic visualization environment.

The static views (typically basic X11 window or a PDF) of our approach have disadvantages when compared to tools described in Section\ref{sec.relatedwork}. Interaction is often crucial for the analyst to find what he is looking for. This is why we also build on plotly, an online analytics tool, that enables the quick conversion of ggplots into interactive, online graphs usable with a classical web browser (see \url{https://plot.ly/ggplot2/} for more details). Two illustrations in this article are also available in an interactive version. We strongly believe that putting interaction at the very end together with the scripting capabilities in the core of the analysis process is the key to carrying out the analysis of complex execution traces.

Finally, loading and merging several traces enables faithful comparisons and even produces perfectly aligned and coherent views. Since execution traces are stochastic in nature, we believe that side by side representations are essential to decide whether a phenomenon is important/recurrent or it is an unlucky situation and can be considered as the execution noise.

2.8 Experimental Validation and Results

2.8.1 Introduction

When dynamically scheduling task-based applications, the kind of question to answer is totally different whenever the resulting DAG is large or small. Large DAGs are expected to be embarrassingly parallel, almost reaching peak performance. Since such DAGs have many tasks, one need to use macroscopic views and indicators to understand how performance can be improved. Small DAGs, on the other hand, have little parallelism. Idle time will inevitably be incurred by task dependencies. For such executions, microscopic views with fine-grained data on task dependencies should rather be used.

We fully analyze two very different Cholesky workloads in the rest of this paper: large ($60×60$ tiles of size $960×960$) and small matrices ($12×12$ tiles, same size). For each, we detail the expected behavior, then propose composite views allowing to check these expectations. The views are then exploited to compare the three aforementioned StarPU schedulers, enabling us to propose potential improvements.

2.8.2 Large Workload (Cholesky of $60×60$ tiles of size $960×960$)

2.8.2.1 Expectations

Uniformity. Task duration is expected to depend solely on their type (\DGEMM, \DSYRK, \DTRSM or \DPOTRF) and on the type of resource (CPU or GPU) on which it is executed. Such assumption should be visually verified, highlighting all tasks whose duration is abnormally large compared to the others of the same type/resource. We treat these tasks as independent outliers,

unrelated to other tasks behavior. If not so, it may mean that the whole platform has been perturbed at particular moments or that some resource differs from the others. A task is anomalous if its duration exceeds the sampled third quartile plus 1.5 times the sampled interquartile range. Although this outlier notion is highly debatable and context-specific, other definitions could be easily incorporated.

Dependency problems. Large input matrices generate many tasks, especially when the application starts. We, therefore, want to monitor the number of ready and submitted tasks. For this Cholesky implementation, all tasks are expected to be submitted when the application starts. On scale, the number of task dependencies is extremely large. Automatically selecting which ones to display is haphazard. If a detailed view becomes necessary, we should switch to views described in Section\ref{sec.small_matrices}.

Progress. The task graph resulting from dense linear algebra always share a common structure (for instance, see Figure\ref{fig:dag5}). In a classical semi-sequential execution, the DAG would be executed much similarly to a breadth-first search. However, it is also possible to carry out a depth-first traversal, favoring task execution on the critical path. Following the pipelining of the sets of tasks submitted by each outer loop iteration can be sufficient to get an overview of how the scheduler is handling the DAG and if it corresponds to the analyst’s intuition or not.

Potential improvements. Dependencies are expected to be easily handled with large workloads. The major issue is the load balancing among CPUs and GPUs. Since one knows the average time $w_{i,k}$ needed to perform a task of type $k$ on a resource of type $i$ as well as the total number $n_k$ of tasks per type, one can consider that a fraction $α_{i,k}$ of tasks of type $k$ will be done on resource $i$ and that the $α_{i,k}$ should thus verify:

Since such constraints are linear it is possible to compute the optimal makespan $T$ and allocation $α_{i,k}$. The $T$ value is called the Area Bound Estimation (ABE) and is a lower bound for the execution time. Another classical lower bound is the Critical Path Bound Estimation (CPE). It is obtained by assigning each task to its faster processing resource and by summing all durations along the DAG. These execution time bounds, in particular, the ABE when the workload is large, are quite useful to estimate how much further improvement can be expected. More accurate lower bounds\cite{suraj15} could be used as well, in particular for intermediate size workloads. Moreover, an ideal task allocation is also computed when ABE is defined. Comparing the ideal with the actual allocation may help understanding how scheduling could be improved.

2.8.2.2 Composite View for Pinpointing Scheduling Mistakes

Building on the expectations, we propose a five-area composite view shown in Figure\ref{fig:large_2_pinpointing} (each area is detailed in the caption).

We see in Figure\ref{fig:large_2_pinpointing}(a) that the makespan is 62725ms while the ABE is 59464ms. So, one can hope for a 5% improvement. The scheduling seems indeed inefficient since there are periods (white areas in CPUs) when no useful computation is done. These periods correspond to filtered states (for clarity) where threads try to actively fetch data. The total idle state (yellow areas: a summary in the right of the Gantt) for CPUs is about 1%, while for GPUs it ranges from 2 to 6%. This GPU inactivity is likely the main source of potential improvement. From (c), it is clear that this idle time does not come from a sudden lack of ready tasks. Figure\ref{fig:large_2_pinpointing}(d) clearly indicates that all tasks have been submitted in the beginning and that task execution started immediately after, without waiting for fully unrolling the DAG. As suggested in (b), DAG traversal is rather depth-first. Many outer loop iterations are parallel (the maximum is 30 around 40s), explaining why there are always a sufficient number of ready tasks.

Such GPUs starvation is more likely explained either by data prefetching problem (some tasks are ready but their input data is not yet transferred to GPUs) or possibly by some priority issue (the priorities, used by the scheduler to choose which task to run first when several of them are ready, might be inadequate). The first reason is likely to be the right one here. Indeed, most large idle periods on GPUs and large periods of times where CPUs are not doing useful computations (in white) also coincide with abnormal \DGEMM tasks (in dark green) on GPUs. Note that the scripting feature allow us to show only tasks of a given type, resource, within a specific time interval, and whose duration is larger than a threshold and therefore clear any doubt the analyst may have. An investigation (such as the one of Section\ref{sec.small_matrices}) reveals that, for an unknown reason, the GPUs seem to freeze during a task execution inside the proprietary CUBLAS \DGEMM kernel, ultimately blocking tasks eagerly waiting for GPU data. Understanding why GPUs sometimes get stuck would certainly solve the issue but this clearly suggests a weakness of the chosen scheduler which assumes that tasks duration have small variability. Using other schedulers may, therefore, alleviate this.

The four plots depicted in Figure\ref{fig:large_2_pinpointing}(e) show the ideal allocation when calculating the ABE. They show how the GPUs have been overused with \DGEMM tasks and under-exploited for \DSYRK and \DTRSM tasks. It, therefore, suggests constraining the \DSYRK and \DTRSM tasks to run exclusively on GPUs.

2.8.2.3 Comparing Scheduling Strategies and Task Constraints

The previous analysis leads us to vary the scheduler (DMDA, DMDAS, WS) and to force or not the \DSYRK/\DTRSM allocation on GPUs. Figure\ref{fig:large_2_comparison} provides the six-scenario comparison.

First of all, it is interesting to see how the three schedulers differ in their DAG traversal of the DAG. While the DMDA algorithm has a breadth-first traversal (very few iterations of the outer loop are active at the same time), the DMDAS has a much more depth-first traversal as it takes the priority of the critical path into account. The traversal of the Work Stealing (WS) is even more depth-first as almost all outer loop iterations are still in progress at the end of the execution. Such way of progressing through the DAG is typical of WS and somehow favors local data accesses even though the algorithm is more dependency myopic than the two other ones.

Second, when constraining the \DSYRK and \DTRSM to run solely on the GPUs (the plots on the bottom row of Figure\ref{fig:large_2_comparison}), task allocation corresponds to the ideal one. However, while such constraint allows both DMDAS and Work Stealing to obtain near optimal executions (within less than 2% of the lower bound), it helped only moderately the DMDA algorithm. Many synchronized idle phases can be observed and imputed to both dependency issues (not enough parallelism is obtained from such a strict breadth-first traversal) and particularly slow tasks (probably slowed down by simultaneous data transfers). Interestingly, very few outlier tasks appear in the DMDAS and WS executions although the latter still seems a bit sensitive to this, as inactivity periods on CPUs (white areas) still correlate with the occurrence of \DGEMM outliers (darker green) on GPUs.

Finally, we stress that such observations are no coincidence. We randomly ran similar scenarios ten times and although the numbers always slightly differ, the general behavior and conclusions are the same. We also highlight that the area bound estimations (ABE) can vary significantly between two scenarios (e.g., 60s for constrained DMDA vs. 57s for constrained WS), which can be initially surprising since these estimates only depend on the number of tasks and their per-type average execution time on the different resources. The observations can be explained by the use of sample execution time mean, which may vary a bit. From our investigation, this variation is not explained by outliers occurrence but rather biased toward one or another scheduler. We think this is the consequence of a better locality (cache usage) but more complex measurements would be needed to fully evaluate this hypothesis.

2.8.3 Small Workload (Cholesky of $12×12$ tiles of size $960×960$)

2.8.3.1 Expectations

Potential improvements. The area bound (ABE) is optimistic for small workloads since it ignores task dependencies. The critical path bound (CPE) is much more relevant, especially on very small workloads such as this one. Still, knowing how tight they are is quite difficult\cite{suraj15}. Thus, comparing to the ideal CPU/GPU allocation is meaningless and we should focus mainly on filtering task dependencies.

Idle time everywhere. Having a lot of idle time is expected because of dependencies. It is thus imperative to identify crucial tasks, highlighting the “dynamic critical path”, \ie the last tasks upon which they depended on. Such important tasks may appear either because of the DAG structure or because they have been particularly delayed.

2.8.3.2 Backtrack Dependencies to Pinpoint Scheduling Mistakes

Since StarPU also exports the DAG, the dependency information can be merged into the trace. For Cholesky, the \DPOTRF are critical tasks that release many other tasks. It seems, therefore, relevant to track their dependencies. For a given task $T_{i}$, it is possible to compute what was the task $T_{i-1}$ on which it depends on and that finished the latest, similarly, for $T_{i-1}$ its latest predecessor $T_{i-2}$, etc. Such backward tracking of dependencies allows rebuilding the observed critical path. Given the DAG, in an ideal execution one would expect a \DPOTRF to be immediately preceded by a \DSYRK, immediately preceded by a \DTRSM that would, in turn, be immediately preceded by the \DPOTRF of the previous iteration. Therefore, we compute such filtered backward dependencies for each \DPOTRF and merge them together.

This is illustrated in Figure\ref{fig:small_2_pinpointing}. The makespan is 730ms while the ABE is 434ms and the CPE is 368ms. The bounds may be loose but it seems that there is room for improvement. If we start from the end of the schedule and go backward in time, we can see a dependency path (in blue) that, until timestamp 400ms, fully respects the alternation \DPOTRF–\DSYRK–\DTRSM. At the very end, all tasks execute right one after the other, which is optimal. The first “mistake” appears in time 600ms where the \DSYRK could have been executed a little earlier. Slightly before, some \DTRSM are not executed right after their \DPOTRF maybe because of data transfer or more likely because of a wrong priority. This critical path does not merge with the one obtained for the \DPOTRF of the first iterations. Now, when looking at the other (red) dependency path, we can see many times that the tasks are scheduled as soon as possible as if there was some priority problem, which could possibly be solved with another scheduler. The scripting feature allows plotting only dependencies whose duration is larger than a given threshold, avoiding graphical clutter.

We identify another problem with the blue dependency path. At the end, tasks are executed on the appropriate resource (\DPOTRF on CPUs, and \DSYRK and \DTRSM on GPUs). However, slightly before time 600ms, critical \DSYRK tasks start running on the CPUs, slowing the progression. Likewise, slightly before time 400ms, critical \DTRSM tasks are run on CPUs whereas they are known to be very slow on such resources. It seems that this scheduler makes a bad decision and that constraining \DTRSM and \DSYRK to be executed on GPUs may reduce the total makespan.

Although the potential solutions (fix priority, constraining some tasks to GPUs) suggested by this analysis are the same as in the previous use case, the underlying reasons are fundamentally different.

2.8.3.3 Comparing Scheduling Strategies and Task Constraints

Based on the previous analysis, we vary again the three schedulers and forcing \DSYRK/\DTRSM tasks on GPUs. Figure\ref{fig:small_2_comparison} compares the six resulting combinations for this workload.

By taking a closer look at the unconstrained top row of Figure\ref{fig:small_2_comparison}, we can observe that the behavior demonstrated by the DMDA and DMDAS schedulers are not so different. They both have similar runtime, two unmerged critical paths on which priority and critical task allocation problems can be identified. WS also demonstrates a very bad allocation, which is not surprising because it does not take into account the heterogeneity of the platform. There are three dynamic critical paths in the WS scheduler, with many \DTRSM and \DSYRK running on CPUs. When constraining these two task types to execute only on GPUs (the bottom row of Figure\ref{fig:small_2_comparison}), we observe that such restriction does not really help for the DMDA and DMDAS schedulers. Tasks on the critical path are no longer an issue, but both schedulers still have priority problems. The behavior demonstrated by DMDA seems easier to understand: we see some typical list scheduling behavior with critical \DPOTRF being delayed because CPUs are used for not so critical \DGEMMs. If one could run these tasks earlier, it appears that the whole makespan would be greatly improved.

Surprisingly, Work Stealing strongly benefits from the imposed restriction and now favorably compares against DMDA and DMDAS. It is also interesting to note that WS manages to keep all CPUs busy from the very beginning, unlike the other two schedulers. However, GPUs are not fully exploited, in particular at the end where they should be used to accelerate the \DGEMMs like the DMDA and DMDAS strategies do. If there was a way to prevent \DGEMM task execution on CPU after time 350ms, we would probably get the best of the two scheduling strategies and be much closer to the optimal execution time.

2.9 Conclusion

\label{sec.conclusion}

This article presents how an agile scripting framework allows creating faithful and enlightening trace views for the performance analysis of task-based HPC applications running on heterogeneous platforms. The framework follows the UNIX philosophy and builds upon many small existing tools (pjdump, ggplot2, plotly) glued together with the R programming language. Putting interaction at the very end together with scripting capabilities in the core of the analysis process is the key to a flexible design. We have shown how applying this framework in the analysis of large and small scale scenarios based on the MORSE/StarPU implementation of the Cholesky factorization allows to identify and address several non-trivial scheduling problems. The approach has scalability limitations mostly in the graphical representations (static or interactive) and not the scripting methodology for creating them. Plotly, for instance, is incapable of handling too much data, such as when there are many task types or resources.

As future work, we intend to improve the interactive graphical views by using temporal aggregation to reduce trace size and improve the integration between performance visualization and runtime debugging. We also intend to extend this approach to analyze the behavior of irregular task-based workloads\cite{agullo2013multifrontal}.

2.10 Acknowledgments

This work was supported by the projects: CAPES/Cofecub 764-13, FAPERGS/Inria ExaSE, CNPq 447311/2014-0, HPC4E 689772 from EU/H2020 and MCTI/RNP-Brazil, and the CNRS/LICIA Intl. Lab. We also thank Emmanuel Agullo, Lionel Eyraud-Dubois, and Suraj Kumar for their help in better understanding the traces and the visualization needs of the runtime scheduler developers.

2.11 References