In this paper, we present visual analysis techniques to evaluate the performance of HPC task-based applications on hybrid architectures. Our approach is based on composing modern data analysis tools (pjdump, R, ggplot2, plotly), enabling an agile and flexible scripting framework with minor development cost. We validate our proposal by analyzing traces from the full-fledged implementation of the Cholesky decomposition available in the MORSE library running on a hybrid (CPU/GPU) platform. The analysis compares two different workloads and three different task schedulers from the StarPU runtime system. Our analysis based on composite views allows to identify allocation mistakes, priority problems in scheduling decisions, GPU tasks anomalies causing bad performance, and critical path issues.
To fulfill the ever-growing need for computation power of High-Performance Computing (HPC) applications, it has become common to rely on hybrid nodes, composed of multi-core processors (CPUs) with multiple accelerators (GPUs). However, due to the heterogeneity and complexity of such machines, achieving portable and scalable performance has become extremely challenging. A possible solution, increasingly used by the community, is to program the application at a high level, independently of the hardware architecture, as a Directed Acyclic Graph (DAG) of tasks. It is then the responsibility of another software layer, called the runtime, to dynamically schedule the resulting tasks on the different computing resources taking into account the possible speed heterogeneity and variability as well as to automatically take care of data movements between resources. This allows removing artificial synchronizations from the application code, implementing complex scheduling and data movement algorithms (such as HEFT\cite{heft2992topcuoglu}) that would be hard to manually incorporate in the application. Thanks to dynamic decisions, the irregular behavior of applications and resources is absorbed and exploited by the scheduler during execution, effectively balancing the load among computing resources.
Task-based executions on hybrid platforms are inherently stochastic. Task mapping, for instance, can drastically change from one execution to another. From the performance analysis perspective, the nature of such opportunistic execution schemes makes classical performance analysis, which expects regularity and highlights heterogeneity, invariably ineffective. At the same time, application and runtime developers seek to understand the attained performance to improve the application design and scheduling decisions. This investigation is challenging because it is hard to compare many traces when parameters affecting task creation and scheduling change.
In this article, we explain how we designed a framework that enables an easy and faithful identification of subtle scheduling problems that would otherwise go unnoticed and misunderstood with classical trace visualization approaches. We built the framework@@latex:\setlength{\skip\footins}{0.3cm}\footnote{Code available at \url{http://perf-ev-runtime.gforge.inria.fr/vpa2016/}}@@ on top of modern data analytics tools, combining the R programming language (and in particular the ggplot2 library) and org-mode\cite{orgmode}. The tool combination comes at a very low development cost when compared to a traditional and monolithic performance visualization tool. The designed views depict task execution along time for each resource, automatically detecting several interleaving critical paths in traces. We demonstrate the effectiveness of our visualization approach by analyzing traces from the dense linear algebra Cholesky factorization of the Chameleon/MORSE package\cite{agullo2012morse}, implemented using the StarPU task-based runtime\cite{augonnet2011starpu}. Two representative factorization workloads are carried out on a hybrid multi-core/multi-GPU architecture. The large workload brings interesting insights on pinpointing resource usage mistakes and comparing three StarPU schedulers (DMDA, DMDAS and Work Stealing). The smaller workload shows that the dynamic critical path analysis provides hints for optimizations.
Section\ref{sec.context} provides some background on task-based runtimes for hybrid platforms and on the Cholesky algorithm. Section\ref{sec.relatedwork} presents some related work on trace visualization, motivating our own study. Section\ref{sec.proposal} presents our trace visualization proposal for the performance analysis of task-based runtimes. In Section\ref{sec.usecases} we detail two case studies demonstrating the effectiveness of our visualization strategy. Section\ref{sec.conclusion} gives a summary of results and future work.
\label{sec.context}
\label{sec.background}
Traditional bulk-synchronous parallel (BSP) applications, made of supersteps (computation, communication, barrier), is a very common design when computing resources are homogeneous. The trend towards heterogeneous resources with accelerators, such as GPUs, makes the task-based programming paradigm much more suitable. Depending on the sophistication of the runtime scheduling algorithms, applications can more or less efficiently exploit heterogeneous configurations.
Several runtimes targeting hybrid platforms have been developed in the recent years. MAGMA\cite{agullo2009magma}, for instance, combines multi-core with GPUs for linear algebra applications. OmpSs\cite{duran2011ompss} provides an extension to OpenMP tasks through new directives that allow supporting multi-core systems combined with GPUs. PaRSEC\cite{bosilca2012parsec} is a generic framework for architecture aware scheduling of tasks on many-core heterogeneous clusters. StarPU\cite{augonnet2011starpu} is a task parallelism runtime initially designed to exploit hybrid architectures and additionally providing an MPI-based extension\cite{augonnet2012starpumpi} to exploit several nodes at once. Dense/sparse linear algebra have been among the first applications to exploit such runtimes but other applications have been also implemented, such as FEM applications\cite{ohshima2013FEMstarpu}, seismic wave modeling\cite{martinez2015towards}, and others\cite{agullo2016conjugate,lacoste2014sparse}.
\label{sec.setup}
We propose to build on a relatively representative use case to address
the lack of adequate visualization tools for task-based
applications. We use traces of the dense linear algebra Cholesky
factorization, more specifically of the full-fledged implementation from
the MORSE library\cite{agullo2012morse}, compiled with the CUBLAS
kernels. A simplified version of this
application is shown in Figure\ref{fig:cholesky}; and the
corresponding DAG for a
Since hybrid heterogeneous nodes motivate the development of task-based runtimes, we execute this Cholesky implementation on IdCin2, a machine with two 14-core Intel(R) Xeon(R) CPU E5-2697v3@2.6GHz and three NVIDIA Titan X. From this set of resources, only 25 CPU cores participate in the computation because it is generally more efficient to let StarPU dedicate one core to manage each GPU. StarPU provides several scheduling algorithms that exploit both the DAG structure (through critical-path based heuristics) and performance models. Here, we focus on three of them.
The DMDA (Deque Model Data Aware) scheduler is a list scheduler, i.e.,
every time a resource is idle, if a task is ready, it will be
scheduled on this particular resource. Such a scheduler therefore
never leaves a resource idle on purpose, which ensures the
well-known
The next section presents related work on performance analysis for BSP and DAG-based HPC programming models. We detail the issues of current solutions, motivating our work.
\label{sec.relatedwork}
Despite the plethora of runtimes to execute task-based applications on heterogeneous resources (see Section\ref{sec.background}), there are very few established tools to conduct a proper task-aware analysis. Developers usually rely on BSP-based trace visualization tools, whose objectives are different, seeking unexpected heterogeneity where regular, homogeneous behavior is normal. Such tools are therefore unsuitable for visualizing task execution behavior since heterogeneity is the expected scenario for task-based applications. We briefly detail trace visualization strategies for BSP and DAG-based applications to better understand their differences. We also describe the design challenges of novel trace visualization techniques for DAG applications, as well as the typical questions usually raised during the analysis process.
Many tools exist to visualize traces from BSP-based applications. Most are focused on MPI applications. The common technique is based on Gantt charts, depicting each thread behavior along time. Behavior is drawn using colors to represent different thread states \eg MPI operation. Message-passing is depicted with arrows from source to destination, sometimes annotated with the amount of transferred data. Vite\cite{Vite2009} is an OpenGL-based open-source tool that has such a view, capable to visualize large traces with thread states and communication. Since this tool relies on the semantic-free Paje language\cite{pajelang}, it can depict virtually any kind of traces. Paraver\cite{pillet1995paraver} is another open-source Gantt tool that allows filtering, zooming, and graphical trace aggregation. Its trace format supports many HPC programming models. Vampir\cite{knupfer2008vampir} is a proprietary tool with multiple views for OTF2 traces. It is more scalable than similar tools due to its distributed organization.
There are a few tools to perform analysis and visualization of task-based executions. Typically, they are built with resources not designed for data analysis and rely on either non-scalable or non-scriptable strategies \ie with mouse pointer interaction. DAGViz\cite{huynh2015dagviz} offers a visual representation of task-based executions focusing on the DAG structure, which is retrieved using macros (translated to Cilk, Intel TBB or OpenMP) and presented in a hierarchical way. The resulting DAG can be folded/unfolded on-demand to show details and the node color indicates where they are executed. There is no way to retrieve the time dimension and task duration, which can make performance analysis difficult.
Temanejo\cite{keller2012temanejo} provides a similar timeless DAG interactive views for many task-based runtimes. Its main features are the debug capabilities, e.g. the user can put a breakpoint in a task and fix its dependencies. These capabilities are very useful during algorithm design on small scale, being unsuited for performance analysis
Haugen \etal proposes an interactive Gantt chart
enhanced with dependencies, drawn as edges between tasks \cite{kurzak2015taskdep}. We believe
this approach suffers from three issues. First, in term of
scalability, since (\eg in
Cholesky) tasks typically have many (
The performance analysis of task-based applications raises many challenges. The stochastic behavior of scheduling decisions induced by actual resource availability and by the task execution variability hinders the performance reproducibility. Moreover, executions are apparently unstructured, with no clear phases as in the BSP model. At the same time, task dependencies are part of the application and should be exploited by the analyst to understand bottlenecks.
Traces are generally much larger than the available screen space and naively displaying everything generally leads to biased views\cite{schnorr2013fits}. One, therefore, has to use selection schemes to show only data that is fully relevant from the analysis point of view. In such complex execution traces, many hypotheses (and thus filters) can be proposed regarding the expected behavior. Developing a monolithic tool, such as the ones tailored for BSP analysis, that anticipates all possible performance problems is thus impractical. Moreover, such tool would be quite difficult to maintain and customize. A more flexible solution with scripting capability is thus needed.
During the analysis of DAG traces, the typical questions that arise are different from those of BSP applications. Instead of inspecting whether and why a certain algorithm iteration or phase was slowed down, one needs to analyze if the tasks were properly scheduled and if sufficient parallelism is available. One also needs to examine whether complex data movement techniques, task submission, and many internal runtime mechanisms can be further improved.
A common approach to better understand the application behavior is to compare several execution traces, possibly each one with a different configuration (\eg scheduling parameter). However, due to the dynamic scheduling, this is generally difficult for task-based traces. To draw relevant conclusions, one needs to synchronize multiple visualizations and filter the unwanted states. Although some support exists in some tools\cite{pillet1995paraver,pagano2014framesoc}, they do not offer enough customization flexibility for such studies.
The next section details our proposal to visualize task-based application traces. It tackles some of the preceding challenges with a framework that combines data analytics tools to create a flexible environment enabling customized analysis.
Visualizing data allows to graphically check many assumptions at once. It helps with assumptions that are difficult to formally state or for which defining a proper statistical test would require building on even more hypothesis that would also have to be verified. This is why it is important to start by listing various hypothesis (or expectations) made on the system under study. From such list, a set of visualizations can be generated. We propose visualizations that are designed for the application and scheduler developers, assisting them to rapidly identify performance issues as well as potential solutions.
The set of hypothesis to check is fairly rich in heterogeneous
platforms targeted by task-based runtime systems. It is thus important
to build a visualization framework that allows to
easily and rapidly combine various views and propose new alternative views
in an agile way. Moreover, since dynamic scheduling and machine
heterogeneity bring a lot of variabilities, the ideal visualization
should exploit any potential regularity coming
from the application algorithm. For example, as we have seen in
Figure\ref{fig:dag5}, each task can be identified by the loop indices
To meet these different design goals, we decided to use the
workflow shown in Figure\ref{fig:combo}.
With few modifications on MORSE,
tracing is extended to tag tasks with loop indexes at the task
creation.
StarPU relies on FXT\cite{danjean2005fxt} traces to produce timestamped events in
the Paje language\cite{pajelang}. A complete DAG is also created with
task identifiers coherent with the Paje trace. Instead of building a
complex monolithic tool, we follow the UNIX philosophy and
script many small tools. Using pjdump, the paje trace is converted
into a Comma-Separated Values (CSV) file that can be
loaded into R. Thanks to the expressiveness and to the rich set of
statistical libraries of the R language, many cleanups, filtering,
statistic computations, and adaptation to another application than Cholesky can be done with few lines of code. The
ggplot library provides a grammar of graphics and a very high-level
way of building plots, enabling us to easily produce custom visualizations.
This environment has enough expressiveness to guarantee
different but coherent views (colors,
scales, etc.).
This approach allows building static views in a fully automatic and very efficient way. Although such visualizations could probably be sped up even further by programming everything in C/C++, the used libraries are already well optimized and benefit from the know-how of data analysts. Furthermore, a combination of small scripts is both easier to maintain and adapt to a new necessity or to a particular situation than a rigid monolithic visualization environment.
The static views (typically basic X11 window or a PDF) of our approach
have disadvantages when compared to tools described in
Section\ref{sec.relatedwork}. Interaction is often crucial
for the analyst to find what he is looking for. This is why we also
build on plotly, an online analytics tool, that
enables the quick conversion of ggplots into interactive, online
graphs usable with a classical web browser (see \url{https://plot.ly/ggplot2/} for more details).
Two illustrations in this article are also available in an interactive
version. We
strongly believe that putting interaction at the very end together
with the scripting capabilities in the core of the analysis process is
the key to carrying out the analysis of complex execution traces.
Finally, loading and merging several traces enables faithful comparisons and even produces perfectly aligned and coherent views. Since execution traces are stochastic in nature, we believe that side by side representations are essential to decide whether a phenomenon is important/recurrent or it is an unlucky situation and can be considered as the execution noise.
When dynamically scheduling task-based applications, the kind of question to answer is totally different whenever the resulting DAG is large or small. Large DAGs are expected to be embarrassingly parallel, almost reaching peak performance. Since such DAGs have many tasks, one need to use macroscopic views and indicators to understand how performance can be improved. Small DAGs, on the other hand, have little parallelism. Idle time will inevitably be incurred by task dependencies. For such executions, microscopic views with fine-grained data on task dependencies should rather be used.
We fully analyze two very different Cholesky workloads in the rest of this
paper: large (
Uniformity. Task duration is expected to depend solely on their type (\DGEMM, \DSYRK, \DTRSM or \DPOTRF) and on the type of resource (CPU or GPU) on which it is executed. Such assumption should be visually verified, highlighting all tasks whose duration is abnormally large compared to the others of the same type/resource. We treat these tasks as independent outliers,
unrelated to other tasks behavior. If not so, it may mean that the whole platform has been perturbed at particular moments or that some resource differs from the others. A task is anomalous if its duration exceeds the sampled third quartile plus 1.5 times the sampled interquartile range. Although this outlier notion is highly debatable and context-specific, other definitions could be easily incorporated.
Dependency problems. Large input matrices generate many tasks, especially when the application starts. We, therefore, want to monitor the number of ready and submitted tasks. For this Cholesky implementation, all tasks are expected to be submitted when the application starts. On scale, the number of task dependencies is extremely large. Automatically selecting which ones to display is haphazard. If a detailed view becomes necessary, we should switch to views described in Section\ref{sec.small_matrices}.
Progress. The task graph resulting from dense linear algebra always share a common structure (for instance, see Figure\ref{fig:dag5}). In a classical semi-sequential execution, the DAG would be executed much similarly to a breadth-first search. However, it is also possible to carry out a depth-first traversal, favoring task execution on the critical path. Following the pipelining of the sets of tasks submitted by each outer loop iteration can be sufficient to get an overview of how the scheduler is handling the DAG and if it corresponds to the analyst’s intuition or not.
Potential improvements. Dependencies are expected to be easily handled
with large workloads. The major issue is the load balancing among
CPUs and GPUs. Since one knows the average time
Since such constraints are linear it is possible to compute the
optimal makespan
Building on the expectations, we propose a five-area composite view shown in Figure\ref{fig:large_2_pinpointing} (each area is detailed in the caption).
We see in Figure\ref{fig:large_2_pinpointing}(a) that the makespan is 62725ms while the ABE is 59464ms. So, one can hope for a 5% improvement. The scheduling seems indeed inefficient since there are periods (white areas in CPUs) when no useful computation is done. These periods correspond to filtered states (for clarity) where threads try to actively fetch data. The total idle state (yellow areas: a summary in the right of the Gantt) for CPUs is about 1%, while for GPUs it ranges from 2 to 6%. This GPU inactivity is likely the main source of potential improvement. From (c), it is clear that this idle time does not come from a sudden lack of ready tasks. Figure\ref{fig:large_2_pinpointing}(d) clearly indicates that all tasks have been submitted in the beginning and that task execution started immediately after, without waiting for fully unrolling the DAG. As suggested in (b), DAG traversal is rather depth-first. Many outer loop iterations are parallel (the maximum is 30 around 40s), explaining why there are always a sufficient number of ready tasks.
Such GPUs starvation is more likely explained either by data prefetching problem (some tasks are ready but their input data is not yet transferred to GPUs) or possibly by some priority issue (the priorities, used by the scheduler to choose which task to run first when several of them are ready, might be inadequate). The first reason is likely to be the right one here. Indeed, most large idle periods on GPUs and large periods of times where CPUs are not doing useful computations (in white) also coincide with abnormal \DGEMM tasks (in dark green) on GPUs. Note that the scripting feature allow us to show only tasks of a given type, resource, within a specific time interval, and whose duration is larger than a threshold and therefore clear any doubt the analyst may have. An investigation (such as the one of Section\ref{sec.small_matrices}) reveals that, for an unknown reason, the GPUs seem to freeze during a task execution inside the proprietary CUBLAS \DGEMM kernel, ultimately blocking tasks eagerly waiting for GPU data. Understanding why GPUs sometimes get stuck would certainly solve the issue but this clearly suggests a weakness of the chosen scheduler which assumes that tasks duration have small variability. Using other schedulers may, therefore, alleviate this.
The four plots depicted in Figure\ref{fig:large_2_pinpointing}(e) show the ideal allocation when calculating the ABE. They show how the GPUs have been overused with \DGEMM tasks and under-exploited for \DSYRK and \DTRSM tasks. It, therefore, suggests constraining the \DSYRK and \DTRSM tasks to run exclusively on GPUs.
The previous analysis leads us to vary the scheduler (DMDA, DMDAS, WS) and to force or not the \DSYRK/\DTRSM allocation on GPUs. Figure\ref{fig:large_2_comparison} provides the six-scenario comparison.
First of all, it is interesting to see how the three schedulers differ in their DAG traversal of the DAG. While the DMDA algorithm has a breadth-first traversal (very few iterations of the outer loop are active at the same time), the DMDAS has a much more depth-first traversal as it takes the priority of the critical path into account. The traversal of the Work Stealing (WS) is even more depth-first as almost all outer loop iterations are still in progress at the end of the execution. Such way of progressing through the DAG is typical of WS and somehow favors local data accesses even though the algorithm is more dependency myopic than the two other ones.
Second, when constraining the \DSYRK and \DTRSM to run solely on the GPUs (the plots on the bottom row of Figure\ref{fig:large_2_comparison}), task allocation corresponds to the ideal one. However, while such constraint allows both DMDAS and Work Stealing to obtain near optimal executions (within less than 2% of the lower bound), it helped only moderately the DMDA algorithm. Many synchronized idle phases can be observed and imputed to both dependency issues (not enough parallelism is obtained from such a strict breadth-first traversal) and particularly slow tasks (probably slowed down by simultaneous data transfers). Interestingly, very few outlier tasks appear in the DMDAS and WS executions although the latter still seems a bit sensitive to this, as inactivity periods on CPUs (white areas) still correlate with the occurrence of \DGEMM outliers (darker green) on GPUs.
Finally, we stress that such observations are no coincidence. We randomly ran similar scenarios ten times and although the numbers always slightly differ, the general behavior and conclusions are the same. We also highlight that the area bound estimations (ABE) can vary significantly between two scenarios (e.g., 60s for constrained DMDA vs. 57s for constrained WS), which can be initially surprising since these estimates only depend on the number of tasks and their per-type average execution time on the different resources. The observations can be explained by the use of sample execution time mean, which may vary a bit. From our investigation, this variation is not explained by outliers occurrence but rather biased toward one or another scheduler. We think this is the consequence of a better locality (cache usage) but more complex measurements would be needed to fully evaluate this hypothesis.
Potential improvements. The area bound (ABE) is optimistic for small workloads since it ignores task dependencies. The critical path bound (CPE) is much more relevant, especially on very small workloads such as this one. Still, knowing how tight they are is quite difficult\cite{suraj15}. Thus, comparing to the ideal CPU/GPU allocation is meaningless and we should focus mainly on filtering task dependencies.
Idle time everywhere. Having a lot of idle time is expected because of dependencies. It is thus imperative to identify crucial tasks, highlighting the “dynamic critical path”, \ie the last tasks upon which they depended on. Such important tasks may appear either because of the DAG structure or because they have been particularly delayed.
Since StarPU also exports the DAG, the dependency information can be
merged into the trace. For Cholesky,
the \DPOTRF are critical tasks that release many other tasks. It
seems, therefore, relevant to track their dependencies. For a given task
This is illustrated in Figure\ref{fig:small_2_pinpointing}. The makespan is 730ms while the ABE is 434ms and the CPE is 368ms. The bounds may be loose but it seems that there is room for improvement. If we start from the end of the schedule and go backward in time, we can see a dependency path (in blue) that, until timestamp 400ms, fully respects the alternation \DPOTRF–\DSYRK–\DTRSM. At the very end, all tasks execute right one after the other, which is optimal. The first “mistake” appears in time 600ms where the \DSYRK could have been executed a little earlier. Slightly before, some \DTRSM are not executed right after their \DPOTRF maybe because of data transfer or more likely because of a wrong priority. This critical path does not merge with the one obtained for the \DPOTRF of the first iterations. Now, when looking at the other (red) dependency path, we can see many times that the tasks are scheduled as soon as possible as if there was some priority problem, which could possibly be solved with another scheduler. The scripting feature allows plotting only dependencies whose duration is larger than a given threshold, avoiding graphical clutter.
We identify another problem with the blue dependency path. At the end, tasks are executed on the appropriate resource (\DPOTRF on CPUs, and \DSYRK and \DTRSM on GPUs). However, slightly before time 600ms, critical \DSYRK tasks start running on the CPUs, slowing the progression. Likewise, slightly before time 400ms, critical \DTRSM tasks are run on CPUs whereas they are known to be very slow on such resources. It seems that this scheduler makes a bad decision and that constraining \DTRSM and \DSYRK to be executed on GPUs may reduce the total makespan.
Although the potential solutions (fix priority, constraining some tasks to GPUs) suggested by this analysis are the same as in the previous use case, the underlying reasons are fundamentally different.
Based on the previous analysis, we vary again the three schedulers and forcing \DSYRK/\DTRSM tasks on GPUs. Figure\ref{fig:small_2_comparison} compares the six resulting combinations for this workload.
By taking a closer look at the unconstrained top row of Figure\ref{fig:small_2_comparison}, we can observe that the behavior demonstrated by the DMDA and DMDAS schedulers are not so different. They both have similar runtime, two unmerged critical paths on which priority and critical task allocation problems can be identified. WS also demonstrates a very bad allocation, which is not surprising because it does not take into account the heterogeneity of the platform. There are three dynamic critical paths in the WS scheduler, with many \DTRSM and \DSYRK running on CPUs. When constraining these two task types to execute only on GPUs (the bottom row of Figure\ref{fig:small_2_comparison}), we observe that such restriction does not really help for the DMDA and DMDAS schedulers. Tasks on the critical path are no longer an issue, but both schedulers still have priority problems. The behavior demonstrated by DMDA seems easier to understand: we see some typical list scheduling behavior with critical \DPOTRF being delayed because CPUs are used for not so critical \DGEMMs. If one could run these tasks earlier, it appears that the whole makespan would be greatly improved.
Surprisingly, Work Stealing strongly benefits from the imposed restriction and now favorably compares against DMDA and DMDAS. It is also interesting to note that WS manages to keep all CPUs busy from the very beginning, unlike the other two schedulers. However, GPUs are not fully exploited, in particular at the end where they should be used to accelerate the \DGEMMs like the DMDA and DMDAS strategies do. If there was a way to prevent \DGEMM task execution on CPU after time 350ms, we would probably get the best of the two scheduling strategies and be much closer to the optimal execution time.
\label{sec.conclusion}
This article presents how an agile scripting framework allows creating faithful and enlightening trace views for the performance analysis of task-based HPC applications running on heterogeneous platforms. The framework follows the UNIX philosophy and builds upon many small existing tools (pjdump, ggplot2, plotly) glued together with the R programming language. Putting interaction at the very end together with scripting capabilities in the core of the analysis process is the key to a flexible design. We have shown how applying this framework in the analysis of large and small scale scenarios based on the MORSE/StarPU implementation of the Cholesky factorization allows to identify and address several non-trivial scheduling problems. The approach has scalability limitations mostly in the graphical representations (static or interactive) and not the scripting methodology for creating them. Plotly, for instance, is incapable of handling too much data, such as when there are many task types or resources.
As future work, we intend to improve the interactive graphical views by using temporal aggregation to reduce trace size and improve the integration between performance visualization and runtime debugging. We also intend to extend this approach to analyze the behavior of irregular task-based workloads\cite{agullo2013multifrontal}.
This work was supported by the projects: CAPES/Cofecub 764-13, FAPERGS/Inria ExaSE, CNPq 447311/2014-0, HPC4E 689772 from EU/H2020 and MCTI/RNP-Brazil, and the CNRS/LICIA Intl. Lab. We also thank Emmanuel Agullo, Lionel Eyraud-Dubois, and Suraj Kumar for their help in better understanding the traces and the visualization needs of the runtime scheduler developers.