SK-adaptive-scheduling-paper/main.tex at master · hu-macsy/SK-adaptive-scheduling-paper · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out.
\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
%\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{algorithmicx}
%\usepackage[colorlinks=true, allcolors=blue]{hyperref}
\usepackage{algpseudocode}
\usepackage{algorithm}
\usepackage{xspace}
\usepackage{hyperref}
\usepackage{numprint}
\usepackage{todonotes}
\usepackage{booktabs}
\usepackage{balance}

\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\include{macros}

\newcommand{\algo}[1]{\textsc{#1}}
\newcommand{\bottomlevel}[1]{\underline{l}_{#1}} % underline short italic
\newcommand{\criticalpath}{\mathcal{P}}
\newcommand{\parents}[1]{\,\Pi_{#1}}
\newcommand{\children}[1]{\,child_{#1}}
\newcommand{\cluster}{\,\mathcal{S}}

\newcommand{\heft}{\algo{HEFT}\xspace}
\newcommand{\heftmm}{\algo{HEFTM-MM}\xspace}
\newcommand{\heftbl}{\algo{HEFTM-BL}\xspace}
\newcommand{\heftblc}{\algo{HEFTM-BLC}\xspace}


\newcommand{\MM}{M}
\newcommand{\MC}{MC}
\newcommand{\rt}{rt}
\newcommand{\curM}{curM}
\newcommand{\curC}{curC}
\newcommand{\PD}{PD}

%\newcommand{\new}[1]{{\color{blue}#1}}
\newcommand{\new}[1]{{#1}}

\newcommand{\skug}[1]{{\color{blue}[SK: #1]}}
\newcommand{\hmey}[1]{{\color{red}[HM: #1]}}
\newcommand{\AB}[1]{\todo[inline]{{\color{purple}[AB: #1]}}}

\renewcommand{\iec}{i.e., }

\pagestyle{plain}


\begin{document}

    \title{Memory-aware Adaptive Scheduling of Scientific Workflows on Heterogeneous Architectures\\
%{\footnotesize \textsuperscript{*}Note: Sub-titles are not captured in Xplore and
%should not be used}
    % \thanks{Identify applicable funding agency here. If none, delete this.}
    }

\author{\IEEEauthorblockN{Svetlana Kulagina}
\IEEEauthorblockA{\textit{Humboldt-Universität zu Berlin, Germany} \\
ORCID: 0000-0002-2108-9425}
\and
\IEEEauthorblockN{Anne Benoit}
\IEEEauthorblockA{\textit{ENS Lyon \& IUF, France} \\
ORCID: 0000-0003-2910-3540
}
%\textit{name of organization (of Aff.)}\\
%City, Country \\
%email address or ORCID
\and
\IEEEauthorblockN{Henning Meyerhenke}
\IEEEauthorblockA{\textit{Humboldt-Universität zu Berlin, Germany} \\
ORCID: 0000-0002-7769-726X}
}


\maketitle


 \IEEEpubid{\begin{minipage}{\textwidth}\ \\[12pt]
 © 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
 See https://www.ieee.org/publications/rights/index.html for more information.
 The published conference version of this paper appears in the Proceedings of 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid), IEEE Computer Society.
\end{minipage}}


    \begin{abstract}
The analysis of massive scientific data often happens in the form of workflows with
interdependent tasks. When such a scientific workflow needs to be scheduled on a parallel or distributed
system, one usually represents the workflow as a directed acyclic graph (DAG).
The vertices of the DAG represent the tasks, while
its edges model the dependencies between the tasks (usually data to be communicated to
successor tasks). When executed, each task requires a certain amount of memory and if that
exceeds the available memory, the execution fails.
The typical goal is to execute the workflow without failures (i.e., satisfying the memory
constraints) and with the shortest possible execution time (i.e., to minimize its makespan).

To address this problem, we investigate the memory-aware scheduling of DAG-shaped workflows on
heterogeneous platforms, where each processor can have a different speed and a different memory size.
We propose a variant of HEFT (Heterogeneous Earliest Finish Time) that, in contrast to the original, accounts for memory and
includes eviction strategies for cases when it might be beneficial to remove some data from memory
in order to have enough memory to execute other tasks.
%
Furthermore, while HEFT assumes perfect knowledge of the execution time and memory usage
of each task, the actual values might differ upon execution. Thus, we propose an adaptive
scheduling strategy, where a schedule is recomputed when there has been a significant variation in terms
of execution time or memory.
%
The scheduler has been closely integrated with a runtime system, allowing us to perform a thorough
experimental evaluation on real-world workflows. The runtime system warns the scheduler when
the task parameters have changed, and a schedule can be recomputed on the fly. The memory-aware
strategy allows us to schedule task graphs that would run out of memory with a state-of-the-art
scheduler, and the adaptive setting allows us to significantly reduce the makespan.

%                when tentatively assigning tasks, in order to
%        Its first step is to compute the weights of the tasks.
%        We suggest three variants: bottom levels as weights, bottom levels with impact of incoming edge weight,
%        and weights along the optimal memory traversal.
%        In the second step, we try assigning each task to each processors and execute the assignment that
%        is feasible with regard to memory size and gives the earliest finishing time to the task.
%        Sometimes, data corresponding to edge weights that is stored in the memory needs to be evicted in order to
%        assign a task to a processor.
%        We suggest two eviction strategies - largest files first and smallest files first.

%        Our experimental evaluation on real-world workflows and simulated (\skug{generated? They are generated from real-world wfs})
%        ones with real task and edge weights with up to 30,000 tasks shows that
%        respecting memory constraints only costs $11\%$ of runtime in comparison to a non memory-aware baseline.
%        Calculating task weights with the impact of memory gives a $x\%$ better makespans on a normal and $y\%$ better makespans
%        on small one.
%        Calculating task weights along the optimal memory traversal gives on average $z\%$ worse makespans, but improves
%        average memory utlization by $t\%$.

\end{abstract}

    \begin{IEEEkeywords}
        DAG, Heterogeneous platform, Adaptive \new{workflow} scheduling, Memory constraint.
    \end{IEEEkeywords}

\section{Introduction} %: \skug{Full: 0, Polished: 0}}
    \IEEEpubidadjcol
%%% CONTEXT %%%
The analysis of massive datasets, originating from fields such as genomics,
remote sensing, or biomedical imaging -- to name just a few -- has become ubiquitous in science;
this often takes the form of workflows, \iec separate software components chained together
in some kind of complex pipeline~\cite{DBLP:journals/dbsk/LeserHDEGHKKKKK21}.
These workflows are usually represented as directed acyclic graphs (DAGs).
The DAG vertices represent the software components (or, more generally, the workflow \emph{tasks}),
while the edges model I/O dependencies between the tasks~\cite{adhikari2019survey,liu2018survey}.
Large workflows with resource-intensive tasks can easily exceed the capabilities of a
single computer and are therefore executed on a parallel or distributed platform.
An efficient execution of the workflows on such platforms requires the mapping of tasks
to specific processors.
\new{Moreover, a task schedule -- \iec a valid execution order that respects the dependencies --
and possibly also starting times for the tasks are needed to increase utilization by reusing
processors that have become idle after finishing a task.}
    \IEEEpubidadjcol

%%% MOTIVATION %%%
Modern parallel and in particular distributed computing platforms are often heterogeneous,
meaning they feature varying CPU speeds and memory sizes.
In general, having different memory sizes per CPUs makes it more challenging for an algorithm
to compute a schedule that respects all memory constraints -- meaning that no task is executed on a
processor with less memory than needed for the task. Violating a memory constraint is, however,
very important to avoid possibly expensive runtime failures and to provide a satisfactory user experience.
Hence, building on previous related %\hmey{You may want to add other refs not from us}
work~\cite{gou2020partitioning,He21,DBLP:conf/icpp/KulaginaMB24}, we consider a scheduling problem
formulation that takes memory sizes as explicit constraints into account.
Its objective is the very common \emph{makespan}~\cite{liu2018survey},
which acts as proxy for the total execution time of a workflow.
However, to the best of our knowledge, the only memory-aware heuristics that would account for
memory constraints partition the DAG and do not reuse processors once they have processed a part
of the graph. This approach leads to high values of makespan compared to a more fine-grained solution
with processor reuse.

While previous work with memory constraints has focused on partitioning the DAG and not on
reusing processors during execution, a seminal list scheduling heuristic for workflows on
heterogeneous platforms, without accounting for the memory constraint, is HEFT
(heterogeneous earliest finish time)~\cite{topcuoglu2002performance}.
It has two phases: (i) each task is assigned a priority and (ii) the tasks in a priority-ordered list are assigned
to processors, where the ``ready'' task with the highest priority is scheduled next on the processor
where it would complete its execution first.
HEFT has been extended (e.g., by Shi and Dongarra~\cite{SHI2006665}) and adjusted
for a variety of different scheduling problem formulations.
Yet, none of them adhere to memory constraints as addressed in this paper --
see the discussion of related work in Section~\ref{sec:related-work}.
%    (\skug{check in related work if true!}).
%        \skug{Note: 2 papers that deal with memory sizes, but model is very different!}
%
Another limitation of HEFT (and many other scheduling strategies) in practice is their
assumption that the task running times provided to them are accurate. In practice, this is
not the case and deviations from user estimates or historical measurements are
very common~\cite{hirales2012multiple}. As a consequence, it is advisable to adapt the schedule when \emph{major}
deviations occur. However, the original list-based schedulers, such as HEFT, are designed
for a static setting with accurate task parameters.
%
%    List-based schedulers such as HEFT are, however, not designed for
%    such an adaptation~\cite{TODO}.\hmey{Svetlana, Anne: is this a fair statement? Please add ref (if any)}
%    \AB{Actually, we adapt HEFT as well, I'll reformulate...}
%%    and would compute a completely new schedule from scratch.


%%% CONTRIBUTION %%%
%\paragraph*{Contribution}
The main contributions of this paper are both algorithmic and experimental:
\begin{itemize}
\item We formalize the problem with memory constraints, where communication buffers
are used to evict data from memory if it will be later used by another processor.
\item We design three HEFT-based heuristics
that adhere to memory size constraints: \heftbl, \heftblc, and \heftmm.
M behind HEFT stands for \underline{m}emory, BL for \underline{b}ottom \underline{l}evel,
BLC for \underline{b}ottom \underline{l}evel with \underline{c}ommunication,
and MM for \underline{m}inimum \underline{m}emory traversal.
The difference between the new heuristics is the way they prioritize tasks for processor assignment.

\item We implement a runtime system able to provide some feedback to the scheduler
when task requirements (in terms of execution time and/or memory) differ from the initial predictions,
and we recompute a schedule, based on the reported deviations.

\item We perform extensive simulations, first in the static case by comparing the schedules produced
by these heuristics with the classical HEFT as baseline (the latter does not take memory sizes into account);
while HEFT returns invalid schedules that exceed the processor memories and cannot execute correctly,
the new heuristics are able to successfully schedule large workflows and do so with reasonable makespans.

\item In the dynamic setting, we use a runtime system that allows us to simulate workflow executions.
The runtime system introduces deviations in running times and task memory requirements and communicates
them to the scheduler; the scheduler can then recompute a schedule. Without these recomputations,
most schedules become invalid after deviations, since the memory constraint is exceeded
for most workflows, which demonstrates the necessity of a dynamic adjustment of the schedule.

%    \hmey{Need to clarify: why is this SotA?}

%  \begin{itemize}
%      \item Static: we find that our heuristics are able to schedule all workflows correctly, and produce makespans similar to the baseline.
%      \item Adaptive: runtime system built, simulates workflow executions and deviations in running times and mem requirements of tasks
%      \item Answering requests of the runtime system for adaptation, the scheduler computes an improved schedule based on the reported deviations.
\end{itemize}

We first review related work in Section~\ref{sec:related-work}. Then, we formalize the model in Section~\ref{sec:model}
and present our algorithms in Section~\ref{sec:heuristics}. The adaptation of the heuristics in a dynamic setting is discussed in Section~\ref{sec:dyn}, and the results of our experiments are presented in Section~\ref{sec:expe}. Finally, we conclude
and provide future working directions in Section~\ref{sec:conc}.


\section{Related work} %: \skug{Full: 5, Polished: 4}}
\label{sec:related-work}

%First, we focus on HEFT-like scheduling heuristics from the literature  that do not necessarily
%consider memory constraints. Then, we discuss memory-aware scheduling algorithms.
%Finally, we move to related work on dynamic or adaptive algorithms.

%    We discuss relevant scheduling approaches that reuse processors or respect the memory requirements of the processors.
%
%    \subsection{Early list schedulers with unlimited processors}
%    An entire cluster of works on list schedulers has been carried out as early as the 90s.
%    They all assume a DAG-shaped workflow with makespan weights on tasks, and an unlimited amount of homogeneous processors
%    with the speed of 1~\cite{benoit2013survey}.\hmey{needs refs or at least a survey pointer}
%    \skug{actually, Anne cited them, so citing her :-)}
%
%    The \textit{task duplication}-based approaches exploit that sometimes running a task twice on different machines can
%    help reduce the makespan by saving communication costs.
%    The two categories are scheduling with partial duplication~(SPD), and with full duplication~(SFD).
%    For a join task (a task whose incoming degree is larger than its outgoing degree), SPD finds a critical immediate
%    parent (the one that gives the largest start time to the join task) and duplicates only it.
%    SFD duplicates all parents of a join node.
%    The algorithm by~\cite{dfrn1997} duplicates first (creates copies of all parent tasks) and then reduces (removes) the ones that can be removed without harming the makespan.
%    The critical path fast duplication algorithm CPFD~\cite{5727760} classifies tasks into three categories: critical
%    path task, in-branch task, or out-branch task.
%    It schedules critical path tasks first, then in-branch tasks.
%    \hmey{I'm missing the connection to our paper. Or vice versa, how is our contribution
%    connected to these works? (Einordnung in den Forschungskontext) For example, do we do similar things? Are there interesting limitations we overcome?}
%
%    Linear clustering~\cite{KWOK1999381} acts on critical paths in the workflow.
%    It assigns the current critical path to one processor, removes all these tasks from the workflow, recomputes the critical
%    path and repeats the procedure.
%    Heaviest node first~\cite{SHIRAZI1990222} assigns the tasks level by level;
%    in each level, it schedules the one with largest computation time first.


%\noindent{\bf
\subsection{HEFT-based algorithms}
%\subsection{Static list schedulers, especially HEFT-based algorithms}
%\label{sub:static-list-schedulers}
%
Introduced in 2002, HEFT~\cite{topcuoglu2002performance} is a list-based heuristic, consisting
%HEFT and its successors consist
of two phases: task prio\-ri\-tization/ordering and task assignment.
In the first phase, the algorithms compute bottom levels of the tasks based on some priorities (create the list),
and then schedule tasks in the order of these priorities.
The modifications of HEFT revolve around the way the priorities of the tasks are computed and the logic of the processor assignment.
All such algorithms assume a heterogeneous execution environment.

%During the task prioritization phase in~\cite{sulaiman2021hybrid}, the standard deviation of the computation cost % (between processors)
%is computed and added to the mean value to account for differences between processor speeds.
%% TODO: Why ``between processors''? And what exactly is ``computation cost'' in the first place?
%In the processor choice phase, the entry task and the longest parent tasks are duplicated
%during idle times on the processor.
%% TODO: Context of previous sentence unclear. Why is this done? What is the difference to others / our work in this respect?

%    Ref.~\cite{alebrahim2017task} computes the bottom level based on the difference of execution times on
%    the fastest and the slowest processors, divided by the speed ratio of these two processors.
%    When doing processor selection, the authors differentiate between the lowest execution time and earliest finishing time.
%    They choose the processor with the lowest execution time and cross over to other processors sometimes.
%    They build upon~\cite{shetti2013optimization}.\hmey{Last sentence ``hangs in the air''. Either drop or connect it properly.}

Some variants have been designed with various ways of ordering tasks, for instance based
on an optimistic cost table in
PEFT (Predict earliest finish time)~\cite{arabnejad2014list}, or by combining the standard deviation
with the communication cost weight on the tasks in HSIP (Heterogeneous Scheduling with Improved task Priorities)~\cite{wang2016hsip}.
% is a HEFT variant that computes an Optimistic
%Cost Table (OCT).
%The OCT is computed per task-processor pair and stores the longest shortest path from this task to the target task if this
%processor is chosen for this task.
%% TODO: is ``target task'' defined/clear?
%Ranking is based on OCT values.
%The processor choice stage minimizes the optimistic EFT, which is EFT plus the longest path to the exit node for each task. % TODO: is ``exit node'' == ``target task''?
%
%The HSIP (Heterogeneous Scheduling with Improved task Priorities)~\cite{wang2016hsip} has an improved first step in
%comparison to HEFT.
%It combines the standard deviation with the communication cost weight on the tasks.
%In the second stage, the algorithm duplicates the entry task if there is a need for it.
%% TODO: Sounds like Ref. [32] above. What is the difference? (Why) Is it an improvement? Do we need to describe both?

The TSHCS (Task Scheduling for Heterogeneous Computing Systems) algorithm~\cite{alebrahim2017task}
improves on HEFT by adding randomized decisions to the second phase.
%The decision is whether the task be assigned to the processor with the lowest execution time or to the processor that
%produces the lowest finish time.
The SDC algorithm~\cite{SHI2006665} considers the percentage of feasible processors in addition to a task’s
average execution cost in its weight.
%The selected task is then assigned to a processor which minimizes its Adjusted Earliest Finish Time (AEFT),
%which additionally notes how large the communication between the current node and its children will be on
%average when scheduled on the current processor.
HEFT  can also be adapted in cloud-oriented environments~\cite{samadi2018eheft} and even combined with reinforcement learning techniques~\cite{yano2022cqga}, but none of these variants of HEFT consider memory constraints,
to the best of our knowledge.

%\medskip
\subsection{
%\noindent{\bf
Memory-aware scheduling algorithms}
%\label{sub:mem-aware-algs}
%
%Only memory-aware scheduling algorithms are designed to respect memory constraints.
%%Respecting processor memories adds a constraint to a scheduling problem.
%%Therefore, only specifically memory-targeted algorithms address this issue.
%Moreover,
The way processor memories are represented in the model has a decisive impact on the way the constraint
is formulated and addressed in the algorithm.
%
Different models of memory available on processors and memory requirements of tasks have been presented.

Marchal~et~al.~\cite{marchal2018parallel} assume a memory model where each processor has an individual memory available.
\new{Workflow tasks themselves have no memory requirements for their computations,
but they have input and output files that need to be stored in the memory.}
A polynomial-time algorithm for computing the peak memory needed for a parallel execution of such a workflow DAG is provided,
as well as an integer linear programming (ILP) solution to the scheduling problem.
The memory model \new{ includes no memory requirement of the task itself, only the weights of incoming and outgoing edges.
When the task starts, all input files are deleted  and all output files are added to memory.}


Other models consider for instance a
%In an assumed
dual-memory system~\cite{herrmann2014memory} where a processor can have access
to two different kinds of memory, and each task can be executed on only one sort of memory.
%Communication happens only between these two kinds of processors (communication within
%each group is ignored).
The authors then present an ILP-based solution for this problem formulation.
%
%The algorithm presented by Yao et al.
Yao et al.~\cite{yao2022memory} consider that each processor has its own internal memory, and all
processors share a common external one. The internal (local) memory is used to store the task files.
The external memory is used to store evicted files to make room for the execution of a task on a processor.
%All processors, including the original one, can access these files.
%Each edge %in~\cite{yao2022memory}
%has two weights -- the size of the files transferred along it,
%and the time of communication along this edge.
%The tasks themselves have no memory requirements, but need to hold all their incoming and outgoing files.
%
Ding \etal~\cite{ding2024ils}, in turn, consider connected processors with individual limited memories
forming a global memory, with different access times to memory and no weights on edges. An ILP
is proposed in this setting.
%The collective set of memories forms the global memory to which each processor has access;
%however, the access time to global memory is different.
%Each memory access in the graph is modeled as a memory access token on the task, while the edges have no weights.
%The solved problem is how to allocate the initial input data in processor memories so that the overall
%execution is minimized and the memories are not exceeded.
%To this end, the authors propose an ILP model.
%%that minimizes the length of the critical path, including a greedy initial solution.
%
%In~\cite{rodriguez2019exploration}, the authors assume memory requirements on tasks represented as tiles.
%Each processor has individual memories to process the task, but only the shared memories store the tiles containing
%memory tiles occupied by memory tiles.
%
There are also some cloud-oriented models that include costs associated with memory usage~\cite{liang2020memory}.

Overall, there are a variety of memory models, but, to the best of our knowledge, the only study on a multiprocessor
platform that is fully heterogeneous, with individual memories, \new{is our own previous work~\cite{DBLP:conf/icpp/KulaginaMB24}.}
\new{Yet, in that paper, we only propose} a partitioning-based mapping of the workflow, without processor reuse.
\new{Since a processor is idle after a workflow partition has finished its execution
in~\cite{DBLP:conf/icpp/KulaginaMB24},
there is no need for communication buffers to store data
that should be \new{exchanged} between processors when tasks are ready to execute.}


 \subsection{%
%\bigskip
%\noindent{\bf
Dynamic/adaptive algorithms}
We finally review related work in a dynamic setting. With no variation in task parameters,
    DVR HEFT~\cite{SANDOKJI2019482} rather considers that new tasks arrive in the system.
    They use an almost unchanged HEFT algorithm in the static step, executing three slightly
    varying variants of task weighting and choosing the variant that gives the best overall makespan.
%    In the dynamic phase, they receive new tasks and schedule them on either idle processors or
%    those processors that give them
%    the earliest finish time.
%    %Task failures are not covered.
%
%    Rahman~\etal
    The dynamic critical path (DCP) algorithm for grids maps tasks to machines
    by calculating the critical path in the graph dynamically at every step~\cite{rahman2013}.
    %For all tasks they compute the earliest start time and absolute latest start time that are upper and lower bounds
    %on the start time of a task (differing by the slack this task has).
    %All tasks on this critical path have the same earliest and latest start times, because they cannot be delayed.
    The authors schedule the first task on the critical path to the best suitable processor and recompute the critical path.
    %The algorithm takes the first unscheduled task on the critical path each time and maps it on a processor identified for it.
    %If processors are heterogeneous, then the start times are computed with respect for the processor, and the minimum
    %execution time for the task is chosen.
    The heuristic also uses the same processor to schedule parent and children tasks, as to avoid data transfer between processors.
%    The approach is evaluated on random workflows of the size up to 300 tasks.


    Garg~\etal~\cite{GARG2015256} propose a dynamic scheduling algorithm for heterogeneous grids based on rescheduling.
    The procedure involves building a first (static) schedule with HEFT, periodic resource monitoring,
    and rescheduling the remaining tasks.
%    The resource model contains resource groups (small tightly-connected sub-clusters), connected between each other.
%    For each resource group, there is an own scheduler, and an overall global scheduler responsible for distributing
%    tasks to groups.
%    The static heuristic is HEFT with earliest start time as priority.
    Upon rescheduling, a new mapping is computed from scratch,
    and this mapping is accepted if the resulting makespan
    is smaller than the previous one.
    \new{Only small workflows with less than 100 tasks are considered, hence it is possible to do
    these repeated computations from scratch in a reasonable time. }
%    The repeated recomputation from scratch is feasible, because the largest workflow size tested is 100 task big.}
%    \todo[inline]{Maybe mention: from scratch computation because of tiny workflows??}
%%    The experiments were conducted on a single workflow with 10 tasks.
%
    % The authors define the execution time, estimated start time, data ready time,a dn estimated finish time per task.
    %The runtimes of tasks depend on processor speeds, are calculated in advance and stored in tables.

    %The algorithm first computes bottom levels for all tasks (execution time is average of all possible execution times).
    %THe bottom level represents the priority of the task, and tasks are sorted according to these priorities.
    %They then go through tasks and map than to such processors that minimize the earliest start times of this task's
    %successors.
    %To do this, the authors calculate the earliest finishing time of the task across all ressources, along with the
    %average communication and computation costs fir the dependent tasks.
    %
    %The rescheduling is triggered when either a load on a resource increases over a threshold, or if a new resource
    %is added.
%
%
%    Most dynamic or adaptive algorithms are formulated for clouds, where the execution environment is not fixed,
%    but constrained by cost.
%
% In a cloud framework,  Wang et al.~\cite{wang2019dynamic} propose a dynamic particle swarm optimization algorithm.
% % to schedule workflows in a cloud.
%%    Particles are possible solution in the solution space.
%    However, the dynamic is only in the choice of generation sizes, not in the changes in the execution environment.
%    Similarly, Singh et al.~\cite{singh2018novel} addresses dynamic provisioning of resources with a constraint deadline.

De Olivera~\etal~\cite{de2012provenance} propose a tri-criteria (makespan, reliability, cost) \new{cost model and an adaptive
 scheduling algorithm for clouds (scheduling virtual machines).
The greedy algorithm based on this cost model  works in four steps, choosing the best resource types to execute the virtual
machine on, producing new cloud activities, setting up the granularity factor, and finally adapting the amount of resources
to fit the budget.}
The authors test four scenarios -- one preferring each criterion in the cost model and a balanced one.
The authors use workflows with less than ten tasks, but repeat them so that the execution has up to 200 tasks.
%They do not report the runtime of the scheduling algorithm, only the speedup and cost saving it produces.
%   The authors use provenance data to make scheduling decisions.


Daniels \etal~\cite{daniels1995robust} formalize the concept of robust scheduling with variable processing times
on a single machine.
The changes in task running times are not due to changing machine properties, but are rather task-related
and thus unrelated to each other.
The authors search for an optimal schedule
%    formulate a decision space of all permutations of $n$ jobs, and the optimal schedule
in relation to a performance measure. % $\phi$.
Then, they proceed to formulate the Absolute Deviation Robust Scheduling Problem \new{with} a set of linear constraints.


While several related works consider building a new schedule once some variation has been observed,
we are not aware of work implementing a real runtime system that interacts with the scheduler
and has been tested on workflows with thousands of tasks, as we propose in this paper. Furthermore,
we are not aware of any previous work discussing dynamic algorithms combined with memory constraints.

%    \subsection{Other notable works}
%
%    \cite{palis1996task} present a clustering-based scheduling algorithm for a parallel execution and prove its quality.
%    They utilize task duplication when creating the clusters (grains).
%    Their scheduler then maps clusters to processors.
%    They assume unlimited processors with the speed of 1.
%    For each task, they compute the earliest starting time and find a cluster, where this tasks's start time is as close
%    to it as possible.
%    %The cluster growing algorithm adds one task to the cluster at a time, by adding tasks in nondecreasing order of
%    %release times.
%
%    GRASP (generally randomized adaptive search procedure)~\cite{feo1989probabilistic} conducts a number of iterations
%    to search for an optimal solution for mapping tasks on machines.
%    A solution is generated at each step, and the best solution is kept at the end.
%    The search terminates when a certain termination criterion is reached.
%    It generates better results than other algorithms, because it explores the whole solution space.
%
%    Avanes~\etal\cite{avanes2008adaptive} present a heuristic for networks in disaster scenarios.
%    These networks are a set of DAG-shaped scenarios, out of which one needs to be executed.
%    The scenario contains AND- and OR-branches, where AND-branches indicate activities that need to be executed in parallel.
%    The heuristic first determines similar activities and groups them together.
%    Then they allocate these groups to disaster responders and tasks within this group according to a constraint system.
%    The dynamic part deals with changes and distinguishes between retriable and compensation activities.
%    The heuristic calculates a new execution path with these tasks.
%
%
%    \cite{lutke2024hetsim} is a scheduling simulator that models heterogeneous software with memory and accelerator
%    (processor) speed heterogeneity.
%    Each accelerator has its own memory that can be zero.
%    Each accelerator's characteristics depend on the task it runs and are not fixed.
%
%
%
%
%    \cite{meng2018traffic} investigate scheduling on multi-core chips.
%    Their model is far from ours.
%
%
%    An online scheduling algorithm~\cite{Witt2018POS} assumes a DAG-structured workflow and learns task characteristics.
%    They prioritize tasks that have failed before or are well-predictable.
%
%


\section{Model} %: \skug{Full: 4, polished: 3}}
\label{sec:model}
%
The applications we target, large scientific workflows for which we do not have exact a priori knowledge,
are described in Section~\ref{sec.mod.work}. Then, the type of heterogeneous system on which
the applications are to be executed is presented in Section~\ref{sec.mod.plat}. The optimization problem
is defined in Section~\ref{sec.mod.pb}, and the key notation is summarized in Table~\ref{tabnotation}.


\subsection{Applications: Large scientific workflows}
\label{sec.mod.work}
%
Following common practice, we represent a workflow by a DAG (Directed Acyclic Graph)
% The target applications, corresponding to large scientific workflows, are represented with a DAG (Directed Acyclic Graph)
$G=(V,E)$. The set of vertices~$V$ corresponds to tasks, while edges express the
precedence constraints. Hence, a directed edge $e=(u,v)\in E$ means that task $u\in V$ must be executed
before task~$v\in V$. A cost~$c_{u,v}$ is associated with each edge, representing the size of the
output of task~$u$, to be used by task~$v$.
Furthermore, $w_u$ is the number of operations performed by task~$u\in V$,
and $m_u$ is the amount of memory required by task~$u$ to be executed.
%    A workflow is modeled as a directed acyclic graph $G=(V, E)$, where $V$ is the set of vertices (tasks), and
%    $E$ is a set of directed edges of the form $e=(u,v)$, with $u,v\in V$, expressing precedence constraints between tasks.
%    Each task~$u \in V$  is performing $w_u$ operations, and it also
%    requires some amount of memory to be executed, denoted as~$m_u$.
%    For an edge $(u,v) \in E$, the weight~$c_{u,v}$  corresponds to the size of the output file
%    written by task~$u$ and used as input by task~$v$.
We denote by $\parents{u}$ the tasks preceding task~$u\in V$, also called {\em parents},
which must be completed before $u$ can be started:
%    The parents of a task~$u\in V$ are the directly preceding tasks that must be completed before $u$ can be started, i.e., the set of parents is
$ \parents{u} = \{v \in V: (v,u) \in E\}$. A {\em source} task is a task without parents. % is called a {\it source task}.
Similarly, the children of task $u\in V$ are %   The children tasks of~$u$ are the tasks following~$u$ directly according to the precedence constraints, i.e.,
$ \children{u} = \{v \in V: (u,v) \in E\}$, and a {\em target} task has no children.
%A task without children is called a {\it target task}.
Each task may have multiple parents and children.

Note that $m_u$ is the total memory usage
of a task during its execution, including input and output files currently being read and written,
and hence the total memory requirement for executing task~$u$, denoted by~$r_u$,  is the maximum of the following:
(i) the total size of the files to be received from the parents, (ii) the total size of the files
to be sent to the children, and (iii) the total memory size~$m_u$ (which often reaches the maximum):
\[
    r_u = \max\left\{m_u , \sum_{v:(v,u)\in E}c_{v,u}, \sum_{v:(u,v)\in E} c_{u,v}\right\}.
\]

Furthermore, we operate in a context where we do not have perfect knowledge
of the task parameters ($w_u$ and $m_u$) before the tasks start their execution,
but only estimates~\cite{rahman2013,GARG2015256}.
%. \AB{Add motivation: related work with variable task durations for instance...}
Hence, scheduling decisions are made on these estimated parameters, and
may be reconsidered at runtime when a task starts its execution and we know its exact parameters.


\subsection{Heterogeneous system}
\label{sec.mod.plat}
%
The target platform is a heterogeneous system $\cluster$ with $k$ processors
%    The goal is to execute the workflow on a heterogeneous system, denoted as $\cluster$, which
%    consists of $k$ processors
    $p_1, \dots, p_k$.
    For $1 \leq j \leq k$, each processor $p_j$  has an individual memory of size $M_j$, a communication
    buffer of size $MC_j$ and a speed~$s_j$.
    We can decide to evict some data from the main memory when we are sending the data
    to another processor; it then stays in the communication buffer until it has been sent.
    The execution time of a single task~$u\in V$ on a processor~$p_j$ is~$\frac{w_u}{s_j}$,
    and all
    %We assume that all
    processors are connected with an identical bandwidth~$\beta$.
%    \hmey{Maybe mention that variable bandwidths are part of future work...?}

    We keep track of the current ready time of each processor and each communication
    channel, $\rt_j$ and $\rt_{j,j'}$, for each processor $j$ and all pairs~$(j,j')$.
    Initially, all the ready times are set to~$0$.
    We also keep track of the currently available memory, $availM_j$ and $availC_j$,
    on the processor memory and communication buffer, respectively.
    Furthermore, $\PD_j$ is a priority queue with the {\em pending data}
    that are in the memory of size $\MM_j$ but may be evicted to be communicated if
    more memory is needed on~$p_j$. They are ordered by non-decreasing size and
    correspond to some $c_{u,v}$.

In order to compute the memory requirement of a DAG, we use \algo{memDag}~\cite{KAYAASLAN20181},
an algorithm that
%    We use the \algo{memDag} algorithm developed by Kayaaslan \etal~\cite{KAYAASLAN20181} to compute
%    the memory requirement; it
    transforms the workflow into a series-parallel graph
    and then finds the traversal that leads to the minimum memory consumption.

    \begin{table}
        \begin{center}
            \begin{tabular}{rl}
                \hline
                \textbf{Symbol}                       & \textbf{Meaning}                                         \\
                \hline
                $G = (V, E)$                          & Application DAG (tasks and edges)       \\
                $\parents{u}$, $\children{u}$         & Parents of task $u$, children of task $u$            \\
                $w_u$                                 & Number of operations of task $u$         \\
                $c_{u,v}$                             & Size of output file for edge $(u,v)\in E$         \\
                $m_u$                                 & Amount of memory required by task $u$                                \\
                $r_u$ 				& Total memory requirement for task~$u$ \\
%                $F$, $\mathcal{F}$                    & A partitioning function and the partition it creates     \\
%                $V_i$                                 & Block number $i$                                         \\ %\wrt~some $F$   \\
                $\cluster$, $k$                    & Computing platform, total number of processors           \\
%                $p_j$, proc($V_i$)                          & Processor number $j$, processor of block $V_i$                 \\
                $M_j$, $MC_j$, $s_j$                               & Memory size, comm. buffer size, and speed of proc.\ $p_j$                          \\
                $\beta$                     & Bandwidth in the compute system                                \\
                $bl(u)$                      & Bottom level of task $u$ \\
                $ST(u,p_j)$	 & Start time of task~$u$ on~$p_j$ \\
               $FT(u,p_j)$	 & Finish time of task~$u$ on~$p_j$ \\
%                $\mu_G$, $\mu_i$ & Makespan of the entire workflow $G$ and of a block $V_i$               \\
%                $\Gamma = (\mathcal{V}, \mathcal{E})$                      & Quotient graph, its vertices and its edges        \\
%                $r_u$, $r_{V_i}$                            & Memory requirement of task $u$ and of block $V_i$                 \\
                \hline
            \end{tabular}
        \end{center}
        \caption{Key Notation} \label{tabnotation}
    \end{table}


\subsection{Optimization problem}
\label{sec.mod.pb}

The goal is to find a schedule of the DAG~$G$ for the $k$ processors,
so that the makespan (total execution time) is minimized while
respecting memory constraints. If a processor runs out of memory to execute
a task mapped on it, the schedule is said to be {\em invalid}.

Since tasks are subject to variability, we aim at minimizing the actual makespan
achieved at the end of the execution, while decisions may be taken \wrt
the estimated task parameters.

Note that the problem is already NP-hard even in the homogeneous case and
without memory constraints, because of the DAG structure of the application.
Hence, we focus on the design of efficient scheduling heuristics.
%\hmey{Mention NP-hardness due to being more general than NP-hard problem?}

%    \paragraph{Workflow-related changes}
%
%    \begin{itemize}
%        \item A task $v$ takes longer or shorter to execute than planned: its time weight $w_u$ changes to $w'_u$.
%        \item A task $v$ takes more or less memory to execute than planned: its memory requirement $m_v$ changes to $m'_v$.
%
%    \end{itemize}
%
%    The following changes are not a part of this article's scope:
%
%    \begin{itemize}
%        \item The workflow structure changes: edges or tasks come in or leave.
%    \end{itemize}
%
%    \paragraph{Execution environment-related changes }
%
%
%    \begin{itemize}
%        \item A processor exists the execution environment: $k$ decreases and $\cluster$ changes.
%        \item A processor enters the execution environment: $k$ increases, $\cluster$ gets a new processor with possibly new memory requirement and processor speed.
%
%    \end{itemize}
%
%    The following changes are not a part of this article's scope:
%
%    \begin{itemize}
%        \item Processor characteristics change: the memory requirement or speed become bigger or smaller
%    \end{itemize}
%
%    \subsection{Time of changes }
%
%    We consider discrete time in seconds.
%    The time point(s) at which the changes happen is unambiguously defined.
%
%    For any task $v$, its runtime equals its time weight divided by the speed of the processor $p_j$ it has been assigned to: $w_v/s_j$.
%    The start time of any task $v$ is its top level($\bar{l}_v$), or the difference between the maximum bottom level in the workflow (the makespan of the workflow) and the task's own bottom level: $\bar{l}_v = \mu_\Gamma - \bottomlevel{v}$.
%    The start time of the source task in the workflow is zero.
%    The end time of a task $v$ is its start time and its runtime: $\bar{l}_v + w_v/s_j$
%
%    \subsection{Changes and knowledge horizon - important questions TBA}
%
%    Given a valid mapping of tasks to processors, we can say what we predicted would happen at any given time point $T$: what tasks have been executed, what have not finished or have not even started.
%
%    At the point of change, we know that some tasks that finished took longer than expected ($w_v$ are bigger) or shorter.
%    However, how do we model the following:
%    \begin{itemize}
%        \item Do we know the new weights of currently running tasks and tasks that have not yet started? This means, do we foresee into the future or do we assume that all weights on unfinished tasks remain the same?
%        \item A change in memory requirements can mean that the assignment had been invalid. Do we assume that these tasks failed and we need to rerun them?
%        \item How many times of change do we model - one per workflow run, or multiple?
%        \item At what time does the change and reevaluation happen - is it a fixed (random?) point of time or is it workflow-dependent (say, after 10\% of the workflow is ready)?
%    \end{itemize}


\section{Scheduling heuristics}
\label{sec:heuristics}
%
We design variants of HEFT that account for memory usage and aim at minimizing the makespan.
First, we present in Section~\ref{sec.heft} the baseline HEFT heuristic; as it does not account for the memory,
it may return invalid schedules that will not be able to run successfully on the platform (due to
running out of memory).  Then, Section~\ref{sec.heftm} focuses on the presentation of the novel
heuristics, including eviction strategies to move some data in communication buffers
in case there is not enough memory available on some processors.

\subsection{Baseline: original HEFT without memories}
\label{sec.heft}
%
Original HEFT does not account for memory sizes.
Its schedules can be invalid if tasks are assigned to processors without enough memory.
These solutions can be viewed, however, as a ``lower bound'' for a valid solution that
respects memory constraints.

HEFT works in two stages.
In the first stage, it computes the ranks of tasks by computing their non-increasing bottom levels.
The bottom level of a task is defined as
\[
bl(u) = w_u + \max_{(u,v)\in E} \{c_{u,v} + bl(v)\}
\]
(with the max yielding $0$ if there is no outgoing edge).
The tasks are sorted by non-decreasing ranks.

In the second stage, the algorithm iterates over the ranks and tries to assign the task to the processor where it
has the earliest finish time.
We tentatively assign each task~$v$ to each processor~$p_j$.
The task's starting time $ST(v,p_j)$ on processor~$p_j$ is dictated by the maximum between the ready time of the processor~$rt_j$
and all communications that
must be orchestrated from predecessor tasks $u\notin T(p_j)$.
The starting time is then:

{\footnotesize{   \[ST(v, p_j) = \max{ \Large\{rt_j, \max_{ u \in \Pi(v)}\{ FT(u)+ \frac{c_{u,v}}{\beta} ,
rt_{proc(u), p_j} + \frac{c_{u,v}}{\beta}  \} \Large\} } . \]}}

Finally, its finish time on $p_j$ is
$FT(v,p_j) = st_v + \frac{w_v}{s_j}$.

Once we have computed all finish times for task~$v$,
we keep the minimum $FT(v,p_j)$ and assign task~$v$
to processor~$p_j$.

\textit{Assignment to processor. }
When assigning the task, we set the ready time $rt_j$ of  processor~$j$ to be the finish time of the task.
For every predecessor~$u$ of task~$v$ that has been assigned to another processor~$j'$, we adjust ready times on
communication buffers $rt_{j', j}$: % for every predecessor $u$'s processor $j'$:
we increase them by the
communication time $c( u,v) / \beta$.


\subsection{Memory-aware heuristics}
\label{sec.heftm}
%
Like the original HEFT, the memory-aware versions of HEFT consist of two stages:
first, they compute the task ranks,
and second, they assign tasks to processors in the order defined in the first stage.
We consider three variants of HEFT accounting for memory usage (HEFTM), which only
differ in the order they consider tasks to be scheduled in the first stage.

\medskip
\noindent{\bf Compute task ranks. }

Our three variants of memory-aware HEFT work as follows:
\begin{itemize}
\item
  HEFTM-BL orders tasks by non-increasing bottom levels, where the bottom
  level is defined as
  $$bl(u) = w_u + \max_{(u,v)\in E} \{c_{u,v} + bl(v)\}$$
  (max yields $0$ if there is no outgoing edge).

\item
  HEFTM-BLC %: from the study of the fork (see below), it seems important
  %  to also account for the size of the data as input of a task,
  gives more priority to tasks with potential large incoming communications,
  hence aiming at clearing the memory used by files as soon as possible,
  to have more free memory for remaining tasks to be executed on the processor.
  Therefore, for each task, we compute a modified bottom level accounting for communications:
  $$blc(u) = w_u + \max_{(u,w)\in E} \{c_{u,w} + blc(w)\} + \max_{(v,u)\in E} c_{v,u}   . $$

%    \skug{avoid having mixed ranks, when the memory size of the lower task is not taken into account}

\item
  Finally, HEFTM-MM orders tasks  in the order returned by %as dictated by MinMem.
  the  \algo{memDag} algorithm~\cite{KAYAASLAN20181}, which corresponds to a traversal
  of the graph that minimizes peak memory usage.
\end{itemize}


\bigskip
\noindent  {\bf Task assignment. }

Then, the idea is to pick the next free task in the given order,
and greedily assign it to a processor, by trying all possible options
and keeping the most promising one. We first detail how a task
is tentatively assigned to a processor, by carefully accounting for the memory usage.
Next, we explain the steps to be taken to effectively assign a task to a given processor.

\medskip
\noindent{\em Tentative assignment of task~$v$ on $p_j$.}\\
{\bf Step 1.} First, we need to check that for all predecessors~$u$ of~$v$ that are mapped
on~$p_j$, the data $c_{u,v}$ is still in the memory of~$p_j$,
i.e., $c_{u,v}\in PD_j$. Otherwise, the finish time is set to~$+\infty$ (invalid choice).

\smallskip
\noindent{\bf Step 2.} Next, we check the memory constraint on~$p_j$, by computing

\begin{align*}
Res =\ & availM_j - \max \bigg\{ m_v,\
\sum_{\substack{u \in \Pi(v)\\ u \notin T(p_j)}} c_{u,v},\\
& \hspace{1.5cm}
\sum_{\substack{w \in Succ(v)}} c_{v,w} - \sum_{\substack{u \in T(p_j)}} c_{u,v}  \bigg\}
\end{align*}

$T(p_j)$ is the set of tasks already scheduled on $p_j$; by Step 1, their files are
already in the memory of~$p_j$. However, the files from the
other predecessor tasks must be loaded into memory before executing task~$v$.
Depending on what part of $r_u$ was the largest, we need to have enough memory to store
the task data of size $m_v$, or the data generated for all successor tasks.
$Res$ then checks whether there is enough memory; if it is negative,
we have exceeded the memory of~$p_j$ with this tentative assignment.
%
In this case ($Res <0$), we try to evict some data from memory so that there is enough memory to execute task~$v$.
Clearly, we need to evict at least $Res$ data.
To this end, we propose a greedy approach, which evicts the largest files of $\PD_j$ until data of size $Res$ have been evicted.
A variant where the smallest files are evicted first has been tested; it led to comparable results.
%
%    in order to avoid costly communications.
%    \AB{FYI We initially discussed evicting the largest files, but this leads to
%    large communications and does not seem efficient after all... Maybe we can think of another
%    approach that would take into account both data size and bottom level...}
While tentatively evicting files, we remove them from the list of pending memories and move them into a list
of memories pending in the communication buffer.
We keep track of the available buffer size, too -- every time a file is moved into the pending buffer,
the available buffer size is reduced by the file size.

If, after tentatively evicting all files from $\PD_j$, we still do not have enough memory, or if we exceed the size of the available buffer during this process, we set the finish time to $+\infty$ (indicating an invalid choice).


\smallskip
\noindent{\bf Step 3.} We tentatively assign task~$v$ on $p_j$.
Its starting time $ST(v, p_j)$ on $p_j$ is dictated by the maximum between $rt_j$ and all communications that
must be orchestrated from predecessor tasks $u\notin T(p_j)$.
The starting time is therefore:\\[-.7cm]

{\small{ \begin{align*}
ST(v, p_j) & = \max  \large\{rt_j,  \\
& \max_{ u \in \parent(v), u\notin T(p_j)}\{ FT(u) , rt_{proc(u), p_j}\} + \frac{c_{u,v}}{\beta} \Large\} .
\end{align*}
}}

\noindent  Finally, its finish time on $p_j$ is
$FT(v,p_j) = ST(v, p_j) + \frac{w_v}{s_j}$.


\medskip
\noindent{\em Assignment of task~$v$.}\\
Once we have computed all finish times for task~$v$,
we keep the minimum $FT(v,p_j)$ and assign~$v$
to processor~$p_j$.
In detail:
\begin{itemize}
\item
  We evict the files corresponding to edge weights that need to be evicted to free the memory.
  We remove these files from pending memories
  $PD_j$, add them to pending data in the communication buffer, and reduce the available buffer size accordingly.
\item
  We calculate the new $availM_j$ on the processor.
  Next, we subtract the weights of all incoming files from predecessors assigned to the same processor
  and add the weights of outgoing files generated by the currently assigned task.

\item
  For every predecessor of~$v$ that has been assigned to another processor, we adjust ready times on
  communication buffers $rt_{j', j}$ for the processor~$j'$ that the predecessor $u$ has been assigned to: we increase them by the
  communication time $c( u,v) / \beta$.
  We also remove the incoming files from either the pending memories or pending data in buffers of these other
  processors, and increase the available memory or buffer sizes on these processors.

\item
  We compute the correct amount of available memory for~$p_j$ (for when the task is done).
  Then, for each predecessor that is mapped to the same processor,
  we remove the pending memory corresponding to the weight of
  the incoming edge, also freeing the same amount of available memory (increasing $availM_j$).
  For each successor, we rather add the edge weights to pending memories and reduce $availM_j$
  by the corresponding amount.
\end{itemize}

%    \subsection{The fork}
%    We look at the behavior of these heuristics on a fork graph,
%    where there is a root task~$T_0$, producing $n$ files $f_1, \ldots, f_n$
%    to be used by tasks $T_1, \ldots, T_n$ ($f_i = c_{0,i}$).
%
%    Without memory, this problem is NP-complete; this is equivalent
%    to 2-partition if the tasks have $w_i=a_i$, and all files are of size~$f_i=0$,
%    and with two processors. Half of the tasks must be sent to the processor
%    on which $T_0$ is not executed, and the optimal makespan is
%    $w_0+\frac{1}{2}\sum_{1\leq i \leq n} w_i$.
%
%    However, with an infinite number of identical processors, it can be
%    solved in polynomial time: sort tasks by non-decreasing $f_i+w_i$;
%    the $k$ tasks with smallest $f_i+w_i$ are then sent to another processor,
%    while the remaining $n-k$ tasks are executed locally (try all values of $k$).
%
%    With heterogeneous processors, it is probably NP-complete again
%    because we could ensure that there are only two processors fast enough
%    and get back to the 2-partition...
%
%    We also had an example where evicting large files first in step 2
%    can lead to arbitrarily bad makespan. Consider a fork with $n=2$,
%    $f_1=1$, $w_1=2$, $f_2=100$, $w_2=1$, and memory constraint
%    imposes that we free one unit of memory before executing one
%    of the tasks\ldots Actually the new version with BLC would start
%    considering $T_2$ and be fine in this case\ldots
%
%
%    \AB{Can we prove that we have (maybe) a 2-approximation,
%        at least for the fork? What worst-case can we think of? }


\vspace{-0.09cm}
\section{Dynamic scenario}
\label{sec:dyn}
%
In a workflow execution environment, the scheduling method interacts with the runtime environment, which provides information such as resource estimates.
This information may include memory usage, running time, graph structures, or the status of the underlying infrastructure.
In order to ensure that the information is up to date, a monitoring system observes the workflow execution and collects metrics for tasks and the underlying infrastructure.
By incorporating dynamic monitoring values, e.g., the resources a task consumed, the runtime environment can incorporate the data into the prediction model to provide more accurate resource predictions.
Also the underlying infrastructure can change during the workflow execution.
Examples are processor failures, node recoveries, or acquisition of new nodes.
Even if the hardware infrastructure does not change, the set of nodes provided as a scheduling target might change due to release or occupation in shared cluster infrastructures.
As infrastructure information and resource predictions are dynamically updated and provided to the scheduler during workflow runtime, the previous schedule may become invalid, so that a new one must be calculated.

For state-of-the-art memory prediction methods, a cold-start median prediction error for heterogeneous infrastructures
of approximately 15\% has been observed~\cite{malik2013execution}.
Online prediction methods were able to significantly reduce the error during runtime, with the reduction reaching up to one third of the cold-start error~\cite{baderDiedrichDynamic2023,witt2019learning}.
%For instance, Nadeen~et~al.\cite{} report an error of 10\%, 11\%, and 15\% while the task prediction errors shows a normal and exponential distribution.
%Bader~et~al.~ report a prediction error between 13\% and 17\% for their method, showing an exponential task error distribution.
% @Svetlana, willst du sowas für deine Experimente? Also die Daten, welche du dann konfigurieren kannst?
Such a dynamic execution environment requires a dynamic scheduling method where the schedule can be recomputed during workflow execution.


\subsection*{Retracing the effects of change on an existing schedule}
After the monitoring system has reported changes, we need to assess their impact on the existing schedule.
These changes can invalidate the schedule (\eg if there is not enough memory for some tasks to execute anymore),
they can lead to a later finishing time (\eg if some tasks are longer and they delay other tasks), or they can have no effect (\eg if new processors
joined the cluster, but the old schedule did not account for them).
To assess the impact, we need to retrace the schedule.

First, we find out if at least one processor that had assigned tasks has terminated operation -- this instantly invalidates the
entire schedule.
%
We then iterate over all tasks of the workflow in a topological order --
any of the orderings given by rankings BL, BLC or MM is a topological ordering.
We then repeat steps similar to those we did during tentative assignment in the heuristics,
except that we do not choose a processor
anymore, but rather we check whether the current processor assigned to the task still fits.

For each task $v$, we first assess its current memory constraint $Res$ using Step 2 from the heuristic.
The factors that affect $Res$ are possible changes in $m_v$, in $c_{u,v}$ from predecessors $u$