Description
An External
Task is a work unit that runs in an external environment than Dask.
But still, it is known by the scheduler even before it becomes available and can be used in a task graph as any (internal) Dask task.
An External Task is created in an external
state.
It has a possible transition to the memory state when it becomes available in the distributed memory of Dask.
And it unblocks all the dependent tasks by triggering the task transition Algorithm.
External task makes it possible to integrate Dask distributed analytics in larger workflows implying other environments, typically this feature has been used in HPC workflows coupling MPI simulation to Dask analytics in a producer-consumer functionality.
The external tasks can can be used in other workflows such as digital twins workflows and IoT integration.
A way to do that is by using scatter
operation from the client Class to send a block of data from the simulation process (with a client) to a Dask worker. Then sending semantic metadata to be able to use those data in the analytics.
The metadata includes the future returned by the scatter
with semantic metadata from the simulation.
The issue with this solution is being able to submit tasks to Dask only when the data is available. Data comes incrementally from the simulation side (iteration by iteration). Thus one has to write incremental algorithms for analytics needing data from several timesteps.
Moreover, a lot of metadata has to go through the scheduler and usually more than 512 clients are needed which overloads the scheduler.
In addition to bypassing the disk accesses to get the data, the external tasks allow the submission of task graphs even before data becomes available in the worker's memory.
This feature is already implemented in a forked version of Dask distributed
A detailed demonstration of the feature is published here. And a demo is here
In this example, two clients have been used Client01
submits the tasks, and Client02
only sends the data with the same key
and activates the external mode which is referred to as deisa
mode for (Dask Enabled In Situ Analytics).
In addition to using the same key, the scatter operation needs to act like other finished tasks, that is triggering the transition algorithm, once done. This is mandatory as the external tasks are used in analytics even before they are available.