-
-
Notifications
You must be signed in to change notification settings - Fork 730
Support external tasks #8199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support external tasks #8199
Conversation
Can one of the admins verify this patch? Admins can comment |
Here's a working external task. Motivation:This helps to submit tasks to Dask that will be run externally, but still, we know them and can submit task graphs that use them. Usage:
Use cases:Coupling Dask to any running producer of data. |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ± 0 27 suites ±0 10h 51m 54s ⏱️ - 53m 8s For more details on these failures, see this check. Results for commit 9b7ce18. ± Comparison against base commit 2f04dcb. This pull request removes 4 and adds 22 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
compatible bridge
Example:We can test this with 2 clients or 1 Client and 1 Bridge (a class that allows connecting an
from distributed import Client, Future
from dask import delayed
import dask.array as da
c = Client(tcp://...)
# We create a future that is related to data that will be generated by an external source
external_future = distributed.Future("external_key", external=True, inform=True)
# We describe the data that will be associated with this future, here it is dask.array
External_dask_array = da.from_delayed(delayed(external_future), shape=(10,10), dtype=float)
"""
We submit a task graph that uses this external data.
Internally this will create an *external* task rather than a released one. This will prevent the
scheduler from sending this empty task to the workers.
These tasks will be switched to the in-memory state when the external application
scatters it to a worker.
The worker will inform the scheduler about this new state.
"""
External_sum = External_dask_array.sum()
# Here the client is blocked until the scatter is performed and the result is available.
res = client1.compute(External_sum).result()
print(res) We can use the client class to send data to the workers, here is the code:
from distributed import Client
import numpy as np
# Connect a client to the same scheduler as the previous one
external_client = Client("tcp://...")
"""
Now we send an array to a connected worker to that scheduler.
- We need the same key
- direct has to be activated, because we don't want to go through the scheduler,
- We have to provide a list of workers
- And make sure to activate the external boolean.
"""
external_future = external_client.scatter(np.ones((10,10)), keys=["external_key"],
direct=True, workers=["tcp://127.0.0.1:33251"],
external=True)
# Now the first Client is unblocked and should get the result. Another possibility is to use the dedicated Bridge class, here is the code:
from distributed import Bridge
import numpy as np
# Connect a bridge directly to the associated worker
external_bridge = Bridge(workers=["tcp:\\.."])
"""
Now we send an array to a connected worker to that scheduler.
- We need the same key
- And all the other params of the default scatter are retrieved by default
"""
external_future = external_bridge.scatter(np.ones((10,10)), keys=["external_key"])
# Now the first Client is unblocked and should get the result. |
Cleaner bridge
cleaning the code
Here is our paper showing a typical use case of external tasks in HPC/ML workflows: Dask-Extended External Tasks for HPC/ML In transit Workflows |
Adding such a feature would be of extreme interest to us at CEA. We would really love to be able to use Dask to process data produced by numerical simulations and as demonstrated in @GueroudjiAmal 's paper this feature is of critical importance to get this use case efficiently working. Please let us know what we could do to help get this in Dask. We can help with testing, working together to improve API and whatnot. |
stimulus external task
Closes #8070
pre-commit run --all-files