You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Vt 514 dask usage via karabo is pit of success (#516)
* Made karabo "pit of success", added parameter if inside container for dask client creation.
* Fixed missing type for mypy.
* Fixed line too long and improved description.
* Imported at the beginning.
* Fixed failing tests
* Added test for dask.
* Added test for dask
* Improved documentation.,
Copy file name to clipboardExpand all lines: doc/src/examples/example_structure.md
+61Lines changed: 61 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,6 +17,67 @@ Please look at the karabo.package documentation for specifics on the individual
17
17
18
18

19
19
20
+
## Parallel processing with Karabo
21
+
22
+
Karabo streamlines the process of setting up an environment for parallelization. Through its utility function `parallelize_with_dask`, Karabo nudges the user towards a seamless parallelization experience. By adhering to its format, users find themselves in a `pit of success` with parallel processing. This ensures efficient task distribution across multiple cores or even entire cluster nodes, especially when handling large datasets or tasks with high computational demands.
23
+
24
+
### Points to Consider When Using `parallelize_with_dask` and Dask in General
25
+
26
+
When leveraging the `parallelize_with_dask` function for parallel processing in Karabo, users should keep in mind the following best practices:
27
+
28
+
1.**Avoid Infinite Tasks**: Ensure that the tasks you're parallelizing have a defined end. Infinite or extremely long-running tasks can clog the parallelization pipeline.
29
+
30
+
2.**Beware of Massive Tasks**: Large tasks can monopolize resources, potentially causing an imbalance in the workload distribution. It's often more efficient to break massive tasks into smaller, more manageable chunks.
31
+
32
+
3.**No Open h5 Connections**: Objects with open h5 connections are not `pickleable`. This means that they cannot be serialized and sent to other processes. If you need to pass an object with an open h5 connection to a function, close the connection before passing it to the function, e.g. by calling `h5file.close()` or `.compute()` inside Karabo.
33
+
34
+
4.**Use `.compute()` on Dask Arrays**: Before passing Dask arrays to the function, call `.compute()` on them to realize their values. This avoids potential issues and ensures efficient processing.
35
+
36
+
5.**Refer to Dask's Best Practices**: For a more comprehensive understanding and to avoid common pitfalls, consult [Dask's official best practices guide](https://docs.dask.org/en/stable/best-practices.html).
37
+
38
+
Following these guidelines will help ensure that you get the most out of Karabo's parallel processing capabilities.
39
+
40
+
41
+
### Parameters
42
+
- iterate_function (callable): The function to be applied to each element of the iterable. This function should take the current element of the iterable as its first argument, followed by any specified positional and keyword arguments.
43
+
44
+
- iterable (iterable): The collection of elements over which the iterate_function will be applied.
45
+
46
+
- args (tuple): Positional arguments that will be passed to the iterate_function after the current element of the iterable.
47
+
48
+
- kwargs (dict): Keyword arguments that will be passed to the iterate_function.
49
+
50
+
### Returns
51
+
- tuple: A tuple containing the results of the iterate_function for each element in the iterable. Results are gathered using Dask's compute function.
52
+
53
+
### Additional Notes
54
+
It's important when working on a `Slurm Cluster` to call DaskHandler.setup() at the beginning.
55
+
56
+
If 'verbose' is specified in kwargs and is set to True, progress messages will be printed during processing.
57
+
58
+
The function internally uses the distributed scheduler of Dask.
59
+
60
+
Leverage the `parallelize_with_dask` utility in Karabo to harness the power of parallel processing and speed up your data-intensive operations.
61
+
62
+
### Function Signature
63
+
64
+
```python
65
+
defparallelize_with_dask(
66
+
iterate_function: Callable[..., Any],
67
+
iterable: Iterable[Any],
68
+
*args: Any,
69
+
**kwargs: Any,
70
+
) -> Union[Any, Tuple[Any, ...], List[Any]]:
71
+
72
+
# Example
73
+
defmy_function(element, *args, **kwargs):
74
+
# Do something with element
75
+
return result
76
+
77
+
parallelize_with_dask(my_function, my_iterable, *args, **kwargs) # The current element of the iterable is passed as the first argument to my_function
78
+
>>> (result1, result2, result3, ...)
79
+
```
80
+
20
81
## Use Karabo on a SLURM cluster
21
82
22
83
Karabo manages all available nodes through Dask, making the computational power conveniently accessible for the user. The `DaskHandler` class streamlines the creation of a Dask client and offers a user-friendly interface for interaction. This class contains static variables, which when altered, modify the behavior of the Dask client.
0 commit comments