Description
Hi! I noticed the High throughput
page in CSC docs (https://docs.csc.fi/computing/running/throughput/), and I have some comments on the properties of HyperQueue (I'm one of the HQ developers, as a disclaimer).
It seems that the page mentions HyperQueue as a primary alternative to GNU parallel or Slurm array jobs, which is fine, however that's just one of the interfaces for using HQ. In fact, HQ supports (and was explicitly designed) for supporting many of the things mentioned on the page, even though the page claims that it does not support them :)
Here are a few things that I'd like to clarify, regarding the "decision tree" and the comparison table:
- Multi-node tasks (tree) - HQ does in fact support multi-node tasks, and allows you to even combine single node and multinode tasks in the same task graph.
- Dependencies between subtasks (tree) - HQ does support dependencies between tasks, they can be expressed either using workflow files or using a Python API.
- Packs jobs/job steps (table) - The primary motivation of HQ is to allow users to submit large task graphs (e.g. a million tasks) and then fully automatically map these tasks to a small number of Slurm/PBS allocations. In fact, "job packing" was the main reason why HQ was even created :)
- Dependency support (table) - As mentioned above, HQ allows expressing dependencies between tasks.
- Error recovery - Fault-tolerant task execution is built into HQ. When a task fails, it will get fully automatically recomputed. So it seems weird to me that this is marked as "not supported" :)
- Slurm integration - HQ has an automatic allocator that is able to submit allocations on the behalf of users fully automatically, based on computational needs of submitted tasks. It can integrate and communicate with Slurm on users' behalf.
- Multi-partition support - HQ has very advanced resource management. You can specify arbitrary resource requirements per task, e.g. "needs 16 CPUs" or "needs 2 NVIDIA GPUs" or even "needs 0.25 NVIDIA GPU" or "needs either 4 CPUs AND 1 GPU OR needs 16 CPUs". You can add HQ workers from arbitrary amount of different Slurm partitions and HQ will schedule tasks to them fully transparently, you can even configure the automatic allocator to provide you with allocations from a CPU partition and a GPU partition at the same time.
While HQ can be used "just" as a task executor within a single Slurm allocation, it can be much more powerful when it is used as a meta-scheduler, i.e. users just run the server on a login node and let HQ manage Slurm allocations for them fully automatically.
I hope that this description makes the set of offered HQ features a bit more accurate :) I can send a PR that clarifies these points in your docs if you want.