Skip to content

Commit 2ca3c37

Browse files
authored
Merge pull request #171 from pfackeldey/pfackeldey/add_queue_based_task_distribution_project
Add queue based task distribution project
2 parents eb73db1 + a6a98ea commit 2ca3c37

File tree

2 files changed

+46
-0
lines changed

2 files changed

+46
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
node_modules/
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
name: Developing a queue-based task distribution system
3+
postdate: 2026-01-13
4+
categories:
5+
- Networking
6+
- Computing
7+
- Open science
8+
durations:
9+
- 3 months
10+
experiments:
11+
- ATLAS
12+
- CMS
13+
- HLLHC
14+
skillset:
15+
- Python
16+
- Docker
17+
- Kubernetes
18+
- Networking
19+
status:
20+
- Available
21+
project:
22+
- IRIS-HEP
23+
location:
24+
- Any
25+
commitment:
26+
- Any
27+
program:
28+
- Any
29+
shortdescription: Developing a task distribution system based on queues (potential Dask replacement)
30+
description: >
31+
Dask is nowadays commonly used to distribute tasks and process them on HEP infrastructures.
32+
However, for large scale analysis that involve more than tens/hundreds of thousands of tasks it comes to its limits.
33+
The scaling limitation is primarily due to Dask's Python-based single-threaded central scheduler, which can only handle up to a few thousand tasks per second gracefully.
34+
We propose a project to replace this central scheduler with a message (task) queue (e.g. using RabbitMQ) that workers can concurrently fetch work from.
35+
This concept essentially changes the system from a 'push-based' (central scheduler submits to workers) to a 'pull-based' (workers fetch from a queue) system.
36+
This project involves building a distributed system including: a client to put tasks into a queue, deploying such a queue on coffea-casa, workers (managed by HTCondor/Kubernetes) that connect and fetch from this queue.
37+
Certain challenges/opportunities need to be addressed: avoiding large messages (e.g. pickled coffea Processors) for the queue, queue multiplexing, dynamic worker scaling (adding/removing workers over time), task failure handling, different queue types/kinds, workers being able to submit dynamically tasks back into the queue, adding a monitoring system for the queue and workers, etc.
38+
contacts:
39+
- name: Peter Fackeldey
40+
email: peter.fackeldey@cern.ch
41+
- name: Oksana Shadura
42+
email: oksana.shadura@cern.ch
43+
44+
mentees: # keep an empty list until the project has started or a student is identified
45+
# when that happens add a list with name: and link: attributes for each students

0 commit comments

Comments
 (0)