Dynamic workers#144

Merged

haraldsvik merged 15 commits intomainfrom

dynamic-workers

Jan 30, 2025

Contributor

haraldsvik commented Jan 9, 2025 •

edited

Loading

Uses a class to store state about workers.

The class stores:

default_max_workers - Max workers we allow
max_gb_all_workers - Max total size(disk size of tar) of the jobs in the pipeline
workers - List of workers (updated Worker to have job_size as property)

How it works:

When we want to start a new job we first check the size of the .tar file
Then we check if we are allowed to spawn a new job:
2.1 If alive workers >= default_max_workers and current_total_size + new_job_size >= max size in pipeline
2.2 we allow new job!
if we are allowed to spawn the new worker, we then register the job in the state.
when the job finished we unregister the job.

Note:

This can cause large jobs to be prioritized later than small jobs:
e.g.
default_max_workers=4
max_gb_all_workers=100GB
JOB_1=20GB --> len(workers) is 0 && current_size is 0GB--> Can spawn!
JOB_2=90GB --> len(workers) is 1 && current_size is 20GB --> Not allowed!
JOB_3=20GB --> len(workers) is 1 && current_size is 20GB --> Can spawn!
JOB_4=20GB --> len(workers) is 2 && current_size is 40GB --> Can spawn!

If we start these 4 jobs. Job_2 won't run until the others are done. (exect if it get picked up first, then all the other jobs will wait.)

haraldsvik added 2 commits

January 9, 2025 13:44


          Introducing state to dynamically change number of workers

2a731bd


          move get size to local_storage

2936e0d

haraldsvik requested a review from a team as a code owner

January 9, 2025 13:18

haraldsvik added 2 commits

January 9, 2025 14:21


          test env

714457e


          test env

f94ca69

DanielElisenberg reviewed

View reviewed changes

Collaborator

DanielElisenberg left a comment

I really like the state manager class 👍🏻 Easy to read solution to our issue. Good stuff 💯

job_executor/adapter/local_storage.py Outdated Show resolved Hide resolved

.test.env Outdated Show resolved Hide resolved

job_executor/app.py Outdated

                           if job and job.status not in ["queued", "built"]:
                               logger.info(f"Worker died and did not finish job {job.job_id}")
                               fix_interrupted_job(job)
+                          manager_state.unregister_job(job.job_id)

Collaborator

DanielElisenberg Jan 9, 2025

job would be none if it is completed here. I think we should refactor this solution to iterate through each dead worker and use the dead worker first and foremost. Something akin to :

def clean_up_after_dead_workers(
    dead_workers: List[Worker], manager_state
) -> None:
    if len(dead_workers) == 0:
        return
    for dead_worker in dead_workers:
            # only query for the job in question
            # the amount of requests here are maximum 4, so of no consequence
            job = job_service.get_job(dead_worker.job_id)
            if job and job.status not in ["queued", "built"]:
                logger.info(f"Worker died and did not finish job {job.job_id}")
                fix_interrupted_job(job)
            manager_state.unregister_job(dead_worker.job_id)

job_executor/app.py Outdated

+                                  logger.info(
+                                      f"{job.job_id} Failed to get the size of the dataset."
+                                  )
+                                  raise LocalStorageError(

Collaborator

DanielElisenberg Jan 9, 2025

If we do it like this the local storage module itself might as well raise the error so we don't need to do the check here.

Another option would be to fail the job:

        job_service.update_job_status(
            job_id, "failed", log=f"No such dataset available for import"
        )

This is preferable to crashing the whole executor if one dataset should be tampered with in some unexpected way.

job_executor/worker/manager_state.py Outdated Show resolved Hide resolved

haraldsvik added 3 commits

January 10, 2025 13:44


          worker config uses GB for config

8ea2b3c

rename get gize got get_input_tar_size_in_bytes
fix unregister for dead_workers
Fails job instead of crashing if size isnt found
Tok docs-strings for a run and a diet


          skip futher processing of a job is size 0

19ca018


          minor refactoring

5babae5

haraldsvik requested a review from a team

January 23, 2025 08:25

pawbu reviewed

View reviewed changes

job_executor/worker/manager_state.py Outdated

-                      else:
-                          self.current_max_workers = self.default_max_workers
+                      can_spawn = True
+                      self.update_worker_limit(new_job_size)

Contributor

pawbu Jan 23, 2025

Nice to refactor this, but I think the can_spawn_new_worker should be purely "get" method without extra "setting" things, so refactoring this out will improve readability

job_executor/worker/manager_state.py Outdated Show resolved Hide resolved


          can spawn is now purly a getter.

3edacc5

Register job handles update current worker limit

Collaborator

DanielElisenberg commented Jan 24, 2025

Just food for thought: Could it be beneficial to also keep the active workers list inside the ManagerState? That is also part of the statefulness of the manager process so it might be more readable for as much of it as possible to exist in this new state class?

DanielElisenberg reviewed

View reviewed changes

job_executor/worker/manager_state.py Outdated Show resolved Hide resolved

haraldsvik added 2 commits

January 24, 2025 21:06


          ManagerState now manages the workers

0ca3524


          missing argument

pawbu reviewed

View reviewed changes

job_executor/app.py Outdated Show resolved Hide resolved


          fix spelling

177e5d8

Co-authored-by: pawbu <pawbu@users.noreply.github.com>

DanielElisenberg reviewed

View reviewed changes

job_executor/app.py Show resolved Hide resolved

job_executor/worker/manager_state.py Outdated Show resolved Hide resolved

job_executor/app.py Show resolved Hide resolved


          removed logs from manager_state

eab6855

DanielElisenberg reviewed

View reviewed changes

job_executor/worker/manager_state.py Outdated Show resolved Hide resolved

DanielElisenberg reviewed

View reviewed changes

job_executor/worker/manager_state.py Show resolved Hide resolved

DanielElisenberg reviewed

View reviewed changes

job_executor/worker/manager_state.py Outdated Show resolved Hide resolved

haraldsvik and others added 3 commits

January 29, 2025 09:12


          removed line

1cbe258

Co-authored-by: Daniel Elisenberg <33904479+DanielElisenberg@users.noreply.github.com>


          Simplify can_spawn

e5d2266

Co-authored-by: Daniel Elisenberg <33904479+DanielElisenberg@users.noreply.github.com>


          Simplify logic, and updated tests

3f82d91

sonarqubecloud bot commented Jan 29, 2025

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

DanielElisenberg approved these changes

View reviewed changes

Collaborator

DanielElisenberg left a comment

Let's send it to integration tests and QA 💯

haraldsvik merged commit 8230455 into main

7 checks passed

haraldsvik deleted the dynamic-workers branch

January 30, 2025 12:44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet