Skip to content

Dynamic workers#144

Merged
haraldsvik merged 15 commits intomainfrom
dynamic-workers
Jan 30, 2025
Merged

Dynamic workers#144
haraldsvik merged 15 commits intomainfrom
dynamic-workers

Conversation

@haraldsvik
Copy link
Contributor

@haraldsvik haraldsvik commented Jan 9, 2025

Uses a class to store state about workers.

The class stores:

  • default_max_workers - Max workers we allow
  • max_gb_all_workers - Max total size(disk size of tar) of the jobs in the pipeline
  • workers - List of workers (updated Worker to have job_size as property)

How it works:

  1. When we want to start a new job we first check the size of the .tar file
  2. Then we check if we are allowed to spawn a new job:
    2.1 If alive workers >= default_max_workers and current_total_size + new_job_size >= max size in pipeline
    2.2 we allow new job!
  3. if we are allowed to spawn the new worker, we then register the job in the state.
  4. when the job finished we unregister the job.

Note:

This can cause large jobs to be prioritized later than small jobs:
e.g.
default_max_workers=4
max_gb_all_workers=100GB
JOB_1=20GB --> len(workers) is 0 && current_size is 0GB--> Can spawn!
JOB_2=90GB --> len(workers) is 1 && current_size is 20GB --> Not allowed!
JOB_3=20GB --> len(workers) is 1 && current_size is 20GB --> Can spawn!
JOB_4=20GB --> len(workers) is 2 && current_size is 40GB --> Can spawn!

If we start these 4 jobs. Job_2 won't run until the others are done. (exect if it get picked up first, then all the other jobs will wait.)

@haraldsvik haraldsvik requested a review from a team as a code owner January 9, 2025 13:18
Copy link
Collaborator

@DanielElisenberg DanielElisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the state manager class 👍🏻 Easy to read solution to our issue. Good stuff 💯

if job and job.status not in ["queued", "built"]:
logger.info(f"Worker died and did not finish job {job.job_id}")
fix_interrupted_job(job)
manager_state.unregister_job(job.job_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job would be none if it is completed here. I think we should refactor this solution to iterate through each dead worker and use the dead worker first and foremost. Something akin to :

def clean_up_after_dead_workers(
    dead_workers: List[Worker], manager_state
) -> None:
    if len(dead_workers) == 0:
        return
    for dead_worker in dead_workers:
            # only query for the job in question
            # the amount of requests here are maximum 4, so of no consequence
            job = job_service.get_job(dead_worker.job_id)
            if job and job.status not in ["queued", "built"]:
                logger.info(f"Worker died and did not finish job {job.job_id}")
                fix_interrupted_job(job)
            manager_state.unregister_job(dead_worker.job_id)

logger.info(
f"{job.job_id} Failed to get the size of the dataset."
)
raise LocalStorageError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do it like this the local storage module itself might as well raise the error so we don't need to do the check here.

Another option would be to fail the job:

        job_service.update_job_status(
            job_id, "failed", log=f"No such dataset available for import"
        )

This is preferable to crashing the whole executor if one dataset should be tampered with in some unexpected way.

rename get gize got get_input_tar_size_in_bytes
fix unregister for dead_workers
Fails job instead of crashing if size isnt found
Tok docs-strings for a run and a diet
@haraldsvik haraldsvik requested a review from a team January 23, 2025 08:25
else:
self.current_max_workers = self.default_max_workers
can_spawn = True
self.update_worker_limit(new_job_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to refactor this, but I think the can_spawn_new_worker should be purely "get" method without extra "setting" things, so refactoring this out will improve readability

Register job handles update current worker limit
@DanielElisenberg
Copy link
Collaborator

Just food for thought: Could it be beneficial to also keep the active workers list inside the ManagerState? That is also part of the statefulness of the manager process so it might be more readable for as much of it as possible to exist in this new state class?

Co-authored-by: pawbu <pawbu@users.noreply.github.com>
haraldsvik and others added 3 commits January 29, 2025 09:12
Co-authored-by: Daniel Elisenberg <33904479+DanielElisenberg@users.noreply.github.com>
Co-authored-by: Daniel Elisenberg <33904479+DanielElisenberg@users.noreply.github.com>
@sonarqubecloud
Copy link

Copy link
Collaborator

@DanielElisenberg DanielElisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's send it to integration tests and QA 💯

@haraldsvik haraldsvik merged commit 8230455 into main Jan 30, 2025
7 checks passed
@haraldsvik haraldsvik deleted the dynamic-workers branch January 30, 2025 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants