Skip to content

fix: testing get or creating actors#32

Merged
bgroupe merged 2 commits into
mainfrom
omni/boot-tweaks
May 7, 2025
Merged

fix: testing get or creating actors#32
bgroupe merged 2 commits into
mainfrom
omni/boot-tweaks

Conversation

@bgroupe

@bgroupe bgroupe commented Apr 29, 2025

Copy link
Copy Markdown
Collaborator

@DGaffney Here I have some tweaks to the omni boot to get/create named actors. I'm running the main job like

ray job submit --runtime-env omni_env.yaml --submission-id omni01 -- python omni_boot.py

The ray CLI can stop the jobs, but it can't change the state of detached actors. So there are two options:

  1. writing a kill script to run at some point in the update process
  2. making the booter optionally kill the existing actors when booting new ones, or return the live ones that don't need to be augmented.

Getting actors by name on a fresh Ray cluster throws an error, so there's an try catch in there to just create in that case. I have this working for the cache, actors, but the CPU workers I am leaving. It's technically possible to kill some workers, but not others.

Note: Unrelated you might noted I added some type hints for my LSP in random places.

@DGaffney

Copy link
Copy Markdown
Collaborator

Ok, so a review of what the workers are, and what their job is:

Graze Semaphore - rate limiter for requests to our own internal API,
Bluesky Semaphore - rate limiter for requests to Bluesky's API (specifically, their image CDN),
Cache - big memory blob for storing data that's frequently looked up (predictions on common texts, lists of user relationships, images and their scores, etc),
Network Worker - bank of dedicated lanes for sending http (technically httpx I think) requests to the internet for external resources (from Bluesky and Graze, via respective semaphores),
GPU Worker - bank of workers for running loaded ML Models like sentence transformers and text classifiers,
CPU Worker - bank of workers for processing the batches of tasks, and in turn calling on GPU Workers, Network Workers, and Cache.

Right now, every new task happens within the context of a new CPU Worker. Booting new CPU Workers and Network Workers on every run is not a big deal, since startup time is minimal. Starting up new GPU workers is a very big deal, as there's significant startup costs when loading all the various ML models we use. It looks like we're re-instantiating every run right now, is that right? Are GPU and CPU workers running on the same deployed server, or are they on separate servers and thus decoupled (i.e. we can independently scale up/down each type of resource)? Looks like the omni yaml suggests that we reboot CPU every time but not GPU, but also not seeing the same discovery approach as used with Semaphore/Cache workers? Jumping into a grazer-algo-cluster-kuberay-workergroup-worker-9cqvh box it looks like there's no GPU at all attached right now? Apologies for the questions that are perhaps obvious and due to my own oversights, just genuinely trying to figure out the topology here.

It seems like the GPU lanes needing to be instantiated could become a headache, as well as the CPU lanes, while the other workers are less of a headache. I have several ideas here that we could try if necessary:

  1. Refactor to have CPU/GPU instances be more decoupled from direct work. If we had a new QueueWorker which actually coordinates the work, and sends the work to CPU/GPU workers instead, would that help at all?
  2. Refactor to have GPU work as a totally external service. If it's a pain in the ass to reattach and reuse them, we could always just spin up a cluster of workers and communicate with them via job queues or HTTP to coordinate.

Other things we can do to relieve pain would be to not autoscale certain things for now. Peak activity during the day is probably ≈2.5x of trough, so running at 3x overprovision from trough is not the end of the world, and likely just the decoupling of one-CPU-one-GPU worker that we have on Runpod drastically reduces the burn and makes it still a huge benefit, buying us time to figure out the longer term strategy.

@bgroupe bgroupe merged commit 3489038 into main May 7, 2025
@bgroupe bgroupe deleted the omni/boot-tweaks branch May 7, 2025 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants