fix: testing get or creating actors#32
Conversation
|
Ok, so a review of what the workers are, and what their job is: Graze Semaphore - rate limiter for requests to our own internal API, Right now, every new task happens within the context of a new CPU Worker. Booting new CPU Workers and Network Workers on every run is not a big deal, since startup time is minimal. Starting up new GPU workers is a very big deal, as there's significant startup costs when loading all the various ML models we use. It looks like we're re-instantiating every run right now, is that right? Are GPU and CPU workers running on the same deployed server, or are they on separate servers and thus decoupled (i.e. we can independently scale up/down each type of resource)? Looks like the omni yaml suggests that we reboot CPU every time but not GPU, but also not seeing the same discovery approach as used with Semaphore/Cache workers? Jumping into a It seems like the GPU lanes needing to be instantiated could become a headache, as well as the CPU lanes, while the other workers are less of a headache. I have several ideas here that we could try if necessary:
Other things we can do to relieve pain would be to not autoscale certain things for now. Peak activity during the day is probably ≈2.5x of trough, so running at 3x overprovision from trough is not the end of the world, and likely just the decoupling of one-CPU-one-GPU worker that we have on Runpod drastically reduces the burn and makes it still a huge benefit, buying us time to figure out the longer term strategy. |
@DGaffney Here I have some tweaks to the omni boot to get/create named actors. I'm running the main job like
The
rayCLI can stop the jobs, but it can't change the state of detached actors. So there are two options:Getting actors by name on a fresh Ray cluster throws an error, so there's an try catch in there to just create in that case. I have this working for the cache, actors, but the CPU workers I am leaving. It's technically possible to kill some workers, but not others.
Note: Unrelated you might noted I added some type hints for my LSP in random places.