Description
Once a session is restarted, Backend.AI performs the followings:
-
In the Agent:* Destroys the container.
-
Recreates the container reusing the prior allocation map and kernel resource spec.
-
In the Manager:* Just updates the kernel/session status (RUNNING → RESTARTING → RUNNING).
Though, the restarted session is broken:
-
Connecting to in-container apps does not work, via both AppProxy and manager's own proxy like
./backend.ai app
command.* Probably because the port mapping is changed but not propagated to Manager (and App Proxy). -
GPU config env-vars (Add intrinsic GPU config env-vars for AI runtimes #3253) have empty values.
-
The
CID
field in/home/config/resource.txt
gets duplicated after restarted. (Refactor container preparation step #3266) -
maybe more?
Due to these issues, we need to restart the model service process in feat(BA-441): restart model service process #3282, instead of cleanly restarting the container.
For future work, we may need to consider:
- Adding an option to rebuild the device allocation map, to avoid re-allocating "faulty" devices if present.
- Adding an explicit handling of restart failures in case of there is no available devices in such cases.
- Reschedule a new session based on a terminated session's creation params #3284
Activity