Description
Hi,
so we have a jupyterhub instance (v 5.0.0) which uses a htcondor batch cluster to spawn server jobs on.
Occasionally we run into deadlocked user accounts which have a server route but the server job is gone and the jhub does not notice - the only solution currently is to delete the user account completely to get rid of the route.
Today we had another case and going through the logs I understand that this is caused (in this case) by the jhub somehow detecting a server stop and removing the route, but readding it shortly there after, because it is "missing" - meaning, the server entry is probably still in the db even after the route is removed.
This feels like a bug to me, maybe a timing issue? If routes and servers are removed together, it seems like the deadlock would be avoided. Hoping for more insights...
Best
Kruno
Here are the logs (jupyterhub + htcondor), boiled down to the relevant parts, with some comments added by me
# last poll for the job
[D 2024-11-15 15:26:50.332 JupyterHub batchspawner:314] Spawner querying job: sudo -E -u lodott /var/lib/condor/util/condor_q.sh 1612
# server stop detected by jhub - don't know how/why
# there should have been two more polls here (every 30secs)
[W 2024-11-15 15:28:17.258 JupyterHub base:1290] User lodott server stopped, with exit code: 1
# jhub request route removes from proxy
[I 2024-11-15 15:28:17.258 JupyterHub proxy:356] Removing user lodott from proxy (/user/lodott/)
# user requests are ongoing but eventually run into a Timeout exception
# I think the exception might be responsible for killing the poll here
[E 2024-11-15 15:28:55.991 JupyterHub spawner:1459] Unhandled error in poll callback for <CondorOptionsSpawner object at 0x7fbe07e0bf50>
...
TimeoutError: Repeated api_request to proxy path "/user/lodott" failed.
### lots of client requests with 302 followed 500, because no route
# periodic route check puts back route because its "missing" - the server entry in db was apparently not deleted
[D 2024-11-15 15:31:16.190 JupyterHub proxy:392] Checking routes
...
[W 2024-11-15 15:31:16.208 JupyterHub proxy:416] Adding missing route for /user/lodott/ (Server(url=http://batchj004:40002/user/lodott/, bind_url=http://batchj004:40002/user/lodott/))
### client requests with 302 which are now routed again but server state is unclear
# htcondor actually removes notebook job due to
11/15/24 17:39:11 (1612.0) (1799193): Job 1612.0 is being removed: Job removed by SYSTEM_PERIODIC_REMOVE due to Job runtime longer than reserved.
# at this stage the htcondor job is gone but since there is no polling, the jhub never notices - deadlock