Skip to content

deadlocked accounts - jhub removes route but does not delete server? #314

Open
@lodott72

Description

@lodott72

Hi,

so we have a jupyterhub instance (v 5.0.0) which uses a htcondor batch cluster to spawn server jobs on.

Occasionally we run into deadlocked user accounts which have a server route but the server job is gone and the jhub does not notice - the only solution currently is to delete the user account completely to get rid of the route.

Today we had another case and going through the logs I understand that this is caused (in this case) by the jhub somehow detecting a server stop and removing the route, but readding it shortly there after, because it is "missing" - meaning, the server entry is probably still in the db even after the route is removed.

This feels like a bug to me, maybe a timing issue? If routes and servers are removed together, it seems like the deadlock would be avoided. Hoping for more insights...

Best
Kruno

Here are the logs (jupyterhub + htcondor), boiled down to the relevant parts, with some comments added by me

# last poll for the job
[D 2024-11-15 15:26:50.332 JupyterHub batchspawner:314] Spawner querying job: sudo -E -u lodott /var/lib/condor/util/condor_q.sh 1612
# server stop detected by jhub - don't know how/why
# there should have been two more polls here (every 30secs)
[W 2024-11-15 15:28:17.258 JupyterHub base:1290] User lodott server stopped, with exit code: 1
# jhub request route removes from proxy
[I 2024-11-15 15:28:17.258 JupyterHub proxy:356] Removing user lodott from proxy (/user/lodott/)
# user requests are ongoing but eventually run into a Timeout exception
# I think the exception might be responsible for killing the poll here
[E 2024-11-15 15:28:55.991 JupyterHub spawner:1459] Unhandled error in poll callback for <CondorOptionsSpawner object at 0x7fbe07e0bf50>
...
TimeoutError: Repeated api_request to proxy path "/user/lodott" failed.
### lots of client requests with 302 followed 500, because no route
# periodic route check puts back route because its "missing" - the server entry in db was apparently not deleted
[D 2024-11-15 15:31:16.190 JupyterHub proxy:392] Checking routes
...
[W 2024-11-15 15:31:16.208 JupyterHub proxy:416] Adding missing route for /user/lodott/ (Server(url=http://batchj004:40002/user/lodott/, bind_url=http://batchj004:40002/user/lodott/))
### client requests with 302 which are now routed again but server state is unclear
# htcondor actually removes notebook job due to	
11/15/24 17:39:11 (1612.0) (1799193): Job 1612.0 is being removed: Job removed by SYSTEM_PERIODIC_REMOVE due to Job runtime longer than reserved.
# at this stage the htcondor job is gone but since there is no polling, the jhub never notices - deadlock

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions