Skip to content

Zenith server registrar pod needs restarting every 90 days on AWS EKS #1125

@jcwomack

Description

@jcwomack

Bristol Centre for Supercomputing (BriCS) have Zenith server (currently latest release 0.15.0) deployed in AWS EKS and use this to establish tunnels to proxy users from the public internet to HTTP-based services running on BriCS facilities, using persistent URL path prefixes to register services (e.g. https://apps.isambard.ac.uk/servicename).

Having had Zenith server deployed for a number of months now, we have encountered a infrequent but predictable issue where established Zenith tunnels fail 90 days from when the Zenith server registrar pod was started.

When this occurs the services.zenith.stackhpc.com and endpoints.zenith.stackhpc.com objects in the zenith-services namespace corresponding to previously registered services remain present, but there are no pods in the namespace. This suggests the presence of a reserved subdomain/subpath, but no established tunnel.

When the issue occurs the Zenith server registrar pod logs contain errors indicating an issue communicating with the K8s API, e.g.

INFO:     127.0.0.6:47361 - "POST /admin/verify HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/venv/lib/python3.10/site-packages/easykube/kubernetes/client/client.py", line 34, in raise_for_status
    yield super().raise_for_status(response)
  File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 57, in execute_flow
    to_send = await yielded_obj
  File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
    yielded_obj = action(to_send)
  File "/venv/lib/python3.10/site-packages/easykube/rest/client.py", line 54, in raise_for_status
    raise exc
  File "/venv/lib/python3.10/site-packages/easykube/rest/client.py", line 47, in raise_for_status
    response.raise_for_status()
  File "/venv/lib/python3.10/site-packages/httpx/_models.py", line 761, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '401 Unauthorized' for url 'https://10.100.0.1/apis/zenith.stackhpc.com/v1alpha1/namespaces/zenith-services/services?labelSelector=zenith.stackhpc.com%2Ffingerprint%3Dfpe6-uzs-sa_o_pg_bZSe5mh7Hjv6oqt9YVxK58ulQWD4'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/401

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 396, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    response = await f(request)
  File "/venv/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/venv/lib/python3.10/site-packages/zenith/registrar/app.py", line 198, in verify_subdomain
    subdomain = await backend.subdomain_for_public_key(fingerprint_bytes(req.public_key))
  File "/venv/lib/python3.10/site-packages/zenith/registrar/backends/crd.py", line 107, in subdomain_for_public_key
    services = [service async for service in ekresource.list(labels = labels)]
  File "/venv/lib/python3.10/site-packages/zenith/registrar/backends/crd.py", line 107, in <listcomp>
    services = [service async for service in ekresource.list(labels = labels)]
  File "/venv/lib/python3.10/site-packages/easykube/rest/iterators.py", line 88, in __anext__
    return await self._next_item()
  File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
    yielded_obj = action(to_send)
  File "/venv/lib/python3.10/site-packages/easykube/rest/iterators.py", line 61, in _next_item
    response = yield self._client.get(self._next_url, params = self._next_params)
  File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 57, in execute_flow
    to_send = await yielded_obj
  File "/venv/lib/python3.10/site-packages/httpx/_client.py", line 1801, in get
    return await self.request(
  File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
    yielded_obj = action(to_send)
  File "/venv/lib/python3.10/site-packages/easykube/rest/client.py", line 30, in request
    return (yield super().request(method, url, **kwargs))
  File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 57, in execute_flow
    to_send = await yielded_obj
  File "/venv/lib/python3.10/site-packages/httpx/_client.py", line 1574, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
    yielded_obj = action(to_send)
  File "/venv/lib/python3.10/site-packages/easykube/rest/client.py", line 38, in send
    yield self.raise_for_status(response)
  File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 57, in execute_flow
    to_send = await yielded_obj
  File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
    yielded_obj = action(to_send)
  File "/venv/lib/python3.10/site-packages/easykube/kubernetes/client/client.py", line 36, in raise_for_status
    raise ApiError(source)
easykube.kubernetes.client.errors.ApiError: Unauthorized

The registrar container seems to lose the ability to communicate with the K8s API on at 10.100.0.1, i.e.

$ k get -n default service/kubernetes
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   109d

We found we can resolve this issue by restarting the registrar pod, e.g.

$ k scale --replicas=0 -n zenith deployment/zenith-server-registrar
$ k scale --replicas=1 -n zenith deployment/zenith-server-registrar

We have also found that doing this would allow the registrar pod to start communicating with the K8s API and Zenith tunnels for existing registered subdomain/subpaths can be re-established.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions