-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Bristol Centre for Supercomputing (BriCS) have Zenith server (currently latest release 0.15.0) deployed in AWS EKS and use this to establish tunnels to proxy users from the public internet to HTTP-based services running on BriCS facilities, using persistent URL path prefixes to register services (e.g. https://apps.isambard.ac.uk/servicename).
Having had Zenith server deployed for a number of months now, we have encountered a infrequent but predictable issue where established Zenith tunnels fail 90 days from when the Zenith server registrar pod was started.
When this occurs the services.zenith.stackhpc.com and endpoints.zenith.stackhpc.com objects in the zenith-services namespace corresponding to previously registered services remain present, but there are no pods in the namespace. This suggests the presence of a reserved subdomain/subpath, but no established tunnel.
When the issue occurs the Zenith server registrar pod logs contain errors indicating an issue communicating with the K8s API, e.g.
INFO: 127.0.0.6:47361 - "POST /admin/verify HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/venv/lib/python3.10/site-packages/easykube/kubernetes/client/client.py", line 34, in raise_for_status
yield super().raise_for_status(response)
File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 57, in execute_flow
to_send = await yielded_obj
File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
yielded_obj = action(to_send)
File "/venv/lib/python3.10/site-packages/easykube/rest/client.py", line 54, in raise_for_status
raise exc
File "/venv/lib/python3.10/site-packages/easykube/rest/client.py", line 47, in raise_for_status
response.raise_for_status()
File "/venv/lib/python3.10/site-packages/httpx/_models.py", line 761, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '401 Unauthorized' for url 'https://10.100.0.1/apis/zenith.stackhpc.com/v1alpha1/namespaces/zenith-services/services?labelSelector=zenith.stackhpc.com%2Ffingerprint%3Dfpe6-uzs-sa_o_pg_bZSe5mh7Hjv6oqt9YVxK58ulQWD4'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/401
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 396, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/venv/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
response = await f(request)
File "/venv/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/venv/lib/python3.10/site-packages/zenith/registrar/app.py", line 198, in verify_subdomain
subdomain = await backend.subdomain_for_public_key(fingerprint_bytes(req.public_key))
File "/venv/lib/python3.10/site-packages/zenith/registrar/backends/crd.py", line 107, in subdomain_for_public_key
services = [service async for service in ekresource.list(labels = labels)]
File "/venv/lib/python3.10/site-packages/zenith/registrar/backends/crd.py", line 107, in <listcomp>
services = [service async for service in ekresource.list(labels = labels)]
File "/venv/lib/python3.10/site-packages/easykube/rest/iterators.py", line 88, in __anext__
return await self._next_item()
File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
yielded_obj = action(to_send)
File "/venv/lib/python3.10/site-packages/easykube/rest/iterators.py", line 61, in _next_item
response = yield self._client.get(self._next_url, params = self._next_params)
File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 57, in execute_flow
to_send = await yielded_obj
File "/venv/lib/python3.10/site-packages/httpx/_client.py", line 1801, in get
return await self.request(
File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
yielded_obj = action(to_send)
File "/venv/lib/python3.10/site-packages/easykube/rest/client.py", line 30, in request
return (yield super().request(method, url, **kwargs))
File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 57, in execute_flow
to_send = await yielded_obj
File "/venv/lib/python3.10/site-packages/httpx/_client.py", line 1574, in request
return await self.send(request, auth=auth, follow_redirects=follow_redirects)
File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
yielded_obj = action(to_send)
File "/venv/lib/python3.10/site-packages/easykube/rest/client.py", line 38, in send
yield self.raise_for_status(response)
File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 57, in execute_flow
to_send = await yielded_obj
File "/venv/lib/python3.10/site-packages/easykube/flow.py", line 49, in execute_flow
yielded_obj = action(to_send)
File "/venv/lib/python3.10/site-packages/easykube/kubernetes/client/client.py", line 36, in raise_for_status
raise ApiError(source)
easykube.kubernetes.client.errors.ApiError: Unauthorized
The registrar container seems to lose the ability to communicate with the K8s API on at 10.100.0.1, i.e.
$ k get -n default service/kubernetes
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 109d
We found we can resolve this issue by restarting the registrar pod, e.g.
$ k scale --replicas=0 -n zenith deployment/zenith-server-registrar
$ k scale --replicas=1 -n zenith deployment/zenith-server-registrar
We have also found that doing this would allow the registrar pod to start communicating with the K8s API and Zenith tunnels for existing registered subdomain/subpaths can be re-established.