-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
What's the issue?
During a deployment of a new code location version, all webserver replicas (2 read/write + 1 readonly) display this error on the UI, even after the code location has come up for 10+ minutes.
dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 180 seconds to complete.
File "/usr/local/lib/python3.12/site-packages/dagster/_core/workspace/context.py", line 820, in _load_location
else origin.create_location(self.instance)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/dagster/_core/remote_representation/origin.py", line 350, in create_location
return GrpcServerCodeLocation(self, instance=instance)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/dagster/_core/remote_representation/code_location.py", line 731, in __init__
self._repository_snaps = sync_get_streaming_external_repositories_data_grpc(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/dagster/_api/snapshot_repository.py", line 23, in sync_get_streaming_external_repositories_data_grpc
external_repository_chunks = list(
^^^^^
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 456, in streaming_external_repository
for res in self._streaming_query(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 260, in _streaming_query
self._raise_grpc_exception(
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 183, in _raise_grpc_exception
raise DagsterUserCodeUnreachableError(
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2026-01-15T21:52:59.810148227+00:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"
>
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 256, in _streaming_query
yield from self._get_streaming_response(
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 232, in _get_streaming_response
yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
File "/usr/local/lib/python3.12/site-packages/grpc/_channel.py", line 543, in __next__
return self._next()
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/grpc/_channel.py", line 969, in _next
raise self
The above exception occurred during handling of the following exception:
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/server_watcher.py", line 120, in watch_grpc_server_thread
watch_for_changes()
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/server_watcher.py", line 83, in watch_for_changes
new_server_id = client.get_server_id(timeout=REQUEST_TIMEOUT)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 311, in get_server_id
res = self._query("GetServerId", dagster_api_pb2.Empty, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 205, in _query
self._raise_grpc_exception(
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 188, in _raise_grpc_exception
raise DagsterUserCodeUnreachableError(
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:172.20.155.159:3030: Failed to connect to remote host: connect: Connection refused (111)"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:172.20.155.159:3030: Failed to connect to remote host: connect: Connection refused (111)", grpc_status:14, created_time:"2026-01-15T21:48:57.490484304+00:00"}"
>
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 203, in _query
return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/dagster/_grpc/client.py", line 163, in _get_response
return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/grpc/_channel.py", line 1181, in __call__
return _end_unary_response_blocking(state, call, False, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
None of the webservers recover even after waiting for 3+ hours. However, the daemon functions as expected during this time, as all scheduled jobs are run on-time. The daemon logs also confirm that it sees the new code location deployment pretty soon after the deployment finishes.
The code location logs indicate it took ~40 seconds to start, so I'm not sure where the 180 second timeout from the webserver to the code location is being hit.
2026-01-16 17:01:11 +0000 - dagster.code_server - INFO - Starting Dagster code server for package dagster_project on port 3030 in process 1
2026-01-16 17:01:13 +0000 - dagster - WARNING - /usr/local/lib/python3.10/site-packages/dagster/_core/definitions/antlr_asset_selection/antlr_asset_selection.py:80: BetaWarning: Parameter `include_sources` of function `AssetSelection.all` is currently in beta, and may have breaking changes in minor version releases, with behavior changes in patch releases.
return AssetSelection.all(include_sources=self.include_sources) - selection
2026-01-16 17:01:20 +0000 - dagster - WARNING - /usr/local/lib/python3.10/site-packages/dagster/_core/definitions/antlr_asset_selection/antlr_asset_selection.py:124: BetaWarning: Parameter `include_sources` of function `AssetSelection.tag` is currently in beta, and may have breaking changes in minor version releases, with behavior changes in patch releases.
return AssetSelection.tag(key, value, include_sources=self.include_sources)
WARNING:dagster:/usr/local/lib/python3.10/site-packages/dagster/_core/definitions/antlr_asset_selection/antlr_asset_selection.py:124: BetaWarning: Parameter `include_sources` of function `AssetSelection.tag` is currently in beta, and may have breaking changes in minor version releases, with behavior changes in patch releases.
return AssetSelection.tag(key, value, include_sources=self.include_sources)
2026-01-16 17:01:21 +0000 - dagster - WARNING - /usr/local/lib/python3.10/site-packages/dagster/_core/definitions/resolved_asset_deps.py:24: BetaWarning: Asset [<redacted>]'s dependency '[<redacted>]' was resolved to upstream asset [<redacted>], because the name matches and they're in the same group. This is a beta functionality that may change in a future release is currently in beta, and may have breaking changes in minor version releases, with behavior changes in patch releases.
self._deps_by_assets_def_id = resolve_assets_def_deps(assets_defs, source_assets)
WARNING:dagster:/usr/local/lib/python3.10/site-packages/dagster/_core/definitions/resolved_asset_deps.py:24: BetaWarning: Asset [<redacted>]'s dependency '[<redacted>' was resolved to upstream asset [<redacted>], because the name matches and they're in the same group. This is a beta functionality that may change in a future release is currently in beta, and may have breaking changes in minor version releases, with behavior changes in patch releases.
self._deps_by_assets_def_id = resolve_assets_def_deps(assets_defs, source_assets)
17:01:51 The selection criterion '' does not match any enabled nodes
17:01:51 The selection criterion '' does not match any enabled nodes
17:01:52 The selection criterion '' does not match any enabled nodes
17:01:52 The selection criterion '' does not match any enabled nodes
2026-01-16 17:01:52 +0000 - dagster.code_server - INFO - Started Dagster code server for package dagster_project on port 3030 in process 1
INFO:dagster.code_server:Started Dagster code server for package dagster_project on port 3030 in process 1
This started appearing consistently over the past month, just for this particular code location. We have a few other locations that are not impacted by this problem.
Any pointers on where the issue might be happening?
- Where is the 180 second timeout being reached, if the code location only takes ~40 seconds to start and serve requests?
- How come the webservers do not attempt to retry a "fresh" connection to the code location endpoint? Logs indicate that once the initial timeout is hit, no more attempts are made, given there are no more error logs output on the pod.
What did you expect to happen?
The webservers to recover and see the new code location pod, like the daemon does. It appears after hitting the 180 second timeout (which I'm not sure how it's being hit), the webserver does not retry the connection, based on logs.
How to reproduce?
Not sure if this can be deterministically reproduced. Steps to follow:
- Use an existing Dagster setup in Kubernetes, with 3 webserver replicas and a single code location.
- Deploy a new code location image in Kubernetes. Wait for the deployment to cycle the pod
- See connection errors on the webserver UI. This step is expected, since during the deployment, the connection is briefly lost.
- See that the code location has started serving requests and that the daemon has resumed normal ticks based on logs. The webservers, however, still do not properly communicate with the code location, even after waiting for 10+ minutes.
Dagster version
1.10.11
Deployment type
Dagster Helm chart
Deployment details
- Kubernetes 1.31 (EKS)
- Dagster Helm chart
- Custom webserver image using Python 3.12 and installing the
dagster-webserverpackage
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.