Skip to content

pod healthy but /metrics failing #365

@huguesgr

Description

@huguesgr

I'm noticing the following behavior in some cases when a redis failover has happened in the cluster.

The /metrics endpoint fails:

[2025-08-18 04:20:38,475] ERROR in app: Exception on /metrics [GET]
Traceback (most recent call last):
  File "/app/.venv/lib/python3.13/site-packages/redis/connection.py", line 644, in read_response
    response = self._parser.read_response(disable_decoding=disable_decoding)
  File "/app/.venv/lib/python3.13/site-packages/redis/_parsers/resp2.py", line 15, in read_response
    result = self._read_response(disable_decoding=disable_decoding)
  File "/app/.venv/lib/python3.13/site-packages/redis/_parsers/resp2.py", line 25, in _read_response
    raw = self._buffer.readline()
  File "/app/.venv/lib/python3.13/site-packages/redis/_parsers/socket.py", line 115, in readline
    self._read_from_socket()
    ~~~~~~~~~~~~~~~~~~~~~~^^
  File "/app/.venv/lib/python3.13/site-packages/redis/_parsers/socket.py", line 65, in _read_from_socket
    data = self._sock.recv(socket_read_size)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/.venv/lib/python3.13/site-packages/flask/app.py", line 1511, in wsgi_app
    response = self.full_dispatch_request()
  File "/app/.venv/lib/python3.13/site-packages/flask/app.py", line 919, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/app/.venv/lib/python3.13/site-packages/flask/app.py", line 917, in full_dispatch_request
    rv = self.dispatch_request()
  File "/app/.venv/lib/python3.13/site-packages/flask/app.py", line 902, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/app/src/http_server.py", line 32, in metrics
    current_app.config["metrics_puller"]()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/app/src/exporter.py", line 156, in scrape
    self.track_queue_metrics()
    ~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/app/src/exporter.py", line 238, in track_queue_metrics
    for worker, stats in (self.app.control.inspect().stats() or {}).items()
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/app/.venv/lib/python3.13/site-packages/celery/app/control.py", line 243, in stats
    return self._request('stats')
           ~~~~~~~~~~~~~^^^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/celery/app/control.py", line 106, in _request
    return self._prepare(self.app.control.broadcast(
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~^
        command,
        ^^^^^^^^
    ...<6 lines>...
        pattern=self.pattern, matcher=self.matcher,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ))
    ^
  File "/app/.venv/lib/python3.13/site-packages/celery/app/control.py", line 777, in broadcast
    return self.mailbox(conn)._broadcast(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        command, arguments, destination, reply, timeout,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        limit, callback, channel=channel,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/app/.venv/lib/python3.13/site-packages/kombu/pidbox.py", line 337, in _broadcast
    self._publish(command, arguments, destination=destination,
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                  reply_ticket=reply_ticket,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
                  pattern=pattern,
                  ^^^^^^^^^^^^^^^^
                  matcher=matcher)
                  ^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/kombu/pidbox.py", line 299, in _publish
    maybe_declare(self.reply_queue(chan))
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/kombu/common.py", line 113, in maybe_declare
    return _maybe_declare(entity, channel)
  File "/app/.venv/lib/python3.13/site-packages/kombu/common.py", line 155, in _maybe_declare
    entity.declare(channel=channel)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/kombu/entity.py", line 617, in declare
    self._create_queue(nowait=nowait, channel=channel)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/kombu/entity.py", line 626, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/kombu/entity.py", line 655, in queue_declare
    ret = channel.queue_declare(
        queue=self.name,
    ...<5 lines>...
        nowait=nowait,
    )
  File "/app/.venv/lib/python3.13/site-packages/kombu/transport/virtual/base.py", line 538, in queue_declare
    return queue_declare_ok_t(queue, self._size(queue), 0)
                                     ~~~~~~~~~~^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/kombu/transport/redis.py", line 1012, in _size
    sizes = pipe.execute()
  File "/app/.venv/lib/python3.13/site-packages/redis/client.py", line 1613, in execute
    return conn.retry.call_with_retry(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^
        lambda: execute(conn, stack, raise_on_error),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        lambda error: self._disconnect_raise_on_watching(conn, error),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/app/.venv/lib/python3.13/site-packages/redis/retry.py", line 92, in call_with_retry
    raise error
  File "/app/.venv/lib/python3.13/site-packages/redis/retry.py", line 87, in call_with_retry
    return do()
  File "/app/.venv/lib/python3.13/site-packages/redis/client.py", line 1614, in <lambda>
    lambda: execute(conn, stack, raise_on_error),
            ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/redis/client.py", line 1455, in _execute_transaction
    connection.send_packed_command(all_cmds)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/redis/connection.py", line 581, in send_packed_command
    self.check_health()
    ~~~~~~~~~~~~~~~~~^^
  File "/app/.venv/lib/python3.13/site-packages/redis/connection.py", line 573, in check_health
    self.retry.call_with_retry(self._send_ping, self._ping_failed)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.13/site-packages/redis/retry.py", line 92, in call_with_retry
    raise error
  File "/app/.venv/lib/python3.13/site-packages/redis/retry.py", line 87, in call_with_retry
    return do()
  File "/app/.venv/lib/python3.13/site-packages/redis/connection.py", line 563, in _send_ping
    if str_if_bytes(self.read_response()) != "PONG":
                    ~~~~~~~~~~~~~~~~~~^^
  File "/app/.venv/lib/python3.13/site-packages/redis/connection.py", line 652, in read_response
    raise ConnectionError(f"Error while reading from {host_error} : {e.args}")

redis.exceptions.ConnectionError: Error while reading from <my-redis-service>:6379 : (113, 'No route to host')

While at the same time, the liveness probe on /health still returns a 200:

$ curl -s 127.0.0.1:9808/health
Connected to the broker redis://<my-redis-service>:6379//

Restarting the pod manually fixes the issue.

Could we change this behavior? Maybe the process should exit instead of returning only an ERROR log?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions