Skip to content

[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424

Open
@hubertdeng123

Description

Description

It seems like occasionally when a container is unhealthy on start and restarts and becomes healthy, the docker compose up -d --wait will fail with an unhealthy error message. This happens when docker compose up -d --wait is run in parallel, and with the policy restart: unless-stopped. Note that this occasionally happens, not all the time.

I would hope that even if the container is unhealthy and crashes on start, --wait will account for this as it eventually becomes healthy after restarting itself if it is within the timeout period.

Steps To Reproduce

I have 3 config files like so:

docker-compose-redis:

services:
  redis:
    image: ghcr.io/getsentry/image-mirror-library-redis:5.0-alpine
    healthcheck:
      test: redis-cli ping | grep PONG
      interval: 5s
      timeout: 5s
      retries: 3
    command:
      [
        'redis-server',
        '--appendonly',
        'yes',
        '--save',
        '60',
        '20',
        '--auto-aof-rewrite-percentage',
        '100',
        '--auto-aof-rewrite-min-size',
        '64mb',
      ]
    ports:
      - 127.0.0.1:6379:6379
    volumes:
      - redis-data:/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  redis-data:

docker-compose-kafka:

services:
  kafka:
    image: ghcr.io/getsentry/image-mirror-confluentinc-cp-kafka:7.5.0
    healthcheck:
      test: kafka-topics --bootstrap-server 127.0.0.1:9092 --list
      interval: 5s
      timeout: 5s
      retries: 3
    environment:
      # https://docs.confluent.io/platform/current/installation/docker/config-reference.html#cp-kakfa-example
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: [email protected]:29093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_NODE_ID: 1001
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,INTERNAL://0.0.0.0:9093,EXTERNAL://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:29092,INTERNAL://kafka:9093,EXTERNAL://127.0.0.1:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: 1
      KAFKA_LOG_RETENTION_HOURS: 24
      KAFKA_MESSAGE_MAX_BYTES: 50000000 # 50MB or bust
      KAFKA_MAX_REQUEST_SIZE: 50000000 # 50MB on requests apparently too
      CONFLUENT_SUPPORT_METRICS_ENABLE: false
      KAFKA_LOG4J_LOGGERS: kafka.cluster=WARN,kafka.controller=WARN,kafka.coordinator=WARN,kafka.log=WARN,kafka.server=WARN,state.change.logger=WARN
      KAFKA_LOG4J_ROOT_LOGLEVEL: WARN
      KAFKA_TOOLS_LOG4J_LOGLEVEL: WARN
    ulimits:
      nofile:
        soft: 4096
        hard: 4096
    ports:
      - 127.0.0.1:9092:9092
      - 127.0.0.1:9093:9093
    volumes:
      - kafka-data:/var/lib/kafka/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:

docker-compose-relay:

services:
  relay:
    image: us-central1-docker.pkg.dev/sentryio/relay/relay:nightly
    ports:
      - 127.0.0.1:7899:7899
    command: [run, --config, /etc/relay]
    healthcheck:
      test: curl -f http://127.0.0.1:7899/api/relay/healthcheck/live/
      interval: 5s
      timeout: 5s
      retries: 3
    volumes:
      - ./config/relay.yml:/etc/relay/config.yml
      - ./config/devservices-credentials.json:/etc/relay/credentials.json
    extra_hosts:
      - host.docker.internal:host-gateway
    networks:
      - devservices
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:
  redis-data:

When I run

# Start up commands in parallel
docker compose -p redis -f docker-compose-redis.yml up redis -d --wait > redis_up.log 2>&1 &
kafka_pid=$!
docker compose -p kafka -f docker-compose-kafka.yml up kafka -d --wait > kafka_up.log 2>&1 &
redis_pid=$!
docker compose -p relay -f docker-compose-relay.yml up relay -d --wait > relay_up.log 2>&1 &
relay_pid=$!

# Wait for all up commands to complete
wait $kafka_pid $redis_pid $relay_pid

Relay sometimes fails the to come up with the --wait flag, even if the docker status is technically healthy.

Logs:

Container relay-relay-1  Creating
 Container relay-relay-1  Created
 Container relay-relay-1  Starting
 Container relay-relay-1  Started
 Container relay-relay-1  Waiting
container relay-relay-1 is unhealthy

Compose Version

2.29.7

Docker Environment

Client:
 Version:    27.2.0
 Context:    colima

Anything else?

Let me know if there is anything else I can add to help out when reproducing the issue. The contents of the relay configs can be found here:
https://github.com/getsentry/relay/tree/fe3f09fd3accd2361887dd678dbe034f25139fce/devservices/config

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions