[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424
Description
Description
It seems like occasionally when a container is unhealthy on start and restarts and becomes healthy, the docker compose up -d --wait will fail with an unhealthy error message. This happens when docker compose up -d --wait is run in parallel, and with the policy restart: unless-stopped. Note that this occasionally happens, not all the time.
I would hope that even if the container is unhealthy and crashes on start, --wait
will account for this as it eventually becomes healthy after restarting itself if it is within the timeout period.
Steps To Reproduce
I have 3 config files like so:
docker-compose-redis:
services:
redis:
image: ghcr.io/getsentry/image-mirror-library-redis:5.0-alpine
healthcheck:
test: redis-cli ping | grep PONG
interval: 5s
timeout: 5s
retries: 3
command:
[
'redis-server',
'--appendonly',
'yes',
'--save',
'60',
'20',
'--auto-aof-rewrite-percentage',
'100',
'--auto-aof-rewrite-min-size',
'64mb',
]
ports:
- 127.0.0.1:6379:6379
volumes:
- redis-data:/data
networks:
- devservices
extra_hosts:
- host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
labels:
- orchestrator=devservices
restart: unless-stopped
networks:
devservices:
external: true
volumes:
redis-data:
docker-compose-kafka:
services:
kafka:
image: ghcr.io/getsentry/image-mirror-confluentinc-cp-kafka:7.5.0
healthcheck:
test: kafka-topics --bootstrap-server 127.0.0.1:9092 --list
interval: 5s
timeout: 5s
retries: 3
environment:
# https://docs.confluent.io/platform/current/installation/docker/config-reference.html#cp-kakfa-example
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_CONTROLLER_QUORUM_VOTERS: [email protected]:29093
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_NODE_ID: 1001
CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,INTERNAL://0.0.0.0:9093,EXTERNAL://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:29092,INTERNAL://kafka:9093,EXTERNAL://127.0.0.1:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: 1
KAFKA_LOG_RETENTION_HOURS: 24
KAFKA_MESSAGE_MAX_BYTES: 50000000 # 50MB or bust
KAFKA_MAX_REQUEST_SIZE: 50000000 # 50MB on requests apparently too
CONFLUENT_SUPPORT_METRICS_ENABLE: false
KAFKA_LOG4J_LOGGERS: kafka.cluster=WARN,kafka.controller=WARN,kafka.coordinator=WARN,kafka.log=WARN,kafka.server=WARN,state.change.logger=WARN
KAFKA_LOG4J_ROOT_LOGLEVEL: WARN
KAFKA_TOOLS_LOG4J_LOGLEVEL: WARN
ulimits:
nofile:
soft: 4096
hard: 4096
ports:
- 127.0.0.1:9092:9092
- 127.0.0.1:9093:9093
volumes:
- kafka-data:/var/lib/kafka/data
networks:
- devservices
extra_hosts:
- host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
labels:
- orchestrator=devservices
restart: unless-stopped
networks:
devservices:
external: true
volumes:
kafka-data:
docker-compose-relay:
services:
relay:
image: us-central1-docker.pkg.dev/sentryio/relay/relay:nightly
ports:
- 127.0.0.1:7899:7899
command: [run, --config, /etc/relay]
healthcheck:
test: curl -f http://127.0.0.1:7899/api/relay/healthcheck/live/
interval: 5s
timeout: 5s
retries: 3
volumes:
- ./config/relay.yml:/etc/relay/config.yml
- ./config/devservices-credentials.json:/etc/relay/credentials.json
extra_hosts:
- host.docker.internal:host-gateway
networks:
- devservices
labels:
- orchestrator=devservices
restart: unless-stopped
networks:
devservices:
external: true
volumes:
kafka-data:
redis-data:
When I run
# Start up commands in parallel
docker compose -p redis -f docker-compose-redis.yml up redis -d --wait > redis_up.log 2>&1 &
kafka_pid=$!
docker compose -p kafka -f docker-compose-kafka.yml up kafka -d --wait > kafka_up.log 2>&1 &
redis_pid=$!
docker compose -p relay -f docker-compose-relay.yml up relay -d --wait > relay_up.log 2>&1 &
relay_pid=$!
# Wait for all up commands to complete
wait $kafka_pid $redis_pid $relay_pid
Relay sometimes fails the to come up with the --wait flag, even if the docker status is technically healthy.
Logs:
Container relay-relay-1 Creating
Container relay-relay-1 Created
Container relay-relay-1 Starting
Container relay-relay-1 Started
Container relay-relay-1 Waiting
container relay-relay-1 is unhealthy
Compose Version
2.29.7
Docker Environment
Client:
Version: 27.2.0
Context: colima
Anything else?
Let me know if there is anything else I can add to help out when reproducing the issue. The contents of the relay configs can be found here:
https://github.com/getsentry/relay/tree/fe3f09fd3accd2361887dd678dbe034f25139fce/devservices/config