Open
Description
Description of the bug:
If the bazel server becomes a zombie process, no new server can be started up because the new bazel invocation sees the zombie one and tries to shut it down forever, unsuccessfully.
This is a common occurrence in docker containers, because by default docker containers have no init program that is capable of reaping zombies. We saw this issue on the Android CI system (buildbot).
The message that you get when this happens is:
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
INFO: Waited 60 seconds for server process (pid=24) to terminate.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=24) to terminate.
FATAL: Attempted to kill stale server process (pid=24) using SIGKILL, but it did not die in a timely fashion.
The message doesn't give much information, which makes it hard to debug. Some improvements that could be made:
- Presumably the first 60 second wait period is for SIGTERM or grpc call, and the second is for SIGKILL, but currently the messages don't say which is which so it kindof looks like bazel tried the same thing twice.
- The killpg return value is ignored, which may give a useful hint if it was printed.
- Bazel could print a message explaining that it's probably a zombie and needs to be reaped, and could print the parent of the zombie bazel process (by reading /proc files) to let the user know if the parent process is a real init process or not.
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Dockerfile:
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y wget python3
RUN \
wget https://github.com/bazelbuild/bazelisk/releases/download/v1.15.0/bazelisk-linux-amd64 && \
mv bazelisk-linux-amd64 bazelisk && \
chmod +x bazelisk
COPY ./test.py /test.py
# It's important to use the exec form and not the shell form here,
# or else a shell will be PID 1 and will reap zombies
CMD ["python3", "/test.py"]
Test.py:
#!/usr/bin/env python3
import os
import subprocess
import sys
# It's important that this is a python script and not
# a bash script because bash will reap zombies.
os.mkdir('/workspace')
open('/workspace/WORKSPACE', 'w').close()
open('/workspace/BUILD', 'w').close()
os.chdir('/workspace')
env = {
**os.environ,
"USE_BAZEL_VERSION": "5.3.2", # also seen on newer versions, like what we have checked into android, but I'm pinning it to the latest release for reproducability
}
try:
subprocess.run(['/bazelisk', 'query', '//...'], check=True, env=env)
subprocess.run(['/bazelisk', 'shutdown'], check=True, env=env)
subprocess.run(['/bazelisk', 'query', '//...'], check=True, env=env)
except subprocess.CalledProcessError:
sys.exit(1)
Run it by using sudo docker build -t bazel_zombie_test . && sudo docker run bazel_zombie_test
.
The issue will be fixed if you add --init
to the docker run
command.
Metadata
Metadata
Assignees
Labels
We'll consider working on this in future. (Assignee optional)Issues where users get stuck because they don't understand what they did wrongSomeone outside the Bazel team could own thisSkyframe, bazel query, BEP, options parsing, bazelrcExternal dependency handling, remote repositiories, WORKSPACE file.