Over the years, we at MongoDB have developed tooling within our correctness testing infrastructure
to make it easier to debug crashes (by collecting core dumps), hangs (by collecting thread stacks
and lock requests), and data corruption (by collecting data files). However, we have yet to evolve a
better strategy around debugging race conditions and still depend on an engineer to run the failed
test many times with additional logging, or to have them think really hard about where in the code
to add a sleep. Technologies such as rr may help us form a better story for investigating
race-related issues without requiring effort from an engineer to manually reproduce the failure.
git clone https://github.com/visemet/mongodb-rr-experiment.git
cd mongodb-rr-experimentThe following instructions were adapted from https://github.com/mozilla/rr/wiki/Building-And-Installing.
sudo apt update
sudo apt install \
capnproto \
ccache \
clang \
cmake \
coreutils \
g++-multilib \
gdb \
git \
libcapnp-dev \
make \
manpages-dev \
ninja-build \
pkg-config \
python-pexpect \
python3-pexpectgit clone https://github.com/mozilla/rr.git
cd rr
git checkout 5.2.0
CC=clang CXX=clang++ cmake -B build/ -G Ninja -Ddisable32bit=ON .
cmake --build .
sudo cmake --build . --target install
sudo sysctl kernel.perf_event_paranoid=1The following instructions were adapted from https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source.
sudo apt install libcurl4-openssl-dev python-pipgit clone https://github.com/mongodb/mongo.git
cd mongo
git remote add visemet https://github.com/visemet/mongo.git
git fetch visemet mongodb-rr-experiment
git checkout visemet/mongodb-rr-experiment
python2 -m pip install -r etc/pip/dev-requirements.txt
python2 -m pip install --user psutil==5.4.8You may notice when comparing the columns in the tables below that (1) there weren't any cases where
a failure could only be reproduced using rr, and (2) there were multiple cases where a failure
could only be reproduced manually. This shouldn't be interpreted as saying rr is ineffective. It
is still very likely that rr would save an engineer both time and effort when investigating a
build failure. The results simply demonstrate that it isn't possible to solely rely on rr as the
answer to investigating all race-related issues.
| Build failure | Able to reproduce? | |
|---|---|---|
| using rr | manually | |
| BF-9810 | ||
| BF-9958 | ✓ | ✓ |
| BF-10742 | ✓ | ✓ |
| BF-10932 | ✓ | ✓ |
| Build failure | Able to reproduce? | |
|---|---|---|
| using rr | manually | |
| BF-6346 | ✓ | |
| BF-8424 | ✓ | ✓ |
| BF-9030 | ||
| Build failure | Able to reproduce? | |
|---|---|---|
| using rr | manually | |
| BF-7114 | ✓ | |
| BF-7588 | ✓ | ✓ |
| BF-7888 | ✓ | |
| BF-8258 | ||
| BF-8642 | ✓ | ✓ |
| BF-9248 | ✓ | |
| BF-9426 | ||
| BF-9552 | ✓ | ✓ |
| BF-9864 | ||
| BF-10729 | ✓ | ✓ |
| BF-11054 | ✓ | ✓ |