Description
Impact: TBD (AFAIK clients should use another test helper if one of them is down, but it may be not the case).
Detection: repeated email alert
Timeline UTC:
00:00 four periodic jobs start: cerbot@cron, certbot@systemd, munin plugins apt update, /data/b.web-connectivity.th.ooni.io/update-bouncer.py
00:00 2017-07-27 00:00:03,678:DEBUG:certbot.storage:Should renew, less than 30 days before certificate expiry 2017-08-25 23:01:00 UTC.
00:00 2017-07-27 00:00:03,678:INFO:certbot.hooks:Running pre-hook command: docker stop ooni-backend-b.web-connectivity.th.ooni.io
00:00 2017-07-27T00:00:03+0000 [-] Received SIGTERM, shutting down.
00:00 Jul 27 00:00:13 b dockerd[366]: time="2017-07-27T00:00:13.717964674Z" level=info msg="Container 6a4379167c880b295f7383d6eab8fc7b9e422ac1b0e6df0ab5cfefa2524fd512 failed to exit within 10 seconds of signal 15 - using the force"
00:00 2017-07-27 00:00:32,230:INFO:certbot.hooks:Running post-hook command: cp ... && docker start ooni-backend-b.web-connectivity.th.ooni.io
00:00 2017-07-27T00:00:34.710428510Z Another twistd server is running, PID 1
00:01 [FIRING] Instance https://b.web-connectivity.th.ooni.io/status down
00:34 2017-07-27 00:34:22,763:INFO:certbot.renewal:Cert not yet due for renewal
07:00 darkk@ wakes up
07:31 darkk@ logs into b.web-connectivity.th.ooni.io
07:33 2017-07-27T07:33:31.007042572Z Pidfile /oonib.pid contains non-numeric value
after an attempt to truncate --size 0 /var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:34 2017-07-27T07:34:00.767235934Z Removing stale pidfile /oonib.pid
after echo 42 >/var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:36 [RESOLVED] Instance https://b.web-connectivity.th.ooni.io/status down
09:50 incident published
What went well:
- repeated alerting made it clear that the problem is ongoing
- although I had no
darkk
user at the host, I had my key at root's authorized_keys - docker aufs makes it easy to do amendments to container FS while it is stopped, device-mapper based backend is worse in that case
What went wrong:
- that was recidive of pid=1 in pidfile problem
- View in AlertManager link in email and slack notifications is not clickable
- two certbot launchers
/etc/systemd/system/timers.target.wants/certbot.timer
and/etc/cron.d/letsencrypt_renew_certs-...
are confusing, I have not noticed one of them at first and spent some time looking for the trigger of SIGTERM - ooni-backend-b.web-connectivity.th.ooni.io container stores significant amount of logs inside, these logs take 7G and overall container size limit is 15G, historical logs are useful for sure, but: 1. renaming of 7400 files on rotation may be slow and twisted is single-threaded, 2. they should not overflow container disk quota
- twisted logs do not contain milliseconds, those are useful as a helper to distinguish two nearby events from event producing several log lines. docker logs stores nanoseconds, but oonib does not log to stdout (docker)
What is still unclear and should be discussed:
- impact. Do applications really fallback to another test helper?
- do we need more insistent notification system that can wake people up?
What could be done to prevent relapse and decrease impact:
- move letsencrypt updates to team "business hours", people sleep at UTC midnight
- avoid stale pid files somehow: cleanup twisted PID file on container start || grab flock on pid || randomize daemon pid so it's not pid=1
- fix AlertManager link
- fix two certbot launchers
- limit number of /var/log/ooni/oonibackend.log.* files in the container
- make /var/log/ooni/oonibackend.log.* files renamed once (e.g. oonibackend.log.${unix_timestamp})
- milli- or microseconds in twisted logs || logging to stdout (to docker)
- ensure that same storage driver is used across all containers #96