b.web-connectivity.th down for 7.6 hours

Impact: **TBD** (AFAIK clients should use another test helper if one of them is down, but it may be not the case).

Detection: repeated email alert

Timeline UTC:
00:00 four periodic jobs start: cerbot@cron, certbot@systemd, munin plugins apt update, /data/b.web-connectivity.th.ooni.io/update-bouncer.py
00:00 `2017-07-27 00:00:03,678:DEBUG:certbot.storage:Should renew, less than 30 days before certificate expiry 2017-08-25 23:01:00 UTC.`
00:00 `2017-07-27 00:00:03,678:INFO:certbot.hooks:Running pre-hook command: docker stop ooni-backend-b.web-connectivity.th.ooni.io`
00:00 `2017-07-27T00:00:03+0000 [-] Received SIGTERM, shutting down.`
00:00 `Jul 27 00:00:13 b dockerd[366]: time="2017-07-27T00:00:13.717964674Z" level=info msg="Container 6a4379167c880b295f7383d6eab8fc7b9e422ac1b0e6df0ab5cfefa2524fd512 failed to exit within 10 seconds of signal 15 - using the force"`
00:00 `2017-07-27 00:00:32,230:INFO:certbot.hooks:Running post-hook command: cp ... && docker start ooni-backend-b.web-connectivity.th.ooni.io`
00:00 `2017-07-27T00:00:34.710428510Z Another twistd server is running, PID 1`
00:01 [FIRING] Instance https://b.web-connectivity.th.ooni.io/status down
00:34 `2017-07-27 00:34:22,763:INFO:certbot.renewal:Cert not yet due for renewal`
07:00 darkk@ wakes up
07:31 darkk@ logs into b.web-connectivity.th.ooni.io
07:33 `2017-07-27T07:33:31.007042572Z Pidfile /oonib.pid contains non-numeric value` after an attempt to `truncate --size 0 /var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid`
07:34 `2017-07-27T07:34:00.767235934Z Removing stale pidfile /oonib.pid` after `echo 42 >/var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid`
07:36 [RESOLVED] Instance https://b.web-connectivity.th.ooni.io/status down
09:50 incident published

What went well:
- repeated alerting made it clear that the problem is ongoing
- although I had no `darkk` user at the host, I had my key at root's authorized_keys
- docker aufs makes it easy to do amendments to container FS while it is stopped, device-mapper based backend is worse in that case

What went wrong:
- that was recidive of _pid=1 in pidfile_ problem
- _View in AlertManager_ link in email and slack notifications is not clickable
- two certbot launchers `/etc/systemd/system/timers.target.wants/certbot.timer` and `/etc/cron.d/letsencrypt_renew_certs-...` are confusing, I have not noticed one of them at first and spent some time looking for the trigger of SIGTERM
- ooni-backend-b.web-connectivity.th.ooni.io container stores significant amount of logs inside, these logs take 7G and overall container size limit is 15G, historical logs are useful for sure, but: 1. renaming of 7400 files on rotation may be slow and twisted is single-threaded, 2. they should not overflow container disk quota
- twisted logs do not contain milliseconds, those are useful as a helper to distinguish two nearby events from event producing several log lines. docker logs stores nanoseconds, but oonib does not log to stdout (docker)

What is still unclear and should be discussed:
- [ ] impact. Do applications really fallback to another test helper?
- [ ] do we need more insistent notification system that can wake people up?

What could be done to prevent relapse and decrease impact:
- [ ] move letsencrypt updates to team "business hours", people sleep at UTC midnight
- [ ] avoid stale pid files somehow: cleanup twisted PID file on container start || grab flock on pid || randomize daemon pid so it's not pid=1 
- [x] fix AlertManager link
- [ ] fix two certbot launchers
- [ ] limit number of /var/log/ooni/oonibackend.log.* files in the container
- [ ] make /var/log/ooni/oonibackend.log.* files renamed once (e.g. oonibackend.log.${unix_timestamp})
- [ ] milli- or microseconds in twisted logs || logging to stdout (to docker)
- [ ] ensure that same storage driver is used across all containers #96 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

b.web-connectivity.th down for 7.6 hours #128

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

b.web-connectivity.th down for 7.6 hours #128

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions