b.echo.th.ooni.io possibly down for 8 hours

Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)

Detection: CPUHigh alert with expected 8h delay

Timeline UTC:
17 Nov 07:30 CPU spikes to 100%, thats _`accept()` vs. EMFILE_ busy loop
17 Nov 15:34 `CPUHigh` alert firing
17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU
17 Nov 16:14 @darkk logs into the VM, looks at `oonib`, reboots the VM
17 Nov 16:20 everything recovers to normal

What went well:
- resource utilisation alerts are actually useful!

What went wrong:
- oonib was [slowly](https://mon.ooni.nu/prometheus/graph?g0.range_input=13d&g0.end_input=2018-11-17%2017%3A00&g0.expr=node_netstat_Tcp_CurrEstab%7Binstance%3D~%22b.echo.th.ooni.io.*%22%7D&g0.tab=0) leaking sockets at port tcp/57002 (TCPEchoHelper)
- 1004 connections were enough to kill the daemon, these connections were coming from 395 distinct IPs, only 99 IPs had more than one connection, only 17 were >10, top5 IPs had {55,52,33,32,32} connections
- `status` was reporting nothing for init script , `reboot` was an "easy" way to restrat the service
```
ooni-backend Status
Listing all oonib procs
No running oonib procs
```

What is still unclear:
- was the service actually down? Seems, it should be, but no other alerts besides CPUHigh were triggered

What could be done to prevent relapse and decrease impact:
- [ ] increase FD limit
- [ ] preventive restart (?)
- [ ] add TCP_KEEPALIVE with low timeout values for the endpoints (?)
- [ ] monitoring for the service itself besides TCP port check (?)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b.echo.th.ooni.io possibly down for 8 hours #244

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

b.echo.th.ooni.io possibly down for 8 hours #244

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions