Skip to content

b.echo.th.ooni.io possibly down for 8 hours #244

Open
@darkk

Description

@darkk

Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)

Detection: CPUHigh alert with expected 8h delay

Timeline UTC:
17 Nov 07:30 CPU spikes to 100%, thats accept() vs. EMFILE busy loop
17 Nov 15:34 CPUHigh alert firing
17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU
17 Nov 16:14 @darkk logs into the VM, looks at oonib, reboots the VM
17 Nov 16:20 everything recovers to normal

What went well:

  • resource utilisation alerts are actually useful!

What went wrong:

  • oonib was slowly leaking sockets at port tcp/57002 (TCPEchoHelper)
  • 1004 connections were enough to kill the daemon, these connections were coming from 395 distinct IPs, only 99 IPs had more than one connection, only 17 were >10, top5 IPs had {55,52,33,32,32} connections
  • status was reporting nothing for init script , reboot was an "easy" way to restrat the service
ooni-backend Status
Listing all oonib procs
No running oonib procs

What is still unclear:

  • was the service actually down? Seems, it should be, but no other alerts besides CPUHigh were triggered

What could be done to prevent relapse and decrease impact:

  • increase FD limit
  • preventive restart (?)
  • add TCP_KEEPALIVE with low timeout values for the endpoints (?)
  • monitoring for the service itself besides TCP port check (?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    incidenttrackingthis is a recurring incident or bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions