Skip to content

Commit 879956f

Browse files
committed
Add blog post about flight recorders
1 parent ebad852 commit 879956f

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed

_posts/2018-08-03-flight-recorders.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
layout: blog
3+
title: Houston, We’ve Had a Problem Here
4+
author: Alexander Kyte
5+
tags: [logging]
6+
---
7+
8+
## Houston, We’ve Had a Problem Here ##
9+
10+
### Bringing Flight Recorders to Mono ###
11+
12+
There are few things as rewarding as printf-debugging a difficult problem.
13+
Sometimes though, they’re not an appropriate solution. A bug that only
14+
reproduces after hours of use can lead to tens of gigabytes of logging files in
15+
the worst case. Or, even worse, you might have a Heisenbug. When you try to
16+
print the variables involved in your bug, the IO involved prevents the race
17+
from happening entirely. Left with no good way to debug the main problem,
18+
you’re forced to read the code really carefully and put checks outside of
19+
critical regions to check when the bug has happened.
20+
21+
### Prototyping a Solution ###
22+
23+
This type of problem very often complained about with respect to the mono
24+
debugger. It can be hard to isolate whether a mono bug, a C# application bug,
25+
or debugger misuse were the cause of a failure. People find themselves trying
26+
to understand what mono’s view of the world is, based on the output they see
27+
in their debuggers. Submitting a bug is hard work, as is arguing that the
28+
error is not between the keyboard and the chair.
29+
30+
It was this that motivated the creation of the debugger flight recorder.
31+
A flight recorder, in software engineering, is a system that takes messages
32+
without overhead or waits and can print out the last hundred retained messages
33+
or so. The main use of it is to avoid filling up your disk, slowing down your
34+
application, or paying other costs associated with logging incredibly
35+
verbosely around a problem.
36+
37+
The debugger’s existing logging was extracted into a number of functions that
38+
created a global state machine for the debugger. These transitions were logged
39+
into the ring buffer of the flight recorder. While being rather mundane, it
40+
proved useful for catching bugs that otherwise involved fear, uncertainty,
41+
doubt, and code that is pushed to master to assert with good error messages
42+
when a known bug is hit.
43+
44+
### Generalized Logging with the Flight Recorder ###
45+
46+
Shortly later, we decided to bring the flight recorder mechanism into the big
47+
leagues. We were going to replace all of our logging with it, when it was
48+
enabled. This was not trivial. While taking a global lock every time you log is
49+
very slow, lockless ring buffers were somehow slower. Something had to change.
50+
I was inspired by research that suggested that a dedicated thread with a message
51+
inbox was more scalable than lockless compare-and-set data structures
52+
(ffwd by S. Roghanchi - 2017).
53+
54+
Leveraging existing lockless message queues and threading primitives, and
55+
patterns from our profiler, I created a very simple message-passing system.
56+
This message passing system was given some callbacks that would make a thread
57+
to serve as a global flight recorder. All other threads dump their logs in a
58+
wait-free manner into the queue, and the ring buffer drains the queue.
59+
In short, a thread now owns the responsibility of managing all the logs.
60+
It scales incredibly well, and does not appear to interfere with race
61+
conditions reproducing.
62+
63+
### Impact on You ###
64+
65+
To use the flight recorder in your application today, export
66+
`MONO_LOG_DEST=flight-recorder` and you will see the log suffix on crash or on
67+
shutdown. It’s really rather exciting, to have an application-level view of all
68+
of the steps that one would usually have to mentally trace while debugging. It’s
69+
freeing to no longer choose between logging and performance, between verbosity
70+
and bug reproduction.
71+
72+
73+
74+
75+
76+
77+

0 commit comments

Comments
 (0)