|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: Houston, We’ve Had a Problem Here |
| 4 | +author: Alexander Kyte |
| 5 | +tags: [logging] |
| 6 | +--- |
| 7 | + |
| 8 | +## Houston, We’ve Had a Problem Here ## |
| 9 | + |
| 10 | +### Bringing Flight Recorders to Mono ### |
| 11 | + |
| 12 | +There are few things as rewarding as printf-debugging a difficult problem. |
| 13 | +Sometimes though, they’re not an appropriate solution. A bug that only |
| 14 | +reproduces after hours of use can lead to tens of gigabytes of logging files in |
| 15 | +the worst case. Or, even worse, you might have a Heisenbug. When you try to |
| 16 | +print the variables involved in your bug, the IO involved prevents the race |
| 17 | +from happening entirely. Left with no good way to debug the main problem, |
| 18 | +you’re forced to read the code really carefully and put checks outside of |
| 19 | +critical regions to check when the bug has happened. |
| 20 | + |
| 21 | +### Prototyping a Solution ### |
| 22 | + |
| 23 | +This type of problem very often complained about with respect to the mono |
| 24 | +debugger. It can be hard to isolate whether a mono bug, a C# application bug, |
| 25 | +or debugger misuse were the cause of a failure. People find themselves trying |
| 26 | +to understand what mono’s view of the world is, based on the output they see |
| 27 | +in their debuggers. Submitting a bug is hard work, as is arguing that the |
| 28 | +error is not between the keyboard and the chair. |
| 29 | + |
| 30 | +It was this that motivated the creation of the debugger flight recorder. |
| 31 | +A flight recorder, in software engineering, is a system that takes messages |
| 32 | +without overhead or waits and can print out the last hundred retained messages |
| 33 | +or so. The main use of it is to avoid filling up your disk, slowing down your |
| 34 | +application, or paying other costs associated with logging incredibly |
| 35 | +verbosely around a problem. |
| 36 | + |
| 37 | +The debugger’s existing logging was extracted into a number of functions that |
| 38 | +created a global state machine for the debugger. These transitions were logged |
| 39 | +into the ring buffer of the flight recorder. While being rather mundane, it |
| 40 | +proved useful for catching bugs that otherwise involved fear, uncertainty, |
| 41 | +doubt, and code that is pushed to master to assert with good error messages |
| 42 | +when a known bug is hit. |
| 43 | + |
| 44 | +### Generalized Logging with the Flight Recorder ### |
| 45 | + |
| 46 | +Shortly later, we decided to bring the flight recorder mechanism into the big |
| 47 | +leagues. We were going to replace all of our logging with it, when it was |
| 48 | +enabled. This was not trivial. While taking a global lock every time you log is |
| 49 | +very slow, lockless ring buffers were somehow slower. Something had to change. |
| 50 | +I was inspired by research that suggested that a dedicated thread with a message |
| 51 | +inbox was more scalable than lockless compare-and-set data structures |
| 52 | +(ffwd by S. Roghanchi - 2017). |
| 53 | + |
| 54 | +Leveraging existing lockless message queues and threading primitives, and |
| 55 | +patterns from our profiler, I created a very simple message-passing system. |
| 56 | +This message passing system was given some callbacks that would make a thread |
| 57 | +to serve as a global flight recorder. All other threads dump their logs in a |
| 58 | +wait-free manner into the queue, and the ring buffer drains the queue. |
| 59 | +In short, a thread now owns the responsibility of managing all the logs. |
| 60 | +It scales incredibly well, and does not appear to interfere with race |
| 61 | +conditions reproducing. |
| 62 | + |
| 63 | +### Impact on You ### |
| 64 | + |
| 65 | +To use the flight recorder in your application today, export |
| 66 | +`MONO_LOG_DEST=flight-recorder` and you will see the log suffix on crash or on |
| 67 | +shutdown. It’s really rather exciting, to have an application-level view of all |
| 68 | +of the steps that one would usually have to mentally trace while debugging. It’s |
| 69 | +freeing to no longer choose between logging and performance, between verbosity |
| 70 | +and bug reproduction. |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | + |
| 77 | + |
0 commit comments