Skip to content

Commit 04f79af

Browse files
committed
Add telemetry blog post
1 parent ebad852 commit 04f79af

File tree

1 file changed

+109
-0
lines changed

1 file changed

+109
-0
lines changed
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
layout: blog
3+
title: Automatic Bug Reporting
4+
author: Alexander Kyte
5+
tags: [logging]
6+
---
7+
8+
## Automatic Bug Reporting ##
9+
10+
Software engineers often like to think of exceptional paths as being rarely taken.
11+
While this is hopefully the case on a customer’s machine, the engineer will see a
12+
program fail far more often than they will watch it succeed. Every engineer who has
13+
had to suffer bad tooling of one form or another becomes aware of this fact like a
14+
person with a broken foot becomes aware of how far away things are.
15+
16+
The modern software lifecycle does not end when you ship a piece of code. Defects in
17+
shipped software are reported, studied, reproduced, and then debugged. A significant
18+
portion of the time it takes to fix a bug can be spent in discovering that it exists.
19+
Often this discovery includes finding the ways that the customer’s environment differs
20+
from the developer’s testing environment. A back-and-forth conversation can give a
21+
developer a lot of information, but it’s not something that every bug filer is
22+
motivated enough to keep returning and responding.
23+
24+
### Trust and Privacy ###
25+
26+
A desire to automate the process of finding out what made a bit of code crash has lead
27+
to a lot of innovations in software engineering tooling. Unfortunately, many attempts to
28+
address this problem have completely lost the trust of the customer base. As soon as a
29+
brand is associated with “tracking,” people stop affording the company the benefit of the
30+
doubt. Telemetry mistakes seeming sinister (and sometimes correctly so) has lead to the
31+
passing of the GDPR in Europe.
32+
33+
With all of this in mind, a team attempting to solve the telemetry problem is faced with
34+
sweaty palms. The addition of integrated telemetry support to the Mono runtime is something
35+
that would have to balance a number of concerns.
36+
37+
I believe that we did pretty well. Any constructive criticism through official channels is
38+
appreciated. I will describe now the balance that we struck.
39+
40+
### Domain Details ###
41+
42+
Any telemetry system can be separated into a number of components.
43+
44+
1. That which collects information about runtime state during the crash
45+
2. That which moves it from the customer machine to the developer machine
46+
3. The manipulation and aggregation systems for the crashes.
47+
48+
Now the concerns in the various parts are rather contradictory. Part 3 should be private,
49+
as information about bugs may pose security risks. Making Part 3 private often means that making
50+
Part 2 proprietary and close-source is important. I don’t think that many people disagree with
51+
these two. The information being sent (and how it was collected in Part 1) is the part that is
52+
subject to the most scrutiny.
53+
54+
All of the source code that creates the files that are created during a crash to control Part 2
55+
are open source and completely audit-able. Folks can play along at home
56+
(source is [here](https://github.com/mono/mono/blob/5672eba58212345b8e9722587533c325a0c5825d/mono/utils/mono-state.c))
57+
as they continue to read, if they would like to confirm what I am saying.
58+
59+
### Implementation ###
60+
61+
Being privacy-preserving is more than just having our policies in open-source code; we must have
62+
good policies. To abide by the GDPR we cannot collect any Personally Identifying Information.
63+
What is PII for a bug report? Most of the choices were very straightforward, but a few required
64+
careful work to preserve our desired behavior. Beyond avoiding sending file paths from a user’s
65+
machine or data from their code, we also cannot capture their code in our stack trace. If a
66+
Visual Studio For Mac extension contains code that is not ours, we should not collect information
67+
about their crashes. It is simple enough to only report the UUID, token, and managed IL offset
68+
(CIL metadata about the class) that all work together like a unique hash. Without the C# assembly
69+
file in question, you don’t know which hash goes to which file. It’s a primary key, but the
70+
people who have the mapping already have the private data.
71+
72+
We needed to balance that desire against the desire for a crash from two different versions of mono
73+
to look very similar or identical to the backend (Part 3). Mono already has unique hashing functions
74+
for metadata objects; we use it inside of the AOT compiler and runtime. We can then generate a
75+
hash that is identical for two identical stack traces, while uploading a version-dependent unique
76+
reference with the main stack trace. What this means is that if two crashes have the same hash but
77+
different information otherwise, they are the same crash. This preserves privacy while letting you
78+
count how often each bug is hit.
79+
80+
To make this concrete, this is what we send for each managed frame:
81+
82+
{
83+
"is_managed" : "true",
84+
"guid" : "0845998F-6B70-4AA8-9214-6731378926A0",
85+
"token" : "0x6003817",
86+
"native_offset" : "0x1fd",
87+
"il_offset" : "0x00071"
88+
}
89+
90+
### Bigger Impacts ###
91+
92+
This crash dump creation also represented a change in our philosophy with respect to error reporting.
93+
Rather than expecting the bug report submission to be the beginning of an ongoing conversation,
94+
we are now expecting it to be an anonymous message. This motivates our embedding information into the
95+
dump that we would typically ask someone to collect for us by exporting an environment variable or
96+
attaching with a debugger. The dump needs to contain enough system clues without describing the
97+
embedding application too closely and being creepy.
98+
99+
It is a much harder problem for a runtime than for a web server because we expose a lot of the
100+
details of the underlying platform to a developer who chooses to poke and prod. The CPU model isn’t
101+
likely to cause problems with a web server, but leads to
102+
[impossible bugs](https://www.mono-project.com/news/2016/09/12/arm64-icache/) with mono.
103+
We expose these abstract state machines in the API that only roughly correlate to the bits in the machine registers.
104+
This train of thought with respect to state machines and logging lead to the flight recorders, which will be mentioned in a subsequent blog post.
105+
106+
The best part of all of this is that it is all open source. Because mono is an embedded runtime, when you embed our telemetry engine you gain the ability to collect telemetry on your own code. Someone today can build mono in a way that allows them to get a beautiful runtime state dump on each crash. If they don’t change too much, it’ll even be GDPR-compliant. It can be hard to get really excited about logging, but it’s easy to get excited about spending less time teaching customers to debug.
107+
108+
109+

0 commit comments

Comments
 (0)