|
| 1 | +--- |
| 2 | +date: 2025-01-02 |
| 3 | +title: A Software Observability Roundup |
| 4 | +excerpt: |
| 5 | + I spent some time recently catching up on my #to-read saves in Obsidian. More than a few |
| 6 | + of these were blog posts from 2024 about software observability. Talk of "redefining observability", |
| 7 | + "observability 2.0", and "try Honeycomb" had caught my eye in a few spaces, |
| 8 | + and so I had been hoarding links on the topic. After spending a few days immersing myself in those |
| 9 | + articles and branching out to others, I decided to write this bullet-form roundup. |
| 10 | +--- |
| 11 | + |
| 12 | +I spent some time recently catching up on my `#to-read` saves in Obsidian. More than a few of these |
| 13 | +were blog posts from 2024 about _software observability_. Talk of "redefining observability", |
| 14 | +"observability 2.0", and "try [Honeycomb](https://honeycomb.io)" had caught my eye in a few spaces, |
| 15 | +and so I had been hoarding links on the topic. |
| 16 | + |
| 17 | +After spending a few days immersing myself in those articles and branching out to others, I decided |
| 18 | +to write this bullet-form roundup: |
| 19 | + |
| 20 | +1. for myself, as a way of solidifying my current understanding |
| 21 | +2. in public, as a way to invite corrections and improvements (drop a [comment](#userComments) below |
| 22 | + or [@parente.dev on Bluesky](https://bsky.app/profile/parente.dev)!) |
| 23 | +3. with my colleagues in mind, as a new way to approach and discuss an ever-green question: |
| 24 | + |
| 25 | +**As our [issue space](https://www.thorn.org/research/state-of-the-issue/) changes and grows, and |
| 26 | +[our solutions](https://www.thorn.org/solutions/) adapt and scale in response, what (else) should we |
| 27 | +do today so that we can readily address unknown-unknowns tomorrow?** |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +# Overview |
| 32 | + |
| 33 | +The seventeen [references](#references) I surveyed offer perspectives on observability as it |
| 34 | +pertains both to software systems and organizations around them. They cover what observability is, |
| 35 | +what problems it solves, how it is and should be implemented. There's alignment from the |
| 36 | +authors on the state of affairs, learned best practices, and a direction in which the industry |
| 37 | +should head. Shared terminology and goals are works in progress. |
| 38 | + |
| 39 | +# Origins |
| 40 | + |
| 41 | +- According to control theory, _observability_ is a measure of how well internal states of a |
| 42 | + system can be inferred from knowledge of its external outputs.[^wikipedia2022] |
| 43 | +- The discipline of software engineering (distributed computing, site reliability engineering, et |
| 44 | + al) has not settled on a single definition. One that stays close to the control theory original |
| 45 | + is that _software observability_ measures how well a system's state can be understood from the |
| 46 | + obtained telemetry.[^wikipedia2022] |
| 47 | +- Metrics, logs, and traces caught on as the three kinds of telemetry required to observe a |
| 48 | + software system—the so-called "three pillars of observability." |
| 49 | + - ... perhaps because they helped build a shared vocabulary at the 2017 Distributed Tracing |
| 50 | + Summit.[^bourgon2017] |
| 51 | + - ... perhaps because they _do_ provide a comprehensive way for engineers to _monitor_ systems |
| 52 | + for _known_ problems and hint at where the issue lies.[^parker2024] |
| 53 | + - ... perhaps because solutions for monitoring systems using metrics, logs, and traces are |
| 54 | + what vendors had to sell.[^majors2024aug] |
| 55 | + |
| 56 | +# Problems and Limitations |
| 57 | + |
| 58 | +- The task of analyzing disjoint metrics, logs, and trace data falls on humans when using |
| 59 | + three-pillar systems designed primarily for monitoring.[^sigelman2021a] |
| 60 | + - Moving beyond investigation of known-knowns is difficult without data and tooling designed |
| 61 | + to support correlations and experimentation.[^weakly2024oct] |
| 62 | + - Use of monitoring tools leads to org reliance on the intuition of a few system experts |
| 63 | + resulting in cognitive costs and bus-factor risks. Low visibility slows development and |
| 64 | + reduces team confidence.[^majors2024jan] |
| 65 | + - Using CloudWatch logs, CloudWatch metrics, and X-Ray traces together, for example, requires |
| 66 | + users to infer answers to questions from their mental model of the system, incomplete data, |
| 67 | + disparate views, and reading of code.[^tane2024dec] |
| 68 | +- The three-pillar data model constrains the types of questions that can be asked and answered, |
| 69 | + with an almost exclusive focus on engineering concerns. Even mature observability programs will |
| 70 | + struggle to answer questions of greater interest and value _to the business_[^parker2024], such |
| 71 | + as: |
| 72 | + - What's the relationship between system performance and conversions, by funnel stage, broken |
| 73 | + down by geo, device, and intent signals? |
| 74 | + - What's our cost of goods sold per request, per customer, with real-time pricing data of |
| 75 | + resources? |
| 76 | + - How much does each marginal API request to our enterprise data endpoint cost in terms of |
| 77 | + availability for lower-tiered customers? Enough to justify automation work? |
| 78 | +- There are many sources of truth when disparate formats (metrics, logs, traces) and/or tools are |
| 79 | + in play, with decisions made at write-time about how the data will be used in the future. |
| 80 | + [^majors2024nov] |
| 81 | +- The value of metrics, logs, and (un-sampled) traces does not scale with the costs required to |
| 82 | + collect, transfer, and store them.[^sigelman2021a] As the bill goes up, the value stays constant |
| 83 | + at best, and more likely _decreases_.[^majors2024jan] |
| 84 | + - Logs get noisier and get slower to search with greater volume. |
| 85 | + - Custom metrics require more forethought and auditing as the set grows over time. |
| 86 | +- "At the end, the three pillars of observability do not exist. It's not something we should be |
| 87 | + relying on."[^tane2024dec] |
| 88 | + |
| 89 | + - The coexistence of metrics, logging, and tracing is not _observability_. They are |
| 90 | + _telemetry_ useful in _monitoring_ systems.[^sigelman2021b] |
| 91 | + |
| 92 | +<a name="better-practices"></a> |
| 93 | + |
| 94 | +# Better Practices |
| 95 | + |
| 96 | +- Instrument applications to emit "wide events" (or "canonical logs" or "structured logs") as your |
| 97 | + telemetry data. |
| 98 | + |
| 99 | + - Wide events have high-dimensionality (many attributes) and attributes with high-cardinality |
| 100 | + (many possible unique values) making them context-rich (everything about the event is |
| 101 | + attached to it).[^tane2024sept] |
| 102 | + - "High-dimensionality" roughly equates with **hundreds** of attributes at present. Metadata |
| 103 | + about hosts, pods, builds, requests, responses, users, customers, timing, errors, teams, |
| 104 | + services, versions, third-party vendors, etc. are all fair game.[^morrell2024] |
| 105 | + |
| 106 | +- Have a single source of truth which stores the wide events as they are emitted. |
| 107 | + |
| 108 | + - Do no aggregation at write-time. Make decisions at read-time about how to query and use the |
| 109 | + data.[^majors2024nov] [^tane2024sept] |
| 110 | + - Wide events from a service continuously handling 1000 requests per second—about 1 million |
| 111 | + events per day—can compress to about 80 MB in columnar formats like Parquet and cost |
| 112 | + pennies to retain for a few months in typical object stores.[^morrell2024] |
| 113 | + - Custom metrics are effectively infinite as costs no longer increase linearly (thanks to |
| 114 | + columnar data storage) and the ability to cross-correlate increases as more event attributes |
| 115 | + are added. Intelligent sampling can control volume costs associated with these structured |
| 116 | + events when scale demands it.[^majors2024jan] |
| 117 | + - Storing event data in one place lends itself to the application of AI-tools which are good |
| 118 | + at correlating and summarizing[^burmistrov2024], perhaps continually in the |
| 119 | + background.[^tane2024dec] |
| 120 | + |
| 121 | +- Adopt exploratory tooling that lets you explore quickly and cheaply query that data about |
| 122 | + emergent behaviors, new questions, unknown unknowns. |
| 123 | + |
| 124 | + - Proper tooling allows engineers to investigate any system, regardless of their experience |
| 125 | + with it or its complexity, in a methodical and objective manner.[^majors2022] |
| 126 | + - The waterfall view of traces, root spans, nested spans, and the like _is_ not sufficient. |
| 127 | + Users need the ability to "dig" into data however they deem necessary.[^burmistrov2024] |
| 128 | + - You will never ask the same question twice. Something is different since you last asked |
| 129 | + it.[^weakly2024mar] |
| 130 | + - There is a natural tension between a system’s scalability and its feature set. You can |
| 131 | + afford much powerful observability features at scales orders of magnitude smaller than |
| 132 | + Google.[^sigelman2021a] |
| 133 | + |
| 134 | +# Looking Forward |
| 135 | + |
| 136 | +- Confusion abounds about what observability really is[^burmistrov2024] to the point that folks |
| 137 | + are actively redefining it[^weakly2024mar] [^parker2024] or versioning it[^majors2024aug] |
| 138 | + [^weakly2024dec] to improve clarity. |
| 139 | + |
| 140 | + - "Pretty much everything in business is about asking questions and forming hypotheses, then |
| 141 | + testing them." That's observability.[^parker2024] |
| 142 | + - The cognitive systems engineering definition of observability—feedback that provides |
| 143 | + insight into a process and refers to the work needed to extract meaning from available |
| 144 | + data—may be a better starting point for software engineering.[^weakly2024mar] |
| 145 | + - "Observability is the process through which one develops the ability to ask meaningful questions, |
| 146 | + get useful answers, and act effectively on what you learn." It is not a tooling problem but |
| 147 | + rather a strategic capability akin to business intelligence.[^weakly2024mar] |
| 148 | + - "Observability 2.0 has one source of truth, wide structured log events, from which you can |
| 149 | + _derive_ all the other data types." The benefit to the full software development lifecycle, |
| 150 | + the cost model, and the adoption by a critical mass of developers make observability 2.0 |
| 151 | + inevitable.[^majors2024nov] |
| 152 | + - "Observability 1.0 gave us lots of useful answers, observability 2.0 gives us the potential |
| 153 | + to ask meaningful questions, and observability 3.0 is going to give us the ability to act |
| 154 | + effectively on what we learn."[^weakly2024dec] |
| 155 | + |
| 156 | +- There is consensus on the direction in which software observability should head: toward the |
| 157 | + [better practices](#better-practices) mentioned earlier. Discussion continues to establish |
| 158 | + shared language and goals. |
| 159 | + |
| 160 | + - "Observability 3.0 will be measured by the value that non-engineering functions in the |
| 161 | + business are able to get from it."[^weakly2024dec] |
| 162 | + - "The success of Observability 2.0 will be measured by how well engineering teams can |
| 163 | + understand their decisions and describe what they do in the language of the |
| 164 | + business."[^majors2024dec] |
| 165 | + |
| 166 | +<a name="references"></a> |
| 167 | + |
| 168 | +# References |
| 169 | + |
| 170 | +[^wikipedia2022]: |
| 171 | + [Observability |
| 172 | + (software)](<https://en.wikipedia.org/w/index.php?title=Observability_(software)&oldid=1225628905>). |
| 173 | + (2024, May 24). In _Wikipedia_. |
| 174 | + |
| 175 | +[^bourgon2017]: |
| 176 | + Bourgon, P. (2017, February 21). [Metrics, tracing, and |
| 177 | + logging](https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html). _Peter |
| 178 | + Bourgon's Blog_. |
| 179 | + |
| 180 | +[^parker2024]: |
| 181 | + Parker, A. (2024, March 29). [Re-Redefining |
| 182 | + Observability](https://aparker.io/2024/03/re-redefining-observability/). _Austin Parker's Blog_. |
| 183 | + |
| 184 | +[^majors2024aug]: |
| 185 | + Majors, C. (2024, August 7). [Is It Time To Version Observability? (Signs Point To |
| 186 | + Yes)](https://charity.wtf/2024/08/07/is-it-time-to-version-observability-signs-point-to-yes/). |
| 187 | + _charity.wtf_. |
| 188 | + |
| 189 | +[^sigelman2021a]: |
| 190 | + Sigelman, B. (2021, February 4). [Debunking the 'Three Pillars of Observability' |
| 191 | + Myth](https://softwareengineeringdaily.com/2021/02/04/debunking-the-three-pillars-of-observability-myth/). |
| 192 | + _Software Engineering Daily_. |
| 193 | + |
| 194 | +[^weakly2024oct]: |
| 195 | + Weakly, H. (2024, October 3). [The 4 Evolutions of Your Observability |
| 196 | + Journey](https://thenewstack.io/the-4-evolutions-of-your-observability-journey/). _The New |
| 197 | + Stack_. |
| 198 | + |
| 199 | +[^sigelman2021b]: |
| 200 | + Sigelman, B. (2021, February 4). [Observability Won’t Replace Monitoring (Because |
| 201 | + It |
| 202 | + Shouldn’t)](https://thenewstack.io/observability-wont-replace-monitoring-because-it-shouldnt/). |
| 203 | + _The New Stack_. |
| 204 | + |
| 205 | +[^majors2024jan]: |
| 206 | + Majors, C. (2024, January 24). [The Cost Crisis in Observability |
| 207 | + Tooling](https://www.honeycomb.io/blog/cost-crisis-observability-tooling). _Honeycomb Blog_. |
| 208 | + |
| 209 | +[^tane2024dec]: |
| 210 | + Tane, B. & Galbraith, K. (2024, December 6). [Observing Serverless Applications |
| 211 | + (SVS212)](https://youtu.be/mPbI3Qxdocc) [Conference presentation]. AWS re:Invent 2024 Las Vegas, |
| 212 | + Nevada, United States. |
| 213 | + |
| 214 | +[^majors2024nov]: |
| 215 | + Majors, C. (2024, November 19). [There Is Only One Key Difference Between |
| 216 | + Observability 1.0 and |
| 217 | + 2.0](https://www.honeycomb.io/blog/one-key-difference-observability1dot0-2dot0). _Honeycomb |
| 218 | + Blog_. |
| 219 | + |
| 220 | +[^tane2024sept]: |
| 221 | + Tane, B. (2024, September 8). [Observability Wide Events |
| 222 | + 101](https://boristane.com/blog/observability-wide-events-101/). _Boris Tane's Blog_. |
| 223 | + |
| 224 | +[^morrell2024]: |
| 225 | + Morrell, J. (2024, October 22). [A Practitioner's Guide to Wide |
| 226 | + Events](https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/). _Jeremy Morrell's |
| 227 | + Blog_. |
| 228 | + |
| 229 | +[^majors2022]: |
| 230 | + Majors, C., Fong-Jones, L., & Miranda, G. (2022, May 6). [Observability Engineering: |
| 231 | + Achieving production |
| 232 | + excellence](https://learning.oreilly.com/library/view/observability-engineering/9781492076438/). |
| 233 | + O’Reilly Media, Inc. |
| 234 | + |
| 235 | +[^burmistrov2024]: |
| 236 | + Burmistrov, I. (2024, February 15). [All you need is Wide Events, not "Metrics, |
| 237 | + Logs and Traces"](https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics). |
| 238 | + _A Song Of Bugs And Patches_. |
| 239 | + |
| 240 | +[^weakly2024mar]: |
| 241 | + Weakly, H. (2024, March 15). [Redefining |
| 242 | + Observability](https://hazelweakly.me/blog/redefining-observability/). _Hazel Weakly's Blog_. |
| 243 | + |
| 244 | +[^weakly2024dec]: |
| 245 | + Weakly, H. (2024, December 9). [The Future of Observability: Observability |
| 246 | + 3.0](https://hazelweakly.me/blog/the-future-of-observability-observability-3-0/). _Hazel |
| 247 | + Weakly's Blog_. |
| 248 | + |
| 249 | +[^majors2024dec]: |
| 250 | + Majors, C. (2024, December 20). [On Versioning Observabilities (1.0, 2.0, |
| 251 | + 3.0…10.0?!?)](https://charity.wtf/2024/12/20/on-versioning-observabilities-1-0-2-0-3-0-10-0/). |
| 252 | + _charity.wtf_. |
0 commit comments