Skip to content

Commit 044f020

Browse files
committed
A Software Observability Roundup post
- Footnote styling improvements
1 parent 258a61f commit 044f020

File tree

5 files changed

+277
-6
lines changed

5 files changed

+277
-6
lines changed

.prettierrc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"tabWidth": 4,
3+
"useTabs": false
4+
}

pages/20241129-obsidian-webclipper-config/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,8 @@ I was able to visit one of the [Amazon Bedrock documentation pages](https://docs
138138
- There are a lot of knobs to turn here, from model parameters to prompt variables to note formats to template configs. Tuning and tweaking is in order.
139139
- The OpenAI and Anthropic models would likely provide better results out of the gate. I'm sticking with Ollama/Llama in the spirit of the local-only Obsidian privacy model.
140140

141+
---
142+
141143
[^1]: I'm maintaining my latest Obsidian configs in [parente/obsidian-configs on GitHub](https://github.com/parente/obsidian-configs).
142144
[^2]: I did spend a few moments considering if I should set this env var less globally and narrow down the extension UUID(s) allowed. I did not think it worth the effort in my case. Follow your heart.
143145
[^3]: Restart the Ollama app from the macOS menu bar if you're already running it so that the `launchctl setenv` takes effect. Otherwise, the ollama server will respond with auth errors when the extension attempts to use it.
Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
---
2+
date: 2025-01-02
3+
title: A Software Observability Roundup
4+
excerpt:
5+
I spent some time recently catching up on my #to-read saves in Obsidian. More than a few
6+
of these were blog posts from 2024 about software observability. Talk of "redefining observability",
7+
"observability 2.0", and "try Honeycomb" had caught my eye in a few spaces,
8+
and so I had been hoarding links on the topic. After spending a few days immersing myself in those
9+
articles and branching out to others, I decided to write this bullet-form roundup.
10+
---
11+
12+
I spent some time recently catching up on my `#to-read` saves in Obsidian. More than a few of these
13+
were blog posts from 2024 about _software observability_. Talk of "redefining observability",
14+
"observability 2.0", and "try [Honeycomb](https://honeycomb.io)" had caught my eye in a few spaces,
15+
and so I had been hoarding links on the topic.
16+
17+
After spending a few days immersing myself in those articles and branching out to others, I decided
18+
to write this bullet-form roundup:
19+
20+
1. for myself, as a way of solidifying my current understanding
21+
2. in public, as a way to invite corrections and improvements (drop a [comment](#userComments) below
22+
or [@parente.dev on Bluesky](https://bsky.app/profile/parente.dev)!)
23+
3. with my colleagues in mind, as a new way to approach and discuss an ever-green question:
24+
25+
**As our [issue space](https://www.thorn.org/research/state-of-the-issue/) changes and grows, and
26+
[our solutions](https://www.thorn.org/solutions/) adapt and scale in response, what (else) should we
27+
do today so that we can readily address unknown-unknowns tomorrow?**
28+
29+
---
30+
31+
# Overview
32+
33+
The seventeen [references](#references) I surveyed offer perspectives on observability as it
34+
pertains both to software systems and organizations around them. They cover what observability is,
35+
what problems it solves, how it is and should be implemented. There's alignment from the
36+
authors on the state of affairs, learned best practices, and a direction in which the industry
37+
should head. Shared terminology and goals are works in progress.
38+
39+
# Origins
40+
41+
- According to control theory, _observability_ is a measure of how well internal states of a
42+
system can be inferred from knowledge of its external outputs.[^wikipedia2022]
43+
- The discipline of software engineering (distributed computing, site reliability engineering, et
44+
al) has not settled on a single definition. One that stays close to the control theory original
45+
is that _software observability_ measures how well a system's state can be understood from the
46+
obtained telemetry.[^wikipedia2022]
47+
- Metrics, logs, and traces caught on as the three kinds of telemetry required to observe a
48+
software system—the so-called "three pillars of observability."
49+
- ... perhaps because they helped build a shared vocabulary at the 2017 Distributed Tracing
50+
Summit.[^bourgon2017]
51+
- ... perhaps because they _do_ provide a comprehensive way for engineers to _monitor_ systems
52+
for _known_ problems and hint at where the issue lies.[^parker2024]
53+
- ... perhaps because solutions for monitoring systems using metrics, logs, and traces are
54+
what vendors had to sell.[^majors2024aug]
55+
56+
# Problems and Limitations
57+
58+
- The task of analyzing disjoint metrics, logs, and trace data falls on humans when using
59+
three-pillar systems designed primarily for monitoring.[^sigelman2021a]
60+
- Moving beyond investigation of known-knowns is difficult without data and tooling designed
61+
to support correlations and experimentation.[^weakly2024oct]
62+
- Use of monitoring tools leads to org reliance on the intuition of a few system experts
63+
resulting in cognitive costs and bus-factor risks. Low visibility slows development and
64+
reduces team confidence.[^majors2024jan]
65+
- Using CloudWatch logs, CloudWatch metrics, and X-Ray traces together, for example, requires
66+
users to infer answers to questions from their mental model of the system, incomplete data,
67+
disparate views, and reading of code.[^tane2024dec]
68+
- The three-pillar data model constrains the types of questions that can be asked and answered,
69+
with an almost exclusive focus on engineering concerns. Even mature observability programs will
70+
struggle to answer questions of greater interest and value _to the business_[^parker2024], such
71+
as:
72+
- What's the relationship between system performance and conversions, by funnel stage, broken
73+
down by geo, device, and intent signals?
74+
- What's our cost of goods sold per request, per customer, with real-time pricing data of
75+
resources?
76+
- How much does each marginal API request to our enterprise data endpoint cost in terms of
77+
availability for lower-tiered customers? Enough to justify automation work?
78+
- There are many sources of truth when disparate formats (metrics, logs, traces) and/or tools are
79+
in play, with decisions made at write-time about how the data will be used in the future.
80+
[^majors2024nov]
81+
- The value of metrics, logs, and (un-sampled) traces does not scale with the costs required to
82+
collect, transfer, and store them.[^sigelman2021a] As the bill goes up, the value stays constant
83+
at best, and more likely _decreases_.[^majors2024jan]
84+
- Logs get noisier and get slower to search with greater volume.
85+
- Custom metrics require more forethought and auditing as the set grows over time.
86+
- "At the end, the three pillars of observability do not exist. It's not something we should be
87+
relying on."[^tane2024dec]
88+
89+
- The coexistence of metrics, logging, and tracing is not _observability_. They are
90+
_telemetry_ useful in _monitoring_ systems.[^sigelman2021b]
91+
92+
<a name="better-practices"></a>
93+
94+
# Better Practices
95+
96+
- Instrument applications to emit "wide events" (or "canonical logs" or "structured logs") as your
97+
telemetry data.
98+
99+
- Wide events have high-dimensionality (many attributes) and attributes with high-cardinality
100+
(many possible unique values) making them context-rich (everything about the event is
101+
attached to it).[^tane2024sept]
102+
- "High-dimensionality" roughly equates with **hundreds** of attributes at present. Metadata
103+
about hosts, pods, builds, requests, responses, users, customers, timing, errors, teams,
104+
services, versions, third-party vendors, etc. are all fair game.[^morrell2024]
105+
106+
- Have a single source of truth which stores the wide events as they are emitted.
107+
108+
- Do no aggregation at write-time. Make decisions at read-time about how to query and use the
109+
data.[^majors2024nov] [^tane2024sept]
110+
- Wide events from a service continuously handling 1000 requests per second&mdash;about 1 million
111+
events per day&mdash;can compress to about 80 MB in columnar formats like Parquet and cost
112+
pennies to retain for a few months in typical object stores.[^morrell2024]
113+
- Custom metrics are effectively infinite as costs no longer increase linearly (thanks to
114+
columnar data storage) and the ability to cross-correlate increases as more event attributes
115+
are added. Intelligent sampling can control volume costs associated with these structured
116+
events when scale demands it.[^majors2024jan]
117+
- Storing event data in one place lends itself to the application of AI-tools which are good
118+
at correlating and summarizing[^burmistrov2024], perhaps continually in the
119+
background.[^tane2024dec]
120+
121+
- Adopt exploratory tooling that lets you explore quickly and cheaply query that data about
122+
emergent behaviors, new questions, unknown unknowns.
123+
124+
- Proper tooling allows engineers to investigate any system, regardless of their experience
125+
with it or its complexity, in a methodical and objective manner.[^majors2022]
126+
- The waterfall view of traces, root spans, nested spans, and the like _is_ not sufficient.
127+
Users need the ability to "dig" into data however they deem necessary.[^burmistrov2024]
128+
- You will never ask the same question twice. Something is different since you last asked
129+
it.[^weakly2024mar]
130+
- There is a natural tension between a system’s scalability and its feature set. You can
131+
afford much powerful observability features at scales orders of magnitude smaller than
132+
Google.[^sigelman2021a]
133+
134+
# Looking Forward
135+
136+
- Confusion abounds about what observability really is[^burmistrov2024] to the point that folks
137+
are actively redefining it[^weakly2024mar] [^parker2024] or versioning it[^majors2024aug]
138+
[^weakly2024dec] to improve clarity.
139+
140+
- "Pretty much everything in business is about asking questions and forming hypotheses, then
141+
testing them." That's observability.[^parker2024]
142+
- The cognitive systems engineering definition of observability&mdash;feedback that provides
143+
insight into a process and refers to the work needed to extract meaning from available
144+
data&mdash;may be a better starting point for software engineering.[^weakly2024mar]
145+
- "Observability is the process through which one develops the ability to ask meaningful questions,
146+
get useful answers, and act effectively on what you learn." It is not a tooling problem but
147+
rather a strategic capability akin to business intelligence.[^weakly2024mar]
148+
- "Observability 2.0 has one source of truth, wide structured log events, from which you can
149+
_derive_ all the other data types." The benefit to the full software development lifecycle,
150+
the cost model, and the adoption by a critical mass of developers make observability 2.0
151+
inevitable.[^majors2024nov]
152+
- "Observability 1.0 gave us lots of useful answers, observability 2.0 gives us the potential
153+
to ask meaningful questions, and observability 3.0 is going to give us the ability to act
154+
effectively on what we learn."[^weakly2024dec]
155+
156+
- There is consensus on the direction in which software observability should head: toward the
157+
[better practices](#better-practices) mentioned earlier. Discussion continues to establish
158+
shared language and goals.
159+
160+
- "Observability 3.0 will be measured by the value that non-engineering functions in the
161+
business are able to get from it."[^weakly2024dec]
162+
- "The success of Observability 2.0 will be measured by how well engineering teams can
163+
understand their decisions and describe what they do in the language of the
164+
business."[^majors2024dec]
165+
166+
<a name="references"></a>
167+
168+
# References
169+
170+
[^wikipedia2022]:
171+
[Observability
172+
(software)](<https://en.wikipedia.org/w/index.php?title=Observability_(software)&oldid=1225628905>).
173+
(2024, May 24). In _Wikipedia_.
174+
175+
[^bourgon2017]:
176+
Bourgon, P. (2017, February 21). [Metrics, tracing, and
177+
logging](https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html). _Peter
178+
Bourgon's Blog_.
179+
180+
[^parker2024]:
181+
Parker, A. (2024, March 29). [Re-Redefining
182+
Observability](https://aparker.io/2024/03/re-redefining-observability/). _Austin Parker's Blog_.
183+
184+
[^majors2024aug]:
185+
Majors, C. (2024, August 7). [Is It Time To Version Observability? (Signs Point To
186+
Yes)](https://charity.wtf/2024/08/07/is-it-time-to-version-observability-signs-point-to-yes/).
187+
_charity.wtf_.
188+
189+
[^sigelman2021a]:
190+
Sigelman, B. (2021, February 4). [Debunking the 'Three Pillars of Observability'
191+
Myth](https://softwareengineeringdaily.com/2021/02/04/debunking-the-three-pillars-of-observability-myth/).
192+
_Software Engineering Daily_.
193+
194+
[^weakly2024oct]:
195+
Weakly, H. (2024, October 3). [The 4 Evolutions of Your Observability
196+
Journey](https://thenewstack.io/the-4-evolutions-of-your-observability-journey/). _The New
197+
Stack_.
198+
199+
[^sigelman2021b]:
200+
Sigelman, B. (2021, February 4). [Observability Won’t Replace Monitoring (Because
201+
It
202+
Shouldn’t)](https://thenewstack.io/observability-wont-replace-monitoring-because-it-shouldnt/).
203+
_The New Stack_.
204+
205+
[^majors2024jan]:
206+
Majors, C. (2024, January 24). [The Cost Crisis in Observability
207+
Tooling](https://www.honeycomb.io/blog/cost-crisis-observability-tooling). _Honeycomb Blog_.
208+
209+
[^tane2024dec]:
210+
Tane, B. & Galbraith, K. (2024, December 6). [Observing Serverless Applications
211+
(SVS212)](https://youtu.be/mPbI3Qxdocc) [Conference presentation]. AWS re:Invent 2024 Las Vegas,
212+
Nevada, United States.
213+
214+
[^majors2024nov]:
215+
Majors, C. (2024, November 19). [There Is Only One Key Difference Between
216+
Observability 1.0 and
217+
2.0](https://www.honeycomb.io/blog/one-key-difference-observability1dot0-2dot0). _Honeycomb
218+
Blog_.
219+
220+
[^tane2024sept]:
221+
Tane, B. (2024, September 8). [Observability Wide Events
222+
101](https://boristane.com/blog/observability-wide-events-101/). _Boris Tane's Blog_.
223+
224+
[^morrell2024]:
225+
Morrell, J. (2024, October 22). [A Practitioner's Guide to Wide
226+
Events](https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/). _Jeremy Morrell's
227+
Blog_.
228+
229+
[^majors2022]:
230+
Majors, C., Fong-Jones, L., & Miranda, G. (2022, May 6). [Observability Engineering:
231+
Achieving production
232+
excellence](https://learning.oreilly.com/library/view/observability-engineering/9781492076438/).
233+
O’Reilly Media, Inc.
234+
235+
[^burmistrov2024]:
236+
Burmistrov, I. (2024, February 15). [All you need is Wide Events, not "Metrics,
237+
Logs and Traces"](https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics).
238+
_A Song Of Bugs And Patches_.
239+
240+
[^weakly2024mar]:
241+
Weakly, H. (2024, March 15). [Redefining
242+
Observability](https://hazelweakly.me/blog/redefining-observability/). _Hazel Weakly's Blog_.
243+
244+
[^weakly2024dec]:
245+
Weakly, H. (2024, December 9). [The Future of Observability: Observability
246+
3.0](https://hazelweakly.me/blog/the-future-of-observability-observability-3-0/). _Hazel
247+
Weakly's Blog_.
248+
249+
[^majors2024dec]:
250+
Majors, C. (2024, December 20). [On Versioning Observabilities (1.0, 2.0,
251+
3.0…10.0?!?)](https://charity.wtf/2024/12/20/on-versioning-observabilities-1-0-2-0-3-0-10-0/).
252+
_charity.wtf_.

static/css/site.css

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
:root {
22
--bs-code-color: #338866;
3-
}
4-
3+
}
4+
5+
a:target {
6+
background-color: #edfff8;
7+
}
8+
59
body {
610
font-family: "Gentium Basic", serif;
711
font-size: 1.25em;
@@ -109,12 +113,12 @@ iframe {
109113
font-size: 0.7em;
110114
}
111115

112-
#mainColumn p {
116+
.mainColumn p {
113117
margin-top: 1em;
114118
margin-bottom: 1em;
115119
}
116120

117-
#mainColumn ul, #mainColumn ol {
121+
.mainColumn ul, .mainColumn ol {
118122
margin: 1em 0px;
119123
}
120124

@@ -133,7 +137,16 @@ iframe {
133137
}
134138

135139
.footnote {
136-
font-size: 0.7em;
140+
font-size: 0.9em;
141+
}
142+
143+
.footnote hr {
144+
display: none;
145+
}
146+
147+
.footnote li > p {
148+
margin-top: 0;
149+
margin-bottom: 0;
137150
}
138151

139152
body .gist {

templates/shell.mako

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
</header>
3232

3333
<!-- Main Body -->
34-
<article id="mainColumn">
34+
<article id="mainColumn" class="mainColumn">
3535
<%block name="mainColumn" />
3636
</article>
3737

0 commit comments

Comments
 (0)