feat: ops[tracing] #1527

dimaqq · 2025-01-15T08:49:10Z

`ops[tracing]`

Native Open Telemetry tracing support for your charm!

[more text tbd]

Decisions to be made

drop yaml instrumentation?
drop all pebble instrumentation?
- attaching attributes (e.g. services to start/stop) would be weird, as they would be potentially attached to user's span
drop the outer level of pebble instrumentation, but keep the inner?
use "magic" urllib instrumentation?
protbuf or JSON? --> JSON
pure urllib or requests?
raw data file of sqlite? --> sqlite

Files

Functionality

Testing

run unit tests with tracing
run unit tests without tracing
run unit tests without tracing with old api
maybe setup CI for the 3 above?

dimaqq · 2025-01-15T08:51:04Z

Notes:

adds opentelemetry-api (64kB wheel) to dependencies
adds a bunch of opentelemetry-this-and-that (total size tbd) to ops[tracing] dependency group

dimaqq · 2025-01-17T08:51:56Z

No idea why RTD build fails...

HACKING.md

dimaqq · 2025-01-27T00:59:33Z

squashed and rebased: #1539 got merged; #1538 a no-go.

benhoyt

Commented with some questions. Let's discuss voice further on Monday.

ops/_private/yaml.py

benhoyt · 2025-01-31T02:14:46Z

ops/_tracing/export.py

+
+def setup_tracing(charm_class_name: str) -> None:
+    global _exporter
+    # FIXME would it be better to pass Juju context explicitly?


Yes, please. :-)

Let's check with @tonyandrewmeyer

If I were to pass juju context to this function, I should probably also pass it to _Manager too...

And if so, maybe Scenario's Ops would also have to be changed?

I'm not sure about the trade-offs off the top of my head.

Sorry for being super slow here; I think I missed the notification for this.

The _Manager already has the _JujuContext, and now that Scenario uses _Manager as well, it also provides it there. We should definitely pass it to this function.

I think in general we should consider any use of os.environ to get Juju information, other than one singular _JujuContext creation, a code smell (both in ops/Scenario, and in charms).

ops/_tracing/buffer.py

ops/model.py

benhoyt · 2025-01-31T02:27:01Z

ops/pebble.py

@@ -2068,23 +2074,27 @@ def _request_raw(

        return response

+    @tracer.start_as_current_span('ops.pebble.Client.get_system_info')  # type: ignore


Similar here: could we not trace the lower-level Pebble request to do it only in one place, or is that too hard to make meaningful?

Let me think about that.
One issue is being able to provide useful information about the arguments.
For example, what set of services is being start or stopped.

dimaqq · 2025-01-31T03:06:26Z

I need to handle more export errors:

ERROR:opentelemetry.sdk.trace.export:Exception while exporting Span batch.
Traceback (most recent call last):
  File "/code/operator/.ahh-venv/lib/python3.13/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
        (self._dns_host, self.port),
    ...<2 lines>...
        socket_options=self.socket_options,
    )
  File "/code/operator/.ahh-venv/lib/python3.13/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/code/operator/.ahh-venv/lib/python3.13/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
    ~~~~~~~~~~~~^^^^
ConnectionRefusedError: [Errno 111] Connection refused

dimaqq · 2025-02-22T16:07:03Z

Some Data charm tests fail with:

Because opentelemetry-sdk (1.21.0) depends on opentelemetry-api (1.21.0)
 and no versions of opentelemetry-sdk match >1.21.0,<1.22.0, opentelemetry-sdk (>=1.21.0,<1.22.0) requires opentelemetry-api (1.21.0).
And because ops (2.19.0.dev0) @ file:///home/runner/work/operator/operator/myops depends on opentelemetry-api (>=1.30,<2.0)
 and opentelemetry-exporter-otlp-proto-http (1.21.0) depends on opentelemetry-sdk (>=1.21.0,<1.22.0), ops (2.19.0.dev0) @ file:///home/runner/work/operator/operator/myops is incompatible with opentelemetry-exporter-otlp-proto-http (1.21.0).
So, because non-package-mode depends on both ops (2.19.0.dev0) @ file:///home/runner/work/operator/operator/myops and opentelemetry-exporter-otlp-proto-http (1.21.0), version solving failed.

In short, they have pinned older OTEL libraries, and my PR includes newer.

I'll talk to the Data team, perhaps their OTEL pins could be lifted.

If that doesn't work, we could have:

ops depend on a very permissive range, like opentelemetry-api ~= 1.1
ops[tracing] depend on new versions, opentelemetry-api ~= 1.30 because that's opt-in

tonyandrewmeyer · 2025-02-25T09:06:35Z

ops/charm.py

+class SetupTracingEvent(LifecycleEvent):
+    """FIXME docstring."""
+
+    def set_destination(self, *, url: Optional[str], ca: Optional[str] = None) -> None:


I've somewhat lost track of where things are documented, so this may be written already somewhere, sorry (feel free to just point me at it).

Did you try having something like CharmBase.set_tracing_destination instead of the event? It still feels weird to me to be setting up the tracing destination on every single hook, rather than once when I get the destination and then each time that relation data changes.

That's perhaps a FIXME we need to discuss.

I'm open to a solution like that with a few caveats:

tracing destination needs to be stored between dispatches

URL is tiny

the CA list from 1KB (single CA)...

to 290KB (certifi)

holistic charms will be setting the destination every time

the CA list, if used (the ca= arg) and the URL (the url= arg) typically come from two distinct relations:

TLS relation for the CA

COS relation for the URL

Arguably the current approach already:

forces the user to read both relations on every dispatch, and persistent destination is not worse, just a little more convoluted?

if someone uses large CA lists, the cost of reading them from a relation vs local storage seems roughly equal, and persistent destination approach is not worse.

tonyandrewmeyer · 2025-02-25T09:21:02Z

ops/storage.py

+    with tracer.start_as_current_span('ops.storage::subprocess.run') as span:
+        span.set_attribute('argv', args)
+        return subprocess.run([cmd, *args[1:]], encoding='utf-8', **kw)


I feel like this is too ops-internal to be interesting to charmers -- and we're also telling them not to use Juju for storage, so until that changes, we shouldn't make it any better than it currently is.

So... what do you reckon, is it better hide this entirely, even though we're calling an "external" binary?

ops/pebble.py

tonyandrewmeyer · 2025-02-25T09:26:13Z

ops/model.py

+            with tracer.start_as_current_span('ops.model::subprocess.run') as span:
+                span.set_attribute('argv', args)
+                result = subprocess.run(args, **kwargs)  # type: ignore


I think this is fine, with two exceptions:

action-set: this definitely contains sensitive data sometimes (like an "add super user" action that returns credentials). Annoyingly, there's no --file option according to the documentation, so we can't do that like we do with relation and secret data. Maybe there needs to be a "sensitive args" flag or something to skip adding it in this case? Or just hard-code an exception for action-set?

juju-log: it feels a bit excessive to include all of the logging inside of the trace in this way. I think we would want to ignore the args in this case.

juju-log should be excluded via self.drop in the JujuLogHandler.

I'll keep this open to confirm that.

tonyandrewmeyer · 2025-02-25T09:26:56Z

ops/model.py

@@ -1814,6 +1821,7 @@ def _hook_is_running(self) -> bool:
        # unrestricted, allowing test code to read/write databags at will.
        return bool(self._backend._hook_is_running)

+    # FIXME instrument this or LazyMapping?


I don't think we need to do this, given that the subprocess call is logged.

tonyandrewmeyer · 2025-02-25T09:38:01Z

ops/_tracing/export.py

+        return context
+
+    def _ssl_context(self, ca: str | None) -> ssl.SSLContext:
+        # FIXME: What should our protocol range be?


Maybe 1.2+, ideally 1.3+ only.

https://matrix.to/#/!nHXpRkcSNJHlHdUGbQ:ubuntu.com/$ymSYuWFl_KWT83WayQiF4yFQ_yVlhrLbpa_7Ns6z3zI?via=ubuntu.com&via=matrix.org for COS

I'll need to figure out a way to require 1.3 in the client that works across py 3.8 ~ 3.12.

My understanding is that newer Pythons already do that, and that's why specific protocol selection is deprecated (since 3.10?). Perhaps only 3.8 needs a kludge.

Let's say I write some code that should do the right thing... how do I validate that it does?

ops/_tracing/__init__.py

tonyandrewmeyer · 2025-02-25T09:40:37Z

ops/_tracing/__init__.py

+    # FIXME: API design choice, decide on CA semantics:
+    # - a local path to a file with CA data
+    # - or the CA data itself?
+    #
+    # Sadly Requests `verify=` kwarg accepts only:


Not sure how fixed the current choice is, or whether the change from requests to urllib alters things.

I would think that given that the CA probably comes from a secret (via a relation) it would be much more convenient to be the actual data and not a path (and if you do have a path, it's pretty simple to read it first, whereas you incur a penalty if you have it in memory but not a file).

Indeed, ca from relation should be PEM content.

IMHO, the only use case for path is using the system CA list which may be stored as a bunch of files in the directory.

Confirming with COS: https://matrix.to/#/!nHXpRkcSNJHlHdUGbQ:ubuntu.com/$A7zFMyAKTDLiGfyilvHcBpL7hiuHAYEXTuSG6d_cUI8?via=ubuntu.com&via=matrix.org

tonyandrewmeyer · 2025-02-25T09:41:42Z

ops/_main.py

@@ -488,6 +496,7 @@ def _make_framework(self, dispatcher: _Dispatcher):

    def _emit(self):
        """Emit the event on the charm."""
+        ops.charm._setup_tracing(self.charm)


I find it confusing that there's ops.charm._setup_tracing and ops._tracing.setup_tracing, which do quite separate things. Could one be renamed? Maybe this one, to be more focused on "emit the event", which is all it's doing at the moment, or "request destination" since that's what it's wanting?

ops/_main.py

benhoyt

I've added a bunch of comments, some style things, others questions about where to trace etc.

From here, I think you should work at making this PR easier to review and review-ready, as follows:

Resolve any other FIXMEs.
Finalise (and document in PR description) exactly what we're tracing: for example subprocess.run command for hook tool calls, Pebble requests (at which level), and so on.
Resolve all other decisions to be made.
Remove code not relevant to the PR like the "dont-merge" files. Move anything not strictly tracing-related to separate PRs - sounds like the logging stuff might fall into that category.
Write a detailed PR description describing the design and API. It would be particularly helpful to see all the API we're exposing to charmers, in one place. (It's fine to keep the checklists, but move them to the bottom.)

I think it's important to get the design right. After that, once the initial review is done (there's a lot), you can follow up by adding tests and docs.

benhoyt · 2025-02-27T02:10:03Z

ops/_tracing/hacks.py

+# FIXME must this hack be run before OTEL packages are imported?
+# must test this
+@tracer.start_as_current_span('ops.remove_stale_otel_sdk_packages')
+def remove_stale_otel_sdk_packages() -> None:


Yeah, I reckon let's not do this. It's a bug in Juju, and bug-fixed in Juju 3.5.4 (and 3.6).

benhoyt · 2025-02-27T02:48:59Z

ops/_tracing/__init__.py

+
+
+try:
+    from . import export


Why does this local import have a try/except? Be good to at least comment what's going on here.

Ultimately the idea here is to detect if we're installed as ops or ops[tracing].

Maybe I can do that in some other way though.

benhoyt · 2025-02-27T03:04:52Z

ops/_tracing/__init__.py

+    export.setup_tracing(charm_class_name)
+
+
+def set_tracing_destination(


I'm unsure about the signature of this function. Wouldn't the url param always be required? And is ca always a single file, or could it be a directory or cert and key?

ca= is the CA list, against which the server certificate is to be validated. There's no key.

benhoyt · 2025-02-27T03:06:35Z

ops/_tracing/buffer.py

+import functools
+import logging
+import sqlite3
+from pathlib import Path


Per our STYLE.md, we prefer import pathlib style, with the exception of from typing import A, B, C.

benhoyt · 2025-02-27T03:07:57Z

ops/_tracing/buffer.py

+R = TypeVar('R')
+
+
+def retry(f: Callable[P, R]) -> Callable[P, R]:


Why do we need retries on the SQL queries, when the db is a simple local file?

Yes, because mark_observed() is invoked from the main thread, while .export()/.pump() is invoked from the helper thread.

benhoyt · 2025-02-27T03:12:25Z

ops/_tracing/export.py

+
+# FIXME: single-file Python package can't be marked as py.typed
+import otlp_json  # type: ignore
+from opentelemetry.instrumentation.urllib import URLLibInstrumentor


Ugh, I don't deeply nested packages like opentelemetry.instrumentation.urllib. In keeping with our style, what about a semi-consistent way to import these with a package name prefix, like:

import opentelemetry.instrumentation.urllib as ot_urllib from opentelemetry.sdk.resources as ot_resources # ot_urllib.URLLibInstrumentor() ...

benhoyt · 2025-02-27T03:14:11Z

ops/_tracing/export.py

+from opentelemetry.trace import get_tracer_provider, set_tracer_provider
+
+import ops
+import ops._tracing.buffer


Relative . import?

🤔 the split between from . and from ops. is 13 - 22.
I supposed I leaned towards the majority use.

In the end both work just fine, happy to change.

This would probably be a good addition to our style guide (covering at least ops and ops-scenario, but potentially also Jubilant and others). And following a decision, a PR that normalises everything to match.

Yes, that's a good idea. Let's discuss briefly in today's daily, and then one of us can update the style guide and open a PR.

benhoyt · 2025-02-27T03:17:51Z

ops/framework.py

@@ -871,6 +874,11 @@ def _emit(self, event: EventBase):
        saved = False
        event_path = event.handle.path
        event_kind = event.handle.kind
+        ops_event = event.__class__.__module__.startswith('ops.')
+        opentelemetry.trace.get_current_span().add_event(
+            f'{"ops." if ops_event else ""}{event.__class__.__name__}',


The inline if/else inside a f-string variable confused me a lot. Plus, this looks like it's simply concatenating type strings, so what about ('ops.' if ops_event else '') + event.__class__.__name__.

Also, why do we only add the module prefix if it's in ops?

If it's not under ops., then it's an event defined in a charm or a charm lib.

However, this does look a bit ugly. I was trying to refer to events as ops.StartEvent (like how we want users to refer to them) and not ops.charm.StartEvent (where the event is really defined).

Maybe this is too magic and I should report the class name only, without the package?

benhoyt · 2025-02-27T03:25:05Z

pyproject.toml

@@ -40,6 +43,18 @@ docs = [
 testing = [
    "ops-scenario>=7.0.5,<8",
 ]
+tracing = [
+    "otlp-json>=0.9.2",


Nice! Do you want any review on this repo?

Yes please!

It's a single file: https://github.com/dimaqq/otlp-json/blob/main/otlp_json/__init__.py

benhoyt · 2025-02-27T03:27:21Z

pyproject.toml

+    # FIXME: decision needed
+    #
+    # On one hand, this instrumenttion provides additional info about pebble
+    # (request URL, status code, post and response data sizes)


Can't we drop this and just do this ourselves manually in pebble.Client._request or _request_raw?

Related: I thought we decided on just tracing in the one place for these (subprocess.run for hook tools and pebble request for Pebble stuff), or am I remembering wrong?

dimaqq · 2025-02-27T12:39:46Z

Re: stale OTEL packages hotfix...

I wonder about that. It's a nasty bug, can we realistically require all production to upgrade their Juju models, even if it's a micro bump?

Without the fix, charm fails at import time and there's no way to recover.

Maybe I need to repro this issue to be sure?

benhoyt · 2025-02-28T00:21:50Z

I did a quick review of otlp-json code here: dimaqq/otlp-json#2 (review)

dimaqq · 2025-02-28T07:48:02Z

ops/framework.py

+                    # The unit agent called us to process e.g. relation-changed
+                    # There's also a deferred relation-changed event
+                    # The two events are collapsed... which is actually run?
+                    ops._tracing.mark_observed()


@tonyandrewmeyer I could use your help validating my logic.

Ultimately, if someone (charm or charm lib) observes the dispatch event, I want to mark all telemetry from this dispatch as "important".

I don't know what happens if we're dispatched for event X and the very same event X (same set of handlers) is present in the deferred queue.

Events (really notices, event+handler) in the queue run first. If it defers again, then it ends up staying in the queue. If it doesn't defer this time, then it's removed from the queue.

We then process the actual Juju event - if that's exactly the same (same notice, same snapshot) as an item in the queue, then it gets skipped.

So if the deferred notice is deferred again, then it'll only run once. If it doesn't get deferred, then it'll run twice, once for the deferred case, and once for the current one. I looked into removing that second case in the de-duping work, but it was going to be complicated (keeping track of which notices were removed from the queue this hook, or changing the sequencing) and didn't seem to have a lot of value, compared to avoiding really large queues of duplicates.

_reemit() gets called in two cases. The first is when reemit() is called: this is done by the _Manager just before emitting the Juju event -- this is what calls observers for any notices that are in the queue. The second is at the end of Framework._emit() -- this is what calls observers for the specific Juju event being processes (it avoids doing them all over again by passing in the specific event path to focus on).

So the flows are:

_Manager calls reemit(), which calls _remit(), which calls observer handles for any deferred events. If they are not deferred again, they are dropped from the queue.

_Manager calls BoundEvent.emit() on the Juju event, which calls Framework._emit(), which saves notices (one per observer) and the snapshot for the current Juju event (skipping if there's a match in the queue already). If at least one notice was saved, then it calls _reemit() with the event path, which calls observer handles for any matching events. If they aren't deferred, they are dropped from the queue.

_Manager then calls BoundEvent.emit() on the app status and unit status events, which work like (2) above.

_Manager then commits the framework, triggering pre_commit and commit calls, which are BoundEvent.emit() like (2) above.

For custom events, the library calls BoundEvent.emit() on the custom event at any time during the above (hopefully only during (1) or (2)), and it does the same as in (2), right away.

Ultimately, if someone (charm or charm lib) observes the dispatch event, I want to mark all telemetry from this dispatch as "important".

You're currently skipping marking as observed for all the ops (LifeCycle) events. I think it's worth reconsidering that - if a charm is observing one of the collect status events or precommit or commit, then I think the charmer would still want the traces. I think this should at least be higher priority than when no observer is called at all, even if it's lower priority than if a HookEvent observer is called.

You're also currently skipping custom events -- if it's not deferred, then that makes sense, because whatever Juju event caused the custom event to be emitted will already have marked as observed. However, if the custom event handler is deferred, then you could have a situation where the next hook event runs the handler for that deferred custom event but then there are no other handlers to run because nothing is observing that second hook event. So I think you want to mark as observed for custom events as well.

You're also skipping deferred events. I think it's worth reconsidering this, too. For example, say there's a deferred config-changed+handler and then there's an update-status event with no observer. The config-changed handler will be executed, and I'd think that the charmer would want that trace as much as if it hadn't been deferred. If you change this then that would solve the above, so you could indeed skip custom events.

Note that you will be calling mark_observed() more than once, even as it is now, because if there are two handlers, then there are two calls to _reemit(). If you want it to be called only once and only for the actual Juju event, then you should move the call to the if saved: block at the end of _emit() instead of this method.

However, I think "prioritise if any observer is run" is a better choice than "prioritise if the Juju event had an observer run", so I'd just move the call down to the top of the "if custom_handler:block, and rely on theif observed: return` early exit. I doubt the overhead of a dict lookup, function call, and if will be noticeable.

Note that not marking observed doesn't mean throwing away the traces.

I'll leave a better comment in the new PR.

dimaqq · 2025-03-10T08:13:59Z

closing in favour of #1612

dimaqq commented Jan 17, 2025

View reviewed changes

HACKING.md Outdated Show resolved Hide resolved

dimaqq force-pushed the feat-otel branch 3 times, most recently from a483e07 to 3c060af Compare January 27, 2025 00:58

dimaqq changed the title ~~feat: otel [very draftey draft]~~ feat: ops[tracing] [very draftey draft] Jan 28, 2025

dimaqq changed the title ~~feat: ops[tracing] [very draftey draft]~~ feat: ops[tracing] [draftey draft] Jan 28, 2025

dimaqq force-pushed the feat-otel branch 2 times, most recently from 2a83b26 to d7cfe5d Compare January 30, 2025 06:34

benhoyt reviewed Jan 31, 2025

View reviewed changes

dimaqq force-pushed the feat-otel branch 4 times, most recently from 000c2f9 to 2b6ac38 Compare February 10, 2025 08:23

dimaqq force-pushed the feat-otel branch from 24006d0 to 89ca86e Compare February 22, 2025 02:03

dimaqq changed the title ~~feat: ops[tracing] [draftey draft]~~ feat: ops[tracing] Feb 22, 2025

tonyandrewmeyer reviewed Feb 25, 2025

View reviewed changes

dimaqq force-pushed the feat-otel branch from 049a5fc to 6d6ff4e Compare February 26, 2025 08:49

benhoyt reviewed Feb 27, 2025

View reviewed changes

dimaqq force-pushed the feat-otel branch from b8fb825 to 15d91c7 Compare February 28, 2025 05:53

dimaqq commented Feb 28, 2025

View reviewed changes

dimaqq force-pushed the feat-otel branch from a462ab7 to 1e9a8f3 Compare March 5, 2025 05:40

feat: ops[tracing]

33884f9

dimaqq force-pushed the feat-otel branch from 1e9a8f3 to 33884f9 Compare March 5, 2025 05:44

dimaqq closed this Mar 10, 2025

dimaqq mentioned this pull request Mar 10, 2025

feat: ops[tracing] (with a first-party charm lib) #1612

Merged

		@@ -2068,23 +2074,27 @@ def _request_raw(

		return response

		@tracer.start_as_current_span('ops.pebble.Client.get_system_info') # type: ignore

		export.setup_tracing(charm_class_name)


		def set_tracing_destination(

		R = TypeVar('R')


		def retry(f: Callable[P, R]) -> Callable[P, R]:

feat: ops[tracing] #1527

feat: ops[tracing] #1527

Uh oh!

Conversation

dimaqq commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ops[tracing]

Decisions to be made

Files

Functionality

Testing

Uh oh!

dimaqq commented Jan 15, 2025

Uh oh!

dimaqq commented Jan 17, 2025

Uh oh!

Uh oh!

dimaqq commented Jan 27, 2025

Uh oh!

benhoyt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimaqq commented Jan 31, 2025

Uh oh!

dimaqq commented Feb 22, 2025

Uh oh!

tonyandrewmeyer Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimaqq Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benhoyt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dimaqq commented Jan 15, 2025 •

edited

Loading

`ops[tracing]`

tonyandrewmeyer Feb 25, 2025 •

edited

Loading

dimaqq Feb 28, 2025 •

edited

Loading

dimaqq Feb 27, 2025 •

edited

Loading