-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ops[tracing] #1527
feat: ops[tracing] #1527
Conversation
Notes:
|
No idea why RTD build fails... |
a483e07
to
3c060af
Compare
2a83b26
to
d7cfe5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented with some questions. Let's discuss voice further on Monday.
|
||
def setup_tracing(charm_class_name: str) -> None: | ||
global _exporter | ||
# FIXME would it be better to pass Juju context explicitly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check with @tonyandrewmeyer
If I were to pass juju context to this function, I should probably also pass it to _Manager too...
And if so, maybe Scenario's Ops would also have to be changed?
I'm not sure about the trade-offs off the top of my head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being super slow here; I think I missed the notification for this.
The _Manager
already has the _JujuContext
, and now that Scenario uses _Manager
as well, it also provides it there. We should definitely pass it to this function.
I think in general we should consider any use of os.environ
to get Juju information, other than one singular _JujuContext
creation, a code smell (both in ops/Scenario, and in charms).
ops/pebble.py
Outdated
@@ -2068,23 +2074,27 @@ def _request_raw( | |||
|
|||
return response | |||
|
|||
@tracer.start_as_current_span('ops.pebble.Client.get_system_info') # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here: could we not trace the lower-level Pebble request to do it only in one place, or is that too hard to make meaningful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think about that.
One issue is being able to provide useful information about the arguments.
For example, what set of services is being start or stopped.
I need to handle more export errors:
|
000c2f9
to
2b6ac38
Compare
Some Data charm tests fail with:
In short, they have pinned older OTEL libraries, and my PR includes newer. I'll talk to the Data team, perhaps their OTEL pins could be lifted. If that doesn't work, we could have:
|
class SetupTracingEvent(LifecycleEvent): | ||
"""FIXME docstring.""" | ||
|
||
def set_destination(self, *, url: Optional[str], ca: Optional[str] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've somewhat lost track of where things are documented, so this may be written already somewhere, sorry (feel free to just point me at it).
Did you try having something like CharmBase.set_tracing_destination
instead of the event? It still feels weird to me to be setting up the tracing destination on every single hook, rather than once when I get the destination and then each time that relation data changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's perhaps a FIXME we need to discuss.
I'm open to a solution like that with a few caveats:
- tracing destination needs to be stored between dispatches
- URL is tiny
- the CA list from 1KB (single CA)...
- to 290KB (certifi)
- holistic charms will be setting the destination every time
- the CA list, if used (the
ca=
arg) and the URL (theurl=
arg) typically come from two distinct relations:- TLS relation for the CA
- COS relation for the URL
Arguably the current approach already:
- forces the user to read both relations on every dispatch, and persistent destination is not worse, just a little more convoluted?
- if someone uses large CA lists, the cost of reading them from a relation vs local storage seems roughly equal, and persistent destination approach is not worse.
ops/storage.py
Outdated
with tracer.start_as_current_span('ops.storage::subprocess.run') as span: | ||
span.set_attribute('argv', args) | ||
return subprocess.run([cmd, *args[1:]], encoding='utf-8', **kw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is too ops-internal to be interesting to charmers -- and we're also telling them not to use Juju for storage, so until that changes, we shouldn't make it any better than it currently is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... what do you reckon, is it better hide this entirely, even though we're calling an "external" binary?
ops/model.py
Outdated
with tracer.start_as_current_span('ops.model::subprocess.run') as span: | ||
span.set_attribute('argv', args) | ||
result = subprocess.run(args, **kwargs) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, with two exceptions:
action-set
: this definitely contains sensitive data sometimes (like an "add super user" action that returns credentials). Annoyingly, there's no--file
option according to the documentation, so we can't do that like we do with relation and secret data. Maybe there needs to be a "sensitive args" flag or something to skip adding it in this case? Or just hard-code an exception foraction-set
?juju-log
: it feels a bit excessive to include all of the logging inside of the trace in this way. I think we would want to ignore the args in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
juju-log should be excluded via self.drop
in the JujuLogHandler
.
I'll keep this open to confirm that.
ops/model.py
Outdated
@@ -1814,6 +1821,7 @@ def _hook_is_running(self) -> bool: | |||
# unrestricted, allowing test code to read/write databags at will. | |||
return bool(self._backend._hook_is_running) | |||
|
|||
# FIXME instrument this or LazyMapping? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do this, given that the subprocess call is logged.
ops/_tracing/export.py
Outdated
return context | ||
|
||
def _ssl_context(self, ca: str | None) -> ssl.SSLContext: | ||
# FIXME: What should our protocol range be? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 1.2+, ideally 1.3+ only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to figure out a way to require 1.3 in the client that works across py 3.8 ~ 3.12.
My understanding is that newer Pythons already do that, and that's why specific protocol selection is deprecated (since 3.10?). Perhaps only 3.8 needs a kludge.
Let's say I write some code that should do the right thing... how do I validate that it does?
ops/_tracing/__init__.py
Outdated
# FIXME: API design choice, decide on CA semantics: | ||
# - a local path to a file with CA data | ||
# - or the CA data itself? | ||
# | ||
# Sadly Requests `verify=` kwarg accepts only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how fixed the current choice is, or whether the change from requests to urllib alters things.
I would think that given that the CA probably comes from a secret (via a relation) it would be much more convenient to be the actual data and not a path (and if you do have a path, it's pretty simple to read it first, whereas you incur a penalty if you have it in memory but not a file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, ca from relation should be PEM content.
IMHO, the only use case for path is using the system CA list which may be stored as a bunch of files in the directory.
Confirming with COS: https://matrix.to/#/!nHXpRkcSNJHlHdUGbQ:ubuntu.com/$A7zFMyAKTDLiGfyilvHcBpL7hiuHAYEXTuSG6d_cUI8?via=ubuntu.com&via=matrix.org
ops/_main.py
Outdated
@@ -488,6 +496,7 @@ def _make_framework(self, dispatcher: _Dispatcher): | |||
|
|||
def _emit(self): | |||
"""Emit the event on the charm.""" | |||
ops.charm._setup_tracing(self.charm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it confusing that there's ops.charm._setup_tracing
and ops._tracing.setup_tracing
, which do quite separate things. Could one be renamed? Maybe this one, to be more focused on "emit the event", which is all it's doing at the moment, or "request destination" since that's what it's wanting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a bunch of comments, some style things, others questions about where to trace etc.
From here, I think you should work at making this PR easier to review and review-ready, as follows:
- Resolve any other FIXMEs.
- Finalise (and document in PR description) exactly what we're tracing: for example subprocess.run command for hook tool calls, Pebble requests (at which level), and so on.
- Resolve all other decisions to be made.
- Remove code not relevant to the PR like the "dont-merge" files. Move anything not strictly tracing-related to separate PRs - sounds like the logging stuff might fall into that category.
- Write a detailed PR description describing the design and API. It would be particularly helpful to see all the API we're exposing to charmers, in one place. (It's fine to keep the checklists, but move them to the bottom.)
I think it's important to get the design right. After that, once the initial review is done (there's a lot), you can follow up by adding tests and docs.
ops/_tracing/hacks.py
Outdated
# FIXME must this hack be run before OTEL packages are imported? | ||
# must test this | ||
@tracer.start_as_current_span('ops.remove_stale_otel_sdk_packages') | ||
def remove_stale_otel_sdk_packages() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I reckon let's not do this. It's a bug in Juju, and bug-fixed in Juju 3.5.4 (and 3.6).
ops/_tracing/__init__.py
Outdated
|
||
|
||
try: | ||
from . import export |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this local import have a try/except? Be good to at least comment what's going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately the idea here is to detect if we're installed as ops
or ops[tracing]
.
Maybe I can do that in some other way though.
ops/_tracing/__init__.py
Outdated
export.setup_tracing(charm_class_name) | ||
|
||
|
||
def set_tracing_destination( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure about the signature of this function. Wouldn't the url
param always be required? And is ca
always a single file, or could it be a directory or cert and key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ca=
is the CA list, against which the server certificate is to be validated. There's no key.
import functools | ||
import logging | ||
import sqlite3 | ||
from pathlib import Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our STYLE.md, we prefer import pathlib
style, with the exception of from typing import A, B, C
.
R = TypeVar('R') | ||
|
||
|
||
def retry(f: Callable[P, R]) -> Callable[P, R]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need retries on the SQL queries, when the db is a simple local file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because mark_observed()
is invoked from the main thread, while .export()/.pump()
is invoked from the helper thread.
|
||
# FIXME: single-file Python package can't be marked as py.typed | ||
import otlp_json # type: ignore | ||
from opentelemetry.instrumentation.urllib import URLLibInstrumentor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I don't deeply nested packages like opentelemetry.instrumentation.urllib
. In keeping with our style, what about a semi-consistent way to import these with a package name prefix, like:
import opentelemetry.instrumentation.urllib as ot_urllib
from opentelemetry.sdk.resources as ot_resources
#
ot_urllib.URLLibInstrumentor()
...
from opentelemetry.trace import get_tracer_provider, set_tracer_provider | ||
|
||
import ops | ||
import ops._tracing.buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relative .
import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 the split between from .
and from ops.
is 13 - 22.
I supposed I leaned towards the majority use.
In the end both work just fine, happy to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably be a good addition to our style guide (covering at least ops and ops-scenario, but potentially also Jubilant and others). And following a decision, a PR that normalises everything to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good idea. Let's discuss briefly in today's daily, and then one of us can update the style guide and open a PR.
@@ -871,6 +874,11 @@ def _emit(self, event: EventBase): | |||
saved = False | |||
event_path = event.handle.path | |||
event_kind = event.handle.kind | |||
ops_event = event.__class__.__module__.startswith('ops.') | |||
opentelemetry.trace.get_current_span().add_event( | |||
f'{"ops." if ops_event else ""}{event.__class__.__name__}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inline if/else inside a f-string variable confused me a lot. Plus, this looks like it's simply concatenating type strings, so what about ('ops.' if ops_event else '') + event.__class__.__name__
.
Also, why do we only add the module prefix if it's in ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not under ops.
, then it's an event defined in a charm or a charm lib.
However, this does look a bit ugly. I was trying to refer to events as ops.StartEvent
(like how we want users to refer to them) and not ops.charm.StartEvent
(where the event is really defined).
Maybe this is too magic and I should report the class name only, without the package?
pyproject.toml
Outdated
@@ -40,6 +43,18 @@ docs = [ | |||
testing = [ | |||
"ops-scenario>=7.0.5,<8", | |||
] | |||
tracing = [ | |||
"otlp-json>=0.9.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Do you want any review on this repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please!
It's a single file: https://github.com/dimaqq/otlp-json/blob/main/otlp_json/__init__.py
# FIXME: decision needed | ||
# | ||
# On one hand, this instrumenttion provides additional info about pebble | ||
# (request URL, status code, post and response data sizes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we drop this and just do this ourselves manually in pebble.Client._request
or _request_raw
?
Related: I thought we decided on just tracing in the one place for these (subprocess.run for hook tools and pebble request for Pebble stuff), or am I remembering wrong?
Re: stale OTEL packages hotfix... I wonder about that. It's a nasty bug, can we realistically require all production to upgrade their Juju models, even if it's a micro bump? Without the fix, charm fails at import time and there's no way to recover. Maybe I need to repro this issue to be sure? |
I did a quick review of otlp-json code here: dimaqq/otlp-json#2 (review) |
ops/framework.py
Outdated
# The unit agent called us to process e.g. relation-changed | ||
# There's also a deferred relation-changed event | ||
# The two events are collapsed... which is actually run? | ||
ops._tracing.mark_observed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tonyandrewmeyer I could use your help validating my logic.
Ultimately, if someone (charm or charm lib) observes the dispatch event, I want to mark all telemetry from this dispatch as "important".
I don't know what happens if we're dispatched for event X and the very same event X (same set of handlers) is present in the deferred queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Events (really notices, event+handler) in the queue run first. If it defers again, then it ends up staying in the queue. If it doesn't defer this time, then it's removed from the queue.
We then process the actual Juju event - if that's exactly the same (same notice, same snapshot) as an item in the queue, then it gets skipped.
So if the deferred notice is deferred again, then it'll only run once. If it doesn't get deferred, then it'll run twice, once for the deferred case, and once for the current one. I looked into removing that second case in the de-duping work, but it was going to be complicated (keeping track of which notices were removed from the queue this hook, or changing the sequencing) and didn't seem to have a lot of value, compared to avoiding really large queues of duplicates.
_reemit()
gets called in two cases. The first is when reemit()
is called: this is done by the _Manager
just before emitting the Juju event -- this is what calls observers for any notices that are in the queue. The second is at the end of Framework._emit()
-- this is what calls observers for the specific Juju event being processes (it avoids doing them all over again by passing in the specific event path to focus on).
So the flows are:
_Manager
callsreemit()
, which calls_remit()
, which calls observer handles for any deferred events. If they are not deferred again, they are dropped from the queue._Manager
callsBoundEvent.emit()
on the Juju event, which callsFramework._emit()
, which saves notices (one per observer) and the snapshot for the current Juju event (skipping if there's a match in the queue already). If at least one notice was saved, then it calls_reemit()
with the event path, which calls observer handles for any matching events. If they aren't deferred, they are dropped from the queue._Manager
then callsBoundEvent.emit()
on the app status and unit status events, which work like (2) above._Manager
then commits the framework, triggeringpre_commit
andcommit
calls, which areBoundEvent.emit()
like (2) above.
For custom events, the library calls BoundEvent.emit()
on the custom event at any time during the above (hopefully only during (1) or (2)), and it does the same as in (2), right away.
Ultimately, if someone (charm or charm lib) observes the dispatch event, I want to mark all telemetry from this dispatch as "important".
You're currently skipping marking as observed for all the ops (LifeCycle
) events. I think it's worth reconsidering that - if a charm is observing one of the collect status events or precommit or commit, then I think the charmer would still want the traces. I think this should at least be higher priority than when no observer is called at all, even if it's lower priority than if a HookEvent
observer is called.
You're also currently skipping custom events -- if it's not deferred, then that makes sense, because whatever Juju event caused the custom event to be emitted will already have marked as observed. However, if the custom event handler is deferred, then you could have a situation where the next hook event runs the handler for that deferred custom event but then there are no other handlers to run because nothing is observing that second hook event. So I think you want to mark as observed for custom events as well.
You're also skipping deferred events. I think it's worth reconsidering this, too. For example, say there's a deferred config-changed+handler and then there's an update-status event with no observer. The config-changed handler will be executed, and I'd think that the charmer would want that trace as much as if it hadn't been deferred. If you change this then that would solve the above, so you could indeed skip custom events.
Note that you will be calling mark_observed()
more than once, even as it is now, because if there are two handlers, then there are two calls to _reemit()
. If you want it to be called only once and only for the actual Juju event, then you should move the call to the if saved:
block at the end of _emit()
instead of this method.
However, I think "prioritise if any observer is run" is a better choice than "prioritise if the Juju event had an observer run", so I'd just move the call down to the top of the "if custom_handler:block, and rely on the
if observed: return` early exit. I doubt the overhead of a dict lookup, function call, and if will be noticeable.
closing in favour of #1612 |
ops[tracing]
Native Open Telemetry tracing support for your charm!
[more text tbd]
Decisions to be made
Files
ops/jujuversion.py
-- lightweight, no need to instrumentops/_tracing/buffer.py
-- newops/_tracing/__init__.py
-- new, shim onlyops/_tracing/hacks.py
-- may have to run before OTEL importops/_tracing/export.py
-- the guts, exemptops/version.py
-- nothing to doneops/_main.py
-- doneops/log.py
-- recursion prevention addedops/charm.py
-- startedops/pebble.py
-- TBD started, mostly relying on urllib instrumentation thoughops/framework.py
-- TBD startedops/__init__.py
-- only shims, no need to instrumentops/_private/harness.py
-- test onlyops/_private/__init__.py
-- emptyops/_private/yaml.py
-- doneops/_private/timeconv.py
-- nothing to doops/model.py
-- done, I think?ops/storage.py
-- not sure if neededops/lib/__init__.py
-- deprecated, will not touchops/jujucontext.py
-- nothing to doops/main.py
-- nothing to doops/testing.py
-- out of scopeFunctionality
ops
with[tracing]
ops
without[tracing]
ops[tracing]
ops[tracing]
ops[tracing]
ops[tracing]
ops[tracing]
ops[tracing]
type: ignore
Testing