Add debug logging for all Oxide API requests#1053
Conversation
0b5b0b3 to
3060f3c
Compare
Currently only the `auth` CLI subcommand will log any request details
when debug logging is enabled. Adding debug logging for all requests
will enable users to troubleshoot issues on any API endpoint.
To do this, we add add pre and post-request hooks via Progenitor.
However, this does not lend itself to tracking the duration of the
request, or creating a single log event with the details from the the
`Request` and `Response`.
We add a `LogCtx` inner type to the client to act as a store while
requests are in flight, but need something to associate the start and
end. We could inject a header into the request, but on failure this will
not be available to us.
Instead, we use the tokio `task::Id` as an identifier, which is unique
within a given runtime. This raises the possibility of a client shared
between runtimes associating the wrong start with a finished request.
We do not have a way to detect which runtime a task is associated with,
`runtime::Id` is an unstable feature, but we will log a warning if we
find a response without a corresponding request, along with the response
details. When a request is run outside of a task, we fall back to
logging separate events for start and end.
We also create a `Span` that includes the log event. This is ignored by
the CLI, but may be useful for other consumers of the SDK that want to
integrate it with their monitoring infrastructure.
An example event:
{
"timestamp": "2025-03-12T15:25:58.487790Z",
"level": "DEBUG",
"message": "request succeeded",
"url": "https://oxide.sys.r3.oxide-preview.com/v1/disks?project=will",
"path": "/v1/disks",
"remote_addr": "45.154.216.35:443",
"http.request.method": "GET",
"http.response.content_length": 998,
"http.response.status_code": 200,
"start_time": "2025-03-12T15:25:58.048984Z",
"duration_ms": 438,
"oxide.request_id": "5d738e7a-ad78-4938-b682-c43f7b2a84be",
"target": "oxide::tracing"
}
Closes #1014
3060f3c to
c612eea
Compare
ahl
left a comment
There was a problem hiding this comment.
I left some comments. I'll be interested to discuss the path forward for this idea with you. I'd be interested to understand the approaches you've considered and how we landed on this one.
| Err(e) => e.into(), | ||
| Ok(Err(e)) => e, | ||
| Ok(Ok(())) => return, | ||
| }; | ||
|
|
||
| if let Some(io_err) = e.downcast_ref::<io::Error>() { | ||
| if io_err.kind() == io::ErrorKind::BrokenPipe { | ||
| return; | ||
| } |
There was a problem hiding this comment.
you prefer this to what you had written previously?
There was a problem hiding this comment.
I've seen the unwrap we had on the JoinHandle panic every now and then when cancelling a command with Ctrl+C. This gives us a graceful exit.
| .stderr(str::contains("DEBUG")); | ||
| .success(); | ||
|
|
||
| let stderr_str = std::str::from_utf8(&cmd.get_output().stderr).unwrap(); |
There was a problem hiding this comment.
would expectorate make this simpler?
There was a problem hiding this comment.
About the same amount of code, since we'd need to scrub all the fields that don't have a stable response value. This method is agnostic on field ordering in the message, which I don't think is something we need to control for a machine-readable format like JSON.
|
|
||
| let cmd = Command::cargo_bin("oxide") | ||
| .unwrap() | ||
| .arg("--debug") |
There was a problem hiding this comment.
we're not using --debug becuase it gets other stuff? Do we want to make --debug equivalent to RUST_LOG=oxide=debug?
There was a problem hiding this comment.
Yeah good point, filtering to just oxide should be the default for --debug. In the rare event we need the hyper logs we can set RUST_LOG=debug to override.
| @@ -54136,7 +54137,7 @@ impl Client { | |||
| /// `baseurl` is the base URL provided to the internal | |||
| /// `reqwest::Client`, and should include a scheme and hostname, | |||
| /// as well as port and a path stem if applicable. | |||
| pub fn new(baseurl: &str) -> Self { | |||
| pub fn new(baseurl: &str, inner: crate::LogCtx) -> Self { | |||
There was a problem hiding this comment.
This seems like a pretty significant burr that we're imposing on all consumers.
There was a problem hiding this comment.
I don't see how this is resolvable with the current approach.
There was a problem hiding this comment.
Agreed, adding an inner type is necessary to retain context with this approach.
| // Try to store the request details for retrieval when the requests finishes. | ||
| // If we're not in a tokio Task this is not possible, instead log separate | ||
| // events for start and finish. | ||
| let Some(task_id) = tokio::task::try_id() else { | ||
| tracing::debug!( | ||
| url = %url, | ||
| path = url.path(), | ||
| http.request.method = %request.method(), | ||
| http.request.body = details.body, | ||
| ); | ||
| return; | ||
| }; |
There was a problem hiding this comment.
this seems limiting for SDK consumers
There was a problem hiding this comment.
It's not ideal, agreed. I think it will be confusing why some requests have more useful logs than others.
wfchandler
left a comment
There was a problem hiding this comment.
I'd be interested to understand the approaches you've considered and how we landed on this one.
Initially I attempted to use reqwest::middleware for logging oxidecomputer/progenitor#1079 as part of Progenitor, but this had the downside of making that a dependency of all consumers of that library and we decided against that approach.
Moving over to using the pre/post hooks, I have a hard requirement that we be able to track request latency and accurately track which request has failed. This means we need some way of relating the start and end of a request from separate hooks.
Initially I considered adding a header to the request, but this was untenable because we will not be able to access the headers if a request fails. This has the obvious downside of not being able to log a complete event on failure, but also we cannot reliably remove the request start details from LogCtx. On a long-lived process this could lead to significant memory leaks.
I looked into adding a task local identifier, but that wasn't feasible as we would need a larger scope to feed that into the hooks.
This led me to task ID, which this PR landed on, but this has the downside of not being in all circumstances. If the task ID is not present, we avoid leaking events when it isn't available by logging separate start and end events, but we lose duration and cannot associate a request body with the failure, just the URL. For requests like OxQL queries, we need to know the body to understand the failure.
One possibility that occurred to me after sending this PR is to add yet another hook to Progenitor, the exec_hook, which will allow users wrap the actual execution of the request. The hook would take the request, client, and an FnMut that executes the actual request. This can replicate all the functionality of the pre/post hooks, while providing more control to users.
Beyond logging, you could implement retries, perform a dry-run, or any arbitrary action.
async fn exec_hook<F>(
request: reqwest::Request,
client: &reqwest::Client,
mut f: F, // Executes the request
) -> Result<reqwest::Response, reqwest::Error>
where
F: FnMut(
reqwest::Request,
&reqwest::Client,
) -> Pin<
Box<dyn Future<Output = Result<reqwest::Response, reqwest::Error>> + Send + 'static>,
>,
{
// Do stuff before exec
f(request, client).await // Maybe add tracing::instrument to the future.
// Do stuff after exec, retry, etc.
}| Err(e) => e.into(), | ||
| Ok(Err(e)) => e, | ||
| Ok(Ok(())) => return, | ||
| }; | ||
|
|
||
| if let Some(io_err) = e.downcast_ref::<io::Error>() { | ||
| if io_err.kind() == io::ErrorKind::BrokenPipe { | ||
| return; | ||
| } |
There was a problem hiding this comment.
I've seen the unwrap we had on the JoinHandle panic every now and then when cancelling a command with Ctrl+C. This gives us a graceful exit.
|
|
||
| let cmd = Command::cargo_bin("oxide") | ||
| .unwrap() | ||
| .arg("--debug") |
There was a problem hiding this comment.
Yeah good point, filtering to just oxide should be the default for --debug. In the rare event we need the hyper logs we can set RUST_LOG=debug to override.
| // Try to store the request details for retrieval when the requests finishes. | ||
| // If we're not in a tokio Task this is not possible, instead log separate | ||
| // events for start and finish. | ||
| let Some(task_id) = tokio::task::try_id() else { | ||
| tracing::debug!( | ||
| url = %url, | ||
| path = url.path(), | ||
| http.request.method = %request.method(), | ||
| http.request.body = details.body, | ||
| ); | ||
| return; | ||
| }; |
There was a problem hiding this comment.
It's not ideal, agreed. I think it will be confusing why some requests have more useful logs than others.
| .stderr(str::contains("DEBUG")); | ||
| .success(); | ||
|
|
||
| let stderr_str = std::str::from_utf8(&cmd.get_output().stderr).unwrap(); |
There was a problem hiding this comment.
About the same amount of code, since we'd need to scrub all the fields that don't have a stable response value. This method is agnostic on field ordering in the message, which I don't think is something we need to control for a machine-readable format like JSON.
The common case for users using `--debug` will be to observe the logs generated by the SDK, not the events from hyper and reqwest. If all logs are needed, we can advise them to use `RUST_LOG=debug` to enable debug logs for all components. Set `--debug` to set the log env filter to `oxide=debug`.
Co-authored-by: Adam Leventhal <ahl@oxide.computer>
I don't disagree with the desire to have this information, but I would observe that these kind of assertions aren't helpful for understanding the rationale or broader utility.
I'm not sure I follow.
This seems pretty tricky for customer consumers of the SDK to understand. Re: |
I thought I might be able to use a task_local! to create an object that would carry over from the pre_hook to the post_hook. However, the value populated into tokio::task_local! {
static LOG: LogDetails;
}
fn pre_hook() {
let details = LogDetails::new()
LOG.scope(details, async move {
// inner value of LOG only accessible here.
});
}
fn post_hook() {
let details = LOG.get(); // panic, inner value of LOG was dropped in pre_hook.
}I don't think the |
|
Closing in favor of #1092 |
Currently only the
authCLI subcommand will log any request details when debug logging is enabled. Adding debug logging for all requests will enable users to troubleshoot issues on any API endpoint.To do this, we add add pre and post-request hooks via Progenitor. However, this does not lend itself to tracking the duration of the request, or creating a single log event with the details from the the
RequestandResponse.We add a
LogCtxinner type to the client to act as a store while requests are in flight, but need something to associate the start and end. We could inject a header into the request, but on failure this will not be available to us.Instead, we use the tokio
task::Idas an identifier, which is unique within a given runtime. This raises the possibility of a client shared between runtimes associating the wrong start with a finished request. We do not have a way to detect which runtime a task is associated with,runtime::Idis an unstable feature, but we will log a warning if we find a response without a corresponding request, along with the response details. Should a request be run outside of a task, we fall back to logging separate events for start and end.We also create a
Spanthat includes the log event. This is ignored by the CLI, but may be useful for other consumers of the SDK that want to integrate it with their monitoring infrastructure.An example event:
Closes #1014