Add debug logging for all Oxide API requests by wfchandler · Pull Request #1053 · oxidecomputer/oxide.rs

wfchandler · 2025-03-12T17:32:56Z

Currently only the auth CLI subcommand will log any request details when debug logging is enabled. Adding debug logging for all requests will enable users to troubleshoot issues on any API endpoint.

To do this, we add add pre and post-request hooks via Progenitor. However, this does not lend itself to tracking the duration of the request, or creating a single log event with the details from the the Request and Response.

We add a LogCtx inner type to the client to act as a store while requests are in flight, but need something to associate the start and end. We could inject a header into the request, but on failure this will not be available to us.

Instead, we use the tokio task::Id as an identifier, which is unique within a given runtime. This raises the possibility of a client shared between runtimes associating the wrong start with a finished request. We do not have a way to detect which runtime a task is associated with, runtime::Id is an unstable feature, but we will log a warning if we find a response without a corresponding request, along with the response details. Should a request be run outside of a task, we fall back to logging separate events for start and end.

We also create a Span that includes the log event. This is ignored by the CLI, but may be useful for other consumers of the SDK that want to integrate it with their monitoring infrastructure.

An example event:

  {
    "timestamp": "2025-03-12T15:25:58.487790Z",
    "level": "DEBUG",
    "message": "request succeeded",
    "url": "https://oxide.sys.r3.oxide-preview.com/v1/disks?project=will",
    "path": "/v1/disks",
    "remote_addr": "45.154.216.35:443",
    "http.request.method": "GET",
    "http.response.content_length": 998,
    "http.response.status_code": 200,
    "start_time": "2025-03-12T15:25:58.048984Z",
    "duration_ms": 438,
    "oxide.request_id": "5d738e7a-ad78-4938-b682-c43f7b2a84be",
    "target": "oxide::tracing"
  }

Closes #1014

Currently only the `auth` CLI subcommand will log any request details when debug logging is enabled. Adding debug logging for all requests will enable users to troubleshoot issues on any API endpoint. To do this, we add add pre and post-request hooks via Progenitor. However, this does not lend itself to tracking the duration of the request, or creating a single log event with the details from the the `Request` and `Response`. We add a `LogCtx` inner type to the client to act as a store while requests are in flight, but need something to associate the start and end. We could inject a header into the request, but on failure this will not be available to us. Instead, we use the tokio `task::Id` as an identifier, which is unique within a given runtime. This raises the possibility of a client shared between runtimes associating the wrong start with a finished request. We do not have a way to detect which runtime a task is associated with, `runtime::Id` is an unstable feature, but we will log a warning if we find a response without a corresponding request, along with the response details. When a request is run outside of a task, we fall back to logging separate events for start and end. We also create a `Span` that includes the log event. This is ignored by the CLI, but may be useful for other consumers of the SDK that want to integrate it with their monitoring infrastructure. An example event: { "timestamp": "2025-03-12T15:25:58.487790Z", "level": "DEBUG", "message": "request succeeded", "url": "https://oxide.sys.r3.oxide-preview.com/v1/disks?project=will", "path": "/v1/disks", "remote_addr": "45.154.216.35:443", "http.request.method": "GET", "http.response.content_length": 998, "http.response.status_code": 200, "start_time": "2025-03-12T15:25:58.048984Z", "duration_ms": 438, "oxide.request_id": "5d738e7a-ad78-4938-b682-c43f7b2a84be", "target": "oxide::tracing" } Closes #1014

ahl

I left some comments. I'll be interested to discuss the path forward for this idea with you. I'd be interested to understand the approaches you've considered and how we landed on this one.

ahl · 2025-03-13T15:39:14Z

+        Err(e) => e.into(),
+        Ok(Err(e)) => e,
+        Ok(Ok(())) => return,
+    };
+
+    if let Some(io_err) = e.downcast_ref::<io::Error>() {
+        if io_err.kind() == io::ErrorKind::BrokenPipe {
+            return;
        }


you prefer this to what you had written previously?

I've seen the unwrap we had on the JoinHandle panic every now and then when cancelling a command with Ctrl+C. This gives us a graceful exit.

ahl · 2025-03-13T15:40:43Z

-        .stderr(str::contains("DEBUG"));
+        .success();
+
+    let stderr_str = std::str::from_utf8(&cmd.get_output().stderr).unwrap();


would expectorate make this simpler?

About the same amount of code, since we'd need to scrub all the fields that don't have a stable response value. This method is agnostic on field ordering in the message, which I don't think is something we need to control for a machine-readable format like JSON.

ahl · 2025-03-13T15:42:11Z

+
+    let cmd = Command::cargo_bin("oxide")
        .unwrap()
-        .arg("--debug")


we're not using --debug becuase it gets other stuff? Do we want to make --debug equivalent to RUST_LOG=oxide=debug?

Yeah good point, filtering to just oxide should be the default for --debug. In the rare event we need the hyper logs we can set RUST_LOG=debug to override.

ahl · 2025-03-13T15:44:24Z

@@ -54136,7 +54137,7 @@ impl Client {
    /// `baseurl` is the base URL provided to the internal
    /// `reqwest::Client`, and should include a scheme and hostname,
    /// as well as port and a path stem if applicable.
-    pub fn new(baseurl: &str) -> Self {
+    pub fn new(baseurl: &str, inner: crate::LogCtx) -> Self {


This seems like a pretty significant burr that we're imposing on all consumers.

I don't see how this is resolvable with the current approach.

Agreed, adding an inner type is necessary to retain context with this approach.

ahl · 2025-03-13T15:52:45Z

+    // Try to store the request details for retrieval when the requests finishes.
+    // If we're not in a tokio Task this is not possible, instead log separate
+    // events for start and finish.
+    let Some(task_id) = tokio::task::try_id() else {
+        tracing::debug!(
+            url = %url,
+            path = url.path(),
+            http.request.method = %request.method(),
+            http.request.body = details.body,
+        );
+        return;
+    };


this seems limiting for SDK consumers

It's not ideal, agreed. I think it will be confusing why some requests have more useful logs than others.

wfchandler

I'd be interested to understand the approaches you've considered and how we landed on this one.

Initially I attempted to use reqwest::middleware for logging oxidecomputer/progenitor#1079 as part of Progenitor, but this had the downside of making that a dependency of all consumers of that library and we decided against that approach.

Moving over to using the pre/post hooks, I have a hard requirement that we be able to track request latency and accurately track which request has failed. This means we need some way of relating the start and end of a request from separate hooks.

Initially I considered adding a header to the request, but this was untenable because we will not be able to access the headers if a request fails. This has the obvious downside of not being able to log a complete event on failure, but also we cannot reliably remove the request start details from LogCtx. On a long-lived process this could lead to significant memory leaks.

I looked into adding a task local identifier, but that wasn't feasible as we would need a larger scope to feed that into the hooks.

This led me to task ID, which this PR landed on, but this has the downside of not being in all circumstances. If the task ID is not present, we avoid leaking events when it isn't available by logging separate start and end events, but we lose duration and cannot associate a request body with the failure, just the URL. For requests like OxQL queries, we need to know the body to understand the failure.

One possibility that occurred to me after sending this PR is to add yet another hook to Progenitor, the exec_hook, which will allow users wrap the actual execution of the request. The hook would take the request, client, and an FnMut that executes the actual request. This can replicate all the functionality of the pre/post hooks, while providing more control to users.

Beyond logging, you could implement retries, perform a dry-run, or any arbitrary action.

async fn exec_hook<F>(
    request: reqwest::Request,
    client: &reqwest::Client,
    mut f: F, // Executes the request
) -> Result<reqwest::Response, reqwest::Error>
where
    F: FnMut(
        reqwest::Request,
        &reqwest::Client,
    ) -> Pin<
        Box<dyn Future<Output = Result<reqwest::Response, reqwest::Error>> + Send + 'static>,
    >,
{
    // Do stuff before exec
    f(request, client).await // Maybe add tracing::instrument to the future.
    // Do stuff after exec, retry, etc.
}

wfchandler · 2025-03-17T21:07:33Z

+        Err(e) => e.into(),
+        Ok(Err(e)) => e,
+        Ok(Ok(())) => return,
+    };
+
+    if let Some(io_err) = e.downcast_ref::<io::Error>() {
+        if io_err.kind() == io::ErrorKind::BrokenPipe {
+            return;
        }


I've seen the unwrap we had on the JoinHandle panic every now and then when cancelling a command with Ctrl+C. This gives us a graceful exit.

wfchandler · 2025-03-17T21:08:17Z

+
+    let cmd = Command::cargo_bin("oxide")
        .unwrap()
-        .arg("--debug")


Yeah good point, filtering to just oxide should be the default for --debug. In the rare event we need the hyper logs we can set RUST_LOG=debug to override.

wfchandler · 2025-03-17T21:16:57Z

+    // Try to store the request details for retrieval when the requests finishes.
+    // If we're not in a tokio Task this is not possible, instead log separate
+    // events for start and finish.
+    let Some(task_id) = tokio::task::try_id() else {
+        tracing::debug!(
+            url = %url,
+            path = url.path(),
+            http.request.method = %request.method(),
+            http.request.body = details.body,
+        );
+        return;
+    };


It's not ideal, agreed. I think it will be confusing why some requests have more useful logs than others.

wfchandler · 2025-03-18T13:53:46Z

-        .stderr(str::contains("DEBUG"));
+        .success();
+
+    let stderr_str = std::str::from_utf8(&cmd.get_output().stderr).unwrap();


About the same amount of code, since we'd need to scrub all the fields that don't have a stable response value. This method is agnostic on field ordering in the message, which I don't think is something we need to control for a machine-readable format like JSON.

The common case for users using `--debug` will be to observe the logs generated by the SDK, not the events from hyper and reqwest. If all logs are needed, we can advise them to use `RUST_LOG=debug` to enable debug logs for all components. Set `--debug` to set the log env filter to `oxide=debug`.

Co-authored-by: Adam Leventhal <ahl@oxide.computer>

ahl · 2025-03-18T16:34:27Z

I have a hard requirement that we be able to track request latency and accurately track which request has failed

I don't disagree with the desire to have this information, but I would observe that these kind of assertions aren't helpful for understanding the rationale or broader utility.

I looked into adding a task local identifier, but that wasn't feasible as we would need a larger scope to feed that into the hooks.

I'm not sure I follow.

This led me to task ID, which this PR landed on, but this has the downside of not being in all circumstances.

This seems pretty tricky for customer consumers of the SDK to understand.

Re: exec_hook -- neat idea! Let me think about it?

wfchandler · 2025-03-19T20:52:11Z

I looked into adding a task local identifier, but that wasn't feasible as we would need a larger scope to feed that into the hooks.

I'm not sure I follow.

I thought I might be able to use a task_local! to create an object that would carry over from the pre_hook to the post_hook. However, the value populated into LocalKey needs to be accessed within that scope like the example below, so there's no way to pass it between the hooks without modifying Progenitor.

tokio::task_local! {
    static LOG: LogDetails;
}

fn pre_hook() {
    let details = LogDetails::new()
    LOG.scope(details, async move {
        // inner value of LOG only accessible here.
    });

}

fn post_hook() {
    let details = LOG.get(); // panic, inner value of LOG was dropped in pre_hook.
}

I don't think the task_local approach is worth pursuing further, to be clear.

wfchandler · 2025-05-07T20:04:47Z

Closing in favor of #1092

wfchandler mentioned this pull request Mar 12, 2025

Add detailed request logging #1035

Closed

wfchandler force-pushed the wc/event-logging branch 2 times, most recently from 0b5b0b3 to 3060f3c Compare March 12, 2025 20:13

wfchandler force-pushed the wc/event-logging branch from 3060f3c to c612eea Compare March 13, 2025 14:10

wfchandler marked this pull request as ready for review March 13, 2025 14:26

wfchandler requested a review from ahl March 13, 2025 14:26

ahl requested changes Mar 13, 2025

View reviewed changes

wfchandler commented Mar 18, 2025

View reviewed changes

wfchandler and others added 2 commits March 18, 2025 11:04

Add space between features

6510597

Co-authored-by: Adam Leventhal <ahl@oxide.computer>

wfchandler requested a review from ahl March 18, 2025 15:25

ahl mentioned this pull request Apr 19, 2025

DO NOT MERGE prototype with new proposed hooks #1081

Closed

wfchandler closed this May 7, 2025

Conversation

wfchandler commented Mar 12, 2025

Uh oh!

ahl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wfchandler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahl commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wfchandler commented Mar 19, 2025

Uh oh!

wfchandler commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ahl commented Mar 18, 2025 •

edited

Loading