Skip to content

consider redesigning resource states to use tracing more idiomatically #246

Open
@hawkw

Description

@hawkw

Currently, state attributes for resources and async ops are represented by tracing events with a particular structure, within the resource's span. This is somewhat complicated, both in the instrumentation and in console-subscriber. These events consist of a field with the attribute's name and a value, and two additional optional fields for numeric values: a field <attribute_name>.op, which can be "add", "sub", or "override", and a field <attribute_name>.unit, which is used to optionally add a unit such as ms. The op field is used to record numeric values as a delta rather than as an absolute value.

Meanwhile, tracing already has a mechanism for attaching field values to spans, and recording new values for those fields, via the Span::record method. I think we should change state attributes for resources/async ops to use this instead.

Motivation

Here are some of the issues and questions I've noticed about the current approach.

  • It's fairly complicated

  • When spans for resources and async ops are created, there's no way to know all of the state attributes that will exist for that particular resource.

    This means that as individual state_update events occur, the attributes are added to a hashmap one-by-one as the console layer becomes aware of those attributes. This isn't very efficient, as we might reallocate the hashmap multiple times. Also, if these maps are moved out of the aggregator task and updated in the ConsoleLayer, as in fix(subscriber): only send *new* tasks/resources/etc over the event channel #238, they would have to be mutated to insert new attribute names, which would require adding a lock, even if the values themselves can all be updated atomically (since they're all integers or booleans).

    If all these fields were declared up front on the span, however, we would know about all the fields that exist when the resource/async op is created, even if their values aren't populated yet. That way, we could create a single hashmap (or array) with all the attributes already present, and insert it into the span's extensions. This would only need a single allocation. If the values of those attributes are updated atomically, we wouldn't need any kind of locking to update or read from them.

  • I don't understand the advantage of recording changes in numeric values as deltas rather than as absolute values (the .op field and "add"/"sub"). All the values will eventually be reported to the console as absolute values in the wire format anyway. If we want to use these to display the "direction" in which a value has changed since the last update, we would have to implement that logic in the UI anyway, because the delta value is lost when sending the message on the wire.

    What do we gain from recording these values as deltas? As far as I can tell, it just makes the implementation much more complex.

  • I'm not sure if recording units as a structured field really buys us anything.

    In fact, taking the units at each state update just complicates the recording logic, because it allows the unit to differ each time the state is updated. For example, if the previous state update event for a field was

    tracing::trace!(target: "runtime::resource::state_update", my_duration = 100, my_duration.op = "add", my_duration.unit = "ms"); 

    and then another one is recorded like this:

    tracing::trace!(target: "runtime::resource::state_update", my_duration = 360, my_duration.op = "add", my_duration.unit = "us"); 

    the console-subscriber recording code would have to implement the unit conversion to record the new value correctly. This seems unnecessarily complex to me.

    As far as I can tell, the units field is just to inform the UI about how to display the value. I think that we should change this so that the units are defined by the field name as whatever the lowest-granularity unit that we'll record the resource's state in. That way, console-subscriber doesn't need to be aware of units at all, and the console UI can detect base units from the name of the field, and use that to display the value however it wants to.

Proposal

Here's the change I think we should make. I think we should remove the runtime::resource::state_update and runtime::async_op::state_update events, and replace them with fields declared up front on the resource/async op's span. We should remove the op fields, and always record absolute values. Finally, rather than recording units as an additional field, we should just add a base unit to the end of the field name when appropriate, and allow the console UI to handle formatting based on units.

As an example of how we'd change this, consider tokio::time::Sleep. Currently, the instrumentation looks like this:

impl Sleep {
    pub(crate) fn new_timeout(
        deadline: Instant,
        location: Option<&'static Location<'static>>,
    ) -> Sleep {
       // ...

        #[cfg(all(tokio_unstable, feature = "tracing"))]
        let inner = {
            // ...
            let resource_span = tracing::trace_span!(
                "runtime.resource",
                concrete_type = "Sleep",
                kind = "timer",
                loc.file = location.file(),
                loc.line = location.line(),
                loc.col = location.column(),
            );

            let async_op_span = resource_span.in_scope(|| {
                tracing::trace!(
                    target: "runtime::resource::state_update",
                    duration = duration,
                    duration.unit = "ms",
                    duration.op = "override",
                );
                // ...
            });
         // ...
    }


    fn reset_inner(self: Pin<&mut Self>, deadline: Instant) {
       // ...

        #[cfg(all(tokio_unstable, feature = "tracing"))]
        {
            let _resource_enter = me.inner.ctx.resource_span.enter();
            // ...
            let duration =  /* ... */;

            tracing::trace!(
                target: "runtime::resource::state_update",
                duration = duration,
                duration.unit = "ms",
                duration.op = "override",
            );
        }
    }

    // ... and so on ...
}

After my proposed change, it would look like this instead, which I think is much simpler:

impl Sleep {
    pub(crate) fn new_timeout(
        deadline: Instant,
        location: Option<&'static Location<'static>>,
    ) -> Sleep {
       // ...

        #[cfg(all(tokio_unstable, feature = "tracing"))]
        let inner = {
            // ...
            let resource_span = tracing::trace_span!(
                "runtime.resource",
                concrete_type = "Sleep",
                kind = "timer",
                loc.file = location.file(),
                loc.line = location.line(),
                loc.col = location.column(),
                duration_ms = duration,
            );

            let async_op_span = resource_span.in_scope(|| {
                // ... we would still create the async op span, and it would now also 
                // declare its fields up front on the span ...
            });
         // ...
    }


    fn reset_inner(self: Pin<&mut Self>, deadline: Instant) {
       // ...

        #[cfg(all(tokio_unstable, feature = "tracing"))]
        {
            let _resource_enter = me.inner.ctx.resource_span.enter();
            // ...
            let duration =  /* ... */;
            // call `Span::record` with the new value.
            me.inner.ctx.resource_span.record("duration_ms", &duration);
        }
    }

    // ... and so on ...
}

Potential Issues

We would probably still want to handle the current state_update events as well, so that resources/async ops from tokio 1.15.x will still work with the console. This might limit some of the optimizations we can do if we use the new system instead, but I think we can at least make it so that most of those costs are only paid when using the older versions.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions