Description
Currently, state attributes for resources and async ops are represented by tracing
events with a particular structure, within the resource's span. This is somewhat complicated, both in the instrumentation and in console-subscriber
. These events consist of a field with the attribute's name and a value, and two additional optional fields for numeric values: a field <attribute_name>.op
, which can be "add", "sub", or "override", and a field <attribute_name>.unit
, which is used to optionally add a unit such as ms
. The op
field is used to record numeric values as a delta rather than as an absolute value.
Meanwhile, tracing
already has a mechanism for attaching field values to spans, and recording new values for those fields, via the Span::record
method. I think we should change state attributes for resources/async ops to use this instead.
Motivation
Here are some of the issues and questions I've noticed about the current approach.
-
It's fairly complicated
-
When spans for resources and async ops are created, there's no way to know all of the state attributes that will exist for that particular resource.
This means that as individual
state_update
events occur, the attributes are added to a hashmap one-by-one as the console layer becomes aware of those attributes. This isn't very efficient, as we might reallocate the hashmap multiple times. Also, if these maps are moved out of the aggregator task and updated in theConsoleLayer
, as in fix(subscriber): only send *new* tasks/resources/etc over the event channel #238, they would have to be mutated to insert new attribute names, which would require adding a lock, even if the values themselves can all be updated atomically (since they're all integers or booleans).If all these fields were declared up front on the span, however, we would know about all the fields that exist when the resource/async op is created, even if their values aren't populated yet. That way, we could create a single hashmap (or array) with all the attributes already present, and insert it into the span's extensions. This would only need a single allocation. If the values of those attributes are updated atomically, we wouldn't need any kind of locking to update or read from them.
-
I don't understand the advantage of recording changes in numeric values as deltas rather than as absolute values (the
.op
field and "add"/"sub"). All the values will eventually be reported to theconsole
as absolute values in the wire format anyway. If we want to use these to display the "direction" in which a value has changed since the last update, we would have to implement that logic in the UI anyway, because the delta value is lost when sending the message on the wire.What do we gain from recording these values as deltas? As far as I can tell, it just makes the implementation much more complex.
-
I'm not sure if recording units as a structured field really buys us anything.
In fact, taking the units at each state update just complicates the recording logic, because it allows the unit to differ each time the state is updated. For example, if the previous state update event for a field was
tracing::trace!(target: "runtime::resource::state_update", my_duration = 100, my_duration.op = "add", my_duration.unit = "ms");
and then another one is recorded like this:
tracing::trace!(target: "runtime::resource::state_update", my_duration = 360, my_duration.op = "add", my_duration.unit = "us");
the console-subscriber recording code would have to implement the unit conversion to record the new value correctly. This seems unnecessarily complex to me.
As far as I can tell, the units field is just to inform the UI about how to display the value. I think that we should change this so that the units are defined by the field name as whatever the lowest-granularity unit that we'll record the resource's state in. That way,
console-subscriber
doesn't need to be aware of units at all, and theconsole
UI can detect base units from the name of the field, and use that to display the value however it wants to.
Proposal
Here's the change I think we should make. I think we should remove the runtime::resource::state_update
and runtime::async_op::state_update
events, and replace them with fields declared up front on the resource/async op's span. We should remove the op
fields, and always record absolute values. Finally, rather than recording units as an additional field, we should just add a base unit to the end of the field name when appropriate, and allow the console UI to handle formatting based on units.
As an example of how we'd change this, consider tokio::time::Sleep
. Currently, the instrumentation looks like this:
impl Sleep {
pub(crate) fn new_timeout(
deadline: Instant,
location: Option<&'static Location<'static>>,
) -> Sleep {
// ...
#[cfg(all(tokio_unstable, feature = "tracing"))]
let inner = {
// ...
let resource_span = tracing::trace_span!(
"runtime.resource",
concrete_type = "Sleep",
kind = "timer",
loc.file = location.file(),
loc.line = location.line(),
loc.col = location.column(),
);
let async_op_span = resource_span.in_scope(|| {
tracing::trace!(
target: "runtime::resource::state_update",
duration = duration,
duration.unit = "ms",
duration.op = "override",
);
// ...
});
// ...
}
fn reset_inner(self: Pin<&mut Self>, deadline: Instant) {
// ...
#[cfg(all(tokio_unstable, feature = "tracing"))]
{
let _resource_enter = me.inner.ctx.resource_span.enter();
// ...
let duration = /* ... */;
tracing::trace!(
target: "runtime::resource::state_update",
duration = duration,
duration.unit = "ms",
duration.op = "override",
);
}
}
// ... and so on ...
}
After my proposed change, it would look like this instead, which I think is much simpler:
impl Sleep {
pub(crate) fn new_timeout(
deadline: Instant,
location: Option<&'static Location<'static>>,
) -> Sleep {
// ...
#[cfg(all(tokio_unstable, feature = "tracing"))]
let inner = {
// ...
let resource_span = tracing::trace_span!(
"runtime.resource",
concrete_type = "Sleep",
kind = "timer",
loc.file = location.file(),
loc.line = location.line(),
loc.col = location.column(),
duration_ms = duration,
);
let async_op_span = resource_span.in_scope(|| {
// ... we would still create the async op span, and it would now also
// declare its fields up front on the span ...
});
// ...
}
fn reset_inner(self: Pin<&mut Self>, deadline: Instant) {
// ...
#[cfg(all(tokio_unstable, feature = "tracing"))]
{
let _resource_enter = me.inner.ctx.resource_span.enter();
// ...
let duration = /* ... */;
// call `Span::record` with the new value.
me.inner.ctx.resource_span.record("duration_ms", &duration);
}
}
// ... and so on ...
}
Potential Issues
We would probably still want to handle the current state_update
events as well, so that resources/async ops from tokio
1.15.x will still work with the console. This might limit some of the optimizations we can do if we use the new system instead, but I think we can at least make it so that most of those costs are only paid when using the older versions.