Skip to content

Internal logging safety: protocol buffer size limits#2792

Draft
jmacd wants to merge 26 commits intoopen-telemetry:mainfrom
jmacd:jmacd/logs_sdk
Draft

Internal logging safety: protocol buffer size limits#2792
jmacd wants to merge 26 commits intoopen-telemetry:mainfrom
jmacd:jmacd/logs_sdk

Conversation

@jmacd
Copy link
Copy Markdown
Contributor

@jmacd jmacd commented Apr 30, 2026

Change Summary

Makes protocol buffer generation limited. Use a stack-allocated array (256 bytes) for logged attributes, a fixed constant.

Replaces try_into for OtapPayloads with try_into_with_options and try_into_with_default to apply default settings. The default size is large to avoid surprises (256MiB, the hard max implied by 4-byte placeholders). In a future PR we can tighten this and add configurable limits that apply by default anywhere OTLP protos are generated in a pipeline.

Implements dropped_attributes accounting for records the internal logging pipeline.

Implements string truncation for string-valued fields, with [...] inserted if possible. Truncated fields are counted in dropped_attributes. This applies a measure of fairness to the fields of a log record as follows: when the string value is inserted it is allowed to use up to 50% of the remaining bytes or be truncated.

What issue does this PR close?

Fixes #1746
Part of #2725

How are these changes tested?

A new benchmark was added, a heavy self-tracing workload with 10 string attributes and body, deliberately overflowing the 256B inline budget. The old code would re-allocate a Vec to log this event, the new code does not.

Benchmark upstream/main jmacd/logs_sdk Δ
new_record 170.4 µs 183.2 µs +7.5%
encode_proto 195.3 µs 195.2 µs ~0%
encode_proto_with_scope 206.0 µs 225.6 µs +9.5%
encode_and_format 862.1 µs 559.6 µs −35.1%
format_with_entity 934.3 µs 627.5 µs −32.8%

There is some cost for protocol writers, this is the cost of safety. For console writers, this is a win because we use a stack-allocated buffer for the attributes encoding. Note that raw_error! is now allocation-free.

Are there any user-facing changes?

Yes. We begin to observe dropped logs attributes instead of permitting very large log records in the internal logs pipeline.

jmacd and others added 25 commits April 17, 2026 17:26
Two features for issue open-telemetry#1746:

1. ProtoBuffer<const INLINE: usize = 0> is now generic over inline
   storage size. INLINE=0 (default) behaves like Vec<u8>; INLINE>0
   uses SmallVec to keep data on the stack during encoding, avoiding
   heap allocation in the safety-critical encoding phase.

2. DirectFieldVisitor now tracks dropped_count—the number of attribute
   fields dropped due to truncation. LogRecord exposes this as
   dropped_attributes_count. The new LogRecord::new_bounded()
   constructor reserves 3 bytes (tag + 2-byte varint) by reducing the
   buffer limit, guaranteeing space for the dropped_attributes_count
   protobuf field even when the buffer is full.

raw_error! is now fully zero-allocation: it encodes body/attributes
into a stack-allocated ProtoBuffer<256> and formats directly from
the raw bytes via ConsoleWriter::print_raw_log(), bypassing LogRecord
and Bytes entirely.

LogRecord::body_attrs_bytes remains Bytes for cheap reference-counted
cloning through channels and retention. The encoding phase uses
ProtoBuffer<256> on the stack; after encoding completes the result is
converted to Bytes with a correctly-sized allocation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three features for issue open-telemetry#1746:

1. ProtoBufferInline<const INLINE: usize> is the new generic protobuf
   encoding buffer backed by SmallVec. ProtoBuffer is a type alias
   for ProtoBufferInline<0> (heap-only, matching the old behavior).

2. Replaced truncated:bool with dropped:u32 — counts how many fields
   were dropped due to the size limit. try_encode(f) provides atomic
   field-level writes: either the entire field lands or the buffer is
   unchanged and the dropped counter increments by one.

3. DirectFieldVisitor uses try_encode() for concise, correct attribute
   encoding. No more manual checkpoint/rollback/count boilerplate.

raw_error! is fully zero-allocation via __encode_event_impl! shared
with __log_record_impl./tools/sanitycheck.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BorrowedLogRecord<'a> is a lightweight borrowed view of a log record
that carries &[u8] body/attrs bytes, a SavedCallsite, and a dropped
count. Both raw_error! (zero-alloc stack path) and LogRecord (owned
Bytes path) now format through the same ConsoleWriter::print_log()
and StyledBufWriter::format_log() methods via this shared type.

- LogRecord::as_view() -> BorrowedLogRecord<'_> for zero-copy access
- raw_error! constructs BorrowedLogRecord directly from stack buffer
- Removed print_raw_log() and format_raw_log() (replaced by print_log/format_log)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move the zero-alloc encode+format logic into ConsoleWriter::print_event(),
eliminating __encode_event_impl! and the _private::ProtoBufferInline
re-export. raw_error! now just defines its callsite and calls
print_event(&event) — same pattern as __log_record_impl! but without
the Bytes allocation.

Also removes LogRecord::from_encoded() which is no longer needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
__log_record_impl! now returns StackLogRecord — an encoded log record
that lives on the stack. Callers choose how to consume it:
- .into_record(context) for owned LogRecord with Bytes

This makes raw_error! a one-line change from the original: just
print_log(now, &record.as_view(), ...) instead of print_log_record().

Removes ConsoleWriter::print_event() and LogRecord::from_encoded()
which are no longer needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move LOG_RECORD_DROPPED_ATTRIBUTES_COUNT encoding from LogRecord
construction to encode_log_record(), where timestamp/severity are
encoded. The dropped count is retained as a LogRecord field until
the final OTLP encoding.

Also make new_bounded() generic on INLINE, with new() delegating
as new_bounded::<ENCODE_INLINE>().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace all proto_encode_len_delimited_unknown_size! macro invocations
with ProtoBufferInline::encode_len_delimited() method calls, which
auto-select the narrowest placeholder width based on remaining buffer
space.

Remove the _unknown_size, _small, and _large macro variants. Retain
proto_encode_len_delimited_of_size! only for tests needing explicit
width control.

Add test_encode_len_delimited_auto_width to verify width selection
at boundary values (127 vs 128 byte buffers).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When encoding multiple attributes from a tracing event into a bounded
inline buffer, a single large value used to be able to consume all
remaining space, dropping every subsequent attribute entirely. Now each
attribute is allocated at most half of the currently-remaining buffer
space, so a long sequence of large values truncates progressively rather
than blowing out the buffer on the first one.

String attributes additionally use a new truncating encoder
(`encode_string_truncating`, `encode_len_delimited_partial`) so that
oversize values are clipped with the existing `[...]` suffix and still
appear (in truncated form) in the OTLP output, while still being counted
in `dropped_attributes_count`.

For a 256-byte inline buffer with 4 attributes whose values are 1000
bytes each, all four values are now preserved in truncated form with
lengths roughly [118, 55, 23, 7] bytes and `dropped_attributes_count` is
4.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…corruption

Two related fixes for visibility into truncated/dropped attributes when
log encoding overflows the inline ENCODE_INLINE buffer:

1. encoder.rs: wrap encode_body_string and encode_body_debug in
   try_encode. The previous code used encode_len_delimited directly,
   which on overflow leaves an unpatched length placeholder followed
   by partial body content bytes in the inline buffer. That malformed
   tail caused the parser to misread subsequent fields appended by
   encode_log_record (notably field 7 dropped_attributes_count), so
   downstream views always reported 0 drops. With try_encode, partial
   body bytes are rolled back cleanly on overflow, leaving the buffer
   well-formed for later appends.

2. formatter.rs: render dropped_attributes_count as a ' (N dropped)'
   suffix in write_body_and_attrs when it is non-zero, so operators
   can see when fair-budget halving caused truncation/drops.

Adds a tracing_init regression test that drives many oversized
attributes through the full ITS encode path (encode_export_logs_request)
and parses them back via RawLogsData, confirming the dropped count
round-trips through encode/parse.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds focused unit tests in pdata/otlp/common.rs for the transactional
and bounded encoding mechanisms introduced for self-tracing:

- try_encode_rolls_back_buffer_on_error: bytes are byte-identical
  before and after a failing closure.
- encode_len_delimited_does_not_patch_on_inner_err: documents the
  trap (unpatched placeholder + partial bytes) that the body-encoder
  fix in the previous commit had to work around, and shows the safe
  pattern of wrapping in try_encode.
- encode_len_delimited_partial_patches_length_on_err: partial-mode
  produces a valid LEN-prefixed wire field even when the inner
  closure returns Err, so subsequent fields parse correctly.
- encode_string_truncating_three_outcomes: Ok(false) full fit,
  Ok(true) truncated with [...] suffix, Err(Dropped) hard fail
  leaves buffer unchanged.
- encode_string_truncating_round_trips_via_prost: truncated bytes
  decode as a valid AnyValue ending in the suffix.
- truncate_at_utf8_boundary_does_not_panic: encoding 4-byte UTF-8
  characters near the truncation boundary is safe.
- with_max_remaining_{restores_outer_limit,nests_correctly,
  saturates_and_honors_outer}: scoped limit lifecycle, nested
  composition, and saturating math against the outer limit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the per-line stack buffer fills up mid-format, the underlying
Cursor write_all returns Err(WriteZero) and the terminating '\n' is
silently dropped (`let _ = self.write_all(b"\\n")`). The next log
line is appended directly after, producing garbled output like
'…RESOURCE[02026-04-27T15:48:52…│ SCOPE[2026-04-27T15:48:52…'.

Add StyledBufWriter::finish_line which always terminates the line:
if the buffer is already full, back up one byte and overwrite with
'\n' rather than losing the newline. format_line_impl now calls
finish_line for both log lines and RESOURCE/SCOPE header lines.

Includes a unit test covering the overflow, normal, and exactly-full
cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ckProtoBuffer

The self_tracing log encoding path used SmallVec<[u8; 256]> for inline
storage, which incurs a per-call inline-vs-spilled branch on every push
and caused a ~4.5x regression on the encode hot path.

Refactor splits storage into two concrete types sharing encoding logic
via a BoundedBuf trait with default methods:

- ProtoBuffer: Vec-backed, used for unbounded (or large-capacity) encodes.
  New with_capacity_and_limit() pre-allocates to the bound so the bounded
  log path performs at most one terminal alloc.
- StackProtoBuffer<const N: usize>: array-backed, fully stack-resident,
  zero-alloc; preserves the raw_error! zero-alloc guarantee.

All pdata encoders, the self_tracing encoder, and the OTLP exporters
are migrated from <const INLINE: usize> + ProtoBufferInline<INLINE> to
<B: BoundedBuf> generics. ProtoBufferInline is removed.

Bench results vs upstream/main (self_tracing realistic):
  new_record              163 -> 170 us  (+4%)
  encode_proto            177 -> 191 us  (+8%)
  encode_proto_with_scope 194 -> 230 us  (+19%)
  encode_and_format       799 -> 540 us  (-32%)
  format_with_entity      851 -> 548 us  (-36%)

Formatter benches now beat upstream because the bounded encode path
avoids upstream's allocate-then-truncate work at format time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Buf imports

After the BoundedBuf trait refactor, callers holding a concrete
ProtoBuffer needed to import the trait just to call len()/clear()/
is_empty(). Restore parity with the upstream public API by exposing
these as inherent methods that forward to the underlying Vec.

This also lets the self_tracing benchmark file compile unchanged on
both upstream/main and this branch (cross-branch perf comparison).

Removes now-unused 'use crate::otlp::common::BoundedBuf' from
decoder.rs, encode/mod.rs, two common.rs tests, and both OTLP
exporters.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added rust Pull requests that update Rust code query-engine Query Engine / Transform related tasks query-engine-columnar Columnar query engine which uses DataFusion to process OTAP Batches labels Apr 30, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.27%. Comparing base (524b78d) to head (c99f8e3).

❌ Your project check has failed because the head coverage (82.27%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2792       +/-   ##
===========================================
- Coverage   86.09%   82.27%    -3.82%     
===========================================
  Files         694      181      -513     
  Lines      263421    52878   -210543     
===========================================
- Hits       226789    43505   -183284     
+ Misses      36108     8849    -27259     
  Partials      524      524               
Components Coverage Δ
otap-dataflow ∅ <ø> (∅)
query_abstraction 80.61% <ø> (ø)
query_engine 90.76% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver ∅ <ø> (∅)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

query-engine Query Engine / Transform related tasks query-engine-columnar Columnar query engine which uses DataFusion to process OTAP Batches rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Improvements in OTLP encoding logic required for internal logging safety

1 participant