Add mock LA server for local Azure Monitor Exporter testing#4
Open
cijothomas wants to merge 394 commits intomainfrom
Open
Add mock LA server for local Azure Monitor Exporter testing#4cijothomas wants to merge 394 commits intomainfrom
cijothomas wants to merge 394 commits intomainfrom
Conversation
…1946) # Change Summary - Avoid unnecessary conversion of bytes to `&str` for `input()` method - Minor edits
…r CI diagnosability (open-telemetry#1937) # Change Summary Switch pattern from `assert!(result.is_ok())` to `result.unwrap()` in exporter tests. This is to improve diagnostics for flakey tests in CI. Currently, failures output the following which is not actionable: ``` thread 'parquet_exporter::test::test_traces' (2500) panicked at crates\otap\src\parquet_exporter.rs:1299:21: assertion failed: exporter_result.is_ok() note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` With the change above, the error string from the result will be properly logged. ## What issue does this PR close? n/a ## How are these changes tested? n/a ## Are there any user-facing changes? No --------- Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
…en-telemetry#1942) # Change Summary Modified the Idle State Test to run on 1/2/4/8/16/32 cores and confirm if the memory growth (idle state) is predictable. ## What issue does this PR close? Part of the comment https://github.com/open-telemetry/otel-arrow/pull/1528/changes#r2710193083 ## How are these changes tested? Ran locally. ## Are there any user-facing changes? No
…metry#1936) # Change Summary Removes the OTel logging SDK since we have otap-dataflow-internal logging configurable in numerous ways. Updates OTel feature settings to disable the OTel logging SDK from the build. ## What issue does this PR close? Removes `ProviderMode::OpenTelemetry`, the OTel logging SDK and its associated configuration (service::telemetry::logs::processors::*). Fixes open-telemetry#1576. ## Are there any user-facing changes? Yes. **Note: this removes the potential to use the OpenTelemetry tracing support via the opentelemetry tracing appender. However, we view tracing instrumentation as having limited value until otap-dataflow is properly instrumented for tracing. When this happens, we are likely to use an internal tracing pipeline.** --------- Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
# Change Summary - Update Syslog CEF Receiver README --------- Co-authored-by: Cijo Thomas <cijo.thomas@gmail.com>
# Change Summary This PR restructures the OTLP receiver configuration to support flexible protocol deployment modes, aligning with the Go collector's otlpreceiver model: - gRPC only - Configure only protocols.grpc - HTTP only - Configure only protocols.http (new!) - Both protocols - Configure both with a global concurrency cap ## Key Changes ### Configuration restructure: - Moved from flat config to protocols.grpc / protocols.http structure - TLS configuration is now per-protocol (under each protocol's config) - At least one protocol must be configured (validated at startup) ### Concurrency model for dual-protocol mode: - Each protocol enforces its own max_concurrent_requests limit - When both protocols are enabled, an additional global semaphore caps combined load to prevent exceeding downstream capacity - Permits acquired in consistent order (global -> local) to prevent deadlocks ## What issue does this PR close? * Closes open-telemetry#1893 ## How are these changes tested? Manual tested, along with unit tests. ## Are there any user-facing changes?⚠️ Breaking change: The OTLP receiver configuration format has changed. **_Before_**: ```yaml config: listening_addr: "127.0.0.1:4317" tls: cert_file: "/path/to/cert" http: listening_addr: "127.0.0.1:4318" ``` **_After_**: ```yaml config: protocols: grpc: listening_addr: "127.0.0.1:4317" tls: cert_file: "/path/to/cert" http: listening_addr: "127.0.0.1:4318" tls: cert_file: "/path/to/cert" ``` Refer to `otlp_receiver.md` (updated in this PR) for more details. --------- Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
## Fan-out Processor Implementation Implements all four discussed scenarios: | Scenario | Config | Description | |----------|--------|-------------| | 1 | `mode: parallel, await_ack: primary` | Duplicate to all, wait for primary only | | 2 | `mode: parallel, await_ack: all` | Duplicate to all, wait for all (with per-destination timeout) | | 3 | `mode: sequential` | Send one-by-one, advance after ack | | 4 | `fallback_for: <port>` | Failover to backup on nack/timeout | ### Why Stateful (not Stateless like Go collector) The Go Collector's fanout is stateless because it uses **synchronous, blocking calls**: ```go err := consumer.ConsumeLogs(ctx, ld) // blocks until complete, error returns directly ``` Our OTAP engine uses async message passing with explicit ack/nack routing: ```rust effect_handler.send_message_to(port, pdata).await?; // returns immediately // ack arrives later as separate NodeControlMsg::Ack ``` I explored making scenarios 1 and 3 stateless but hit three blockers: 1. **`subscribe_to()` mutates context** - Fanout must subscribe to receive acks, which pushes a frame onto the context stack. For correct upstream routing, we need the *original* pdata (pre-subscription). We cannot use `ack.accepted` from downstream. 2. **Downstream may mutate/drop payloads** - `into_parts()`, transformers, and filters mean we can't rely on getting intact pdata back in ack/nack messages. 3. **Sequential/fallback/timeout require coordination** - Need to know which destination is active, when to advance to the next, and when to trigger fallbacks or finish. Even if downstream guaranteed returning intact payloads, we'd still need state for `await_all` completion tracking, fallback chains, and sequential advancement. The only gain would be a minor memory optimization (not storing `original_pdata`), not true statelessness. Adopting Go's synchronous model would require fundamental engine architecture changes, not just fanout changes. ### Memory Optimizations While full statelessness isn't possible, I have implemented fast paths to minimize allocations for common configurations: | Configuration | Fast Path | State Per Request | |-----------------------------------------------------------|------------------|------------------------------------------------| | `await_ack: none` | Fire-and-forget | None (zero inflight tracking) | | `parallel + primary + no fallback + no timeout` | Slim primary | Minimal (`request_id → original_pdata`) | | All other configs | Full | Complete endpoint tracking | #### Fast Path Details - **Fire-and-forget (`await_ack: none`)** Bypasses all inflight state. Clone, send, and ACK upstream immediately. Zero allocations per request. - **Slim primary path** Uses a tiny `HashMap<u64, OtapPdata>` instead of the full `Inflight` struct with `EndpointVec`. Ignores non-primary ACKs and NACKs. - **Full path** Required for: - Sequential mode - `await_all` - Any fallback - Any timeout Tracks all endpoints and request state. ### Code Structure `Inflight` holds per-request state: - `original_pdata` - pre-subscription pdata, used for all upstream acks/nacks - `endpoints[]` - per-destination status (`Acked`/`Nacked`/`InFlight`/`PendingSend`) - `next_send_queue` - drives sequential mode advancement - `completed_origins` - tracks completion for `await_ack: all` - `timeout_at` - per-destination deadlines for timeout/fallback triggering Not all fields are used for every scenario, but the overhead is minimal - empty HashSets don't allocate, SmallVec is inline for ≤4 items, and clone cost is O(1) for `bytes::Bytes`. ### Documentation See [`crates/otap/src/fanout_processor/README.md`](crates/otap/src/fanout_processor/README.md) for configuration examples and behavior details. --------- Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
…1965) # Change Summary Follow-up from 2026-02-05 SIG meeting Requested to add `clippy` and `fmt` for the 4 OS targets already targeted in `test_and_coverage` ## What issue does this PR close? n/a ## How are these changes tested? CI runs ## Are there any user-facing changes? No
# Change Summary Standardized otap-df-otap node URNs to the canonical urn:<namespace>:<id>:<kind> format, added strict parsing/normalization (including OTel shortcut support), updated component constants/configs/templates/docs to match, and documented otelcol config compatibility design and URN rules. ## What issue does this PR close? - Closes open-telemetry#1831 ## How are these changes tested? - cargo test (per local confirmation) - Added unit/config tests for URN normalization and legacy URN rejection in otap_df_config ## Are there any user-facing changes? Yes. Configuration now enforces canonical URN format and accepts the OTel shortcut form; legacy URNs are rejected with a doc-linked error message.
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [go](https://go.dev/) ([source](https://redirect.github.com/golang/go)) | toolchain | patch | `1.25.6` → `1.25.7` | --- ### Release Notes <details> <summary>golang/go (go)</summary> ### [`v1.25.7`](https://redirect.github.com/golang/go/compare/go1.25.6...go1.25.7) </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 8am every weekday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/open-telemetry/otel-arrow). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi45NS4yIiwidXBkYXRlZEluVmVyIjoiNDIuOTUuMiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiZGVwZW5kZW5jaWVzIl19--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: albertlockett <a.lockett@f5.com>
…open-telemetry#1976) # Change Summary Improve the reliability of `test_durable_buffer_recovery_after_outage` so that it is not subject to minor timing differences across runs that may lead to test failure. Make test more precise by validating the exact number of signals persisted and received by the exporter. ## What issue does this PR close? * Closes open-telemetry#1975 ## How are these changes tested? * Code inspection, manually running the test to attempt failure repro. ## Are there any user-facing changes? No, this is change only affects test code.
# Change Summary This is part of open-telemetry#1926 phase 1 and re-implements the concatenate function. I've fixed a lot of bugs, improved performance over the existing concatenate function quite a bit (mostly by simplifying implementation) and took some steps to prepare for phase 2 which will hopefully improve the interface for this module. Major changes: - Moved this to the transform module, I plan to treat concatenation, splitting, and maybe also re-indexing as additional transformations on OtapArrowRecords similar to removing transport optimized encodings - Added an (in my opinion) better `record_batch!` macro that supports dictionaries - Many bugs fixed for schema selection (especially structs) and nullability - Bugs fixed for boundary conditions for dictionary cardinality selection - Turned a lot of potential panics into schema mismatch errors - Lots of additional unit tests Things I deferred: - Changing the interface to concatenate to be based on OtapBatchStore deferred to phase 2 because we can't plug it into the existing code easily - Moving reindexing into the concatenate operation deferred to phase 2 because we couldn't plug it into the existing code easily - Some performance improvements mentioned in TODOs Benchmark results: | Configuration | New Implementation | Old Implementation | Speedup | |---------------|-------------------|-------------------|---------| | 10 inputs, 5 points | 28.18 us | 101.03 us | **3.58x** | | 10 inputs, 50 points | 29.82 us | 110.47 us | **3.70x** | | 100 inputs, 5 points | 246.37 us | 951.29 us | **3.86x** | | 100 inputs, 50 points | 267.27 us | 1,020.0 us | **3.82x** | | 1000 inputs, 5 points | 4.47 ms | 16.62 ms | **3.72x** | | 1000 inputs, 50 points | 4.98 ms | 17.44 ms | **3.50x** | ## What issue does this PR close? Related to open-telemetry#1926 ## How are these changes tested? Many unit tests ## Are there any user-facing changes? No.
# Change Summary Based on the discussion in today's SIG, add CI task to apply label triage:diciding to new issues for later triage.
…-telemetry#1963) # Change Summary When processing OTLP metrics, calling `OtlpProtoBytes::num_items()` panics with the message `ToDo`. This happens because metrics_data_view was previously unimplemented, but has since been added without the corresponding counter logic for num_items(). This PR implements this logic. Important to mention that the implementation counts data points since `otap.rs` does the same thing in its definition of `num_items`. https://github.com/open-telemetry/otel-arrow/blob/8c726ba2cb1ff2463db6c67ed0f03b102d322a54/rust/otap-dataflow/crates/pdata/src/otap.rs#L423-L430 ## What issue does this PR close? * open-telemetry#1923 ## How are these changes tested? TODO ## Are there any user-facing changes? No. --------- Co-authored-by: albertlockett <a.lockett@f5.com>
# Change Summary Troubleshooting some transient Auth errors using `azure_monitor_exporter` component. We should expose the error coming from `azure_core` crate. ## What issue does this PR close? n/a ## How are these changes tested? Unit tests ## Are there any user-facing changes? No --------- Co-authored-by: Cijo Thomas <cijo.thomas@gmail.com>
# Change Summary Add WAL replay support for crash recovery in Quiver. On engine startup, `QuiverEngine::open()` now replays any WAL entries that were written but not yet finalized to segments, ensuring recover of data which had been written to the WAL, but not yet finalized in a segment file. The implementation includes a new `MultiFileWalReader` that reads entries across rotated WAL files in global position order, and a `ReplayBundle` type that decodes WAL entries back into `RecordBundle` implementations for replay through the normal ingest path. The replay logic respects the persisted cursor to skip already-finalized entries and handles edge cases like truncated entries (crash mid-write) and corrupted entries (CRC mismatch) by stopping replay at the first invalid entry rather than failing startup. ## What issue does this PR close? * Closes open-telemetry#1951 ## How are these changes tested? - Added unit tests for MultiFileWalReader covering single-file reads, multi-file iteration, mid-stream starts, and WAL position preservation - Added unit tests for ReplayBundle verifying IPC payload decoding, multi-slot reconstruction, timestamp handling, and error cases - Added tests for end-to-end WAL replay scenarios including recovery of unfinalized bundles, cursor-based deduplication, empty/missing WAL handling, segment finalization during replay, multi-file replay after rotation, and graceful recovery from truncated and corrupted WAL entries. ## Are there any user-facing changes? No.
# Change Summary Adding "message" attribute to the otel_debug macro. Part of open-telemetry#1972 ## What issue does this PR close? * Closes NA ## How are these changes tested? Building the package ## Are there any user-facing changes? No
…pen-telemetry#1962) # Change Summary 1. Proper Shutdown Deadline Handling: Both TCP and UDP now capture the deadline from `NodeControlMsg::Shutdown` and return `TerminalState::new(deadline, [snapshot])` instead of `TerminalState::default()` 2. UDP Graceful Flush: On shutdown, flushes any pending records in `arrow_records_builder` using `try_send_message_with_source_node()` before returning. Uses `try_send` (non-blocking) since we're shutting down and can't wait indefinitely 3. TCP Task Shutdown Signaling: - Added `Rc<Cell<bool>>` shutdown flag to signal spawned connection tasks to flush and exit - Tasks check `shutdown_flag.get()` at the top of each loop iteration (cheap bool read, no locks) - When flag is set, tasks flush pending records via `try_send` and exit cleanly 5. TCP Task Tracking & Graceful Drain: - Added `Rc<Cell<usize>>` to track active spawned tasks - Tasks increment counter when starting, decrement at all exit points (shutdown, EOF, read error, TLS handshake failure) - On shutdown, waits for tasks to finish with timeout: - Uses 90% of time until deadline, capped at 1 second (`MAX_TASK_DRAIN_WAIT`) - Busy-spins with `yield_now()` should be rare (acceptable during shutdown) - Takes final metrics snapshot only after drain wait completes # Key Design Decisions - Used `Rc<Cell<T>>` instead of `CancellationToken` - simpler, no external dependency, cheaper (just pointer deref + bool read) - Used `try_send` during shutdown flush - non-blocking, won't hang if downstream is full - Rare case: All the tasks handling the active connections during shutdown are awaitnng I/O, then we could have a busy-spin during drain wait which would keep checking if the active task count is zero. I think this is acceptable behavior during shutdown. ## What issue does this PR close? Related to open-telemetry#1149 ## How are these changes tested? ## Are there any user-facing changes?
… quiver & durable_buffer processor (open-telemetry#1961) # Change Summary This PR implements time-based segment retention (`max_age`) for the quiver storage engine, allowing segments to be automatically deleted after a configurable duration regardless of subscriber consumption status. The feature is *opt-in* (`max_age: None` by default) to avoid unexpected data loss. Segments are timestamped using file modification time when finalized, and expired segments are cleaned up both during startup (without loading them) and during periodic maintenance. The implementation coordinates with the subscriber registry to force-complete expired segments before deletion, ensuring subscribers don't attempt to read from deleted files. Also updates the `durable_buffer` processor to pass its existing `max_age` config option through to quiver, replacing the previous placeholder implementation. ## What issue does this PR close? * Closes open-telemetry#1960 ## How are these changes tested? Comprehensive unit tests cover the new functionality. ## Are there any user-facing changes? After this change, the user-facing `max_age` setting on the `durable_buffer` processor will work as expected. (A `max_age` setting is being added to the Quiver configuration.) --------- Co-authored-by: Lalit Kumar Bhasin <lalit_fin@yahoo.com>
Reverting open-telemetry#1973 Fixing the empty "" from our internal macros, that caused the `message="user friendly message here"` from being omitted in stdout! Taking https://github.com/open-telemetry/otel-arrow/blob/main/rust/otap-dataflow/crates/controller/src/lib.rs#L668-L671 as example ```rust otel_warn!( "core_affinity.set_failed", message = "Failed to set core affinity for pipeline thread. Performance may be less predictable." ); ``` Before ```txt 2026-02-06T22:15:09.891Z WARN otap-df-controller::core_affinity.set_failed (crates/controller/src/lib.rs:668): ``` (Missing message!) After (i.e with this PR) ```txt 2026-02-06T22:11:19.095Z WARN otap-df-controller::core_affinity.set_failed (crates/controller/src/lib.rs:668): Failed to set core affinity for pipeline thread. Performance may be less predictable. ``` (Message is back) "message" is already special cased in this repo, OTel Rust repo, and `tracing` itself. Passing user friendly string as an attribute named "message" is *[faster](https://github.com/open-telemetry/opentelemetry-rust/pull/2001/changes)* too! Also, we avoid the less friendly syntax - open-telemetry#1981 (comment)
# Change Summary Minor but important improvement to logging to include the endpoint where OTLP receiver is listening. Also downgraded a registry log to debug - its polluting startup with a ton of logs, which I don't find useful, so downgrading to DEBUG from INFO. Its too much for INFO level, IMHO.
# Change Summary Removes line number from event name to make it fixed. ## What issue does this PR close? Part of open-telemetry#1972 * Closes #N/A ## How are these changes tested? Making local calls. ## Are there any user-facing changes? The event name produced for internal telemetry does not include the line number now.
# Change Summary Saturation tests were initially run continuously as we were figuring out the right inputs. We are still not finalized, but I think it's now stable enough, and can be moved to nightly. These tests take 20+ minutes of scarce resource (perf machine!), so moving to nightly ! ## How are these changes tested? N/A ## Are there any user-facing changes? None.
…buffer tests (open-telemetry#1986) # Change Summary Minor test reliability improvement. In the durable_buffer_tests, allow for expected "Channel is closed" errors during shutdown. (We are seeing these errors occasionally during PR checks.) ## What issue does this PR close? n/a ## How are these changes tested? n/a ## Are there any user-facing changes? No. This is minor test reliability improvement.
…1984) # Change Summary During WAL replay, entries older than the configured `max_age` retention are now skipped rather than replayed into new segments. Without this filtering, replaying expired WAL entries would effectively reset their age to zero, causing data to be retained longer than intended by the configured policy. The cutoff is computed once before the replay loop and compared against each entry's ingestion_time (with no assumption about WAL ordering). Skipped entries advance the cursor so they won't be retried, and the expired_bundles counter is incremented so operators have visibility into filtered data. When *all* replayed entries are expired (nothing is replayed), the cursor is explicitly persisted to the sidecar to avoid redundant re-scanning on subsequent restarts. ## What issue does this PR close? * Closes open-telemetry#1980 ## How are these changes tested? Two new tests cover the mixed old/fresh filtering case and the all-expired edge case, the latter including a third engine reopen to verify cursor persistence. ## Are there any user-facing changes? No, this is an optimization to the WAL recovery behavior. No config or user-facing changes.
This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [grpcio](https://redirect.github.com/grpc/grpc) | `==1.76.0` → `==1.78.0` |  |  | --- ### Release Notes <details> <summary>grpc/grpc (grpcio)</summary> ### [`v1.78.0`](https://redirect.github.com/grpc/grpc/releases/tag/v1.78.0) [Compare Source](https://redirect.github.com/grpc/grpc/compare/v1.76.0...v1.78.0) This is release 1.78.0 ([gutsy](https://redirect.github.com/grpc/grpc/blob/master/doc/g_stands_for.md)) of gRPC Core. For gRPC documentation, see [grpc.io](https://grpc.io/). For previous releases, see [Releases](https://redirect.github.com/grpc/grpc/releases). This release contains refinements, improvements, and bug fixes, with highlights listed below. ## C++ - adding address\_sorting dep in naming test build. ([#​41045](https://redirect.github.com/grpc/grpc/pull/41045)) ## Objective-C - \[Backport]\[v1.78.x]\[Fix]\[Compiler] Plugins fall back to the edition 2023 for older protobuf. ([#​41358](https://redirect.github.com/grpc/grpc/pull/41358)) ## Python - \[python] aio: fix race condition causing `asyncio.run()` to hang forever during the shutdown process. ([#​40989](https://redirect.github.com/grpc/grpc/pull/40989)) - \[Python] Migrate to pyproject.toml build system from setup.py builds. ([#​40833](https://redirect.github.com/grpc/grpc/pull/40833)) - \[Python] Log error details when ExecuteBatchError occurs (at DEBUG level). ([#​40921](https://redirect.github.com/grpc/grpc/pull/40921)) - \[Python] Update setuptools min version to 77.0.1 . ([#​40931](https://redirect.github.com/grpc/grpc/pull/40931)) ## Ruby - \[ruby] Fix version comparison for the ruby\_abi\_version symbol for ruby 4 compatibility. ([#​41061](https://redirect.github.com/grpc/grpc/pull/41061)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 8am on Monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/open-telemetry/otel-arrow). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi45NS4yIiwidXBkYXRlZEluVmVyIjoiNDIuOTUuMiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiZGVwZW5kZW5jaWVzIl19--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
### Description: Updates rand dependency from 0.9.2 to 0.10.0. The main breaking change affecting this codebase is the trait rename `Rng` -> `RngExt` as indicated in [migration guide](https://rust-random.github.io/book/update-0.10.html): `` Users of rand will often need to import rand::RngExt may need to migrate from R: RngCore to R: Rng (noting that where R: Rng was previously used it may be preferable to keep R: Rng even though the direct replacement would be R: RngExt; the two bounds are equivalent for R: Sized). `` Note - this supersede open-telemetry#1997 which is failing for these breaking changes in newer version.
### Description: Follow-up from open-telemetry#1653 review comments from @utpilla: - Remove redundant `exports_failed` metric (already tracked per-signal in `pdata_metrics`) - Use `upload_batches_concurrent` return value for log count instead of `batches.len()` - Rename "OTLP fallback" → "OTLP path" (it's the direct path, not a fallback) - Use array instead of `vec!` for fixed-size `TerminalState` metrics
# Change Summary Improve the command line parsing error message and make it more verbose and clear for users. ## What issue does this PR close? <!-- We highly recommend correlation of every PR to an issue --> * Closes open-telemetry#1992 ## How are these changes tested? Local run tested. ## Are there any user-facing changes? <!-- If yes, provide further info below -->
…pen-telemetry#2306) # Implement zero-copy view for OTAP Traces Fixes open-telemetry#2053 (Traces portion). This PR introduces `.traces` sub-module inside `pdata/src/views/otap.rs`, implementing the `OtapTracesView`. ## Changes made - Created `OtapTracesView` in `crates/pdata/src/views/otap/traces.rs`. - Added zero-copy traversal elements mirroring the OTLP traces model: - `OtapResourceSpansView` - `OtapScopeSpansView` - `OtapSpanView` - `OtapEventView`, `OtapLinkView`, `OtapStatusView` - Exposed the `traces` module in `otap.rs`. - Adapted array access patterns to use standard traits like `ByteArrayAccessor` and `StringArrayAccessor`. - Modified `OtapAttributeView` in `logs.rs` to expose `key` and `value` fields using `pub(crate)` natively so it can be re-used by `traces.rs`. ## Validation results - `cargo test -p otap-df-pdata` passes. - No memory leaks introduced; logic is completely zero-copy across all RecordBatch abstractions for traces. - Unit tests (`test_create_otap_traces_view`, `test_span_fields`, `test_span_status`, `test_missing_optional_columns`, `test_events_iteration`) were run and executed successfully without lifetime / compilation panics. Co-authored-by: albertlockett <a.lockett@f5.com>
…elemetry#2351) # Change Summary As noted in [this workflow run](https://github.com/open-telemetry/otel-arrow/actions/runs/23181837916/job/67356323365?pr=2306) - recent Renovate updates to some python dependencies for perf testing didn't work properly. open-telemetry#2336 updated `pydantic-core` from `2.41.5` to `2.42.0`. However, this is an indirect dependency of `pydantic`, which is a direct dependency in the requirements file: https://github.com/open-telemetry/otel-arrow/blob/59ef72fdbe2003a1425bb5c700d3de0579ffb050/tools/pipeline_perf_test/orchestrator/requirements.txt#L6 `pydantic` `2.12.5` requires EXACTLY `pydantic-core` `2.41.5`, so it was a bad Renovate update. Based on Renovate docs, we should be able to disable indirect dependency updates like this by matching on `matchDepTypes: ["indirect"]` and disabling. In addition, it seems we have some mismatch between python versions used in the repo: `3.11` and `3.14`. This can also lead to bad side effects if lock files are generated with a version different from the one being used at runtime during workflow runs.
…rtup stalls (open-telemetry#2335) # Change Summary Fixes a startup race where pipeline cores could get stuck in `Pending` on high-core machines. Engine lifecycle events (`Admitted`, `Ready`) shared the same fixed-size bounded observed-state channel as lossy async log events. Under startup burst, `Admitted` could be dropped when `send_timeout(1ms)` expired. When `Ready` arrived later, the state machine rejected the `Pending -> Ready` transition as invalid, leaving the core stuck. This change separates reliability classes: engine lifecycle events now go through a dedicated unbounded channel, while async log events stay on the existing bounded lossy path. The unbounded channel is intentional: engine events are low-volume, correctness-critical, and naturally bounded by pipeline/core lifecycle activity. **Alternate approaches considered:** - Increasing the bounded channel size would only reduce the probability of failure under burst; it would not guarantee delivery of lifecycle events. - Making the state machine accept `Pending` -> `Ready` would mask dropped lifecycle events instead of fixing delivery. ## What issue does this PR close? * Closes open-telemetry#2328 ## How are these changes tested? ```bash $ cargo test -p otap-df-state -- --nocapture $ cargo check -p otap-df-controller -p otap-df-state -p otap-df-telemetry -p otap-df-config ``` ## Are there any user-facing changes? No
… crate (open-telemetry#2339) # Change Summary Next part of open-telemetry#1847 and open-telemetry#2086 ## How are these changes tested? * Unit tests / CLI * Compiled and ran `df_engine` and confirmed all nodes are still available ## Are there any user-facing changes? No
…2346) # Change Summary I previously added payload definitions in open-telemetry#2240 which were mostly focused on solving the dictionary key size problem. This PR builds on that work, but with a more refined implementation that also includes things like required vs optional columns as a pre-req for open-telemetry#2289. The major changes: - Added an otap `Schema` type and redefined the payloads according to that. This is a better construct that is somewhat symmetrical to arrow Schemas, allows for defining recursive Struct and List types, and subsequently removes the requirement to have nested lookup tables. - Added deep equality checks between record batches and `Schema` types - Updated some small bits of the `transform` module for the changes ## What issue does this PR close? * Part of open-telemetry#2289 ## How are these changes tested? Unit ## Are there any user-facing changes? No --------- Co-authored-by: albertlockett <a.lockett@f5.com>
…erator (open-telemetry#2347) Need to see the pipeline performance when using large payloads etc. Its repeating same string, so compresses well - addressing that (by true randomn values) would be a future addition.
…elemetry#2211) Add a `node.processor` metric set with a `process.success.duration` and `process.failed.duration` Mmsc instrument for measuring the wall-clock duration of the work done in a process() call. A closure is used to prevent inclusion of async-await points in the measurement. The metric is registered via the node telemetry context. This is intended to be gated by MetricLevel >= Normal Fixes open-telemetry#2210. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
# Change Summary Downgrade from ERR to WARN for export failure. Closes open-telemetry#2348 ## How are these changes tested? ## Are there any user-facing changes? Yes lower severity for internal logs.
…es (open-telemetry#2349) # Change Summary This is another PR on the path to open-telemetry#2289. During my initial implementation, I realized that a lot of tests were generating invalid otap batches. In order to make that easier, I've updated the logs/metrics/traces macros in the pdata crate to automatically fill in anything required to be spec compliant based on what's been specified. I had a half-baked version of this before, but now that we have the spec we can do this properly. Additional changes: - Added an `Into<OtapArrowRecords>` trait bound to OtapBatchStore. From<Logs/Metrics/Traces> was already implemented, it's just useful to have in the bound. - added a `testing` feature which lets these be consumed across other crates. ## What issue does this PR close? * Part of open-telemetry#2289 ## How are these changes tested? Unit. ## Are there any user-facing changes? No --------- Co-authored-by: albertlockett <a.lockett@f5.com>
…ng dependency (open-telemetry#2343) # Change Summary Feature-gate the `weaver` crates behind an optional `dev-tools` feature flag to eliminate the transitive `ring` dependency from production builds. The `weaver` crates are only used by the `fake_data_generator` (traffic generator) receiver — a dev/testing/benchmarking tool. By making them optional, builds with `--no-default-features --features jemalloc,crypto-openssl` no longer pull in `ring` through weaver. `dev-tools` is included in default features for backward compatibility, so existing builds, benchmarks, and configs are unaffected. ## What issue does this PR close? * Closes open-telemetry#2322 * Closes open-telemetry#2358 ## How are these changes tested? - `cargo check` with default features (includes `dev-tools`) — passes, all existing functionality preserved. - `cargo check --no-default-features --features jemalloc,crypto-openssl` — passes, verifies the binary builds without weaver/fake_data_generator. - Added a `no_default_features_check` CI job that runs `cargo check --no-default-features` against the workspace, validation crate, and benchmarks on every PR to prevent regressions (closes open-telemetry#2358). ## Are there any user-facing changes? Yes: - **New feature flag**: `dev-tools` (enabled by default) controls whether the `fake_data_generator` / `traffic_generator` receiver is included in the build. - **Default behavior is unchanged** — `cargo build` still includes the traffic generator and all dev tooling. - **Builds** that want to eliminate `ring` dependency can now build with: ``` cargo build --no-default-features --features <feature1>,<feature2> ``` This excludes weaver and its transitive `ring` dependency entirely.
…y#2357) ## Summary Extracts the backend-agnostic metrics view trait definitions from `otap-df-pdata/views/metrics.rs` to `otap-df-pdata-views/views/metrics.rs`, following the same pattern established in PR open-telemetry#2130 for logs and traces. This was called out as an explicit follow-up in open-telemetry#2130: > Metrics view traits are intentionally excluded for now - the metrics view hierarchy is significantly more complex and can be extracted in a follow-up once the pattern is validated for logs and traces. ## Changes - **New file** `crates/pdata-views/src/views/metrics.rs` — all trait definitions (`MetricsView`, `ResourceMetricsView`, `ScopeMetricsView`, `MetricView`, `DataView`, `GaugeView`, `SumView`, `NumberDataPointView`, `HistogramView`, `HistogramDataPointView`, `ExponentialHistogramView`, `ExponentialHistogramDataPointView`, `SummaryView`, `SummaryDataPointView`, `ExemplarView`, `BucketsView`, `ValueAtQuantileView`) and enums (`DataType`, `Value`, `AggregationTemporality`, `DataPointFlags`) - **Modified** `crates/pdata-views/src/views/mod.rs` — added `pub mod metrics` - **Modified** `crates/pdata/src/views/metrics.rs` — replaced with `pub use otap_df_pdata_views::views::metrics::*` re-export + proto-specific `From` impls ## What stays in pdata Proto-specific `From` impls that depend on `crate::proto::opentelemetry::metrics::v1`: - `From<&proto::metric::Data> for DataType` - `From<&proto::exemplar::Value> for Value` - `From<&proto::number_data_point::Value> for Value` - `From<proto::AggregationTemporality> for AggregationTemporality` ## Verification - `cargo tree -p otap-df-pdata-views` confirms zero external dependencies - `cargo build --all-targets` — all crates compile - `cargo clippy --package otap-df-pdata-views -- -D warnings` — clean - `cargo clippy --package otap-df-pdata -- -D warnings` — clean - `cargo fmt --all --check` — clean - All existing consumers (`core-nodes`, `contrib-nodes`, `pdata`) work unchanged through re-exports Co-authored-by: Gyan Ranjan Panda <gyanranjanpanda@users.noreply.github.com> Co-authored-by: albertlockett <a.lockett@f5.com>
…ry#2359) The new CI machine's legacy Docker builder doesn't support `--build-context`. Switch to `docker buildx build --load` which uses BuildKit natively.
Just a simple CI step to find more about the perf test machine from CNCF. Primarily to find out if they have NUMA and we can leverage it to test if engine's NUMA awareness is tested/proved. If not, we need to request different machine(s).
…lemetry#2365) # Change Summary Two tests had a race condition where the flip thread redundantly waited for delivery (`counter > 0`) with a 5-second timeout, while the pipeline shutdown condition already waited for the same thing with a 15-second ceiling. After permanent NACKs (which drop data rather than retry) the pipeline must generate, buffer, and export entirely new data before the counter increments. On slow CI this path can exceed 5 seconds, causing the flip thread's assertion to fire before the pipeline has a chance to deliver. The fix removes the redundant delivery wait from the flip threads in `test_durable_buffer_permanent_nack_rejects_without_retry` and `test_durable_buffer_mixed_transient_and_permanent_nacks`, matching the pattern already used in `test_durable_buffer_retries_on_nack`. Delivery is still verified by the post-pipeline `assert!(delivered > 0)` in each test. * Closes open-telemetry#2354 ## How are these changes tested? Validated that tests pass when run locally in a loop. Will monitor CI results after merging to confirm stability. ## Are there any user-facing changes? No.
…y#2360) # Change Summary This is the final piece of the nodes refactor from `crates/otap` into `crates/core-nodes` and `crates/contrib-nodes`. Each discrete node implementation has now been moved out of `crates/otap`, leaving only shared helpers / test infrastructure. Created open-telemetry#2362 to track remaining cleanup. ## What issue does this PR close? * Part of open-telemetry#1847 * Closes open-telemetry#2086 ## How are these changes tested? * Unit tests / CI * Compiled and ran `df_engine` and confirmed all nodes are still available ## Are there any user-facing changes? No --------- Co-authored-by: albertlockett <a.lockett@f5.com>
…n-telemetry#2293) # Change Summary This PR adds a design proposal describing the extension system for the **OTel Dataflow Engine**. The document introduces a capability-based extension architecture allowing receivers, processors, and exporters to access non-pdata functionality through well-defined capability interfaces maintained in the engine core. The proposal covers: * core concepts such as **capabilities**, **extension providers**, and **extension instances** * integration of extensions into the **existing configuration model** * the **user experience** for declaring extensions and binding capabilities * the **developer experience** for implementing extension providers * the **runtime architecture** for resolving and instantiating extensions * the **execution models** supported by extensions (local vs shared) * comparison with the **Go Collector extension model** * a **phased evolution plan** (native extensions → hierarchical placement → WASM extensions) * implementation recommendations for building **high-performance extensions aligned with the engine's thread-per-core design** The goal of this document is to provide maintainers with a clear architectural proposal to review before implementing the extension system. ## What issue does this PR close? * Related to open-telemetry#2267, open-telemetry#2230, open-telemetry#2141, open-telemetry#2113 ## How are these changes tested? This PR introduces **documentation only** and does not modify runtime code. ## Are there any user-facing changes? Yes. This proposal describes a **future extension system** that will introduce new configuration capabilities such as: * an `extensions` section in pipeline configurations * a `capabilities` section in node definitions These changes are not implemented yet but outline the intended user-facing configuration model for extensions. --------- Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
# Change Summary If configured, sets resource id header for log analytics API. ## What issue does this PR close? ## How are these changes tested? ## Are there any user-facing changes? New config field `azure_monitor_source_resourceid` under `api` config section of azure monitor exporter. --------- Co-authored-by: Drew Relmas <drewrelmas@gmail.com>
open-telemetry#2359 didn't alone fix, so reverting and trying a diff fix. The new CI machine defaults to the legacy Docker builder which doesn't support --build-context or FROM --platform. Add DOCKER_BUILDKIT=1 prefix to enable the built-in BuildKit engine (available since Docker 18.09) without requiring the buildx plugin. --------- Co-authored-by: Laurent Quérel <l.querel@f5.com>
8a9dad3 to
bcbf69d
Compare
…porter Move mock-la-server from standalone crate to a contrib-nodes example to comply with project structure requirements (otap-df- naming, README).
bcbf69d to
ec30386
Compare
4573033 to
a739e34
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a mock Azure Monitor Logs Ingestion API server and local test config for performance testing without LA costs. See crates/mock-la-server/src/main.rs for usage and error simulation flags.