Initial implementation of OTLP HTTP Exporter#2070
Initial implementation of OTLP HTTP Exporter#2070albertlockett merged 26 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2070 +/- ##
==========================================
+ Coverage 86.97% 87.04% +0.06%
==========================================
Files 539 542 +3
Lines 173124 174232 +1108
==========================================
+ Hits 150571 151653 +1082
- Misses 22019 22045 +26
Partials 534 534
🚀 New features to boost your workflow:
|
|
Keeping this as Draft for now, since it seems to be breaking the pipeline perf tests |
lquerel
left a comment
There was a problem hiding this comment.
I’ve done a first pass on this draft. No big issue, just few suggestions here and there.
| Ok(service_resp) => service_resp.partial_success.map(|partial_success| { | ||
| format!( | ||
| "{} ({} rejected)", | ||
| partial_success.error_message, partial_success.rejected | ||
| ) | ||
| }), |
There was a problem hiding this comment.
This might be the right approach but I'm not sure.
I'm wondering if a partial success should alway result into a Nack (as in this impl). Do you know what the Go collector do in this case?
There was a problem hiding this comment.
As per the specs -
If the request is only partially accepted (i.e. when the server accepts only parts of the data and rejects the rest), the server response MUST be the same ExportServiceResponse message as in the Full Success case.
Sending a Nack here will make the upstream retry the full batch.
There was a problem hiding this comment.
Oops yes you're right - we shouldn't send the Nack here because we don't want to retry. the spec also says:
https://opentelemetry.io/docs/specs/otlp/#partial-success-1
The client MUST NOT retry the request when it receives a partial success response where the partial_success
Looks like there are also semantics for when to retry:
The server SHOULD use HTTP response status codes to indicate retryable and not-retryable errors for a particular erroneous situation. The client SHOULD honour HTTP response status codes as retryable or not-retryable.
Retryable Response Codes
The requests that receive a response status code listed in following table SHOULD be retried. All other 4xx or 5xx response status codes MUST NOT be retried.HTTP response status code
429 Too Many Requests
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
I'll fix this
There was a problem hiding this comment.
@lalitb & @lquerel I've updated the way we handle errors/produce Nacks in 3bb9a33
The behaviour is we receive a 200 response code w/ a partial request we emit a Nack, unless rejected signals == 0. This Nack will have permanent = true, meaning the payload should not be retried.
When partial success rejects 0 signals, we treat this as a success. According to the spec, the server send a partial_success even if it fully accepted the request:
servers MAY also use the partial_success field to convey warnings/suggestions to clients even when it fully accepts the request. In such cases, the rejected_ field MUST have a value of 0, and the error_message field MUST be non-empty.
In the case where we receive a 200 response, but couldn't decode the body, we also send a Nack with permanent=true. The assumption here is that either there was a success, or a partial success, but either way we're not supposed to retry the request.
When we receive a non-200 response, we now follow the suggestions in the spec vis-à-vis retries: we send a Nack, and IF the response status wasn't 429, 502, 503 or 504, we set permanent=true meaning don't retry it.
For non-http errors, the spec only gives some limited clarity about what to do:
https://opentelemetry.io/docs/specs/otlp/#all-other-responses
All other HTTP responses that are not explicitly listed in this document should be treated according to HTTP specifications.
If the server disconnects without returning a response, the client SHOULD retry and send the same request. The client SHOULD implement an exponential backoff strategy between retries to avoid overwhelming the server.
The behaviour I have for these is to retry (e.g. send Nack w/ permanent=false) if the error was a connection error or a TCP timeout. My thinking was that it's reasonable to retry in these cases which may be intermittent networking errors.
| self.pdata_metrics.add_failed(signal_type, 1); | ||
| continue; | ||
| } else { | ||
| Bytes::copy_from_slice(proto_buffer.as_ref()) |
There was a problem hiding this comment.
The same apply here I think
There was a problem hiding this comment.
I don't think we can do something similar to what is suggest here.
Internally this proto_buffer has a Vec, not Bytes.
otel-arrow/rust/otap-dataflow/crates/pdata/src/otlp/common.rs
Lines 330 to 333 in 3d35559
We also clear and reuse this instance for each OTAP batch that we encode to proto.
| &effect_handler, | ||
| &mut self.pdata_metrics, | ||
| ) | ||
| .await; |
There was a problem hiding this comment.
The drain loop ignores the deadline already in scope. With http.timeout = None (default), a slow/hung server will block shutdown indefinitely past the deadline Worth racing next_completion() against sleep_until(deadline) here?
There was a problem hiding this comment.
I created #2099 to track this. OTLP gRPC exporter has the same behaviour
|
Nice work. High level looks good! Left a few inline comments. I believe worth addressing before merge would be the partial success handling for OTEL specs compliance. |
d4f451b to
03166d9
Compare
Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
| // ensure exit success | ||
| result.unwrap(); | ||
|
|
||
| // validate we received three Nacks |
There was a problem hiding this comment.
This comment can be removed/updated for these tests as well:
test_handles_connection_refused_errorstest_handles_response_body_too_largetest_handles_invalid_otap_payloadstest_nacks_for_otap_payloads_when_context_indicates_no_payload_returntest_nacks_for_otlp_payloads_when_context_indicates_no_payload_return
# Change Summary <!-- Replace with a brief summary of the change in this PR --> small followup from #2070. Adds new config options for each signal type to override the endpoint to which the OTLP HTTP exporter sends data. This is to aid with parity between this implementation and the analogous Go collector component, which also has these options: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporter#otlp-http-exporter ## What issue does this PR close? <!-- We highly recommend correlation of every PR to an issue --> * Part of #1145 ## How are these changes tested? A new unit test is added. ## Are there any user-facing changes? <!-- If yes, provide further info below --> Users can configure the component with these new options.
…etry#2082) # Change Summary <!-- Replace with a brief summary of the change in this PR --> Collects the telemetry returned by the component in the exporter `TestRuntime`. A new function is added to manually invoke the telemetry loop one time on `InternalCollector`. This is called if the exporter returned a terminal state with some metrics snapshots. ## What issue does this PR close? <!-- We highly recommend correlation of every PR to an issue --> * Closes open-telemetry#2081 ## How are these changes tested? I tested these manually in a unit test I had on a feature branch: open-telemetry@2722031 More test coverage will be added later on: I plan to use this in a followup to PR open-telemetry#2070 to add a comprehensive suite of unit tests to the metrics reported by the OTLP HTTP exporter ## Are there any user-facing changes? <!-- If yes, provide further info below --> No --------- Co-authored-by: utpilla <utpilla@users.noreply.github.com>
# Change Summary <!-- Replace with a brief summary of the change in this PR --> This PR adds a new exporter for exporting telemetry via OTLP over HTTP. Many of the same patterns from the OTLP gRPC exporter are followed, including the optimizations that were added in open-telemetry#1474. This PR adds a new `otlp_http::client_settings` to the otap crate, which contains `HttpClientSettings`, a reusable configuration for an HTTP client, which can produce a configured `reqwest::ClientBuilder`. This is analogous to the `otlp_grpc::client_settings::GrpcClientSettings`. The exporter uses the `otlp_exporter:: InFlightExports` type for managing a limited number of concurrent in flight requests and, and handling request completions. The exporter uses a pool of `request::Clients`, implemented as the `HttpClientPool` type. The intention is to force requests to be distributed over multiple connections to the OTLP server. In our cast, there may be many servers listening on the same port using SOREUSEPORT (for example, one per instance of a pipeline, each on different cores). The intention is to hopefully have better load balancing. When using HTTP version 1, this is likely unnecessary as with a high enough concurrent request volume, multiple connections will be opened anyway by the internal reqwest Client connection pool. However, this will likely be needed when we add TLS because HTTP version 2 may be negotiated, in which case multiple requests can be multiplexed over the same connection, leading to the same issue we had in the OTLP gRPC exporter (which also uses a client pool). There are a variety of additional tasks that I will handle in followup PRs: - [ ] TLS/mTLS support - [ ] Proxy support - [ ] Compression - compressed request bodies / accepting compressed responses - [ ] Allow endpoint overrides for each signal type (similar to Go collector implementation) - [ ] Unit test metrics produced by component ## What issue does this PR close? <!-- We highly recommend correlation of every PR to an issue --> * Relates to open-telemetry#1145 ## How are these changes tested? A suite of unit tests covering successful requests both OTLP & OTAP pdatas, plus tests covering various errors including connection refused, non 200 response, OTLP service responses w/ partial success and malformed OTAP batches. ## Are there any user-facing changes? Users now have a new component available for their telemetry pipelines. --------- Co-authored-by: Lalit Kumar Bhasin <lalit_fin@yahoo.com> Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
…emetry#2089) # Change Summary <!-- Replace with a brief summary of the change in this PR --> small followup from open-telemetry#2070. Adds new config options for each signal type to override the endpoint to which the OTLP HTTP exporter sends data. This is to aid with parity between this implementation and the analogous Go collector component, which also has these options: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporter#otlp-http-exporter ## What issue does this PR close? <!-- We highly recommend correlation of every PR to an issue --> * Part of open-telemetry#1145 ## How are these changes tested? A new unit test is added. ## Are there any user-facing changes? <!-- If yes, provide further info below --> Users can configure the component with these new options.
## Summary Renames the gRPC-based OTLP exporter module and URN to distinguish it from the newly-added HTTP-based exporter (#2070). **URN change:** `urn:otel:otlp:exporter` → `urn:otel:otlp_grpc:exporter` **Module rename:** `otlp_exporter.rs` → `otlp_grpc_exporter.rs` Fixes #2107 ## Changes ### Rust Source (3 files) - **Renamed** `otlp_exporter.rs` → `otlp_grpc_exporter.rs` and updated URN constant value - **Updated** `lib.rs` module declaration: `pub mod otlp_exporter` → `pub mod otlp_grpc_exporter` - **Updated** `urn.rs` test case URN reference ### Config Files (8 files) All in `rust/otap-dataflow/configs/` — replaced `plugin_urn` from `urn:otel:otlp:exporter` → `urn:otel:otlp_grpc:exporter` ### Perf Test Templates (9 files) All in `tools/pipeline_perf_test/test_suites/integration/templates/configs/` — same URN replacement ### Documentation (3 files) - `crates/quiver/ARCHITECTURE.md` — updated node names + URN in config examples - `docs/self_tracing_architecture.md` — updated node names in config example - `docs/telemetry/metrics-guide.md` — updated metric set name ## What Was NOT Changed (by design) - **Test function names** (e.g. `otlp_exporter_connects_with_mtls`) — describe behavior, not module path - **Test file names** (`otlp_exporter_tls.rs`, `otlp_exporter_proxy_tls.rs`) — no module imports depend on them - **Telemetry crate** (`otlp_exporter_provider`, `configure_grpc_otlp_exporter`) — separate OTel SDK, not the pipeline exporter - **Constant/struct names** (`OTLP_EXPORTER_URN`, `OTLPExporter`) — kept per issue scope ## Verification - ✅ `cargo build --workspace` — passed - ✅ `cargo test --workspace` — all tests passed, zero failures - ✅ `grep -r "urn:otel:otlp:exporter"` — zero matches remain Co-authored-by: Drew Relmas <drewrelmas@gmail.com> Co-authored-by: albertlockett <a.lockett@f5.com>
## Summary Renames the gRPC-based OTLP exporter module and URN to distinguish it from the newly-added HTTP-based exporter (open-telemetry#2070). **URN change:** `urn:otel:otlp:exporter` → `urn:otel:otlp_grpc:exporter` **Module rename:** `otlp_exporter.rs` → `otlp_grpc_exporter.rs` Fixes open-telemetry#2107 ## Changes ### Rust Source (3 files) - **Renamed** `otlp_exporter.rs` → `otlp_grpc_exporter.rs` and updated URN constant value - **Updated** `lib.rs` module declaration: `pub mod otlp_exporter` → `pub mod otlp_grpc_exporter` - **Updated** `urn.rs` test case URN reference ### Config Files (8 files) All in `rust/otap-dataflow/configs/` — replaced `plugin_urn` from `urn:otel:otlp:exporter` → `urn:otel:otlp_grpc:exporter` ### Perf Test Templates (9 files) All in `tools/pipeline_perf_test/test_suites/integration/templates/configs/` — same URN replacement ### Documentation (3 files) - `crates/quiver/ARCHITECTURE.md` — updated node names + URN in config examples - `docs/self_tracing_architecture.md` — updated node names in config example - `docs/telemetry/metrics-guide.md` — updated metric set name ## What Was NOT Changed (by design) - **Test function names** (e.g. `otlp_exporter_connects_with_mtls`) — describe behavior, not module path - **Test file names** (`otlp_exporter_tls.rs`, `otlp_exporter_proxy_tls.rs`) — no module imports depend on them - **Telemetry crate** (`otlp_exporter_provider`, `configure_grpc_otlp_exporter`) — separate OTel SDK, not the pipeline exporter - **Constant/struct names** (`OTLP_EXPORTER_URN`, `OTLPExporter`) — kept per issue scope ## Verification - ✅ `cargo build --workspace` — passed - ✅ `cargo test --workspace` — all tests passed, zero failures - ✅ `grep -r "urn:otel:otlp:exporter"` — zero matches remain Co-authored-by: Drew Relmas <drewrelmas@gmail.com> Co-authored-by: albertlockett <a.lockett@f5.com>
Change Summary
This PR adds a new exporter for exporting telemetry via OTLP over HTTP.
Many of the same patterns from the OTLP gRPC exporter are followed, including the optimizations that were added in #1474.
This PR adds a new
otlp_http::client_settingsto the otap crate, which containsHttpClientSettings, a reusable configuration for an HTTP client, which can produce a configuredreqwest::ClientBuilder. This is analogous to theotlp_grpc::client_settings::GrpcClientSettings.The exporter uses the
otlp_exporter:: InFlightExportstype for managing a limited number of concurrent in flight requests and, and handling request completions.The exporter uses a pool of
request::Clients, implemented as theHttpClientPooltype. The intention is to force requests to be distributed over multiple connections to the OTLP server. In our cast, there may be many servers listening on the same port using SOREUSEPORT (for example, one per instance of a pipeline, each on different cores). The intention is to hopefully have better load balancing. When using HTTP version 1, this is likely unnecessary as with a high enough concurrent request volume, multiple connections will be opened anyway by the internal reqwest Client connection pool. However, this will likely be needed when we add TLS because HTTP version 2 may be negotiated, in which case multiple requests can be multiplexed over the same connection, leading to the same issue we had in the OTLP gRPC exporter (which also uses a client pool).There are a variety of additional tasks that I will handle in followup PRs:
What issue does this PR close?
How are these changes tested?
A suite of unit tests covering successful requests both OTLP & OTAP pdatas, plus tests covering various errors including connection refused, non 200 response, OTLP service responses w/ partial success and malformed OTAP batches.
Are there any user-facing changes?
Users now have a new component available for their telemetry pipelines.