Skip to content

feat(coprocessor): standardize OTLP init + shutdown guard#1942

Merged
mergify[bot] merged 22 commits intomainfrom
codex/coprocessor-otel-phase-a-main
Feb 10, 2026
Merged

feat(coprocessor): standardize OTLP init + shutdown guard#1942
mergify[bot] merged 22 commits intomainfrom
codex/coprocessor-otel-phase-a-main

Conversation

@Eikix
Copy link
Copy Markdown
Contributor

@Eikix Eikix commented Feb 9, 2026

Summary

  • unify coprocessor OTEL init and shutdown lifecycle in fhevm-engine-common::telemetry
  • centralize sns-worker subscriber + optional OTEL layer wiring in common telemetry
  • keep behavior minimal: global tracer provider only, no extra propagator config

Validation

  • pre-commit passed (cargo fmt, cargo check, clippy)
  • targeted check passed:
    SQLX_OFFLINE=true cargo check -p fhevm-engine-common -p sns-worker -p gw-listener -p host-listener -p tfhe-worker -p transaction-sender -p zkproof-worker
  • telemetry unit tests pass in fhevm-engine-common

Scope

  • no gRPC migration
  • no operation-level span migration

Closes https://github.com/zama-ai/fhevm-internal/issues/1006

@Eikix Eikix requested a review from a team as a code owner February 9, 2026 08:54
@cla-bot cla-bot bot added the cla-signed label Feb 9, 2026
@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 9, 2026

@claude /pr-review
is a drop guard good practice or outdated?

@claude
Copy link
Copy Markdown

claude bot commented Feb 9, 2026

Claude finished @Eikix's task in 1m 17s —— View job


PR Review: feat(coprocessor): standardize OTLP init + shutdown guard

  • Gather context and read changed files
  • Analyze drop guard pattern usage
  • Answer question about drop guard best practices
  • Provide comprehensive PR review feedback

Question: Is a Drop Guard Good Practice or Outdated?

Short answer: Drop guards are absolutely still good practice in Rust, especially for your use case.

Why Drop Guards are the Right Choice Here

  1. RAII Principle: Drop guards are the idiomatic Rust way to implement Resource Acquisition Is Initialization (RAII). This pattern ensures cleanup happens even when:

    • The program panics
    • Early returns occur
    • Errors cause stack unwinding
  2. OpenTelemetry's Flush Requirement: The SdkTracerProvider::shutdown() method flushes pending spans before shutting down. Without proper shutdown, you risk losing traces that haven't been exported yet. A drop guard guarantees this happens.

  3. Modern Rust Standard Library Uses This Pattern: Examples include:

    • MutexGuard<T> - releases the lock on drop
    • JoinHandle<T> - can detach or join on drop
    • File - flushes and closes on drop
  4. Your Implementation is Clean: The Option::take() pattern in shutdown_once() is the correct way to ensure idempotent shutdown - this prevents double-free issues if shutdown is called manually before drop.

When Drop Guards Might Be "Outdated"

The only scenario where drop guards are considered problematic is in async contexts where the destructor needs to perform async work. In your case, SdkTracerProvider::shutdown() is synchronous, so this isn't an issue.


PR Review Summary

✅ What's Good

  1. Unified OTLP Initialization API: All binaries now use the same init_otlp() or init_otlp_tracing() path, eliminating drift.

  2. Idempotent Shutdown: The Option::take() pattern ensures shutdown_once() is safe to call multiple times.

  3. Good Test Coverage: The unit tests verify the guard semantics (shutdown_once() disarms, tracer presence/absence).

  4. Graceful Degradation: Empty service_name returns Ok(None) instead of failing, allowing opt-out of OTLP.

  5. Consistent Error Handling Across Binaries: All binaries follow the same pattern:

    let _otlp_runtime = match telemetry::init_otlp(&config.service_name) {
        Ok(runtime) => runtime,
        Err(err) => {
            error!(error = %err, "Failed to setup OTLP");
            None
        }
    };

💡 Suggestions (Non-Blocking)

1. Consider Adding Timeout to Shutdown

telemetry.rs:35-40

The shutdown could potentially hang if the OTLP collector is unresponsive. Consider wrapping with a timeout:

fn shutdown_once(&mut self) {
    if let Some(provider) = self.provider.take() {
        // Consider: timeout around shutdown to prevent hanging on exit
        if let Err(err) = provider.shutdown() {
            warn!(error = %err, "Failed to shutdown OTLP tracer provider");
        }
    }
}

Add timeout →

2. Document Why tracer() Returns a Clone

telemetry.rs:67-69

pub fn tracer(&self) -> Option<opentelemetry_sdk::trace::Tracer> {
    self.tracer.clone()  // Why clone? Is Tracer expensive to clone?
}

This is probably fine since Tracer is likely cheap to clone, but a brief comment would help future readers understand the design choice.

3. Consider #[must_use] for OtlpRuntime

telemetry.rs:51-70

Adding #[must_use] would catch cases where callers accidentally discard the runtime:

#[must_use = "OtlpRuntime must be kept alive for the duration of the process to ensure proper shutdown"]
pub struct OtlpRuntime {
    // ...
}

Add must_use →

✅ Overall Assessment

This PR is well-designed and ready to merge. The drop guard pattern is appropriate, the implementation is clean, and it successfully standardizes OTLP initialization across all coprocessor binaries. The review comments from the self-review have been addressed properly (renaming shutdown_if_neededshutdown_once, removing redundant public shutdown(), etc.).


|

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 9, 2026

🧪 CI Insights

Here's what we observed from your CI run for 60f4a1f.

🟢 All jobs passed!

But CI Insights is watching 👀

@Eikix Eikix marked this pull request as draft February 9, 2026 11:01
@Eikix Eikix marked this pull request as ready for review February 9, 2026 15:04
@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 9, 2026

@claude /pr-review

@claude
Copy link
Copy Markdown

claude bot commented Feb 9, 2026

Claude finished @Eikix's task in 5m 47s —— View job


PR Review: feat(coprocessor): standardize OTLP init + shutdown guard

  • Pre-review validation
  • Gather context (PR summary + guidelines)
  • Run parallel review agents (bug hunter, guideline compliance, error handling)
  • Validate high-confidence issues
  • Post final review

Summary

This PR successfully standardizes OTLP initialization across the coprocessor binaries with a clean TracerProviderGuard pattern that ensures proper shutdown. The implementation is sound and follows OpenTelemetry best practices.

Verdict: ✅ Ready to merge with one minor consistency fix suggested below.


What's Good

  1. Clean Drop Guard Pattern: The TracerProviderGuard using Option::take() ensures idempotent shutdown - safe to call multiple times, guaranteed cleanup on drop.

  2. Graceful Degradation: init_json_subscriber correctly initializes JSON logs even when OTLP export fails, ensuring logs are never lost.

  3. Unified API: All binaries now follow the same initialization pattern, eliminating drift.

  4. Good Test Coverage: Unit tests verify guard semantics (shutdown_once disarms, tracer presence/absence).

  5. Correct Clone Semantics: SdkTracerProvider uses Arc internally, so cloning shares state - this is intentional and ensures shutdown affects all references.


Issues Found

🟡 Minor: Inconsistent Log Level for OTLP Failure

Confidence: 95%

File: coprocessor/fhevm-engine/host-listener/src/poller/mod.rs:96

The poller uses warn! while all other services use error! for OTLP initialization failures:

// poller/mod.rs:96 - uses warn!
warn!(error = %err, "Failed to setup OTLP");

// All other services use error!
error!(error = %err, "Failed to setup OTLP");

Impact: Operators monitoring for ERROR-level logs may miss OTLP failures in the poller service. Since OTLP failure means no distributed tracing, this should be treated consistently across all services.

Fix this →


Notes

Drop Guard Pattern - Answering the Original Question

Q: Is a drop guard good practice or outdated?

A: Drop guards are absolutely still good practice, especially for this use case:

  1. RAII Principle: This is idiomatic Rust for resource cleanup that must happen even during panics or early returns.

  2. Standard Library Uses This: MutexGuard<T>, JoinHandle<T>, File all use drop guards.

  3. OpenTelemetry Requirement: SdkTracerProvider::shutdown() flushes pending spans - without proper shutdown, traces can be lost.

  4. Your Implementation is Correct: The Option::take() pattern prevents double-shutdown issues.

The only scenario where drop guards are problematic is for async cleanup, but SdkTracerProvider::shutdown() is synchronous, so this is not an issue here.


View job

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 10, 2026

@mergify queue

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 10, 2026

Merge Queue Status

Rule: main


This pull request spent 2 hours 37 minutes 15 seconds in the queue, including 1 hour 35 minutes 58 seconds running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • check-success = run-e2e-tests / fhevm-e2e-test
  • any of [🛡 GitHub branch protection]:
    • check-success = common-pull-request/lint (bpr)
    • check-neutral = common-pull-request/lint (bpr)
    • check-skipped = common-pull-request/lint (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = coprocessor-cargo-listener-tests/cargo-tests (bpr)
    • check-neutral = coprocessor-cargo-listener-tests/cargo-tests (bpr)
    • check-success = coprocessor-cargo-listener-tests/cargo-tests (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-success = coprocessor-cargo-test/cargo-tests (bpr)
    • check-neutral = coprocessor-cargo-test/cargo-tests (bpr)
    • check-skipped = coprocessor-cargo-test/cargo-tests (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-success = coprocessor-dependency-analysis/dependencies-check (bpr)
    • check-neutral = coprocessor-dependency-analysis/dependencies-check (bpr)
    • check-skipped = coprocessor-dependency-analysis/dependencies-check (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = gateway-contracts-deployment-tests/sc-deploy (bpr)
    • check-neutral = gateway-contracts-deployment-tests/sc-deploy (bpr)
    • check-success = gateway-contracts-deployment-tests/sc-deploy (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = kms-connector-tests/test-connector (bpr)
    • check-neutral = kms-connector-tests/test-connector (bpr)
    • check-success = kms-connector-tests/test-connector (bpr)

@mergify mergify bot added the queued label Feb 10, 2026
mergify bot added a commit that referenced this pull request Feb 10, 2026
@mergify mergify bot merged commit 4dafed3 into main Feb 10, 2026
72 of 76 checks passed
@mergify mergify bot deleted the codex/coprocessor-otel-phase-a-main branch February 10, 2026 14:47
@mergify mergify bot removed the queued label Feb 10, 2026
mergify bot added a commit that referenced this pull request Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants