Skip to content

feat(coprocessor): otel cardinality cleanup, dfg pattern attribution, spanmetrics#2019

Open
Eikix wants to merge 16 commits intomainfrom
otel-metrics-improvement
Open

feat(coprocessor): otel cardinality cleanup, dfg pattern attribution, spanmetrics#2019
Eikix wants to merge 16 commits intomainfrom
otel-metrics-improvement

Conversation

@Eikix
Copy link
Copy Markdown
Contributor

@Eikix Eikix commented Feb 23, 2026

Summary

This PR finishes the coprocessor OTEL cleanup in four parts:

  • Low-cardinality tracing cleanup across host-listener, tfhe-worker, sns-worker, zkproof-worker, and transaction-sender
  • Scheduler pattern attribution via bounded operation_pattern_id and transaction_pattern_id span attributes
  • Spanmetrics plumbing via an OTEL Collector + Prometheus pipeline for the local stack
  • Pattern integration test hardening so span assertions remain reliable under the shared global subscriber used by the test suite

Closes zama-ai/fhevm-internal#1044, zama-ai/fhevm-internal#1013, zama-ai/fhevm-internal#1012.

What changed

1. Coprocessor span cardinality cleanup

  • Removes high-cardinality child-span identifiers such as handle, txn_id, request_id, and similar per-item fields from runtime spans
  • Keeps transaction-scoped identifiers on root/parent spans where they remain useful for debugging
  • Replaces a few parent: &span call sites with span.in_scope() / equivalent scoped flow so parent-child relationships stay correct
  • Propagates span context correctly across spawned transaction tasks
  • Updates the affected SQLx metadata under coprocessor/fhevm-engine/.sqlx

Main files:

  • coprocessor/fhevm-engine/host-listener/src/database/tfhe_event_propagate.rs
  • coprocessor/fhevm-engine/tfhe-worker/src/tfhe_worker.rs
  • coprocessor/fhevm-engine/sns-worker/src/aws_upload.rs
  • coprocessor/fhevm-engine/sns-worker/src/executor.rs
  • coprocessor/fhevm-engine/sns-worker/src/squash_noise.rs
  • coprocessor/fhevm-engine/transaction-sender/src/ops/*.rs
  • coprocessor/fhevm-engine/zkproof-worker/src/verifier.rs

2. Scheduler pattern attribution

  • Adds deterministic operation_pattern_id and transaction_pattern_id attribution in the scheduler
  • Computes pattern IDs from logical-operation grouping on the pre-partition transaction graph
  • Uses a bounded encoding strategy so span attributes stay low-cardinality and size-safe
  • Splits the pattern implementation into focused modules:
    • scheduler/src/dfg/pattern/encoding.rs
    • scheduler/src/dfg/pattern/grouping.rs
    • scheduler/src/dfg/pattern/types.rs
    • scheduler/src/dfg/pattern/tests.rs
  • Updates scheduler execution spans in scheduler/src/dfg/scheduler.rs to emit the new pattern attributes while keeping transaction_id on the transaction root span

Main files:

  • coprocessor/fhevm-engine/scheduler/src/dfg.rs
  • coprocessor/fhevm-engine/scheduler/src/dfg/scheduler.rs
  • coprocessor/fhevm-engine/scheduler/src/dfg/pattern/*

3. Local spanmetrics pipeline

  • Adds OTEL Collector config with a spanmetrics connector
  • Wires the collector into the local tracing docker-compose stack
  • Updates Prometheus scrape config for the collector/spanmetrics path

Main files:

  • test-suite/fhevm/config/otel-collector/otel-collector-config.yaml
  • test-suite/fhevm/docker-compose/tracing-docker-compose.yml
  • test-suite/fhevm/config/prometheus/prometheus.yml
  • .gitignore

4. tfhe-worker pattern integration test hardening

  • Adds the end-to-end pattern_integration test module for realistic encrypted-transfer shapes
  • Installs a shared in-memory exporter in tfhe-worker test setup so tracing setup is no longer order-dependent
  • Filters captured spans by the real transaction root transaction_id emitted on execute_transaction, so the assertions ignore cross-test span pollution without serializing the suite

Main files:

  • coprocessor/fhevm-engine/tfhe-worker/src/tests/pattern_integration.rs
  • coprocessor/fhevm-engine/tfhe-worker/src/tests/utils.rs
  • coprocessor/fhevm-engine/tfhe-worker/src/tests/mod.rs

Follow-up

  • Log/trace correlation was split out to #2055 to keep this PR focused on span attribution and spanmetrics.

Commit map

  • feat(telemetry): add JSON log-trace correlation
  • refactor(coprocessor): standardize low-cardinality span attribution
  • feat(infra): add OTEL collector spanmetrics pipeline
  • feat(scheduler): add DFG pattern attribution with scalable encoding
  • fix(deps): pin tfhe-cuda-backend to 0.13.0 in lockfile
  • test(tfhe-worker): harden pattern integration tracing assertions
  • fix(tfhe-worker): align compressed ciphertext plumbing with scheduler
  • refactor(coprocessor): keep tx ids on root spans only

Validation

  • Pre-commit cargo check and clippy passed while rewriting the branch history
  • SQLX_OFFLINE=true cargo check -p fhevm-engine-common -p scheduler -p tfhe-worker -p transaction-sender -p sns-worker -p zkproof-worker -p host-listener --tests --quiet
  • cargo test -p scheduler pattern::tests --quiet
  • SQLX_OFFLINE=true cargo test -p tfhe-worker test_erc20_transaction_pattern_ids --no-run

Review notes

  • The branch was history-rewritten into a smaller themed stack; the final tree matches the pre-rewrite PR head plus the follow-up scope reduction that moved log/trace correlation into a separate issue.

@cla-bot cla-bot bot added the cla-signed label Feb 23, 2026
@Eikix Eikix force-pushed the otel-metrics-improvement branch from ceb3dff to a4f023f Compare February 23, 2026 10:49
@Eikix Eikix changed the title feat(coprocessor): OTEL cardinality cleanup, DFG pattern attribution, spanmetrics feat(coprocessor): otel cardinality cleanup, dfg pattern attribution, spanmetrics Feb 23, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 23, 2026

🧪 CI Insights

Here's what we observed from your CI run for e9579f1.

🟢 All jobs passed!

But CI Insights is watching 👀

@Eikix Eikix force-pushed the otel-metrics-improvement branch 11 times, most recently from 762d58e to 4d407b1 Compare February 25, 2026 13:02
@Eikix Eikix marked this pull request as ready for review February 25, 2026 13:42
@Eikix Eikix requested review from a team as code owners February 25, 2026 13:42
@Eikix Eikix force-pushed the otel-metrics-improvement branch from c750e92 to 5f4e105 Compare February 25, 2026 15:01
@Eikix Eikix force-pushed the otel-metrics-improvement branch 2 times, most recently from 0e498f2 to 256b446 Compare February 26, 2026 15:57
@Eikix Eikix marked this pull request as draft February 27, 2026 07:51
@Eikix Eikix marked this pull request as ready for review February 27, 2026 08:40
@Eikix Eikix force-pushed the otel-metrics-improvement branch from 011fb1d to 81ed9e8 Compare March 4, 2026 13:08
claude[bot]

This comment was marked as resolved.

@zama-ai zama-ai deleted a comment from claude bot Mar 5, 2026
@zama-ai zama-ai deleted a comment from claude bot Mar 5, 2026
@Eikix Eikix force-pushed the otel-metrics-improvement branch 2 times, most recently from fe28ad9 to efbcbd1 Compare March 6, 2026 10:04
@claude
Copy link
Copy Markdown

claude bot commented Mar 17, 2026

Claude finished @PanGan21's request.View run

@PanGan21 PanGan21 force-pushed the otel-metrics-improvement branch from 82f366a to d0b9ad0 Compare March 17, 2026 15:41
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AUTOMATED]

Code Review Summary

This is a well-structured PR that accomplishes four distinct goals: OTEL cardinality cleanup, DFG pattern attribution, spanmetrics plumbing, and test hardening. The code quality is high — the pattern encoding module is well-documented, the union-find implementation is correct, the binary layout is sound, and the tracing refactors (replacing parent: &span with span.in_scope() and .instrument()) are semantically correct.

What was reviewed

Agent Focus Findings
Bug Hunter ×2 (opus) Runtime bugs, security No high-confidence bugs found
Error Handling Auditor Silent failures 1 minor observability nit (see inline)
Guideline Compliance Project conventions No violations — new code follows established patterns

Highlights

  • Pattern encoding (encoding.rs): Clean binary layout with proper bounds checking. Encode/decode round-trips are verified by extensive tests. The two-tier strategy (v1 compact encoding → v2 Keccak hash for large groups) is well-designed.
  • Union-find grouping (grouping.rs): Correct path-splitting implementation. The cone-tracing and merge algorithm is well-documented with worked examples.
  • Tracing refactors: The parent: &spanspan.in_scope() / .instrument() changes are semantically correct and properly propagate span context across async boundaries.
  • Test coverage: Comprehensive unit tests for pattern encoding/decoding, plus an end-to-end integration test (pattern_integration.rs) that validates real encrypted-transfer shapes.

One minor observability nit is noted inline. Overall this is a solid, well-tested change.

@claude
Copy link
Copy Markdown

claude bot commented Mar 17, 2026

The review has been posted successfully. Here's a summary of what was done:

Review Complete ✅

PR #2019: feat(coprocessor): otel cardinality cleanup, dfg pattern attribution, spanmetrics

Review Process

  • 4 specialized agents ran in parallel (2× Bug Hunter opus, Error Handling Auditor sonnet, Guideline Compliance sonnet)
  • 3 validation agents independently verified flagged issues
  • Strict confidence threshold (≥80) applied, with validator cross-checks

Results

Check Result
Bugs & Security (2× opus) ✅ No high-confidence bugs found
Error Handling ✅ 1 minor nit (inline comment posted)
Guideline Compliance ✅ No violations — code follows established patterns

Filtered Issues

  • Encode/decode opcode enum asymmetry → False positive (enums have identical discriminant values)
  • Mutex poisoning risk → False positive (lock scope contains only non-panicking operations)
  • Silent toposort failure → Posted as non-blocking nit (missing warn! log for consistency with scheduler.rs)

Posted

  • 1 summary review comment with findings table
  • 1 inline nit on grouping.rs:118 suggesting a warn! log for the silent toposort fallback

@PanGan21 PanGan21 force-pushed the otel-metrics-improvement branch 2 times, most recently from 68a953c to e9579f1 Compare March 18, 2026 10:20
antoniupop
antoniupop previously approved these changes Mar 23, 2026
Copy link
Copy Markdown
Collaborator

@antoniupop antoniupop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - and very well tested. Pattern encoding looks nice!

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Mar 23, 2026

@Mergifyio queue

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 23, 2026

Merge Queue Status

  • 🟠 Waiting for queue conditions
  • ⏳ Enter queue
  • ⏳ Run checks
  • ⏳ Merge
Required conditions to enter a queue
  • any of [🔀 queue conditions]:
    • all of [📌 queue conditions of queue rule main]:
      • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
      • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
      • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
      • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
      • base = main
      • label!=do-not-merge
      • any of [🛡 GitHub branch protection]:
        • check-success = common-pull-request/lint (bpr)
        • check-neutral = common-pull-request/lint (bpr)
        • check-skipped = common-pull-request/lint (bpr)
      • any of [🛡 GitHub branch protection]:
        • check-success = coprocessor-cargo-test/cargo-tests (bpr)
        • check-neutral = coprocessor-cargo-test/cargo-tests (bpr)
        • check-skipped = coprocessor-cargo-test/cargo-tests (bpr)
      • any of [🛡 GitHub branch protection]:
        • check-success = coprocessor-dependency-analysis/dependencies-check (bpr)
        • check-neutral = coprocessor-dependency-analysis/dependencies-check (bpr)
        • check-skipped = coprocessor-dependency-analysis/dependencies-check (bpr)
      • any of [🛡 GitHub branch protection]:
        • check-skipped = gateway-contracts-deployment-tests/sc-deploy (bpr)
        • check-neutral = gateway-contracts-deployment-tests/sc-deploy (bpr)
        • check-success = gateway-contracts-deployment-tests/sc-deploy (bpr)
      • any of [🛡 GitHub branch protection]:
        • check-skipped = kms-connector-tests/test-connector (bpr)
        • check-neutral = kms-connector-tests/test-connector (bpr)
        • check-success = kms-connector-tests/test-connector (bpr)
  • -closed [📌 queue requirement]
  • -conflict [📌 queue requirement]
  • -draft [📌 queue requirement]
  • any of [📌 queue -> configuration change requirements]:
    • -mergify-configuration-changed
    • check-success = Configuration changed

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Mar 25, 2026

CI seems to pass now.

@Eikix Eikix force-pushed the otel-metrics-improvement branch from f8bf5f8 to 4519d5f Compare March 27, 2026 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants