Skip to content

feat(coprocessor): otel cardinality cleanup, dfg pattern attribution, spanmetrics#2019

Open
Eikix wants to merge 9 commits intomainfrom
otel-metrics-improvement
Open

feat(coprocessor): otel cardinality cleanup, dfg pattern attribution, spanmetrics#2019
Eikix wants to merge 9 commits intomainfrom
otel-metrics-improvement

Conversation

@Eikix
Copy link
Contributor

@Eikix Eikix commented Feb 23, 2026

Summary

This PR finishes the coprocessor OTEL cleanup in four parts:

  • Low-cardinality tracing cleanup across host-listener, tfhe-worker, sns-worker, zkproof-worker, and transaction-sender
  • Scheduler pattern attribution via bounded operation_pattern_id and transaction_pattern_id span attributes
  • Spanmetrics plumbing via an OTEL Collector + Prometheus pipeline for the local stack
  • Pattern integration test hardening so span assertions remain reliable under the shared global subscriber used by the test suite

Closes zama-ai/fhevm-internal#1044, zama-ai/fhevm-internal#1013, zama-ai/fhevm-internal#1012.

What changed

1. Coprocessor span cardinality cleanup

  • Removes high-cardinality child-span identifiers such as handle, txn_id, request_id, and similar per-item fields from runtime spans
  • Keeps transaction-scoped identifiers on root/parent spans where they remain useful for debugging
  • Replaces a few parent: &span call sites with span.in_scope() / equivalent scoped flow so parent-child relationships stay correct
  • Propagates span context correctly across spawned transaction tasks
  • Updates the affected SQLx metadata under coprocessor/fhevm-engine/.sqlx

Main files:

  • coprocessor/fhevm-engine/host-listener/src/database/tfhe_event_propagate.rs
  • coprocessor/fhevm-engine/tfhe-worker/src/tfhe_worker.rs
  • coprocessor/fhevm-engine/sns-worker/src/aws_upload.rs
  • coprocessor/fhevm-engine/sns-worker/src/executor.rs
  • coprocessor/fhevm-engine/sns-worker/src/squash_noise.rs
  • coprocessor/fhevm-engine/transaction-sender/src/ops/*.rs
  • coprocessor/fhevm-engine/zkproof-worker/src/verifier.rs

2. Scheduler pattern attribution

  • Adds deterministic operation_pattern_id and transaction_pattern_id attribution in the scheduler
  • Computes pattern IDs from logical-operation grouping on the pre-partition transaction graph
  • Uses a bounded encoding strategy so span attributes stay low-cardinality and size-safe
  • Splits the pattern implementation into focused modules:
    • scheduler/src/dfg/pattern/encoding.rs
    • scheduler/src/dfg/pattern/grouping.rs
    • scheduler/src/dfg/pattern/types.rs
    • scheduler/src/dfg/pattern/tests.rs
  • Updates scheduler execution spans in scheduler/src/dfg/scheduler.rs to emit the new pattern attributes while keeping transaction_id on the transaction root span

Main files:

  • coprocessor/fhevm-engine/scheduler/src/dfg.rs
  • coprocessor/fhevm-engine/scheduler/src/dfg/scheduler.rs
  • coprocessor/fhevm-engine/scheduler/src/dfg/pattern/*

3. Local spanmetrics pipeline

  • Adds OTEL Collector config with a spanmetrics connector
  • Wires the collector into the local tracing docker-compose stack
  • Updates Prometheus scrape config for the collector/spanmetrics path

Main files:

  • test-suite/fhevm/config/otel-collector/otel-collector-config.yaml
  • test-suite/fhevm/docker-compose/tracing-docker-compose.yml
  • test-suite/fhevm/config/prometheus/prometheus.yml
  • .gitignore

4. tfhe-worker pattern integration test hardening

  • Adds the end-to-end pattern_integration test module for realistic encrypted-transfer shapes
  • Installs a shared in-memory exporter in tfhe-worker test setup so tracing setup is no longer order-dependent
  • Filters captured spans by the real transaction root transaction_id emitted on execute_transaction, so the assertions ignore cross-test span pollution without serializing the suite

Main files:

  • coprocessor/fhevm-engine/tfhe-worker/src/tests/pattern_integration.rs
  • coprocessor/fhevm-engine/tfhe-worker/src/tests/utils.rs
  • coprocessor/fhevm-engine/tfhe-worker/src/tests/mod.rs

Follow-up

  • Log/trace correlation was split out to #2055 to keep this PR focused on span attribution and spanmetrics.

Commit map

  • feat(telemetry): add JSON log-trace correlation
  • refactor(coprocessor): standardize low-cardinality span attribution
  • feat(infra): add OTEL collector spanmetrics pipeline
  • feat(scheduler): add DFG pattern attribution with scalable encoding
  • fix(deps): pin tfhe-cuda-backend to 0.13.0 in lockfile
  • test(tfhe-worker): harden pattern integration tracing assertions
  • fix(tfhe-worker): align compressed ciphertext plumbing with scheduler
  • refactor(coprocessor): keep tx ids on root spans only

Validation

  • Pre-commit cargo check and clippy passed while rewriting the branch history
  • SQLX_OFFLINE=true cargo check -p fhevm-engine-common -p scheduler -p tfhe-worker -p transaction-sender -p sns-worker -p zkproof-worker -p host-listener --tests --quiet
  • cargo test -p scheduler pattern::tests --quiet
  • SQLX_OFFLINE=true cargo test -p tfhe-worker test_erc20_transaction_pattern_ids --no-run

Review notes

  • The branch was history-rewritten into a smaller themed stack; the final tree matches the pre-rewrite PR head plus the follow-up scope reduction that moved log/trace correlation into a separate issue.

@cla-bot cla-bot bot added the cla-signed label Feb 23, 2026
@Eikix Eikix force-pushed the otel-metrics-improvement branch from ceb3dff to a4f023f Compare February 23, 2026 10:49
@Eikix Eikix changed the title feat(coprocessor): OTEL cardinality cleanup, DFG pattern attribution, spanmetrics feat(coprocessor): otel cardinality cleanup, dfg pattern attribution, spanmetrics Feb 23, 2026
@mergify
Copy link

mergify bot commented Feb 23, 2026

🧪 CI Insights

Here's what we observed from your CI run for 5f61c98.

🟢 All jobs passed!

But CI Insights is watching 👀

@Eikix Eikix force-pushed the otel-metrics-improvement branch 11 times, most recently from 762d58e to 4d407b1 Compare February 25, 2026 13:02
@Eikix Eikix marked this pull request as ready for review February 25, 2026 13:42
@Eikix Eikix requested review from a team as code owners February 25, 2026 13:42
@Eikix Eikix force-pushed the otel-metrics-improvement branch from c750e92 to 5f4e105 Compare February 25, 2026 15:01
@Eikix Eikix force-pushed the otel-metrics-improvement branch 2 times, most recently from 0e498f2 to 256b446 Compare February 26, 2026 15:57
@Eikix Eikix marked this pull request as draft February 27, 2026 07:51
@Eikix Eikix marked this pull request as ready for review February 27, 2026 08:40
@Eikix Eikix force-pushed the otel-metrics-improvement branch from 011fb1d to 81ed9e8 Compare March 4, 2026 13:08
claude[bot]

This comment was marked as resolved.

@zama-ai zama-ai deleted a comment from claude bot Mar 5, 2026
@zama-ai zama-ai deleted a comment from claude bot Mar 5, 2026
@Eikix Eikix force-pushed the otel-metrics-improvement branch from 91edb09 to fe28ad9 Compare March 6, 2026 09:56
@Eikix Eikix force-pushed the otel-metrics-improvement branch from fe28ad9 to efbcbd1 Compare March 6, 2026 10:04
@Eikix
Copy link
Contributor Author

Eikix commented Mar 6, 2026

@claude /pr-review

@claude
Copy link

claude bot commented Mar 6, 2026

Claude finished @Eikix's task in 3m 20s —— View job


[AUTOMATED]

Code Review — OTel Cardinality Cleanup, DFG Pattern Attribution, Spanmetrics

  • Pre-review validation
  • Gather context (PR files, guidelines, all 37 changed files)
  • Analyze all changed files in detail
  • Post final review

Summary

This is a high-quality, well-structured PR that delivers comprehensive OTel observability improvements. The cardinality cleanup is thorough and consistent, the pattern attribution algorithm is clever and well-tested, and the infrastructure changes are production-ready. The previous review's P2 (request_id on verify_task span) has been addressed — it is now logged inside the span rather than being a span attribute.

Verdict: No blocking issues found. A few observations and one minor nit below.


Findings

Nit — handle remains as span attribute on sns-worker task span (executor.rs:487-491)

The handle field (a unique 32-byte hash) is on the task span in sns-worker. This is a parent/root span for the upload pipeline, so by the PR's own convention (keep identifiers on root spans, remove from children) this is intentional and correct. Just flagging for awareness — it won't create spanmetric label series since it's not in the OTEL collector's dimensions allowlist, but it does inflate exported trace span payloads. Completely fine for the current architecture.

Observation — HASH_LOG_SEEN global mutable state uses Mutex

The LRU dedup cache (encoding.rs:30-34) uses std::sync::Mutex behind a LazyLock. Since pattern finalization happens on the scheduler's compute thread pool (rayon/sync context), this is appropriate. The lock contention should be negligible given the LRU operations are O(1) and the critical section is tiny. Worth noting if the scheduler ever moves to a multi-process architecture.

Observation — inject_otel_fields JSON re-serialization per log line

The OtelJsonFormat (telemetry.rs:75-79) parses and re-serializes JSON for every log event when an OTel context is active. This is a reasonable trade-off for correctness (handles arbitrary field ordering from tracing_subscriber). The no-context fast path (skips injection entirely) covers init and non-traced paths. For extremely hot logging paths this could become measurable, but the current architecture only creates spans around FHE operations and I/O tasks, so volume should be manageable.

Observation — Pattern stability depends on node insertion order

The encoding (encode_subgraph, encoding.rs:265-327) relies on topological order within sorted node indices. The code correctly sorts operations by output_handle in dfg.rs:168 before building the graph, making the encoding deterministic. Any future change to this sorting would break pattern stability — the doc comments correctly document this caveat.


Acceptance Criteria Matrix

# Criterion Verdict Notes
1 Child spans stripped of high-cardinality IDs ✅ PASS txn_id removed from all child spans (scheduler, sns-worker, zkproof-worker, transaction-sender). request_id moved to log message. handle removed from child spans. dependence_chain_id removed.
2 Root/parent spans retain transaction_hash ✅ PASS handle_tfhe_event, handle_acl_event, prepare_add_ciphertext, prepare_verify_proof_resp, prepare_allow_account, task (sns-worker) all carry transaction_hash.
3 operation_pattern_id on FHE operation spans ✅ PASS Present on fhe_operation, compress_ciphertext, rerandomise_op_inputs spans as base64url. 22+ unit tests verify stability.
4 transaction_pattern_id on tx-level spans ✅ PASS Present on execute_transaction span. Propagated via PartitionTask struct.
5 Three-tier encoding (v1 raw / v1-hashed / wide-hashed) ✅ PASS Threshold configurable via FHEVM_PATTERN_HASH_THRESHOLD. All tiers tested explicitly.
6 Cross-thread span parenting ✅ PASS tokio::spawn + .instrument(span) in tfhe-worker. span.set_parent(ctx) in sns-worker. squash_noise.rs improved to span.enter() pattern.
7 Spanmetrics low-cardinality dimensions ✅ PASS Explicit allowlist in OTEL collector config. span.kind excluded. metrics_expiration: 5m for auto-cleanup.
8 OtelJsonFormat for log→trace correlation ✅ PASS trace_id/span_id injected into JSON logs. Correctly skips when no OTel context or parent mismatch. 6 unit tests cover edge cases.
9 E2E integration test coverage ✅ PASS test_erc20_transaction_pattern_ids with InMemorySpanExporter covers 3 transaction shapes with structural assertions.
10 Removed deprecated helpers ✅ PASS short_hex_id, record_short_hex, record_short_hex_if_some removed from telemetry. All callers migrated.

Architecture Highlights

  1. PartitionTask struct (scheduler.rs): Nice refactor from the previous 4-tuple to a named struct. Much more readable and extensible.

  2. execute_transaction_span! macro (scheduler.rs): The feature-gated macro for test-span-attrs is a clean way to include test_transaction_id in test builds without polluting production spans.

  3. Union-find grouping (grouping.rs): The algorithm is well-documented with ASCII-art examples. The reverse-topo walk for cone propagation is correct — allowed nodes act as boundaries, shared intermediates trigger merges.

  4. Wide-format fallback (encoding.rs:333-373): Good engineering to handle oversized groups (>255 nodes or >127 internal-ref) without silently dropping them. The 0xFE marker distinguishes the hash input format.


Minor Residual Risks

  1. Adversarial pattern cardinality: An attacker submitting unique graph structures could grow operation_pattern_id / transaction_pattern_id dimensions unboundedly. The 5m metrics_expiration mitigates but doesn't cap concurrent cardinality. A metrics_cardinality_limit processor in the OTEL collector pipeline would provide a hard cap.

  2. Prometheus scrape config is test-suite only: The otel-spanmetrics job in prometheus.yml correctly uses OpenMetricsText1.0.0 for exemplar ingestion. Production Helm charts would need similar configuration.

Overall: This is a thorough, well-tested PR that significantly improves the coprocessor's observability posture. No blocking issues. Ready for merge.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant