This document lists and describes metrics supported by FHEVM services. Intention is for it to help operators monitor these services, configure alarms based on the metrics, and act on those in case of issues.
We also recommend alarm thresholds for each metric, where applicable. Thresholds suggested are conservative and can be adjusted based on the operator's environment and requirements.
Note that recommendations assume a smoke test that runs transactions/requests at a rate of approximately 1 per 30 seconds. These include verify proofs, FHE computation, ACL updates and decryptions.
- Type: Counter
- Description: Counts the number of successful verify or reject proof transactions in the transaction-sender.
- Alarm: If the counter is a flat line over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Description: Counts the number of failed verify or reject proof transactions in the transaction-sender.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
increase(counter[1m]) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Counter
- Description: Counts the number of successful add ciphertext material transactions in the transaction-sender.
- Alarm: If the counter is a flat line over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Description: Counts the number of failed add ciphertext material transactions in the transaction-sender.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
increase(counter[1m]) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Gauge
- Description: Tracks the number of unsent allow handle transactions in the transaction-sender.
- Alarm: If the gauge value exceeds a predefined threshold.
- Recommendation: more than 100 unsent over 2 minutes, i.e.
min_over_time(gauge[2m]) > 100.
- Recommendation: more than 100 unsent over 2 minutes, i.e.
- Type: Gauge
- Description: Tracks the number of unsent add ciphertext material transactions in the transaction-sender.
- Alarm: If the gauge value exceeds a predefined threshold.
- Recommendation: more than 100 unsent over 2 minutes, i.e.
min_over_time(gauge[2m]) > 100.
- Recommendation: more than 100 unsent over 2 minutes, i.e.
- Type: Gauge
- Description: Tracks the number of unsent verify proof response transactions in the transaction-sender.
- Alarm: If the gauge value exceeds a predefined threshold.
- Recommendation: more than 100 unsent over 2 minutes, i.e.
min_over_time(gauge[2m]) > 100.
- Recommendation: more than 100 unsent over 2 minutes, i.e.
- Type: Gauge
- Description: Tracks the number of pending verify proofs (pending on the zkproof-worker).
- Alarm: If the gauge value exceeds a predefined threshold.
- Recommendation: more than 100 pending over 2 minutes, i.e.
min_over_time(gauge[2m]) > 100.
- Recommendation: more than 100 pending over 2 minutes, i.e.
- Type: Counter
- Description: Counts the number of successful verify proof request events in GW listener.
- Alarm: If the counter is a flat line over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Description: Counts the number of failed verify proof request events in GW listener.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
increase(counter[1m]) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Counter
- Description: Counts the number of failed get block number requests in GW listener.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
increase(counter[1m]) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Counter
- Description: Counts the number of successful get logs requests in GW listener.
- Alarm: If the counter is a flat line over a period of time.
- Type: Counter
- Description: Counts the number of failed get logs requests in GW listener.
- Alarm: If the counter increases over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Description: Counts the number of successful activate CRS requests in GW listener.
- Alarm: N/A - no alarm needed as activate CRS is an infrequent event.
- Type: Counter
- Description: Counts the number of failed activate CRS requests in GW listener.
- Alarm: If the counter increases from 0. Activate CRS is an important event that should not fail.
- Recommendation: alarm on any failures over a 1 minute period, i.e.
increase(counter[1m]) > 0.
- Recommendation: alarm on any failures over a 1 minute period, i.e.
- Type: Counter
- Description: Counts the number of CRS digest mismatches in GW listener.
- Alarm: If the counter increases from 0. CRS digest mismatch is not something that is supposed to happen in normal circumstances.
- Recommendation: alarm on any failures over a 1 minute period, i.e.
increase(counter[1m]) > 0.
- Recommendation: alarm on any failures over a 1 minute period, i.e.
- Type: Counter
- Description: Counts the number of successful activate key requests in GW listener.
- Alarm: N/A - no alarm needed as activate key is an infrequent event.
- Type: Counter
- Description: Counts the number of failed activate key requests in GW listener.
- Alarm: If the counter increases from 0. Activate key is an important event that should not fail.
- Recommendation: alarm on any failures over a 1 minute period, i.e.
increase(counter[1m]) > 0.
- Recommendation: alarm on any failures over a 1 minute period, i.e.
- Type: Counter
- Description: Counts the number of key digest mismatches in GW listener.
- Alarm: If the counter increases from 0. Key digest mismatch is not something that is supposed to happen in normal circumstances.
- Recommendation: alarm on any failures over a 1 minute period, i.e.
increase(counter[1m]) > 0.
- Recommendation: alarm on any failures over a 1 minute period, i.e.
- Type: Counter
- Description: Number of handles where coprocessor digests diverged. Does not discriminate whether divergence comes from the local coprocessor or another coprocessor in the network.
- Type: Counter
- Description: Number of handles that timed out without a consensus event. This includes both handles where no consensus was ever observed and handles where all expected coprocessors submitted but the gateway never emitted a consensus event.
- Type: Counter
- Description: Number of handles where consensus was reached but some expected coprocessors never submitted their ciphertext material before the post-consensus grace period expired.
- Type: Histogram
- Description: Block distance between the first observed submission and the consensus event for a handle. Diagnostic metric for understanding on-chain latency; timeouts are wall-clock based and configured via
--drift-no-consensus-timeout. Bucket boundaries: 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144.
- Type: Histogram
- Description: Block distance between the consensus event and seeing all expected submissions for a handle. Diagnostic metric for understanding on-chain completion latency; the grace window is wall-clock based and configured via
--drift-post-consensus-grace. Bucket boundaries: 0, 1, 2, 3, 5, 8, 13, 21, 34.
Metrics for zkproof-worker are to be added in future releases, if/when needed. Currently, the transaction-sender handles ZK proof related metrics, please see its section.
- Type: Counter
- Description: Counts tasks executed by sns-worker successfully.
- Alarm: If the counter is a flat line over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Description: Counts tasks errors in sns-worker.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 240 failures in 1 minute, i.e.
increase(counter[1m]) > 240.
- Recommendation: more than 240 failures in 1 minute, i.e.
- Type: Counter
- Description: Counts AWS uploads by sns-worker.
- Alarm: If the counter is a flat line over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Description: Counts AWS upload errors in sns-worker.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 240 failures in 1 minute, i.e.
increase(counter[1m]) > 240.
- Recommendation: more than 240 failures in 1 minute, i.e.
- Type: Gauge
- Description: Tracks the number of uncomplete tasks in sns-worker.
- Alarm: If the gauge value exceeds a predefined threshold.
- Recommendation: more than 100 uncomplete over 2 minutes, i.e.
min_over_time(gauge[2m]) > 100.
- Recommendation: more than 100 uncomplete over 2 minutes, i.e.
- Type: Gauge
- Description: Tracks the number of uncomplete AWS uploads in sns-worker.
- Alarm: If the gauge value exceeds a predefined threshold.
- Recommendation: more than 100 uncomplete over 2 minutes, i.e.
min_over_time(gauge[2m]) > 100.
- Recommendation: more than 100 uncomplete over 2 minutes, i.e.
- Type: Counter
- Description: Counts TFHE worker errors.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 240 failures in 1 minute, i.e.
increase(counter[1m]) > 240.
- Recommendation: more than 240 failures in 1 minute, i.e.
- Type: Counter
- Description: Counts work items polled from the database.
- Alarm: N/A - if work usually arrives via notifications, polling is expected to be low.
- Type: Counter
- Description: Counts the number of instant notifications for work items received from the DB.
- Alarm: If the counter is a flat line over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Description: Counts of work items queried from the DB.
- Alarm: If the counter is a flat line over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Description: Counts of work items successfully processed and stored in the DB.
- Alarm: If the counter is a flat line over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Labels:
event_type: can be used to filter by event type (public_decryption_request, user_decryption_request, crsgen_request, ...).
- Description: Counts the number of events received by the GW listener.
- Alarm: If the counter is a flat line over a period of time, only for
event_typepublic_decryption_requestanduser_decryption_request.- Recommendation: 0 for more than 1 minute, i.e.
increase(counter{event_type="..."}[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Labels:
contract: can be used to filter by contract (decryption, kmsgeneration).
- Description: Counts the number of errors encountered by the GW listener while listening for events.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
sum(increase(counter[1m])) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Counter
- Labels:
event_type: see description
- Description: Counts the number of events received by the KMS worker.
- Alarm: If the counter is a flat line over a period of time, only for
event_typepublic_decryption_requestanduser_decryption_request.- Recommendation: 0 for more than 1 minute, i.e.
increase(counter{event_type="..."}[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Labels:
event_type: see description
- Description: Counts the number of errors encountered while listening for events in the KMS worker.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
sum(increase(counter[1m])) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Counter
- Labels:
event_type: see description
- Description: Number of successful GRPC requests sent by the KMS worker to the KMS Core,
- Alarm: If the counter is a flat line over a period of time, only for
event_typepublic_decryption_requestanduser_decryption_request.- Recommendation: 0 for more than 1 minute, i.e.
increase(counter{event_type="..."}[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Labels:
event_type: see description
- Description: Counts the number of errors encountered by the KMS worker while sending grpc requests to the KMS Core.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
sum(increase(counter[1m])) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Counter
- Labels:
event_type: see description
- Description: Counts the number of responses successfully polled from the KMS Core via GRPC.
- Alarm: If the counter is a flat line over a period of time, only for
event_typepublic_decryption_requestanduser_decryption_request.- Recommendation: 0 for more than 1 minute, i.e.
increase(counter{event_type="..."}[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Labels:
event_type: see description
- Description: Counts the number of errors encountered by the KMS worker while polling responses from the KMS Core.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
sum(increase(counter[1m])) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Counter
- Description: Counts the number of ciphertexts retrieved by the KMS worker from S3.
- Alarm: If the counter is a flat line over a period of time.
- Recommendation: 0 for more than 1 minute, i.e.
increase(counter[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Description: Counts the number of errors encountered by the KMS worker while retrieving ciphertexts from S3.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
sum(increase(counter[1m])) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Histogram
- Labels:
event_type: see description
- Description: Measures the latency of decryptions at the KMS worker level, from event creation to processing. Only applies to
public_decryption_requestanduser_decryption_requestevent types. Bucket boundaries (in seconds): 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0. - Alarm: None for now. Need more experience with this metric first.
- Type: Counter
- Labels:
response_type: can be used to filter by response type (public_decryption_response, user_decryption_response, crsgen_response, ...).
- Description: Counts the number of responses received by the TX sender.
- Alarm: If the counter is a flat line over a period of time, only for
response_typepublic_decryption_responseanduser_decryption_response.- Recommendation: 0 for more than 1 minute, i.e.
increase(counter{response_type = "..."}[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Labels:
response_type: see description
- Description: Counts the number of errors encountered by the TX sender while listening for responses.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
sum(increase(counter[1m])) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Counter
- Labels:
response_type: see description
- Description: Counts the number of transactions sent to the Gateway by the TX sender.
- Alarm: If the counter is a flat line over a period of time, only for
response_typepublic_decryption_responseanduser_decryption_response.- Recommendation: 0 for more than 1 minute, i.e.
increase(counter{response_type = "..."}[1m]) == 0.
- Recommendation: 0 for more than 1 minute, i.e.
- Type: Counter
- Labels:
response_type: see description
- Description: Counts the number of errors encountered by the TX sender while sending transactions to the Gateway.
- Alarm: If the counter increases over a period of time.
- Recommendation: more than 60 failures in 1 minute, i.e.
sum(increase(counter[1m])) > 60.
- Recommendation: more than 60 failures in 1 minute, i.e.
- Type: Gauge
- Labels:
event_type: see description (only available for decryption right now!)
- Description: Tracks the number of Gateway events not yet processed in the kms-connector's DB.
- Alarm: Need more experience with this metric first.
- Type: Gauge
- Labels:
response_type: see description (only available for decryption right now!)
- Description: Tracks the number of KMS responses not yet sent to the Gateway in the kms-connector's DB.
- Alarm: Need more experience with this metric first.
- Type: Histogram
- Labels:
response_type: see description
- Description: Measures the latency from response creation in DB to successful blockchain transaction confirmation. Bucket boundaries (in seconds): 0.1, 0.5, 1.0, 2.5, 5.0, 10.0, 15.0, 30.0.
- Alarm: Need more experience with this metric first.