From 0563fbff1522a9dd0cc71826b6fef5ce14cfd032 Mon Sep 17 00:00:00 2001 From: Avi Cohen Date: Tue, 26 May 2026 16:00:37 +0300 Subject: [PATCH] starknet_transaction_prover: document observability surface in README Adds a top-level Observability section to the README covering the HTTP-only side endpoints (`/health`, `/metrics`), the metric names and label sets, request-id propagation rules, the per-request log shape, the startup banner, and the graceful-shutdown / panic-hook contract. Updates the Environment variables table with `LOG_FORMAT` and `HEALTH_MAX_SATURATED_MS`, the JSON config table with `health_max_saturated_ms`, and the Logging subsection with the JSON format option and URL redaction note. Co-Authored-By: Claude Opus 4.7 (1M context) --- crates/starknet_transaction_prover/README.md | 132 ++++++++++++++++++- 1 file changed, 129 insertions(+), 3 deletions(-) diff --git a/crates/starknet_transaction_prover/README.md b/crates/starknet_transaction_prover/README.md index aab5cf92e86..514d79fcd38 100644 --- a/crates/starknet_transaction_prover/README.md +++ b/crates/starknet_transaction_prover/README.md @@ -26,7 +26,8 @@ Expected response: ## API Reference The service exposes JSON-RPC 2.0 on the root path (`/`). The full machine-readable spec is at -`resources/proving_api_openrpc.json`. +`resources/proving_api_openrpc.json`. Two HTTP-only side endpoints are also served (`GET /health` +and `GET /metrics`) — see the [Observability](#observability) section. ### `starknet_specVersion` @@ -170,6 +171,8 @@ and environment variables override values from the config file. | `MAX_REQUEST_BODY_SIZE` | `--max-request-body-size` | `5242880` (5 MiB) | Maximum size of an incoming JSON-RPC request body in bytes. Requests exceeding this limit are rejected before parsing. | | `CONFIG_FILE` | `--config-file` | — | Path to a JSON config file. Fields use snake_case names matching `resources/example-config.json`. Values in the file are overridden by env vars and CLI flags. | | `RUST_LOG` | — | _(see Logging)_ | Controls log verbosity via `tracing-subscriber`. | +| `LOG_FORMAT` | `--log-format` | `text` | Log output format. Use `json` in production so aggregators (e.g. Datadog) parse fields directly. Accepts `text` or `json`. | +| `HEALTH_MAX_SATURATED_MS` | `--health-max-saturated-ms` | `10000` | How long the service must be continuously rejecting proving requests before `GET /health` flips to 503. See [Observability → /health](#health). | ### TLS / HTTPS @@ -212,6 +215,7 @@ built-in defaults. See `resources/example-config.json` for a template. | `cors_allow_origin` | `CORS_ALLOW_ORIGIN` | array of strings | | `tls_cert_file` | `TLS_CERT_FILE` | file path or null | | `tls_key_file` | `TLS_KEY_FILE` | file path or null | +| `health_max_saturated_ms` | `HEALTH_MAX_SATURATED_MS` | integer (ms) | ### Docker example with common options @@ -226,10 +230,11 @@ docker run --rm -p 3000:3000 \ ### Logging -The service uses the `RUST_LOG` environment variable (via `tracing-subscriber`). +The service uses the `RUST_LOG` environment variable (via `tracing-subscriber`) to control verbosity, +and `LOG_FORMAT` to switch between human-readable text and machine-readable JSON. ```bash -# Default — service logs at debug, noisy proving libraries at warn: +# Default — service logs at debug, noisy proving libraries at warn, text format: docker run ... # Verbose — all crates at debug: @@ -237,8 +242,16 @@ docker run -e RUST_LOG=debug ... # Quiet — warnings and errors only: docker run -e RUST_LOG=warn ... + +# Production — JSON output so log aggregators parse fields directly: +docker run -e LOG_FORMAT=json ... ``` +`LOG_FORMAT=json` emits one JSON object per line with `timestamp`, `level`, `target`, `fields`, +and `span` keys. URLs that may contain credentials in the userinfo component (`rpc_node_url`, +`blocking_check_url`) are redacted to `scheme://host[:port]` everywhere they appear in logs, +including the startup banner and CLI-override messages. + ## Compression **HTTP response compression** — The server automatically compresses responses when the client @@ -257,6 +270,119 @@ curl -H 'Accept-Encoding: zstd' -s -X POST http://localhost:3000 \ -d '{"jsonrpc":"2.0","id":1,"method":"starknet_proveTransaction","params":{...}}' | zstd -d ``` +## Observability + +The service exposes three observability surfaces alongside the JSON-RPC API: HTTP probe endpoints +(`/health`, `/metrics`), structured per-request logs with request-id propagation, and graceful +shutdown semantics. All are unauthenticated by design — they're intended for load balancers, +scrapers, and orchestrators, and contain no service state, no transaction data, and no URLs with +credentials. + +### `/health` + +`GET /health` returns the current health of the service. + +| Status | Body | Meaning | +|---|---|---| +| `200 OK` | `{"status":"ok"}` | Service is accepting requests. | +| `503 Service Unavailable` | `{"status":"unhealthy","reason":"saturated"}` | Service has been continuously rejecting proving requests for at least `HEALTH_MAX_SATURATED_MS` (default 10 seconds). Load balancer should drain this pod. | + +The 503 path is driven by the same in-process semaphore that returns JSON-RPC error `-32005` +(Service busy). A single rejection does not trigger 503 — the saturation window must hold for the +full `HEALTH_MAX_SATURATED_MS`. A successful request clears the window immediately, so 503 → 200 +transitions are fast once load drops. + +The response body is intentionally opaque (no timestamps, counters, or upstream URLs) because +`/health` is unauthenticated. The endpoint is served by the outermost tower middleware so probes +bypass CORS, compression, and JSON-RPC parsing. + +### `/metrics` + +`GET /metrics` returns the Prometheus text-format scrape with all metrics emitted by the service. +Like `/health`, scrapes bypass CORS and JSON-RPC parsing, and the endpoint is unauthenticated. + +| Metric | Type | Labels | Description | +|---|---|---|---| +| `prover_build_info` | gauge | `version`, `git_sha` | Always 1. Used to identify the running build from a scrape. | +| `prover_http_requests_total` | counter | `method`, `status` | HTTP request count. `method` is a bounded enum (`get`/`post`/`other`); `status` is the HTTP status class (`2xx`/`4xx`/`5xx`). Excludes `/health` and `/metrics` probes. | +| `prover_http_request_duration_seconds` | histogram | `method`, `status` | End-to-end HTTP request latency. Same label set as `prover_http_requests_total`. | +| `prover_http_inflight_requests` | gauge | — | Current count of HTTP requests being handled. Decremented via RAII so panics and cancellations don't leak. | +| `prover_prove_transaction_outcome_total` | counter | `outcome` | Proving request outcomes. `outcome` is a bounded enum: `success`, `failure_validation`, `failure_blocked`, `failure_runner`, `failure_output_parse`, `failure_proving`. | +| `prover_prove_transaction_duration_seconds` | histogram | `outcome` | End-to-end proving duration, including admission, virtual OS run, and STWO proving. | +| `prover_os_run_duration_seconds` | histogram | — | Virtual OS execution time, recorded for successful runs only. | +| `prover_stwo_prove_duration_seconds` | histogram | — | STWO proving time, recorded for successful runs only (requires `stwo_proving` feature). | +| `prover_concurrency_rejected_total` | counter | — | Requests rejected because the concurrency semaphore was full (i.e. JSON-RPC error `-32005`). | +| `prover_panics_total` | counter | — | Process panics caught by the global panic hook. Useful for alerting on panic rate without log search. | + +Label cardinality is bounded — no user-controlled values become labels. + +### Request IDs + +Every HTTP request is logged with a `request_id` field. The id is taken from the inbound +`x-request-id` header when it's a short, printable-ASCII token (max 128 bytes); otherwise the +service generates a UUID v4. The (possibly generated) id is echoed back on the response in the +same header so a caller that triggered a failure can quote a single id when reporting it. + +Hostile inputs (whitespace, non-printable bytes, oversized values) are dropped and replaced with +a freshly generated id rather than being trusted — this prevents header smuggling and log-field +explosion. + +**OHTTP traffic uses two distinct ids, by design.** For an Oblivious HTTP request the id above is +assigned to the *outer envelope*: it is echoed on the (relay-visible) `message/ohttp-res` response +and appears in the outermost access-log line. The *decapsulated inner request* is given a separate, +freshly generated id that is bound to its content-level logs and **never echoed back**. Any +client-supplied id inside the envelope is discarded. This is a security property, not a logging +inconsistency: keeping the relay-visible envelope id distinct from the gateway's content-log id +means a party holding both the relay's records (id → client) and the gateway's logs (id → request +contents) has no shared key to join them, preserving OHTTP unlinkability. Do not "fix" the two ids +to match — that would reintroduce the join key. + +### Per-request log line + +Each HTTP request produces a single structured log line at `info` level with +`event="http_request"`: + +```json +{ + "timestamp": "...", + "level": "INFO", + "fields": { + "event": "http_request", + "request_id": "a1b2c3d4...", + "method": "POST", + "path": "/", + "status": 200, + "latency_ms": 1247, + "message": "HTTP request handled." + } +} +``` + +Request bodies are never inspected — transaction calldata is private user data and never reaches +the log stream. + +### Startup banner + +At startup the service emits a single `info` log with version, git SHA, chain id, redacted RPC +host, and feature flags. Credentials embedded in URLs (e.g. `https://user:secret@rpc.example.com/`) +are stripped down to `scheme://host[:port]` before logging. No fee token address, TLS path, or +transaction-scoped data appears in the banner. + +### Shutdown and panics + +`SIGTERM` and `SIGINT` trigger a graceful shutdown — the service stops accepting new requests and +waits for in-flight proofs to finish. Both signals produce structured log events +(`shutdown_started`, `shutdown_complete`) so deployment events are visible in the log stream. + +A **second** termination signal during graceful shutdown forces `exit(1)` so an operator can +always reclaim a stuck process. This works around a known tokio behavior where dropping the +`Signal` handle silently swallows subsequent signals on the same channel. + +Process panics are captured by a global panic hook that emits a structured `event="panic"` log +with location, message, and a forced backtrace, then increments the `prover_panics_total` metric. +The hook does not call `process::abort()` — the existing runtime abort-on-panic behavior is +preserved. + ## Limitations - Invoke V3 only — Declare and DeployAccount transactions are not supported.