Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 129 additions & 3 deletions crates/starknet_transaction_prover/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ Expected response:
## API Reference

The service exposes JSON-RPC 2.0 on the root path (`/`). The full machine-readable spec is at
`resources/proving_api_openrpc.json`.
`resources/proving_api_openrpc.json`. Two HTTP-only side endpoints are also served (`GET /health`
and `GET /metrics`) — see the [Observability](#observability) section.

### `starknet_specVersion`

Expand Down Expand Up @@ -170,6 +171,8 @@ and environment variables override values from the config file.
| `MAX_REQUEST_BODY_SIZE` | `--max-request-body-size` | `5242880` (5 MiB) | Maximum size of an incoming JSON-RPC request body in bytes. Requests exceeding this limit are rejected before parsing. |
| `CONFIG_FILE` | `--config-file` | — | Path to a JSON config file. Fields use snake_case names matching `resources/example-config.json`. Values in the file are overridden by env vars and CLI flags. |
| `RUST_LOG` | — | _(see Logging)_ | Controls log verbosity via `tracing-subscriber`. |
| `LOG_FORMAT` | `--log-format` | `text` | Log output format. Use `json` in production so aggregators (e.g. Datadog) parse fields directly. Accepts `text` or `json`. |
| `HEALTH_MAX_SATURATED_MS` | `--health-max-saturated-ms` | `10000` | How long the service must be continuously rejecting proving requests before `GET /health` flips to 503. See [Observability → /health](#health). |

### TLS / HTTPS

Expand Down Expand Up @@ -212,6 +215,7 @@ built-in defaults. See `resources/example-config.json` for a template.
| `cors_allow_origin` | `CORS_ALLOW_ORIGIN` | array of strings |
| `tls_cert_file` | `TLS_CERT_FILE` | file path or null |
| `tls_key_file` | `TLS_KEY_FILE` | file path or null |
| `health_max_saturated_ms` | `HEALTH_MAX_SATURATED_MS` | integer (ms) |

### Docker example with common options

Expand All @@ -226,19 +230,28 @@ docker run --rm -p 3000:3000 \

### Logging

The service uses the `RUST_LOG` environment variable (via `tracing-subscriber`).
The service uses the `RUST_LOG` environment variable (via `tracing-subscriber`) to control verbosity,
and `LOG_FORMAT` to switch between human-readable text and machine-readable JSON.

```bash
# Default — service logs at debug, noisy proving libraries at warn:
# Default — service logs at debug, noisy proving libraries at warn, text format:
docker run ... <IMAGE>

# Verbose — all crates at debug:
docker run -e RUST_LOG=debug ... <IMAGE>

# Quiet — warnings and errors only:
docker run -e RUST_LOG=warn ... <IMAGE>

# Production — JSON output so log aggregators parse fields directly:
docker run -e LOG_FORMAT=json ... <IMAGE>
```

`LOG_FORMAT=json` emits one JSON object per line with `timestamp`, `level`, `target`, `fields`,
and `span` keys. URLs that may contain credentials in the userinfo component (`rpc_node_url`,
`blocking_check_url`) are redacted to `scheme://host[:port]` everywhere they appear in logs,
including the startup banner and CLI-override messages.

## Compression

**HTTP response compression** — The server automatically compresses responses when the client
Expand All @@ -257,6 +270,119 @@ curl -H 'Accept-Encoding: zstd' -s -X POST http://localhost:3000 \
-d '{"jsonrpc":"2.0","id":1,"method":"starknet_proveTransaction","params":{...}}' | zstd -d
```

## Observability

The service exposes three observability surfaces alongside the JSON-RPC API: HTTP probe endpoints
(`/health`, `/metrics`), structured per-request logs with request-id propagation, and graceful
shutdown semantics. All are unauthenticated by design — they're intended for load balancers,
scrapers, and orchestrators, and contain no service state, no transaction data, and no URLs with
credentials.

### `/health`

`GET /health` returns the current health of the service.

| Status | Body | Meaning |
|---|---|---|
| `200 OK` | `{"status":"ok"}` | Service is accepting requests. |
| `503 Service Unavailable` | `{"status":"unhealthy","reason":"saturated"}` | Service has been continuously rejecting proving requests for at least `HEALTH_MAX_SATURATED_MS` (default 10 seconds). Load balancer should drain this pod. |

The 503 path is driven by the same in-process semaphore that returns JSON-RPC error `-32005`
(Service busy). A single rejection does not trigger 503 — the saturation window must hold for the
full `HEALTH_MAX_SATURATED_MS`. A successful request clears the window immediately, so 503 → 200
transitions are fast once load drops.

The response body is intentionally opaque (no timestamps, counters, or upstream URLs) because
`/health` is unauthenticated. The endpoint is served by the outermost tower middleware so probes
bypass CORS, compression, and JSON-RPC parsing.

### `/metrics`

`GET /metrics` returns the Prometheus text-format scrape with all metrics emitted by the service.
Like `/health`, scrapes bypass CORS and JSON-RPC parsing, and the endpoint is unauthenticated.

| Metric | Type | Labels | Description |
|---|---|---|---|
| `prover_build_info` | gauge | `version`, `git_sha` | Always 1. Used to identify the running build from a scrape. |
| `prover_http_requests_total` | counter | `method`, `status` | HTTP request count. `method` is a bounded enum (`get`/`post`/`other`); `status` is the HTTP status class (`2xx`/`4xx`/`5xx`). Excludes `/health` and `/metrics` probes. |
| `prover_http_request_duration_seconds` | histogram | `method`, `status` | End-to-end HTTP request latency. Same label set as `prover_http_requests_total`. |
| `prover_http_inflight_requests` | gauge | — | Current count of HTTP requests being handled. Decremented via RAII so panics and cancellations don't leak. |
| `prover_prove_transaction_outcome_total` | counter | `outcome` | Proving request outcomes. `outcome` is a bounded enum: `success`, `failure_validation`, `failure_blocked`, `failure_runner`, `failure_output_parse`, `failure_proving`. |
| `prover_prove_transaction_duration_seconds` | histogram | `outcome` | End-to-end proving duration, including admission, virtual OS run, and STWO proving. |
| `prover_os_run_duration_seconds` | histogram | — | Virtual OS execution time, recorded for successful runs only. |
| `prover_stwo_prove_duration_seconds` | histogram | — | STWO proving time, recorded for successful runs only (requires `stwo_proving` feature). |
| `prover_concurrency_rejected_total` | counter | — | Requests rejected because the concurrency semaphore was full (i.e. JSON-RPC error `-32005`). |
| `prover_panics_total` | counter | — | Process panics caught by the global panic hook. Useful for alerting on panic rate without log search. |

Label cardinality is bounded — no user-controlled values become labels.

### Request IDs

Every HTTP request is logged with a `request_id` field. The id is taken from the inbound
`x-request-id` header when it's a short, printable-ASCII token (max 128 bytes); otherwise the
service generates a UUID v4. The (possibly generated) id is echoed back on the response in the
same header so a caller that triggered a failure can quote a single id when reporting it.

Hostile inputs (whitespace, non-printable bytes, oversized values) are dropped and replaced with
a freshly generated id rather than being trusted — this prevents header smuggling and log-field
explosion.

**OHTTP traffic uses two distinct ids, by design.** For an Oblivious HTTP request the id above is
assigned to the *outer envelope*: it is echoed on the (relay-visible) `message/ohttp-res` response
and appears in the outermost access-log line. The *decapsulated inner request* is given a separate,
freshly generated id that is bound to its content-level logs and **never echoed back**. Any
client-supplied id inside the envelope is discarded. This is a security property, not a logging
inconsistency: keeping the relay-visible envelope id distinct from the gateway's content-log id
means a party holding both the relay's records (id → client) and the gateway's logs (id → request
contents) has no shared key to join them, preserving OHTTP unlinkability. Do not "fix" the two ids
to match — that would reintroduce the join key.

### Per-request log line

Each HTTP request produces a single structured log line at `info` level with
`event="http_request"`:

```json
{
"timestamp": "...",
"level": "INFO",
"fields": {
"event": "http_request",
"request_id": "a1b2c3d4...",
"method": "POST",
"path": "/",
"status": 200,
"latency_ms": 1247,
"message": "HTTP request handled."
}
}
```

Request bodies are never inspected — transaction calldata is private user data and never reaches
the log stream.

### Startup banner

At startup the service emits a single `info` log with version, git SHA, chain id, redacted RPC
host, and feature flags. Credentials embedded in URLs (e.g. `https://user:secret@rpc.example.com/`)
are stripped down to `scheme://host[:port]` before logging. No fee token address, TLS path, or
transaction-scoped data appears in the banner.

### Shutdown and panics

`SIGTERM` and `SIGINT` trigger a graceful shutdown — the service stops accepting new requests and
waits for in-flight proofs to finish. Both signals produce structured log events
(`shutdown_started`, `shutdown_complete`) so deployment events are visible in the log stream.

A **second** termination signal during graceful shutdown forces `exit(1)` so an operator can
always reclaim a stuck process. This works around a known tokio behavior where dropping the
`Signal` handle silently swallows subsequent signals on the same channel.

Process panics are captured by a global panic hook that emits a structured `event="panic"` log
with location, message, and a forced backtrace, then increments the `prover_panics_total` metric.
The hook does not call `process::abort()` — the existing runtime abort-on-panic behavior is
preserved.

## Limitations

- Invoke V3 only — Declare and DeployAccount transactions are not supported.
Expand Down
Loading