Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 30 additions & 19 deletions infra/status-page/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ following the `infra/iris-iap-proxy/` pattern.

- **Server** — Node 20 + TypeScript + [Hono](https://hono.dev). Exposes
`/api/ferry`, `/api/builds`, `/api/iris`, `/api/workers`,
`/api/control-plane/health`, `/api/workers/history`, `/api/jobs`,
`/api/probes`, `/api/health`, and serves the built web UI from
`web/dist`.
`/api/control-plane/health`, `/api/workers/history`,
`/api/provisioning/history`, `/api/jobs`, `/api/probes`, `/api/health`,
and serves the built web UI from `web/dist`.
- **Web** — Vite + React 18 + TypeScript + Jotai + `@tanstack/react-query`
+ Tailwind.
- Single `package.json`, multi-stage Dockerfile, single service account,
Expand All @@ -28,14 +28,15 @@ following the `infra/iris-iap-proxy/` pattern.
server/
main.ts Hono app: routes, sampler, static serving
cache.ts TTL cache with in-flight coalesce
history.ts ring buffer for worker-count history
history.ts ring buffers for the in-process iris-ping + control-plane series
sources/
github.ts shared REPO + auth header helper
githubActions.ts Ferry workflow runs (REST API)
githubCommits.ts Build panel: per-commit CI rollup on main (GraphQL)
iris.ts iris controller /health caller
serviceHealth.ts active env Iris + finelog /health probes (+ finelog URL)
workers.ts iris worker counts via the ListWorkers RPC
clusterHistory.ts 24h worker + provisioning history from finelog canary rows
jobs.ts iris 24h job-state breakdown via ExecuteRawQuery
probes.ts synthetic-canary checks + provisioning from finelog
finelogQuery.ts finelog StatsService SQL query → Arrow IPC decode
Expand All @@ -55,14 +56,16 @@ web/
useControlPlaneHealth.ts
useWorkers.ts
useWorkersHistory.ts
useProvisioningHistory.ts
useJobs.ts
useProbes.ts
components/
FerryPanel.tsx
BuildPanel.tsx GitHub CI, last 100 runs on main
IrisPanel.tsx wraps reachability + WorkersPanel + ControlPlanePanel + JobsPanel
ControlPlanePanel.tsx active env Iris + finelog latency chart
WorkersPanel.tsx
WorkersPanel.tsx live worker counts + side-by-side availability & provisioning history
ProvisioningHistoryChart.tsx per-region + fleet-average provisioning success ratio
JobsPanel.tsx
ProbesPanel.tsx synthetic-canary health checks + provisioning rollup
style.css Tailwind entry
Expand Down Expand Up @@ -200,7 +203,9 @@ down by in-flight builds.
The Probes panel renders the synthetic-canary telemetry the
`infra/probes/` daemon writes to the finelog `infra.canary.metrics`
namespace (one flat `{metric, value, labels, collected_at}` row per
sample). Two bounded DuckDB queries run against the **active
sample). Two bounded SQL queries (Apache DataFusion, finelog's read
engine — note: no JSON functions, so labels are decoded app-side) run
against the **active
environment's** finelog log-server through its `StatsService.Query`
Connect RPC — the same JSON-over-HTTP shape the controller's
`ExecuteRawQuery` uses, except the result is an Arrow IPC stream, which
Expand Down Expand Up @@ -257,7 +262,8 @@ plus the dev controller discovery settings.
| Iris | 15s | 15s | current only |
| Control plane | in-memory | 30s | 24h ring buffer |
| Workers | 15s | 30s | current only |
| Workers history | in-memory | 30s | 24h ring buffer |
| Workers history | 60s | 60s | 24h from finelog |
| Provisioning history | 60s | 60s | 24h from finelog |
| Jobs | 60s | 60s | 24h window |
| Probes | 60s | 60s | latest cycle |

Expand All @@ -266,10 +272,17 @@ frontend polling can be tuned without affecting upstream. Concurrent
backend requests for the same key coalesce into one upstream call via
`server/cache.ts`.

The workers history is a 2880-slot ring buffer (`server/history.ts`)
filled by a background sampler on a 30s cadence — 24h worth of points.
The sampler runs on a fixed interval, not off request traffic, so
history keeps ticking even when nobody is looking at the dashboard.
The Workers panel renders two finelog-backed history charts side by
side: per-region healthy worker counts (the `worker_healthy` gauge the
canary writes every 60s) and the provisioning create-success ratio
(a fleet average plus per-region lines, derived from the per-pool
`provision_ready` / `provision_outcomes` gauges; zones roll up to
regions). Both query the trailing 24h via `server/sources/clusterHistory.ts`
and survive Cloud Run restarts since the history lives in finelog, not in
process. The remaining in-process ring buffers (`server/history.ts`) back
only the iris-ping and control-plane latency series, filled by a
background sampler on a fixed cadence so they keep ticking even when
nobody is looking at the dashboard.

## Controller data

Expand All @@ -292,14 +305,12 @@ break** — we'll need to plumb a service-account bearer token.

## Known limitations

- **Workers history is in-process.** The ring buffer is lost on Cloud
Run restart (deploys, migrations), so the chart shows a 24h warm-up
window after each restart. Follow-ups to consider:
1. Persist samples to a small GCS object (rewrite on each sample).
2. Bump retention on the controller's `worker_resource_history` table
— currently ~45min — and aggregate from there.
3. Add a proper `worker_count_history` table in the controller schema
so history lives authoritatively next to the workers table.
- **History depends on the canary.** Worker and provisioning history are
read from the `infra.canary.metrics` finelog namespace the `infra/probes`
daemon writes, so both charts are durable across Cloud Run restarts — but
they only have data for an environment whose canary is running. Point the
dashboard at an environment with no canary and both charts show their
empty state rather than data.
- **Iris panel reachability row** is still `/health`-only. Worker counts
and job-state breakdowns are surfaced in the Workers and Jobs
subsections via `ExecuteRawQuery` SQL. Tasks, autoscaler, and detailed
Expand Down
17 changes: 7 additions & 10 deletions infra/status-page/server/history.ts
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
// Fixed-capacity circular buffer for worker-count history.
// Fixed-capacity circular buffers for the in-process sample histories.
//
// Capacity is sized so the buffer holds 24h of samples at a 30s cadence
// (2880 slots). Each append overwrites the oldest slot once full, so
// memory stays bounded regardless of how long the server runs. History
// is in-process and lost on restart — see infra/status-page/README.md
// "Known limitations" for the follow-up plan (persist to GCS or grow a
// worker_count_history table in the controller).
// Capacity is sized so a buffer holds 24h of samples at its sampler's cadence.
// Each append overwrites the oldest slot once full, so memory stays bounded
// regardless of how long the server runs. These histories are in-process and
// lost on restart; worker-count history moved out to finelog (see
// server/sources/clusterHistory.ts), but the iris-ping and control-plane
// latency series are still sampled in-process here.

import type { IrisPingSample } from "./sources/iris.js";
import type { ServiceHealthHistorySample } from "./sources/serviceHealth.js";
import type { WorkerSample } from "./sources/workers.js";

export class RingBuffer<T> {
private readonly capacity: number;
Expand Down Expand Up @@ -44,8 +43,6 @@ export class RingBuffer<T> {
}
}

export class WorkerHistory extends RingBuffer<WorkerSample> {}

export class IrisPingHistory extends RingBuffer<IrisPingSample> {}

export class ServiceHealthHistory extends RingBuffer<ServiceHealthHistorySample> {}
61 changes: 26 additions & 35 deletions infra/status-page/server/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
// GET /api/iris — iris controller reachability (15s cache)
// GET /api/control-plane/health — active env Iris + finelog health history
// GET /api/workers — current iris worker counts (15s cache)
// GET /api/workers/history — in-memory 24h worker count ring buffer
// GET /api/workers/history — 24h per-region worker count from finelog (60s cache)
// GET /api/provisioning/history — 24h provisioning success ratio from finelog (60s cache)
// GET /api/jobs — iris job counts for last 24h by state (60s cache)
// GET /api/probes — synthetic-canary checks + provisioning from finelog (60s cache)
// GET /api/health — liveness probe, no upstream calls
Expand All @@ -19,7 +20,13 @@ import { serve } from "@hono/node-server";
import { serveStatic } from "@hono/node-server/serve-static";
import { Hono } from "hono";
import { TTLCache } from "./cache.js";
import { IrisPingHistory, ServiceHealthHistory, WorkerHistory } from "./history.js";
import { IrisPingHistory, ServiceHealthHistory } from "./history.js";
import {
provisioningHistory,
workersHistory,
type ProvisioningHistoryResponse,
type WorkersHistoryResponse,
} from "./sources/clusterHistory.js";
import {
FERRY_GROUPS,
fetchTierStatus,
Expand Down Expand Up @@ -47,6 +54,11 @@ const ferryCache = new TTLCache<FerryTierStatus>(60_000);
const buildCache = new TTLCache<BuildsResponse>(60_000);
const workersCache = new TTLCache<WorkersSnapshot>(15_000);
const jobsCache = new TTLCache<JobsSnapshot>(60_000);
// Worker (60s cadence) and provisioning (15min cadence) history come from the
// canary's finelog rows; a 60s shield keeps finelog query load low without
// lagging the worker series.
const workersHistoryCache = new TTLCache<WorkersHistoryResponse>(60_000);
const provisioningHistoryCache = new TTLCache<ProvisioningHistoryResponse>(60_000);
// Probe metrics turn over slowly — health checks every ≤5min, provisioning
// every 15min — so a 60s shield is plenty and keeps finelog query load low.
const probesCache = new TTLCache<ProbesSnapshot>(60_000);
Expand All @@ -62,34 +74,14 @@ const IRIS_PING_CAPACITY = Math.ceil(IRIS_PING_WINDOW_MS / IRIS_PING_INTERVAL_MS
const irisPingHistory = new IrisPingHistory(IRIS_PING_CAPACITY);
let lastIrisPing: IrisPingResult | null = null;

// Ring buffer for worker-count history. Sized so the buffer holds 24h of
// samples at the configured cadence. The sampler runs on a fixed interval
// below — not lazily off request traffic — so history keeps ticking even
// when nobody's watching the dashboard.
// In-process sampler cadence + buffer sizing for the control-plane latency
// history (worker-count history now lives in finelog — see clusterHistory.ts).
const SAMPLE_INTERVAL_MS = 30_000;
const HISTORY_CAPACITY = Math.ceil((24 * 60 * 60 * 1000) / SAMPLE_INTERVAL_MS);
const workerHistory = new WorkerHistory(HISTORY_CAPACITY);
const serviceHealthHistory = new ServiceHealthHistory(HISTORY_CAPACITY);
const SERVICE_HEALTH_WINDOW_MS = 24 * 60 * 60 * 1000;
let lastServiceHealth: ServiceHealthSnapshot[] = [];

async function sampleWorkers(): Promise<void> {
const snapshot = await workersCache.get("workers", () => workerSnapshot());
if (snapshot.error) {
console.error("worker sampler: snapshot error:", snapshot.error);
// Don't pollute history with zeros when the controller is unreachable.
return;
}
const regions: Record<string, number> = {};
for (const r of snapshot.byRegion) {
regions[r.region] = r.healthy;
}
workerHistory.push({
t: Date.parse(snapshot.fetchedAt),
regions,
});
}

async function sampleIrisPing(): Promise<void> {
const result = await pingIris();
lastIrisPing = result;
Expand All @@ -111,15 +103,6 @@ async function sampleServiceHealth(): Promise<void> {

// Kick off immediately, then on a fixed cadence. unref() lets the process
// exit cleanly during tests without waiting on the timer.
void sampleWorkers().catch((err) => {
console.error("worker sampler error", err);
});
setInterval(() => {
void sampleWorkers().catch((err) => {
console.error("worker sampler error", err);
});
}, SAMPLE_INTERVAL_MS).unref();

void sampleIrisPing().catch((err) => {
console.error("iris ping sampler error", err);
});
Expand Down Expand Up @@ -180,8 +163,16 @@ app.get("/api/workers", async (c) => {
return c.json(snapshot);
});

app.get("/api/workers/history", (c) => {
return c.json({ samples: workerHistory.samples() });
app.get("/api/workers/history", async (c) => {
const snapshot = await workersHistoryCache.get("workers-history", () => workersHistory());
return c.json(snapshot);
});

app.get("/api/provisioning/history", async (c) => {
const snapshot = await provisioningHistoryCache.get("provisioning-history", () =>
provisioningHistory(),
);
return c.json(snapshot);
});

app.get("/api/jobs", async (c) => {
Expand Down
Loading
Loading