infra-dashboard: cloud run status page for marin by ravwojdyla-agent · Pull Request #4649 · marin-community/marin

ravwojdyla-agent · 2026-04-11T01:22:44Z

one iap-gated cloud run service (marin-infra-dashboard) aggregating ferry ci, main-branch build health, iris reachability, worker counts, and job state — single place to check marin's status without hopping between the actions tab, ssh tunnels, and the iris dashboard
stack: node 20 + hono + ts server, vite + react 18 + jotai + @tanstack/react-query + tailwind 4 + recharts frontend, single package.json, multi-stage dockerfile, ~1.3k hand-written lines (the rest of the diff is package-lock.json)
panels
- ferries — marin-canary-ferry.yaml + marin-datakit-smoke.yaml, last 30 workflow runs each via github rest, colored dot strip + success rate
- github build — last 100 commits on main with their aggregate statusCheckRollup.state via graphql ¹, full-width flex-1 dot strip, latest commit author avatar with a state-driven decoration (tilted crown for SUCCESS, 💩 for FAILURE / ERROR)
- iris — /health reachability row, then two nested subsections:
  - workers — current healthy/active via ExecuteRawQuery raw sql against the controller's sqlite, 24h line chart fed by a 30s-cadence in-process ring buffer
  - jobs — two buckets, "right now" (pending / building / running, no time filter, matches the iris fleet tab) and "last 24h" (terminal states filtered by finished_at_ms), both scoped to root jobs via parent_job_id IS NULL to avoid child-job fan-out inflating the counts
deploy mirrors infra/iris-iap-proxy/: ./deploy.sh → gcloud beta run deploy --source=., direct vpc egress so the service can reach the controller's internal ip, native iap auth, min/max-instances=1 to keep the in-process ttl cache and worker ring buffer warm, github token in secret manager
controller data path uses iris's raw sql rpc in null-auth mode ²
new .github/workflows/marin-infra-dashboard.yaml gated on infra/status-page/** runs npm ci → npm run lint → npm run typecheck → npm run build with eslint 9 flat config scoped per env (node globals for server/, browser + react-hooks for web/)
known limitations captured in infra/status-page/README.md
- workers history is in-process and lost on cloud run restart (24h warm-up after each deploy — follow-up options: gcs object, bumped retention on worker_resource_history, or a proper worker_count_history table in the controller)
- single cluster only (marin); marin-dev deferred
- max one instance — more would split the ttl cache and N× upstream traffic

graphql requires auth even for public repos, so GITHUB_TOKEN is a hard requirement for the build panel, not just a rate-limit lift. the `marin-community` org enforces a 366-day fine-grained-pat lifetime policy — the secret in manager is `marin-status-page-github-token` ↩
`ExecuteRawQuery` is admin-only but `NullAuthInterceptor` promotes anonymous callers to admin on the marin controller. if auth ever gets enabled, both workers and jobs panels will error until a service-account bearer is plumbed ↩

Provides a single IAP-gated view of canary ferry / datakit smoke CI health and iris controller reachability, so the team can check Marin's state without hopping between the GitHub Actions tab and SSH tunnels into the VPC. Follows the iris-iap-proxy deploy pattern: Cloud Run with native IAP, Direct VPC egress, min/max-instances=1 to keep the in-process TTL cache warm.

ravwojdyla-agent · 2026-04-11T01:23:23Z

🤖 Specification (required for >500 LOC PRs; most of the line count is package-lock.json, hand-written source is closer to ~1300 LOC but still over the threshold).

Problem

The team has no single place to check Marin's health. Ferry CI status lives in the GitHub Actions tab; iris controller state is behind the VPC and currently requires SSH tunnels or the iris-iap-proxy dashboard (infra/iris-iap-proxy/, see #4630). A viewer checking "is anything broken" has to hop between tabs and tools.

Approach

One Cloud Run service (marin-status-page) deployed with the same pattern as infra/iris-iap-proxy/: native IAP, Direct VPC egress so it can reach the controller, min/max-instances=1 to keep the in-process TTL cache warm. Language is TypeScript end-to-end so the service is a single language; acknowledged deviation from lib/iris/dashboard/ (Vue + rsbuild) and infra/iris-iap-proxy/ (Python).

Server: Node 20 + Hono. Three JSON endpoints (/api/ferry, /api/orch, /api/health) and static serving of the built web assets from web/dist. Two sources (server/sources/githubActions.ts and server/sources/orch.ts) feed into a hand-rolled TTLCache (server/cache.ts) with in-flight promise coalescing, so concurrent requests for the same key hit upstream at most once per TTL window.

Web: Vite + React 18 + TypeScript + Jotai (UI state) + @tanstack/react-query (network state) + Tailwind 4. Two panels: FerryPanel shows latest run, 30-run colored-square history, and rolling success rate per workflow. OrchPanel shows reachability dot, round-trip latency, and expandable raw /health response.

Key decisions (details in the README):

The marin repo is public, so GITHUB_TOKEN exists only to lift the GH rate limit from 60/hr (unauth, per Cloud Run egress IP) to 5000/hr. The token grants nothing beyond what's public. Stored in Secret Manager as marin-status-page-github-token.
Controller discovery is a direct TS port of infra/iris-iap-proxy/discovery.py (same CONTROLLER_LABEL lookup, same 60s cache, same CONTROLLER_URL env var override for local dev).
orch panel is reachability + latency only in v1. The iris controller's only easily-callable JSON endpoint is /health (lib/iris/src/iris/cluster/controller/dashboard.py:396); richer worker/job data lives behind Connect RPC and would require generated TS stubs. Called out as a known limitation in the README.

Scope (from earlier design iterations, captured in scratch/projects/marin-status-page.md locally):

Ferry workflows: marin-canary-ferry.yaml and marin-datakit-smoke.yaml. Adding more = one line in FERRY_WORKFLOWS.
History window: last 30 runs on main.
Clusters: marin only. marin-dev deferred.
wandb panels: deferred (see "Future work" in README).

Tests

No automated test suite for this PR. Verified manually:

npm install clean (210 packages).
tsc --noEmit on both tsconfig.server.json and tsconfig.web.json — no errors.
vite build produces web/dist (199 KB, 64 KB gzipped), tsc -p tsconfig.server.json produces server/dist.
node server/dist/main.js boots; /api/health returns {"status":"ok"}.
/api/orch with ORCH_FIXTURE=1 returns canned data (used for UI dev without a VPC tunnel).
/api/ferry against real GitHub returns real history for both configured workflows, with correct conclusion/duration/SHA parsing.
Dev path (npx tsx server/main.ts) also boots cleanly.

Not exercised in this PR: Dockerfile build (no Docker in the sandbox; multi-stage layout follows Cloud Run buildpack conventions and mirrors iris-iap-proxy) and deploy.sh end-to-end (needs gcloud + Cloud Build + IAP bindings). Worth a docker build . and a staging deploy before merging if you want full confidence; otherwise the first ./deploy.sh catches anything runtime-specific.

Follow-ups (intentionally out of scope)

Richer orch data once the iris controller grows JSON endpoints or we vendor Connect TS stubs.
wandb panels (rendered metrics via wandb.Api() or scheduled screenshot via Playwright + GCS).
marin-dev cluster.
Historical ferry trends beyond the cache TTL.
Slack alert on ferry red streak.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0f5e81d8b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-11T01:35:47Z

infra/status-page/web/src/components/FerryPanel.tsx

+            <span className="ml-auto text-slate-400">
+              {successRate === null
+                ? "—"
+                : `${Math.round(successRate * 100)}% success over ${wf.history.length}`}


Show success-rate denominator for completed runs

successRate is computed on completed runs only in server/sources/githubActions.ts (successes / completed.length), but the UI text here reports the denominator as wf.history.length. When the history includes queued/in-progress runs, the panel displays a mathematically incorrect “X% success over N” value and understates reliability.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-11T01:35:47Z

infra/status_page/server/main.ts

+  const results = await Promise.all(
+    FERRY_WORKFLOWS.map((wf) => ferryCache.get(wf.file, () => fetchWorkflowStatus(wf))),
+  );


Isolate ferry workflow failures to avoid full API 500

This route aggregates workflow calls with Promise.all, so any thrown exception from one fetchWorkflowStatus call (for example a transient GitHub network error or JSON parse failure) rejects the whole request and returns a 500 for /api/ferry. That prevents the dashboard from showing partial results and per-workflow error fields during upstream flakiness.

Useful? React with 👍 / 👎.

Rename header to "Marin Infra Status". Restructure so the Iris section wraps reachability plus two new subsections: - Workers: current healthy/active count via ExecuteRawQuery against the controller's SQLite, plus a 24h recharts line chart fed by an in-process ring buffer (2880 samples @ 30s). History is lost on restart; follow-ups to a persistent backing are noted in the README. - Jobs: 24h breakdown grouped by JobState enum, rendered as per-state horizontal bars with counts and percentages. Add a GitHub Build panel above Iris that shows per-commit aggregate CI rollup for the last 100 commits on main, via the GraphQL statusCheckRollup field. One request, 100 commits, GITHUB_TOKEN required (GraphQL needs auth even for public repos). Success rate is computed over finalized commits only so in-flight builds don't drag it down. Rename orch → iris throughout (module, types, hook, route, env var, cache, README) to match the fact that Iris *is* the orchestrator.

Rename the Cloud Run service to marin-infra-dashboard (package name, startup log, GitHub user-agent, deploy.sh SERVICE, README title). The service account and GitHub token secret keep their historical marin-status-page* names — GCP does not support renaming either, and the deploy.sh comment now documents the discrepancy. Add .github/workflows/marin-infra-dashboard.yaml gated on infra/status-page/** so PRs touching the dashboard get lint + typecheck + build as a first-class check in branch protection, matching the marin-unit-tests two-job changes/build pattern. Wire ESLint 9 flat config covering server (Node) and web (React + hooks) with scoped globals per environment. Lint runs in CI before typecheck so the fastest failure case surfaces first.

The deployed contract is that the controller URL is reachable and GITHUB_TOKEN is set — fixture mode added dead code paths to every source file for an offline dev loop nobody uses. Drop the *_FIXTURE env vars, the per-source fixture snapshot helpers, the ring-buffer fixture prefill, and WorkerHistory.seed() which only existed for that prefill. Panels that depend on the controller now surface a real error instead of masking it behind synthetic data. Two findings from the PR 4649 review bot are fixed at the same time since they live in adjacent files: - FerryPanel's success-rate denominator used wf.history.length while the server computes successRate over completed.length only, so the percentage and the denominator shown next to it were inconsistent whenever the history contained in-progress runs. - fetchWorkflowStatus and fetchBuildsOnMain did not wrap their outer fetch() calls in try/catch, so a network-level throw (DNS, TLS, connection refused) would propagate out of Promise.all in the route handler and turn /api/ferry or /api/builds into a 500 instead of a per-source error snapshot.

The Build panel now fetches the latest commit author's avatar via the GraphQL API and overlays a state-driven decoration: a gold crown tilted 12 degrees over its bottom pivot when the rollup is SUCCESS, a poop emoji with the same tilt when it is FAILURE or ERROR. Other states fall back to a plain avatar. Easier to read the health of main at a glance than a color dot alone.

Build-status strip now uses flex-1 per dot so the 100 commits span the full card width instead of leaving empty space on the right. Renamed the "Datakit smoke" workflow label to "Datakit ferry" and the FerryPanel heading from "Ferry workflows" to just "Ferries" to keep the terminology consistent across the dashboard.

The previous query filtered every state by submitted_at_ms in the last 24h, which hid long-running experiments that started earlier — the panel said 40 running while the iris Fleet dashboard showed 58. Split the SQL so in-flight states (pending / building / running) are always counted regardless of submission time while terminal states (succeeded / failed / killed / worker_failed / unschedulable) are filtered to finished_at_ms in the last 24h. Also scope to root jobs (parent_job_id IS NULL) so child-job fan-out does not inflate the totals. Render the two buckets as separate subsections inside the Jobs card with their own totals and state bars so viewers can distinguish "what is in flight right now" from "how did today's work resolve".

Dropped recharts' ResponsiveContainer in favour of a hand-rolled useContainerSize hook that measures the chart div via ResizeObserver and passes explicit pixel dimensions to LineChart. ResponsiveContainer emitted a "width(-1) / height(-1)" warning under React StrictMode because its internal effect could run before the container's first layout committed. The hook uses a callback ref — a useRef-based version ran its effect once on mount while the div was still hidden behind the data-loading gate, missed the ref entirely, and never re-set the observer when the element finally appeared.

Vite serves anything under web/public/ at the site root, so dropping marin-logo.svg there and referencing it from index.html is all the wiring needed. The SVG was run through svgo at integer precision (1396-unit viewBox → 1-unit = 0.07% of image width, still subpixel at every favicon size a browser renders) to shrink it from 4.5 KiB to 1.4 KiB.

Iris reachability (cluster name, reachable/unreachable, latency, controllerUrl) moves inline into the section header next to the Iris title instead of sitting in its own boxed card — one less visual enclosure and the info reads at the same level as its section. Workers panel reports a single `healthy` count (workers where healthy=1, no active filter) instead of the confusing available/total pair that made it look like "has capacity" when it meant "in the fleet". Resource line now carries CPU and memory as currently-free amounts plus TPU as raw chip count across healthy workers — iris schedules TPU at whole-VM granularity so "available chips" collapses to a misleading near-zero number on a busy cluster. Workers history chart is now per-region. The ring buffer samples store a `regions: Record<string, number>` map instead of a flat available/total pair; the chart renders one colored line per region using a stable palette and a bottom legend. ResponsiveContainer is replaced by a useContainerSize hook with a callback ref so the chart mounts correctly after data loads without triggering recharts' "width(-1)/height(-1)" warning under React StrictMode. Favicon: add infra/status-page/web/public/marin-logo.svg (svgo'd down to 1.4 KiB) and reference it from index.html.

rjpower

Seems good to me (without looking at the actual output, of course)...

I don't want to be too precious about a vibe dashboard, but let's try to be consistent on UI and servers. E.g. a Python/uvicorn server here could just use the existing Iris tunnel & discovery libraries, and would automatically get flagged if we broke the types or APIs. I want to spend zero cycles maintaining the dashboard if we change something in Iris.

Ditto for the frontend, I'd prefer if we just stuck with Vue & rsbuild a la Iris rather than rolling our own.

But I'm not going to block an experiment...

rjpower · 2026-04-12T00:18:19Z

infra/status-page/server/sources/iris.ts

@@ -0,0 +1,76 @@
+// Iris controller reachability.
+//
+// In v1, the controller's only easily-callable JSON endpoint is /health


I think claude might be confused, all of the Connect RPC methods support JSON output natively.

(which is of course how the raw SQL endpoint is exported as well)

rjpower · 2026-04-12T00:19:36Z

infra/status-page/server/sources/jobs.ts

+// UNSCHEDULABLE=8) are filtered to jobs that finished in the last
+// 24h via finished_at_ms, which is always populated for terminal
+// jobs (verified on the marin controller).
+const BREAKDOWN_SQL = `


It doesn't need to be done now, but using the ListJobs API will likely/definitely be more stable.

ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 11, 2026

Rename to infra-status

181c276

chatgpt-codex-connector bot reviewed Apr 11, 2026

View reviewed changes

ravwojdyla changed the title ~~[status-page] Add Cloud Run status dashboard for ferry + orch~~ Marin Infra Status page for the office Apr 11, 2026

ravwojdyla requested a review from rjpower April 11, 2026 03:27

ravwojdyla added 5 commits April 10, 2026 20:47

ravwojdyla-agent changed the title ~~Marin Infra Status page for the office~~ infra-dashboard: cloud run status page for marin Apr 11, 2026

ravwojdyla added 2 commits April 10, 2026 21:59

rjpower approved these changes Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra-dashboard: cloud run status page for marin#4649

infra-dashboard: cloud run status page for marin#4649
ravwojdyla-agent wants to merge 11 commits intomainfrom
worktree-rav-status-page

ravwojdyla-agent commented Apr 11, 2026 •

edited

Loading

Uh oh!

ravwojdyla-agent commented Apr 11, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 11, 2026

Uh oh!

chatgpt-codex-connector bot Apr 11, 2026

Uh oh!

rjpower left a comment

Uh oh!

rjpower Apr 12, 2026

Uh oh!

rjpower Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ravwojdyla-agent commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

ravwojdyla-agent commented Apr 11, 2026

Problem

Approach

Tests

Follow-ups (intentionally out of scope)

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower left a comment

Choose a reason for hiding this comment

Uh oh!

rjpower Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ravwojdyla-agent commented Apr 11, 2026 •

edited

Loading