infra-dashboard: cloud run status page for marin#4649
infra-dashboard: cloud run status page for marin#4649ravwojdyla-agent wants to merge 11 commits intomainfrom
Conversation
Provides a single IAP-gated view of canary ferry / datakit smoke CI health and iris controller reachability, so the team can check Marin's state without hopping between the GitHub Actions tab and SSH tunnels into the VPC. Follows the iris-iap-proxy deploy pattern: Cloud Run with native IAP, Direct VPC egress, min/max-instances=1 to keep the in-process TTL cache warm.
|
🤖 Specification (required for >500 LOC PRs; most of the line count is package-lock.json, hand-written source is closer to ~1300 LOC but still over the threshold). ProblemThe team has no single place to check Marin's health. Ferry CI status lives in the GitHub Actions tab; iris controller state is behind the VPC and currently requires SSH tunnels or the iris-iap-proxy dashboard (infra/iris-iap-proxy/, see #4630). A viewer checking "is anything broken" has to hop between tabs and tools. ApproachOne Cloud Run service (marin-status-page) deployed with the same pattern as infra/iris-iap-proxy/: native IAP, Direct VPC egress so it can reach the controller, min/max-instances=1 to keep the in-process TTL cache warm. Language is TypeScript end-to-end so the service is a single language; acknowledged deviation from lib/iris/dashboard/ (Vue + rsbuild) and infra/iris-iap-proxy/ (Python). Server: Node 20 + Hono. Three JSON endpoints (/api/ferry, /api/orch, /api/health) and static serving of the built web assets from web/dist. Two sources (server/sources/githubActions.ts and server/sources/orch.ts) feed into a hand-rolled TTLCache (server/cache.ts) with in-flight promise coalescing, so concurrent requests for the same key hit upstream at most once per TTL window. Web: Vite + React 18 + TypeScript + Jotai (UI state) + @tanstack/react-query (network state) + Tailwind 4. Two panels: FerryPanel shows latest run, 30-run colored-square history, and rolling success rate per workflow. OrchPanel shows reachability dot, round-trip latency, and expandable raw /health response. Key decisions (details in the README):
Scope (from earlier design iterations, captured in scratch/projects/marin-status-page.md locally):
TestsNo automated test suite for this PR. Verified manually:
Not exercised in this PR: Dockerfile build (no Docker in the sandbox; multi-stage layout follows Cloud Run buildpack conventions and mirrors iris-iap-proxy) and deploy.sh end-to-end (needs gcloud + Cloud Build + IAP bindings). Worth a docker build . and a staging deploy before merging if you want full confidence; otherwise the first ./deploy.sh catches anything runtime-specific. Follow-ups (intentionally out of scope)
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0f5e81d8b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| <span className="ml-auto text-slate-400"> | ||
| {successRate === null | ||
| ? "—" | ||
| : `${Math.round(successRate * 100)}% success over ${wf.history.length}`} |
There was a problem hiding this comment.
Show success-rate denominator for completed runs
successRate is computed on completed runs only in server/sources/githubActions.ts (successes / completed.length), but the UI text here reports the denominator as wf.history.length. When the history includes queued/in-progress runs, the panel displays a mathematically incorrect “X% success over N” value and understates reliability.
Useful? React with 👍 / 👎.
infra/status_page/server/main.ts
Outdated
| const results = await Promise.all( | ||
| FERRY_WORKFLOWS.map((wf) => ferryCache.get(wf.file, () => fetchWorkflowStatus(wf))), | ||
| ); |
There was a problem hiding this comment.
Isolate ferry workflow failures to avoid full API 500
This route aggregates workflow calls with Promise.all, so any thrown exception from one fetchWorkflowStatus call (for example a transient GitHub network error or JSON parse failure) rejects the whole request and returns a 500 for /api/ferry. That prevents the dashboard from showing partial results and per-workflow error fields during upstream flakiness.
Useful? React with 👍 / 👎.
Rename header to "Marin Infra Status". Restructure so the Iris section wraps reachability plus two new subsections: - Workers: current healthy/active count via ExecuteRawQuery against the controller's SQLite, plus a 24h recharts line chart fed by an in-process ring buffer (2880 samples @ 30s). History is lost on restart; follow-ups to a persistent backing are noted in the README. - Jobs: 24h breakdown grouped by JobState enum, rendered as per-state horizontal bars with counts and percentages. Add a GitHub Build panel above Iris that shows per-commit aggregate CI rollup for the last 100 commits on main, via the GraphQL statusCheckRollup field. One request, 100 commits, GITHUB_TOKEN required (GraphQL needs auth even for public repos). Success rate is computed over finalized commits only so in-flight builds don't drag it down. Rename orch → iris throughout (module, types, hook, route, env var, cache, README) to match the fact that Iris *is* the orchestrator.
Rename the Cloud Run service to marin-infra-dashboard (package name, startup log, GitHub user-agent, deploy.sh SERVICE, README title). The service account and GitHub token secret keep their historical marin-status-page* names — GCP does not support renaming either, and the deploy.sh comment now documents the discrepancy. Add .github/workflows/marin-infra-dashboard.yaml gated on infra/status-page/** so PRs touching the dashboard get lint + typecheck + build as a first-class check in branch protection, matching the marin-unit-tests two-job changes/build pattern. Wire ESLint 9 flat config covering server (Node) and web (React + hooks) with scoped globals per environment. Lint runs in CI before typecheck so the fastest failure case surfaces first.
The deployed contract is that the controller URL is reachable and GITHUB_TOKEN is set — fixture mode added dead code paths to every source file for an offline dev loop nobody uses. Drop the *_FIXTURE env vars, the per-source fixture snapshot helpers, the ring-buffer fixture prefill, and WorkerHistory.seed() which only existed for that prefill. Panels that depend on the controller now surface a real error instead of masking it behind synthetic data. Two findings from the PR 4649 review bot are fixed at the same time since they live in adjacent files: - FerryPanel's success-rate denominator used wf.history.length while the server computes successRate over completed.length only, so the percentage and the denominator shown next to it were inconsistent whenever the history contained in-progress runs. - fetchWorkflowStatus and fetchBuildsOnMain did not wrap their outer fetch() calls in try/catch, so a network-level throw (DNS, TLS, connection refused) would propagate out of Promise.all in the route handler and turn /api/ferry or /api/builds into a 500 instead of a per-source error snapshot.
The Build panel now fetches the latest commit author's avatar via the GraphQL API and overlays a state-driven decoration: a gold crown tilted 12 degrees over its bottom pivot when the rollup is SUCCESS, a poop emoji with the same tilt when it is FAILURE or ERROR. Other states fall back to a plain avatar. Easier to read the health of main at a glance than a color dot alone.
Build-status strip now uses flex-1 per dot so the 100 commits span the full card width instead of leaving empty space on the right. Renamed the "Datakit smoke" workflow label to "Datakit ferry" and the FerryPanel heading from "Ferry workflows" to just "Ferries" to keep the terminology consistent across the dashboard.
The previous query filtered every state by submitted_at_ms in the last 24h, which hid long-running experiments that started earlier — the panel said 40 running while the iris Fleet dashboard showed 58. Split the SQL so in-flight states (pending / building / running) are always counted regardless of submission time while terminal states (succeeded / failed / killed / worker_failed / unschedulable) are filtered to finished_at_ms in the last 24h. Also scope to root jobs (parent_job_id IS NULL) so child-job fan-out does not inflate the totals. Render the two buckets as separate subsections inside the Jobs card with their own totals and state bars so viewers can distinguish "what is in flight right now" from "how did today's work resolve".
Dropped recharts' ResponsiveContainer in favour of a hand-rolled useContainerSize hook that measures the chart div via ResizeObserver and passes explicit pixel dimensions to LineChart. ResponsiveContainer emitted a "width(-1) / height(-1)" warning under React StrictMode because its internal effect could run before the container's first layout committed. The hook uses a callback ref — a useRef-based version ran its effect once on mount while the div was still hidden behind the data-loading gate, missed the ref entirely, and never re-set the observer when the element finally appeared.
Vite serves anything under web/public/ at the site root, so dropping marin-logo.svg there and referencing it from index.html is all the wiring needed. The SVG was run through svgo at integer precision (1396-unit viewBox → 1-unit = 0.07% of image width, still subpixel at every favicon size a browser renders) to shrink it from 4.5 KiB to 1.4 KiB.
Iris reachability (cluster name, reachable/unreachable, latency, controllerUrl) moves inline into the section header next to the Iris title instead of sitting in its own boxed card — one less visual enclosure and the info reads at the same level as its section. Workers panel reports a single `healthy` count (workers where healthy=1, no active filter) instead of the confusing available/total pair that made it look like "has capacity" when it meant "in the fleet". Resource line now carries CPU and memory as currently-free amounts plus TPU as raw chip count across healthy workers — iris schedules TPU at whole-VM granularity so "available chips" collapses to a misleading near-zero number on a busy cluster. Workers history chart is now per-region. The ring buffer samples store a `regions: Record<string, number>` map instead of a flat available/total pair; the chart renders one colored line per region using a stable palette and a bottom legend. ResponsiveContainer is replaced by a useContainerSize hook with a callback ref so the chart mounts correctly after data loads without triggering recharts' "width(-1)/height(-1)" warning under React StrictMode. Favicon: add infra/status-page/web/public/marin-logo.svg (svgo'd down to 1.4 KiB) and reference it from index.html.
rjpower
left a comment
There was a problem hiding this comment.
Seems good to me (without looking at the actual output, of course)...
I don't want to be too precious about a vibe dashboard, but let's try to be consistent on UI and servers. E.g. a Python/uvicorn server here could just use the existing Iris tunnel & discovery libraries, and would automatically get flagged if we broke the types or APIs. I want to spend zero cycles maintaining the dashboard if we change something in Iris.
Ditto for the frontend, I'd prefer if we just stuck with Vue & rsbuild a la Iris rather than rolling our own.
But I'm not going to block an experiment...
| @@ -0,0 +1,76 @@ | |||
| // Iris controller reachability. | |||
| // | |||
| // In v1, the controller's only easily-callable JSON endpoint is /health | |||
There was a problem hiding this comment.
I think claude might be confused, all of the Connect RPC methods support JSON output natively.
(which is of course how the raw SQL endpoint is exported as well)
| // UNSCHEDULABLE=8) are filtered to jobs that finished in the last | ||
| // 24h via finished_at_ms, which is always populated for terminal | ||
| // jobs (verified on the marin controller). | ||
| const BREAKDOWN_SQL = ` |
There was a problem hiding this comment.
It doesn't need to be done now, but using the ListJobs API will likely/definitely be more stable.
marin-infra-dashboard) aggregating ferry ci, main-branch build health, iris reachability, worker counts, and job state — single place to check marin's status without hopping between the actions tab, ssh tunnels, and the iris dashboardhono+ ts server,vite+ react 18 +jotai+@tanstack/react-query+ tailwind 4 +rechartsfrontend, singlepackage.json, multi-stage dockerfile, ~1.3k hand-written lines (the rest of the diff ispackage-lock.json)marin-canary-ferry.yaml+marin-datakit-smoke.yaml, last 30 workflow runs each via github rest, colored dot strip + success ratemainwith their aggregatestatusCheckRollup.statevia graphql 1, full-widthflex-1dot strip, latest commit author avatar with a state-driven decoration (tilted crown forSUCCESS, 💩 forFAILURE/ERROR)/healthreachability row, then two nested subsections:ExecuteRawQueryraw sql against the controller's sqlite, 24h line chart fed by a 30s-cadence in-process ring bufferfinished_at_ms), both scoped to root jobs viaparent_job_id IS NULLto avoid child-job fan-out inflating the countsinfra/iris-iap-proxy/:./deploy.sh→gcloud beta run deploy --source=., direct vpc egress so the service can reach the controller's internal ip, native iap auth,min/max-instances=1to keep the in-process ttl cache and worker ring buffer warm, github token in secret manager.github/workflows/marin-infra-dashboard.yamlgated oninfra/status-page/**runsnpm ci→npm run lint→npm run typecheck→npm run buildwith eslint 9 flat config scoped per env (node globals forserver/, browser +react-hooksforweb/)infra/status-page/README.mdworker_resource_history, or a properworker_count_historytable in the controller)marin);marin-devdeferredFootnotes
graphql requires auth even for public repos, so
GITHUB_TOKENis a hard requirement for the build panel, not just a rate-limit lift. the `marin-community` org enforces a 366-day fine-grained-pat lifetime policy — the secret in manager is `marin-status-page-github-token` ↩`ExecuteRawQuery` is admin-only but `NullAuthInterceptor` promotes anonymous callers to admin on the marin controller. if auth ever gets enabled, both workers and jobs panels will error until a service-account bearer is plumbed ↩