Skip to content

infra-dashboard: cloud run status page for marin#4649

Open
ravwojdyla-agent wants to merge 11 commits intomainfrom
worktree-rav-status-page
Open

infra-dashboard: cloud run status page for marin#4649
ravwojdyla-agent wants to merge 11 commits intomainfrom
worktree-rav-status-page

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

@ravwojdyla-agent ravwojdyla-agent commented Apr 11, 2026

  • one iap-gated cloud run service (marin-infra-dashboard) aggregating ferry ci, main-branch build health, iris reachability, worker counts, and job state — single place to check marin's status without hopping between the actions tab, ssh tunnels, and the iris dashboard
  • stack: node 20 + hono + ts server, vite + react 18 + jotai + @tanstack/react-query + tailwind 4 + recharts frontend, single package.json, multi-stage dockerfile, ~1.3k hand-written lines (the rest of the diff is package-lock.json)
  • panels
    • ferriesmarin-canary-ferry.yaml + marin-datakit-smoke.yaml, last 30 workflow runs each via github rest, colored dot strip + success rate
    • github build — last 100 commits on main with their aggregate statusCheckRollup.state via graphql 1, full-width flex-1 dot strip, latest commit author avatar with a state-driven decoration (tilted crown for SUCCESS, 💩 for FAILURE / ERROR)
    • iris/health reachability row, then two nested subsections:
      • workers — current healthy/active via ExecuteRawQuery raw sql against the controller's sqlite, 24h line chart fed by a 30s-cadence in-process ring buffer
      • jobs — two buckets, "right now" (pending / building / running, no time filter, matches the iris fleet tab) and "last 24h" (terminal states filtered by finished_at_ms), both scoped to root jobs via parent_job_id IS NULL to avoid child-job fan-out inflating the counts
  • deploy mirrors infra/iris-iap-proxy/: ./deploy.shgcloud beta run deploy --source=., direct vpc egress so the service can reach the controller's internal ip, native iap auth, min/max-instances=1 to keep the in-process ttl cache and worker ring buffer warm, github token in secret manager
  • controller data path uses iris's raw sql rpc in null-auth mode 2
  • new .github/workflows/marin-infra-dashboard.yaml gated on infra/status-page/** runs npm cinpm run lintnpm run typechecknpm run build with eslint 9 flat config scoped per env (node globals for server/, browser + react-hooks for web/)
  • known limitations captured in infra/status-page/README.md
    • workers history is in-process and lost on cloud run restart (24h warm-up after each deploy — follow-up options: gcs object, bumped retention on worker_resource_history, or a proper worker_count_history table in the controller)
    • single cluster only (marin); marin-dev deferred
    • max one instance — more would split the ttl cache and N× upstream traffic

Footnotes

  1. graphql requires auth even for public repos, so GITHUB_TOKEN is a hard requirement for the build panel, not just a rate-limit lift. the `marin-community` org enforces a 366-day fine-grained-pat lifetime policy — the secret in manager is `marin-status-page-github-token`

  2. `ExecuteRawQuery` is admin-only but `NullAuthInterceptor` promotes anonymous callers to admin on the marin controller. if auth ever gets enabled, both workers and jobs panels will error until a service-account bearer is plumbed

Provides a single IAP-gated view of canary ferry / datakit smoke CI
health and iris controller reachability, so the team can check Marin's
state without hopping between the GitHub Actions tab and SSH tunnels
into the VPC. Follows the iris-iap-proxy deploy pattern: Cloud Run with
native IAP, Direct VPC egress, min/max-instances=1 to keep the
in-process TTL cache warm.
@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 11, 2026
@ravwojdyla-agent
Copy link
Copy Markdown
Contributor Author

🤖 Specification (required for >500 LOC PRs; most of the line count is package-lock.json, hand-written source is closer to ~1300 LOC but still over the threshold).

Problem

The team has no single place to check Marin's health. Ferry CI status lives in the GitHub Actions tab; iris controller state is behind the VPC and currently requires SSH tunnels or the iris-iap-proxy dashboard (infra/iris-iap-proxy/, see #4630). A viewer checking "is anything broken" has to hop between tabs and tools.

Approach

One Cloud Run service (marin-status-page) deployed with the same pattern as infra/iris-iap-proxy/: native IAP, Direct VPC egress so it can reach the controller, min/max-instances=1 to keep the in-process TTL cache warm. Language is TypeScript end-to-end so the service is a single language; acknowledged deviation from lib/iris/dashboard/ (Vue + rsbuild) and infra/iris-iap-proxy/ (Python).

Server: Node 20 + Hono. Three JSON endpoints (/api/ferry, /api/orch, /api/health) and static serving of the built web assets from web/dist. Two sources (server/sources/githubActions.ts and server/sources/orch.ts) feed into a hand-rolled TTLCache (server/cache.ts) with in-flight promise coalescing, so concurrent requests for the same key hit upstream at most once per TTL window.

Web: Vite + React 18 + TypeScript + Jotai (UI state) + @tanstack/react-query (network state) + Tailwind 4. Two panels: FerryPanel shows latest run, 30-run colored-square history, and rolling success rate per workflow. OrchPanel shows reachability dot, round-trip latency, and expandable raw /health response.

Key decisions (details in the README):

  • The marin repo is public, so GITHUB_TOKEN exists only to lift the GH rate limit from 60/hr (unauth, per Cloud Run egress IP) to 5000/hr. The token grants nothing beyond what's public. Stored in Secret Manager as marin-status-page-github-token.
  • Controller discovery is a direct TS port of infra/iris-iap-proxy/discovery.py (same CONTROLLER_LABEL lookup, same 60s cache, same CONTROLLER_URL env var override for local dev).
  • orch panel is reachability + latency only in v1. The iris controller's only easily-callable JSON endpoint is /health (lib/iris/src/iris/cluster/controller/dashboard.py:396); richer worker/job data lives behind Connect RPC and would require generated TS stubs. Called out as a known limitation in the README.

Scope (from earlier design iterations, captured in scratch/projects/marin-status-page.md locally):

  • Ferry workflows: marin-canary-ferry.yaml and marin-datakit-smoke.yaml. Adding more = one line in FERRY_WORKFLOWS.
  • History window: last 30 runs on main.
  • Clusters: marin only. marin-dev deferred.
  • wandb panels: deferred (see "Future work" in README).

Tests

No automated test suite for this PR. Verified manually:

  • npm install clean (210 packages).
  • tsc --noEmit on both tsconfig.server.json and tsconfig.web.json — no errors.
  • vite build produces web/dist (199 KB, 64 KB gzipped), tsc -p tsconfig.server.json produces server/dist.
  • node server/dist/main.js boots; /api/health returns {"status":"ok"}.
  • /api/orch with ORCH_FIXTURE=1 returns canned data (used for UI dev without a VPC tunnel).
  • /api/ferry against real GitHub returns real history for both configured workflows, with correct conclusion/duration/SHA parsing.
  • Dev path (npx tsx server/main.ts) also boots cleanly.

Not exercised in this PR: Dockerfile build (no Docker in the sandbox; multi-stage layout follows Cloud Run buildpack conventions and mirrors iris-iap-proxy) and deploy.sh end-to-end (needs gcloud + Cloud Build + IAP bindings). Worth a docker build . and a staging deploy before merging if you want full confidence; otherwise the first ./deploy.sh catches anything runtime-specific.

Follow-ups (intentionally out of scope)

  • Richer orch data once the iris controller grows JSON endpoints or we vendor Connect TS stubs.
  • wandb panels (rendered metrics via wandb.Api() or scheduled screenshot via Playwright + GCS).
  • marin-dev cluster.
  • Historical ferry trends beyond the cache TTL.
  • Slack alert on ferry red streak.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0f5e81d8b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

<span className="ml-auto text-slate-400">
{successRate === null
? "—"
: `${Math.round(successRate * 100)}% success over ${wf.history.length}`}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Show success-rate denominator for completed runs

successRate is computed on completed runs only in server/sources/githubActions.ts (successes / completed.length), but the UI text here reports the denominator as wf.history.length. When the history includes queued/in-progress runs, the panel displays a mathematically incorrect “X% success over N” value and understates reliability.

Useful? React with 👍 / 👎.

Comment on lines +27 to +29
const results = await Promise.all(
FERRY_WORKFLOWS.map((wf) => ferryCache.get(wf.file, () => fetchWorkflowStatus(wf))),
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate ferry workflow failures to avoid full API 500

This route aggregates workflow calls with Promise.all, so any thrown exception from one fetchWorkflowStatus call (for example a transient GitHub network error or JSON parse failure) rejects the whole request and returns a 500 for /api/ferry. That prevents the dashboard from showing partial results and per-workflow error fields during upstream flakiness.

Useful? React with 👍 / 👎.

Rename header to "Marin Infra Status". Restructure so the Iris section
wraps reachability plus two new subsections:

- Workers: current healthy/active count via ExecuteRawQuery against the
  controller's SQLite, plus a 24h recharts line chart fed by an
  in-process ring buffer (2880 samples @ 30s). History is lost on
  restart; follow-ups to a persistent backing are noted in the README.
- Jobs: 24h breakdown grouped by JobState enum, rendered as per-state
  horizontal bars with counts and percentages.

Add a GitHub Build panel above Iris that shows per-commit aggregate CI
rollup for the last 100 commits on main, via the GraphQL
statusCheckRollup field. One request, 100 commits, GITHUB_TOKEN required
(GraphQL needs auth even for public repos). Success rate is computed
over finalized commits only so in-flight builds don't drag it down.

Rename orch → iris throughout (module, types, hook, route, env var,
cache, README) to match the fact that Iris *is* the orchestrator.
@ravwojdyla ravwojdyla changed the title [status-page] Add Cloud Run status dashboard for ferry + orch Marin Infra Status page for the office Apr 11, 2026
Rename the Cloud Run service to marin-infra-dashboard (package name,
startup log, GitHub user-agent, deploy.sh SERVICE, README title). The
service account and GitHub token secret keep their historical
marin-status-page* names — GCP does not support renaming either, and
the deploy.sh comment now documents the discrepancy.

Add .github/workflows/marin-infra-dashboard.yaml gated on
infra/status-page/** so PRs touching the dashboard get lint +
typecheck + build as a first-class check in branch protection,
matching the marin-unit-tests two-job changes/build pattern.

Wire ESLint 9 flat config covering server (Node) and web (React +
hooks) with scoped globals per environment. Lint runs in CI before
typecheck so the fastest failure case surfaces first.
@ravwojdyla ravwojdyla requested a review from rjpower April 11, 2026 03:27
The deployed contract is that the controller URL is reachable and
GITHUB_TOKEN is set — fixture mode added dead code paths to every
source file for an offline dev loop nobody uses. Drop the *_FIXTURE
env vars, the per-source fixture snapshot helpers, the ring-buffer
fixture prefill, and WorkerHistory.seed() which only existed for that
prefill. Panels that depend on the controller now surface a real
error instead of masking it behind synthetic data.

Two findings from the PR 4649 review bot are fixed at the same time
since they live in adjacent files:
- FerryPanel's success-rate denominator used wf.history.length while
  the server computes successRate over completed.length only, so the
  percentage and the denominator shown next to it were inconsistent
  whenever the history contained in-progress runs.
- fetchWorkflowStatus and fetchBuildsOnMain did not wrap their outer
  fetch() calls in try/catch, so a network-level throw (DNS, TLS,
  connection refused) would propagate out of Promise.all in the route
  handler and turn /api/ferry or /api/builds into a 500 instead of a
  per-source error snapshot.
The Build panel now fetches the latest commit author's avatar via the
GraphQL API and overlays a state-driven decoration: a gold crown
tilted 12 degrees over its bottom pivot when the rollup is SUCCESS, a
poop emoji with the same tilt when it is FAILURE or ERROR. Other
states fall back to a plain avatar. Easier to read the health of main
at a glance than a color dot alone.
Build-status strip now uses flex-1 per dot so the 100 commits span the
full card width instead of leaving empty space on the right. Renamed
the "Datakit smoke" workflow label to "Datakit ferry" and the
FerryPanel heading from "Ferry workflows" to just "Ferries" to keep
the terminology consistent across the dashboard.
The previous query filtered every state by submitted_at_ms in the
last 24h, which hid long-running experiments that started earlier —
the panel said 40 running while the iris Fleet dashboard showed 58.
Split the SQL so in-flight states (pending / building / running) are
always counted regardless of submission time while terminal states
(succeeded / failed / killed / worker_failed / unschedulable) are
filtered to finished_at_ms in the last 24h. Also scope to root jobs
(parent_job_id IS NULL) so child-job fan-out does not inflate the
totals.

Render the two buckets as separate subsections inside the Jobs card
with their own totals and state bars so viewers can distinguish
"what is in flight right now" from "how did today's work resolve".
Dropped recharts' ResponsiveContainer in favour of a hand-rolled
useContainerSize hook that measures the chart div via ResizeObserver
and passes explicit pixel dimensions to LineChart. ResponsiveContainer
emitted a "width(-1) / height(-1)" warning under React StrictMode
because its internal effect could run before the container's first
layout committed. The hook uses a callback ref — a useRef-based
version ran its effect once on mount while the div was still hidden
behind the data-loading gate, missed the ref entirely, and never
re-set the observer when the element finally appeared.
@ravwojdyla-agent ravwojdyla-agent changed the title Marin Infra Status page for the office infra-dashboard: cloud run status page for marin Apr 11, 2026
Vite serves anything under web/public/ at the site root, so dropping
marin-logo.svg there and referencing it from index.html is all the
wiring needed. The SVG was run through svgo at integer precision
(1396-unit viewBox → 1-unit = 0.07% of image width, still subpixel
at every favicon size a browser renders) to shrink it from 4.5 KiB
to 1.4 KiB.
Iris reachability (cluster name, reachable/unreachable, latency,
controllerUrl) moves inline into the section header next to the Iris
title instead of sitting in its own boxed card — one less visual
enclosure and the info reads at the same level as its section.

Workers panel reports a single `healthy` count (workers where
healthy=1, no active filter) instead of the confusing
available/total pair that made it look like "has capacity" when it
meant "in the fleet". Resource line now carries CPU and memory as
currently-free amounts plus TPU as raw chip count across healthy
workers — iris schedules TPU at whole-VM granularity so "available
chips" collapses to a misleading near-zero number on a busy cluster.

Workers history chart is now per-region. The ring buffer samples
store a `regions: Record<string, number>` map instead of a flat
available/total pair; the chart renders one colored line per region
using a stable palette and a bottom legend. ResponsiveContainer is
replaced by a useContainerSize hook with a callback ref so the chart
mounts correctly after data loads without triggering recharts'
"width(-1)/height(-1)" warning under React StrictMode.

Favicon: add infra/status-page/web/public/marin-logo.svg (svgo'd
down to 1.4 KiB) and reference it from index.html.
Copy link
Copy Markdown
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me (without looking at the actual output, of course)...

I don't want to be too precious about a vibe dashboard, but let's try to be consistent on UI and servers. E.g. a Python/uvicorn server here could just use the existing Iris tunnel & discovery libraries, and would automatically get flagged if we broke the types or APIs. I want to spend zero cycles maintaining the dashboard if we change something in Iris.

Ditto for the frontend, I'd prefer if we just stuck with Vue & rsbuild a la Iris rather than rolling our own.

But I'm not going to block an experiment...

@@ -0,0 +1,76 @@
// Iris controller reachability.
//
// In v1, the controller's only easily-callable JSON endpoint is /health
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think claude might be confused, all of the Connect RPC methods support JSON output natively.

(which is of course how the raw SQL endpoint is exported as well)

// UNSCHEDULABLE=8) are filtered to jobs that finished in the last
// 24h via finished_at_ms, which is always populated for terminal
// jobs (verified on the marin controller).
const BREAKDOWN_SQL = `
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be done now, but using the ListJobs API will likely/definitely be more stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants