Skip to content

Dx local safety and pending status#29

Merged
robmsmt merged 2 commits into
mainfrom
dx-local-safety-and-pending-status
May 21, 2026
Merged

Dx local safety and pending status#29
robmsmt merged 2 commits into
mainfrom
dx-local-safety-and-pending-status

Conversation

@robmsmt
Copy link
Copy Markdown
Contributor

@robmsmt robmsmt commented May 21, 2026

No description provided.

robmsmt and others added 2 commits May 20, 2026 11:04
Frontend
- Tier (24/7 vs Slurm badge) now derived from the peer's launched_by
  label instead of a hardcoded model list. Persistent launchers
  (k8s, cscs_L1) → 24/7; anything else (username from model-launch,
  empty) → Slurm. New helper getTierFromLaunchedBy replaces
  getModelTier in ModelCard and ModelList.
- Pending status surfaces on the collapsed card via a traffic-light
  dot (green/amber+pulsing/grey) AND a muted-grey tile treatment
  (grayscale logo+badges, gray-500/400 text, faint background wash).
  Amber dot stays vivid against the grey card.

Local-dev safety
- Makefile guards _guard-local-db and _guard-local-api refuse to run
  if .env DATABASE_URL or frontend/.env VITE_API_URL points at a
  non-localhost host. Closes a foot-gun where prod creds in .env let
  `make dummy-run` attempt alembic upgrade head against prod Neon.
- Committed .env.example and frontend/.env.example templates (with
  !.env.example in .gitignore so the un-ignore actually works) so a
  fresh clone bootstraps cleanly via `make run`.

Fixture
- dnt_table_dev_live.json gains 6 real k8s peer entries pulled from
  prod DNT, so `make dummy-run` shows a representative mix of k8s
  24/7 models + Slurm jobs (incl. a pending one) for UI iteration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In a multi-node TP replica only rank-0 registers the `llm` service; the
other ranks run as background workers and their OCFs stay status=pending
forever. The frontend was picking the head as the first peer whose id
matched the model id — but every peer in the group shares that id (from
the served_model_name label), so rank-N could win and the whole replica
would render as pending despite serving traffic fine.

Surface `has_service` on each peer entry from the backend and prefer it
in the frontend's head selection. Same change also makes the expanded-
view "head" label match the node sglang actually runs the API server on.
@robmsmt robmsmt merged commit cc680aa into main May 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant