Skip to content

Release v34#35

Merged
wenaus merged 2 commits into
mainfrom
infra/baseline-v34
Apr 21, 2026
Merged

Release v34#35
wenaus merged 2 commits into
mainfrom
infra/baseline-v34

Conversation

@wenaus
Copy link
Copy Markdown
Member

@wenaus wenaus commented Apr 21, 2026

v34 (2026-04-21)

Streaming MCP Moved Off mod_wsgi (swf-monitor)

The /swf-monitor/mcp/ endpoint now runs on a dedicated ASGI worker (uvicorn, swf-monitor-mcp-asgi.service on 127.0.0.1:8001) behind Apache ProxyPass. Everything else (/about/, /api/, /accounts/login/, PCS, static files) stays on mod_wsgi.

Why: django-mcp-server uses Starlette's StreamableHTTPSessionManager. Under WSGI, each streaming MCP session holds a thread via async_to_sync for the full session lifetime. A handful of concurrent MCP clients (OpenCode, Claude Code CLI, Ollama-backed scripts, python-httpx — any streamable-HTTP MCP client) was enough to saturate the pool and 503 every dynamic URL on the site. Isolating /mcp/ on an async worker removes that failure mode from the main app.

What changed operationally:

  • mod_wsgi tuned for burst resilience: threads=30, listen-backlog=500, queue-timeout=30, inactivity-timeout=300, graceful-timeout=15 — no request-timeout (would truncate /api/messages/stream/ SSE long-poll).
  • Proxy tuned for streaming: timeout=3600 keepalive=On disablereuse=On, proxy-sendchunked, no-gzip, CacheDisable on /mcp/.
  • swf-monitor-mcp-asgi.service systemd unit added (Restart=always, 2 uvicorn workers).
  • src/swf_monitor_project/asgi.py cleaned up — removed dead mcp_app.routing import (the module was replaced by the mcp_server package long ago; ASGI entrypoint was quietly broken).

Apache Config Auto-Sync on Deploy (swf-monitor)

apache-swf-monitor.conf in the repo is now the source of truth. deploy-swf-monitor.sh diffs it against the live /etc/httpd/conf.d/swf-monitor.conf on every deploy; if different, it backs up live, installs from the release, validates with httpd -t, and rolls back on failure. The Apache reload that happens every deploy (to recycle mod_wsgi for new Python code) picks up any conf change along with it.

Why it matters: there was a 6-week drift — the Mar 11 dce7abf fix for MCP IP restriction was committed to the repo but never reached live Apache because nothing copied it. setup-apache-deployment.sh regenerated the conf from a hardcoded heredoc (that had drifted from the repo canonical), and deploy-swf-monitor.sh didn't touch Apache conf at all. Closed: setup script now cps apache-swf-monitor.conf and splits the dynamic LoadModule line out to /etc/httpd/conf.modules.d/20-swf-monitor-wsgi.conf.

ASGI worker is also recycled on every deploy — uvicorn loads code once at startup, so fresh Python code requires a restart. Bots already follow the same pattern (conditional on bot-specific code change).

PanDA Mattermost Bot — Multi-Server MCP with Progressive Tool Loading (swf-monitor)

The PanDA bot now orchestrates across seven external MCP servers plus the local swf-monitor MCP, selecting tools based on the user's question. New integrations:

  • LXR MCP server (github.com/BNLNPPS/lxr-mcp-server, new this release) — EIC code browser cross-reference. lxr_ident (definitions + references), lxr_search (ripgrep across repos), lxr_source (read source with line numbers), lxr_list (browse directories).
  • uproot MCP server (github.com/eic/uproot-mcp-server) — inspect ROOT files: list branches, read arrays, sample contents.
  • JLab-Rucio and BNL-Rucio MCP servers — query Rucio for EIC datasets, replicas, and rules.
  • GitHub MCP server — now uses the epic-capybara service account with write access for bot-driven automation on EIC repos.
  • epicdoc — RAG search over ePIC documentation (epic_doc_search, epic_doc_contents). Runs in-process inside the bot (not as a separate MCP server, not inside WSGI — initial attempt to host it in WSGI brought the monitor down and was moved; see the debugging notes in the 2026-03-31 assessment).

With that many tools, "send the whole catalog to the LLM every turn" stops working. Two new techniques address that:

  • Progressive tool loading via semantic similarity. For each user question the bot embeds the question and ranks tools by server-prefixed cosine similarity, auto-truncating at a score cliff. The LLM sees a small, relevant slice rather than all hundreds of tools — and the rank is preserved through the display so the LLM can judge relevance.
  • 3-tier tool awareness. Every tool is visible by name + one-line catalog entry in the system prompt, so the LLM knows the full surface area exists at minimal token cost. Detailed schemas are fetched only for tools the LLM explicitly selects via select_tools. Server and suggestion context carries forward across thread turns, so follow-ups don't re-select from scratch.

Other bot improvements:

  • System prompt externalized to monitor_app/panda/system_prompt.txt and re-read on every message — prompt iteration no longer requires a bot restart.
  • DPID detection hardened. For job/task questions the bot verifies that any Data Provenance ID in the reply came from actual tool output before letting it through. Detection is now line-based and format-agnostic; trigger word AND a matching ID must both be present.
  • Bamboo log analysis integrated into panda_study_job for failed jobs — surfaces Harvester pilot-log analysis automatically when filebrowser lookup fails. Exposed to the LLM via an explicit log_analysis field the bot is instructed to surface.
  • Response style rules in the system prompt curb overenthusiastic replies (e.g., verbose explanations when a one-line answer suffices).
  • Server-side matplotlib plot rendering, nightly cron scripts to auto-update each MCP server repo.

New swf-monitor MCP Tool: panda_harvester_workers

Live Harvester pilot/worker counts via bamboo's askpanda_atlas. Useful for "what pilots are running right now?" without needing to grep through Harvester logs.

panda_harvester_workers(status='running', site='NERSC', resourcetype='SCORE', days=1)

Returns totals plus breakdown by status, site, and resourcetype. Clean, LLM-friendly response format.

PCS — Compose UX Polish + Programmatic Submission Path (swf-monitor)

Compose pages (Physics/EvGen/Simu/Reco tags, Datasets, Prod Configs, Prod Tasks):

  • Uniform button styling — all filled (solid) variants, dark-green accent on live edited values, consistent New-button placement in the left panel across all compose views.
  • Breadcrumbs and Cancel buttons point to compose views instead of the legacy list views.
  • Name-based URL params so compose views are bookmarkable and deep-linkable.
  • Owner-only edit enforcement on production configs (same discipline as tag edits).
  • Edit / Copy / New buttons no longer silently fail on prod config compose (previous type-argument mismatch fixed).
  • Compose panels for command and taskParamMap grow to fit content instead of forcing horizontal scroll.
  • Fixed type-argument mismatch in compose URL sync.

Production Tasks — submission artifacts:

A single read-only endpoint regenerates a task's submission artifact from current PCS state on every call (no DB writes):

GET /swf-monitor/pcs/api/prod-tasks/command/?name=<task_name>&fmt=<format>
fmt Contents
condor env-prefixed submit_csv.sh command
panda prun command
jedi taskParamMap for Client.insertTaskParams()
dump Full view: task + dataset + all four tags + prod config + effective config

The parameter is fmt because DRF reserves format for its own content-negotiation.

New CLI pcs-task-cmd — stdlib-only Python client over that endpoint. The recommended way for production operators and automation to fetch submission artifacts (no Django import, no DB credentials):

# Inspect a task
pcs-task-cmd <task_name> --format dump

# Submit to JEDI (requires valid PanDA auth)
pcs-task-cmd <name> --format jedi | python -c '
import json, sys
from pandaclient import Client
print(Client.insertTaskParams(json.load(sys.stdin)))
'

# Pipe Condor command into bash
eval "$(pcs-task-cmd <name> --format condor)"

Environment: SWFMON_URL (default https://epic-devcloud.org/prod), optional SWFMON_TOKEN for non-public deployments.

JEDI taskParamMap now surfaced on task detailbuild_task_params() renders the full param map users will submit, viewable and copyable directly from the compose page.

Deploy-Script Improvements (swf-monitor)

  • swf-monitor-mcp-asgi.service restart step — always restarts on deploy (uvicorn needs it).
  • Apache conf sync — described above.
  • Shared HuggingFace cachedeploy-swf-monitor.sh ensures /opt/swf-monitor/shared/hf_cache exists with open perms and appends HF_HOME= to production.env if missing. Bamboo and epicdoc reuse the cache across processes.
  • Bot restarts after health check, not before — avoids killing bots mid-request if Apache comes up broken.
  • Nightly cron (nightly-update-mcp-servers.sh, nightly-update-epicdoc.sh) — auto-updates sibling MCP-server repos and re-ingests ePIC documentation into epicdoc's ChromaDB store.

PanDA Production Monitoring — Job Deep-Dive Enhancements (swf-monitor)

  • NERSC portal log URLs surfaced for Perlmutter jobs in panda_study_job — clickable links to the NERSC job portal alongside existing Harvester log URLs.
  • Bamboo log analysis runs on failed jobs automatically; LLM-friendly log_analysis field with fallback to Harvester URL when filebrowser fails.
  • Error field rename in /panda job output (source → component) — fixes a KeyError that surfaced on some job records.

Auth & API Changes (swf-monitor)

  • TunnelAuthMiddleware now requires an X-Remote-User header before auto-authenticating — anonymous proxy requests no longer get a free pass. Matches the threat model of the TunnelAuthentication DRF backend (also checks the header before acting).
  • /api/users/ response now includes email, first_name, last_name — enables richer devcloud account sync.

Documentation

  • PRODUCTION_DEPLOYMENT.md refreshed for the two-backend layout, new setup-apache-deployment.sh behavior, and the full deploy step list (conf sync, ASGI worker restart).
  • MCP.md — ASGI/WSGI split documented, transport description corrected (it IS streamable HTTP), tool summary count corrected to 44, all tool categories added.
  • PCS.md — MCP Tools table corrected to the tools that actually exist.
  • JEDI design docs added: JEDI_INTEGRATION.md (architecture, field mapping, implementation plan) and JEDI_EPIC_PROPOSAL.md (technical proposal for PanDA team review) — roadmap for direct task submission to JEDI replacing the current prun CLI text generation.

Agent Resilience (swf-common-lib)

Further hardening of the BaseAgent lifecycle under unreliable infrastructure:

  • Agent-ID registration retries indefinitely on API failure (previously gave up after a bounded number of attempts). Agents starting into a partially-up monitor no longer silently fail to register.
  • Improved resilience to server restarts — agents survive transient monitor outages and resume their heartbeat loop cleanly on reconnection.

swf-testbed

No user-facing changes in v34 — administrative commits only (CLAUDE.md branch-reference updates, v33 release notes catch-up).


wenaus and others added 2 commits March 31, 2026 14:48
api_utils: extend retry delays to 137s total (was 34s), add 404 and
500 to retryable status codes — both appear transiently during deploy.

base_agent: wrap heartbeat calls so failures log a warning instead of
crashing the agent. Agent stays alive and retries next cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of raising RuntimeError after 6 attempts (~137s), retry with
capped backoff (5s, 10s, ... 60s). Transient API outages should not
permanently kill agents — they should wait for the API to come back.

Fixes the cascade where a monitor outage triggers supervisord FATAL
on all agents due to exhausted startretries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 12:56
@github-actions
Copy link
Copy Markdown

Test Coverage Summary

-------------------------------------------------------------------
src/swf_common_lib/__init__.py            0      0   100%
src/swf_common_lib/api_utils.py          87     87     0%   8-198
src/swf_common_lib/base_agent.py        393    393     0%   5-780
src/swf_common_lib/config_utils.py       30     30     0%   7-102
src/swf_common_lib/logging_utils.py      33      0   100%
src/swf_common_lib/rest_logging.py       80     80     0%   8-157
-------------------------------------------------------------------
TOTAL                                   623    590     5%
============================== 5 passed in 0.43s ===============================

@wenaus wenaus merged commit 18311c0 into main Apr 21, 2026
5 checks passed
@wenaus wenaus deleted the infra/baseline-v34 branch April 21, 2026 12:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens swf-common-lib agent resilience during monitor/API restarts by making agent heartbeats non-fatal and making agent-ID registration retry instead of permanently failing.

Changes:

  • Wrap BaseAgent.run() heartbeats so transient API failures don’t crash the agent loop.
  • Expand API retry/backoff behavior and make get_next_agent_id() retry indefinitely with a capped delay.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/swf_common_lib/base_agent.py Prevents heartbeat failures from terminating the main agent loop.
src/swf_common_lib/api_utils.py Adjusts retry policy and changes agent-ID acquisition to retry indefinitely with capped backoff.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +13 to 15
RETRY_DELAYS = (2, 5, 10, 20, 40, 60)
RETRYABLE_STATUS_CODES = {404, 500, 502, 503, 504}

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RETRYABLE_STATUS_CODES now includes 404 (and 500), but api_request_with_retry() is documented to fail immediately on 4xx and only retry on 502/503/504 + connection/timeouts. Retrying 404s can mask misconfigured URLs/endpoints and adds ~137s of delay before surfacing a real client error. Either remove 404 from the retryable set (and consider whether 500 should be retried globally), and update the docstring to match the final retry policy.

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +118
url = f"{monitor_url}/api/state/next-agent-id/"
attempt = 0
while True:
try:
response = api_request_with_retry('post', url, session=api_session, logger=logger)
response.raise_for_status()

data = response.json()
if data.get('status') == 'success':
agent_id = data.get('agent_id')
logger.info(f"Got next agent ID from persistent state: {agent_id}")
return str(agent_id)
else:
raise RuntimeError(f"API returned error: {data.get('error', 'Unknown error')}")

except Exception as e:
attempt += 1
delay = min(60, 5 * attempt) # 5, 10, 15, ... capped at 60s
logger.warning(
f"Failed to get agent ID (attempt {attempt}): {e} — retrying in {delay}s"
)
time.sleep(delay)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_next_agent_id() retries indefinitely on any exception (including auth failures, persistent 4xx, JSON/schema errors, or server-side “status != success”). That can cause agents to hang forever on non-transient misconfiguration instead of failing loudly. Restrict the infinite retry loop to clearly transient conditions (e.g., connection/timeouts and selected 5xx/503 cases) and fail fast (or at least stop retrying) on 4xx/auth and non-retryable response payload errors.

Copilot uses AI. Check for mistakes.
Comment on lines +324 to +340
try:
self.send_heartbeat()
except Exception:
logging.warning("Initial heartbeat failed — server may be restarting, will retry")

logging.info(f"{self.agent_name} is running. Press Ctrl+C to stop.")
while True:
time.sleep(60) # Keep the main thread alive, heartbeats can be added here

# Check connection status and attempt reconnection if needed
if not self.mq_connected:
self._attempt_reconnect()

self.send_heartbeat()

try:
self.send_heartbeat()
except Exception:
logging.warning("Heartbeat failed — server may be restarting, will retry next cycle")
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heartbeat exception is swallowed without logging the underlying exception details, which makes it hard to distinguish transient server restarts from auth/config bugs. Capture the exception (e.g., except Exception as e) and include the error message and/or exc_info=True in the warning so operators can diagnose repeated failures.

Copilot uses AI. Check for mistakes.
@wenaus wenaus restored the infra/baseline-v34 branch April 21, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants