Release v34#35
Conversation
api_utils: extend retry delays to 137s total (was 34s), add 404 and 500 to retryable status codes — both appear transiently during deploy. base_agent: wrap heartbeat calls so failures log a warning instead of crashing the agent. Agent stays alive and retries next cycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of raising RuntimeError after 6 attempts (~137s), retry with capped backoff (5s, 10s, ... 60s). Transient API outages should not permanently kill agents — they should wait for the API to come back. Fixes the cascade where a monitor outage triggers supervisord FATAL on all agents due to exhausted startretries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test Coverage Summary |
There was a problem hiding this comment.
Pull request overview
This PR hardens swf-common-lib agent resilience during monitor/API restarts by making agent heartbeats non-fatal and making agent-ID registration retry instead of permanently failing.
Changes:
- Wrap
BaseAgent.run()heartbeats so transient API failures don’t crash the agent loop. - Expand API retry/backoff behavior and make
get_next_agent_id()retry indefinitely with a capped delay.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/swf_common_lib/base_agent.py |
Prevents heartbeat failures from terminating the main agent loop. |
src/swf_common_lib/api_utils.py |
Adjusts retry policy and changes agent-ID acquisition to retry indefinitely with capped backoff. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RETRY_DELAYS = (2, 5, 10, 20, 40, 60) | ||
| RETRYABLE_STATUS_CODES = {404, 500, 502, 503, 504} | ||
|
|
There was a problem hiding this comment.
RETRYABLE_STATUS_CODES now includes 404 (and 500), but api_request_with_retry() is documented to fail immediately on 4xx and only retry on 502/503/504 + connection/timeouts. Retrying 404s can mask misconfigured URLs/endpoints and adds ~137s of delay before surfacing a real client error. Either remove 404 from the retryable set (and consider whether 500 should be retried globally), and update the docstring to match the final retry policy.
| url = f"{monitor_url}/api/state/next-agent-id/" | ||
| attempt = 0 | ||
| while True: | ||
| try: | ||
| response = api_request_with_retry('post', url, session=api_session, logger=logger) | ||
| response.raise_for_status() | ||
|
|
||
| data = response.json() | ||
| if data.get('status') == 'success': | ||
| agent_id = data.get('agent_id') | ||
| logger.info(f"Got next agent ID from persistent state: {agent_id}") | ||
| return str(agent_id) | ||
| else: | ||
| raise RuntimeError(f"API returned error: {data.get('error', 'Unknown error')}") | ||
|
|
||
| except Exception as e: | ||
| attempt += 1 | ||
| delay = min(60, 5 * attempt) # 5, 10, 15, ... capped at 60s | ||
| logger.warning( | ||
| f"Failed to get agent ID (attempt {attempt}): {e} — retrying in {delay}s" | ||
| ) | ||
| time.sleep(delay) |
There was a problem hiding this comment.
get_next_agent_id() retries indefinitely on any exception (including auth failures, persistent 4xx, JSON/schema errors, or server-side “status != success”). That can cause agents to hang forever on non-transient misconfiguration instead of failing loudly. Restrict the infinite retry loop to clearly transient conditions (e.g., connection/timeouts and selected 5xx/503 cases) and fail fast (or at least stop retrying) on 4xx/auth and non-retryable response payload errors.
| try: | ||
| self.send_heartbeat() | ||
| except Exception: | ||
| logging.warning("Initial heartbeat failed — server may be restarting, will retry") | ||
|
|
||
| logging.info(f"{self.agent_name} is running. Press Ctrl+C to stop.") | ||
| while True: | ||
| time.sleep(60) # Keep the main thread alive, heartbeats can be added here | ||
|
|
||
| # Check connection status and attempt reconnection if needed | ||
| if not self.mq_connected: | ||
| self._attempt_reconnect() | ||
|
|
||
| self.send_heartbeat() | ||
|
|
||
| try: | ||
| self.send_heartbeat() | ||
| except Exception: | ||
| logging.warning("Heartbeat failed — server may be restarting, will retry next cycle") |
There was a problem hiding this comment.
The heartbeat exception is swallowed without logging the underlying exception details, which makes it hard to distinguish transient server restarts from auth/config bugs. Capture the exception (e.g., except Exception as e) and include the error message and/or exc_info=True in the warning so operators can diagnose repeated failures.
v34 (2026-04-21)
Streaming MCP Moved Off mod_wsgi (swf-monitor)
The
/swf-monitor/mcp/endpoint now runs on a dedicated ASGI worker (uvicorn,swf-monitor-mcp-asgi.serviceon127.0.0.1:8001) behind ApacheProxyPass. Everything else (/about/,/api/,/accounts/login/, PCS, static files) stays on mod_wsgi.Why:
django-mcp-serveruses Starlette'sStreamableHTTPSessionManager. Under WSGI, each streaming MCP session holds a thread viaasync_to_syncfor the full session lifetime. A handful of concurrent MCP clients (OpenCode, Claude Code CLI, Ollama-backed scripts, python-httpx — any streamable-HTTP MCP client) was enough to saturate the pool and 503 every dynamic URL on the site. Isolating/mcp/on an async worker removes that failure mode from the main app.What changed operationally:
threads=30,listen-backlog=500,queue-timeout=30,inactivity-timeout=300,graceful-timeout=15— norequest-timeout(would truncate/api/messages/stream/SSE long-poll).timeout=3600 keepalive=On disablereuse=On,proxy-sendchunked,no-gzip,CacheDisableon/mcp/.swf-monitor-mcp-asgi.servicesystemd unit added (Restart=always, 2 uvicorn workers).src/swf_monitor_project/asgi.pycleaned up — removed deadmcp_app.routingimport (the module was replaced by themcp_serverpackage long ago; ASGI entrypoint was quietly broken).Apache Config Auto-Sync on Deploy (swf-monitor)
apache-swf-monitor.confin the repo is now the source of truth.deploy-swf-monitor.shdiffs it against the live/etc/httpd/conf.d/swf-monitor.confon every deploy; if different, it backs up live, installs from the release, validates withhttpd -t, and rolls back on failure. The Apache reload that happens every deploy (to recycle mod_wsgi for new Python code) picks up any conf change along with it.Why it matters: there was a 6-week drift — the Mar 11
dce7abffix for MCP IP restriction was committed to the repo but never reached live Apache because nothing copied it.setup-apache-deployment.shregenerated the conf from a hardcoded heredoc (that had drifted from the repo canonical), anddeploy-swf-monitor.shdidn't touch Apache conf at all. Closed: setup script nowcpsapache-swf-monitor.confand splits the dynamicLoadModuleline out to/etc/httpd/conf.modules.d/20-swf-monitor-wsgi.conf.ASGI worker is also recycled on every deploy — uvicorn loads code once at startup, so fresh Python code requires a restart. Bots already follow the same pattern (conditional on bot-specific code change).
PanDA Mattermost Bot — Multi-Server MCP with Progressive Tool Loading (swf-monitor)
The PanDA bot now orchestrates across seven external MCP servers plus the local swf-monitor MCP, selecting tools based on the user's question. New integrations:
github.com/BNLNPPS/lxr-mcp-server, new this release) — EIC code browser cross-reference.lxr_ident(definitions + references),lxr_search(ripgrep across repos),lxr_source(read source with line numbers),lxr_list(browse directories).github.com/eic/uproot-mcp-server) — inspect ROOT files: list branches, read arrays, sample contents.epic-capybaraservice account with write access for bot-driven automation on EIC repos.epic_doc_search,epic_doc_contents). Runs in-process inside the bot (not as a separate MCP server, not inside WSGI — initial attempt to host it in WSGI brought the monitor down and was moved; see the debugging notes in the 2026-03-31 assessment).With that many tools, "send the whole catalog to the LLM every turn" stops working. Two new techniques address that:
select_tools. Server and suggestion context carries forward across thread turns, so follow-ups don't re-select from scratch.Other bot improvements:
monitor_app/panda/system_prompt.txtand re-read on every message — prompt iteration no longer requires a bot restart.panda_study_jobfor failed jobs — surfaces Harvester pilot-log analysis automatically when filebrowser lookup fails. Exposed to the LLM via an explicitlog_analysisfield the bot is instructed to surface.New swf-monitor MCP Tool:
panda_harvester_workersLive Harvester pilot/worker counts via bamboo's
askpanda_atlas. Useful for "what pilots are running right now?" without needing to grep through Harvester logs.Returns totals plus breakdown by status, site, and resourcetype. Clean, LLM-friendly response format.
PCS — Compose UX Polish + Programmatic Submission Path (swf-monitor)
Compose pages (Physics/EvGen/Simu/Reco tags, Datasets, Prod Configs, Prod Tasks):
commandandtaskParamMapgrow to fit content instead of forcing horizontal scroll.Production Tasks — submission artifacts:
A single read-only endpoint regenerates a task's submission artifact from current PCS state on every call (no DB writes):
fmtcondorsubmit_csv.shcommandpandapruncommandjeditaskParamMapforClient.insertTaskParams()dumpThe parameter is
fmtbecause DRF reservesformatfor its own content-negotiation.New CLI
pcs-task-cmd— stdlib-only Python client over that endpoint. The recommended way for production operators and automation to fetch submission artifacts (no Django import, no DB credentials):Environment:
SWFMON_URL(defaulthttps://epic-devcloud.org/prod), optionalSWFMON_TOKENfor non-public deployments.JEDI taskParamMap now surfaced on task detail —
build_task_params()renders the full param map users will submit, viewable and copyable directly from the compose page.Deploy-Script Improvements (swf-monitor)
swf-monitor-mcp-asgi.servicerestart step — always restarts on deploy (uvicorn needs it).deploy-swf-monitor.shensures/opt/swf-monitor/shared/hf_cacheexists with open perms and appendsHF_HOME=toproduction.envif missing. Bamboo and epicdoc reuse the cache across processes.nightly-update-mcp-servers.sh,nightly-update-epicdoc.sh) — auto-updates sibling MCP-server repos and re-ingests ePIC documentation into epicdoc's ChromaDB store.PanDA Production Monitoring — Job Deep-Dive Enhancements (swf-monitor)
panda_study_job— clickable links to the NERSC job portal alongside existing Harvester log URLs.log_analysisfield with fallback to Harvester URL when filebrowser fails./panda joboutput (source → component) — fixes a KeyError that surfaced on some job records.Auth & API Changes (swf-monitor)
TunnelAuthMiddlewarenow requires anX-Remote-Userheader before auto-authenticating — anonymous proxy requests no longer get a free pass. Matches the threat model of the TunnelAuthentication DRF backend (also checks the header before acting)./api/users/response now includesemail,first_name,last_name— enables richer devcloud account sync.Documentation
PRODUCTION_DEPLOYMENT.mdrefreshed for the two-backend layout, new setup-apache-deployment.sh behavior, and the full deploy step list (conf sync, ASGI worker restart).MCP.md— ASGI/WSGI split documented, transport description corrected (it IS streamable HTTP), tool summary count corrected to 44, all tool categories added.PCS.md— MCP Tools table corrected to the tools that actually exist.JEDI_INTEGRATION.md(architecture, field mapping, implementation plan) andJEDI_EPIC_PROPOSAL.md(technical proposal for PanDA team review) — roadmap for direct task submission to JEDI replacing the currentprunCLI text generation.Agent Resilience (swf-common-lib)
Further hardening of the BaseAgent lifecycle under unreliable infrastructure:
swf-testbed
No user-facing changes in v34 — administrative commits only (CLAUDE.md branch-reference updates, v33 release notes catch-up).