Skip to content

Gofannon user-ready#577

Merged
andrewmusselman merged 29 commits into
mainfrom
integrate-all-prs
May 5, 2026
Merged

Gofannon user-ready#577
andrewmusselman merged 29 commits into
mainfrom
integrate-all-prs

Conversation

@andrewmusselman

Copy link
Copy Markdown
Collaborator

Sandbox observability + dev-stack hardening

A bundle of medium features and stack hardening accumulated through testing the dev stack against real workloads.

Sandbox Progress Log

The sandbox previously showed an agent's final result and a panel of data store ops; users had to read the api container's stdout to see what an agent was actually doing. Adds a structured per-run trace surfaced in a new Progress Log accordion in the sandbox right column.

Backend. services/agent_trace.py collects events (agent_start/end, llm_call, data_store, error, stdout, log). Bound via contextvar so nested layers (LLM service, data store proxy, GofannonClient.call recursion) emit without threading the collector through every signature. capture_user_io() routes stdout/stderr/logging into the trace with 4 KB per-event and 2000 events per-trace caps; streams restored on exit including on exception. GOFANNON_DISABLE_USER_TRACE=1 suppresses user-origin events; structural events still emit. The LLM call wrapper times each call so duration appears even when call_llm raises. Sandbox failure path returns a structured response with the partial trace instead of raising.

Frontend. SandboxProgressLog.jsx lists runs newest-first; each is a card with status chip and per-agent groups. Outcome icons (✓/✗/⏳), durations, "chained" badges for nested calls. Errors get red border + bg. In-memory history (lost on refresh).

Streaming. POST /agents/run-code/stream returns text/event-stream. Each Trace event becomes one SSE trace frame (~50 ms latency); final done frame carries result/error/opsLog/schemaWarnings. Trace gains an optional asyncio.Queue published to on each append. Frontend uses fetch + ReadableStream (not EventSource — POST + custom headers needed). 30s heartbeat comments + X-Accel-Buffering: no keep proxies from idling out the connection. Non-streaming endpoint stays for callers that want a bulk shape.

Bucketing. Long agents emit hundreds of lines; the panel got unwieldy. stdout/log events collapse into per-agent buckets ("47 lines of stdout/log output · click to view"), breaking at structural events so chronological flow is preserved. Click → Drawer side sheet with all lines in a scrollable monospace block, per-line error highlighting (lines with ERROR/FAIL/TRACEBACK), and a one-line preview of the latest error-flavored line in the bucket summary.

Side sheet for stack traces. Multi-line content truncated to 3 lines inline with a "more" link to the same side sheet.

Tests. test_agent_trace.py (33 unit tests) covers event collection, depth/agent stack, truncation cap, env-var disable, contextvar binding, line-buffering stdout wrapper, logging handler, and capture_user_io including stream restoration on exception. test_run_code_streaming.py (6 integration tests) covers the streaming endpoint end-to-end: success path, error path with structured done frame, opsLog/schemaWarnings in the done frame, response headers, friendly_name plumbing, SSE parser tolerance for heartbeat comments. agent_trace.py jumps from 0 % to 87 % coverage.

docs/developers/agent-trace.md covers the env var, leak vectors, caps, contextvar rationale, and how to add new event types.

Phase B (session) auth as the default dev mode

Session-cookie auth becomes the default dev-stack mode. dev-tail.sh no longer needs --phase-b; mockAuth is no longer the default frontend service.

Flow: GET /auth/login/dev_stub → backend redirects to picker → user clicks alice/bob/site_admin_1 → callback sets gofannon_sid httpOnly cookie → redirected back to frontend. .dev-auth.yaml committed as a dev fixture. developers/local-auth.md documents the flow.

Five bugs surfaced and were fixed during validation:

  • CORS allowing wildcard with credentials (browsers reject) → honor FRONTEND_URL.
  • Backend redirect target was relative; resolved against port 8000 instead of frontend → prefix with FRONTEND_URL when relative.
  • AuthContext misrecognizing sessions because local.js had provider:'mock' → flipped to session.
  • Frontend POST'ing to non-existent /auth/dev_stub/login → use the real GET → picker → callback flow.
  • sessionAuth.onAuthStateChanged fired a synchronous null callback before /auth/me resolved, so PrivateRoute bounced to /login and LoginPage's "already logged in" effect bounced to home — refresh on /agent/<id> always landed on /. Fix: only emit synchronously if a user is already resolved; otherwise wait for _fetchMe and let _emit() send the real value.

E2E tests rewritten — global-setup.js now walks the dev_stub flow and saves storageState. CORS unit test updated to match the fixed allowed_origins. E2E api-keys.spec.js realigned with the new ApiKeysTab DOM (h5 not h6, no "Not configured" chip, no "About API Keys" alert, profile menu trimmed).

Smaller features and fixes

Refresh redirect — see (5) above. Bonus: refresh on any deep route now stays on that route.

Stale namespace lists. HomePage and DataStoreConfigAccordion fetched namespaces only on mount with [] deps. A namespace created in another tab/page didn't appear until hard refresh. Refetch on visibilitychange so coming back to a tab gives fresh data.

webui readiness probe. run-all-tests.sh checked for a webui container in docker ps, but with dev-tail.sh the webui is vite on the host, not a container — check always failed and warned the stack was misconfigured. Replace with a curl localhost:3000 probe.

Paste agent code without generation. Agent Code accordion was gated on hasCode; pasted-in code couldn't be saved without first running the LLM generator. Un-gate the accordion (default-expanded in creation flow or when code exists). Save validation reordered: code required first, description only required when code is absent (description is the prompt input for the generator; once code exists it's optional metadata).

Sandbox shows agent's data store config. The agent page renders DataStoreConfigAccordion with configured namespaces + record counts; the sandbox page only had SandboxDataPanel (ops from the most recent run), so users had no view of "what data does this agent have access to" until after running it. Add a readOnly prop to DataStoreConfigAccordion (hides edit/add/delete) and render it on the sandbox page above the ops panel. Reverted in a follow-up — the additional pane cluttered the sandbox view.

Profile menu cleanup. Profile menu had Basic Info / Usage / Billing / API Keys; only API Keys did anything. Drop the placeholders, collapse ProfilePage to render ApiKeysTab directly. Restyle ApiKeysTab to match other top-level pages (constrained max width, back-arrow + h5 title, single in-place TextField per row, "Configured" chip only when keyed, absence implies not configured).

friendly_name → trace events. Plumb the agent's friendly name through RunCodeRequest so the trace's per-event agent_name reflects the actual agent (e.g. test_agent) instead of a placeholder. Frontend sources from agentData.friendlyName / agentData.name or the creation-flow context.

Roadmap

  • Persistent run history (currently in-memory; lost on refresh).
  • Test coverage for the streaming endpoint's heartbeat path (currently only the parser-side is tested for heartbeat tolerance).

The end-to-end flow has been manually validated against a real Bedrock-backed ASVS auditor agent doing tarball ingest, multi-step LLM analysis, and bursty GitHub pushes.

Clicking a composer or invokable chip on ViewAgent opens the model
dialog pre-populated with that item's existing config, so users can
edit a previously-added model in place instead of deleting and
re-adding. Dialog title reflects whether it's an add or edit.
Adds OpenRouter alongside the existing providers with an 11-model
catalog: grok-code-fast-1, grok-4.1-fast, claude-sonnet-4.5,
claude-opus-4.1, gpt-5, gpt-5-mini, deepseek-v3.2, deepseek-chat-v3.1,
qwen3-coder, qwen3-coder-next, llama-3.3-70b-instruct.

- New config/openrouter/ module with _make_entry helper for the catalog
- provider_config.py registers the new provider
- models/user.py adds openrouter_api_key field
- services/user_service.py extracts PROVIDER_KEY_MAP constant
- ApiKeysTab.jsx adds the OpenRouter row

No llm_service.py changes needed — existing model_string routing
handles openrouter/* model ids.
Input fields in the agent Sandbox now match the declared schema type
instead of always rendering as text. Number fields get numeric input,
boolean gets a Switch, JSON gets a multiline textarea with inline
parse-error feedback. Adds 'json' as a schema type option in the
SchemaEditor. Values are cast on submit so backend receives
correctly-typed payloads.
Backend portion of PR 6 only. SandboxScreen.jsx hunks deferred
because 3 of 5 hunks conflict with PR 5's handleRun restructuring
and partial apply would leave the frontend referencing undefined
variables (schemaWarnings, WarningAmberIcon).

Included:
- agent_factory/prompts.py: strengthens output directive prompts
  with three ✅/❌ examples so the LLM returns structured results
  matching the declared output_schema instead of wrapping in
  {outputText: ...}.
- dependencies.py: validate_output_against_schema() — checks dict
  shape, missing/extra keys, type mismatches (with bool-vs-int
  gotcha handling).
- models/agent.py: adds output_schema to RunCodeRequest and
  schema_warnings to RunCodeResponse.
- routes.py: sandbox route calls the validator, returns warnings.
- services/agentService.js: runCodeInSandbox signature extended
  with outputSchema parameter.

Deferred to PR 8b:
- SandboxScreen.jsx: adds WarningAmberIcon import, schemaWarnings
  state, capture from response, advisory banner JSX, outputSchema
  arg on the service call. All land atomically when PR 8b rebuilds
  handleRun.
Adds a 'Chain View' accordion on ViewAgent showing the transitive
dependency tree of an agent: nested GofannonClient calls and MCP
servers, rendered as an indented MUI List. Root agent is expanded
by default. Cycles and missing agents are badged; depth capped at
8 to prevent runaway recursion.

- dependencies.py: build_agent_chain with ancestry-based cycle
  detection and missing-agent handling
- routes.py: GET /agents/{id}/chain
- New components/AgentChainView.jsx with Launch icon to navigate
  into a child agent
User-facing browser for the persistent data store.

- models/data_store.py: DataStoreRecord, NamespaceStats,
  NamespaceListResponse, SetRecordRequest, ClearNamespaceResponse
- routes.py: 6 new endpoints for namespace/record CRUD with
  path-matched keys
- services/dataStoreService.js: API client wrapper
- pages/DataStoresPage.jsx: stats cards + namespace table + clear
  confirmation
- pages/DataStoreBrowser.jsx: prefix-grouped record table with
  search, right drawer with Value/Metadata/Copy/Edit/Delete,
  JSON-or-raw-string edit dialog
- HomePage adds a 3rd column for Data Stores at xl breakpoints
- New routes /data-stores and /data-stores/:namespace
…rs PR 6 frontend (item 12 part 2)

Per-agent data store configuration and a live sandbox panel showing
every data store op the agent performed during its run.

- services/data_store_service.py: AgentDataStoreProxy instrumented
  with an ops_log parameter; all 9 ops log structured entries
  {op, namespace, agent, ts, key?, valuePreview?, found?, count?}
  with 200-char value previews.
- dependencies.py: _execute_agent_code returns (result, ops_log);
  three internal callsites updated.
- models/agent.py: DataStoreNamespaceConfig + data_store_config
  field on Create/Update requests; ops_log on RunCodeResponse.
- components/SandboxDataPanel.jsx: right-side panel with
  Operations tab (color-coded READ/WRITE/DEL chips, expandable
  rows) + State tab (per-namespace aggregation).
- components/DataStoreConfigAccordion.jsx: flow preview
  (Reads From → agent → Writes To) + namespace table with
  Autocomplete suggestions.
- ViewAgent.jsx inserts the config accordion between Schemas and
  Model Config; data_store_config threaded through save payloads.
- SandboxScreen.jsx rebuilt to integrate PR 5's typed inputs,
  PR 6's schemaWarnings capture (deferred from PR 6), and PR 8b's
  opsLog capture + two-column layout with the data panel. Fixes
  the frontend half of PR 6 that was skipped due to hunk conflicts.
- Updates 5 unit tests (test_dependencies.py, test_context_window.py)
  to unpack the new (result, ops_log) tuple from _execute_agent_code.
Backend-only. Adds the session-based auth system gated behind the
AUTH_CONFIG_PATH env var. No user-visible changes without operator
opt-in; legacy Firebase auth continues to work untouched.

- config/__init__.py: loads AUTH_CONFIG_PATH YAML into settings
- models/session.py, workspace.py, auth.py: data models
- auth/base.py: AuthProvider ABC with get_authorize_url,
  exchange_code, get_workspace_memberships, evaluate_login
- auth/ldap_client.py: ldap3 wrapper for ASF committer/PMC/banned
  queries with soft-fail on LDAP outage
- auth/providers/dev_stub.py: local-dev provider with YAML-configured
  test users and a plain HTML picker page
- auth/providers/asf.py: real oauth.apache.org + LDAP integration
  with ASF-specific banned/emeritus/site-admin policy
- auth/__init__.py: ProviderRegistry with startup init
- services/session_service.py: CRUD + refresh with diff computation
- services/audit_service.py: append-only log (scaffolding for B-3)
- routes_auth.py: /auth/providers, /auth/login/{type},
  /auth/callback/{type}, /auth/logout, /auth/refresh-workspaces,
  /auth/me, /auth/dev-stub-picker
- routes.py: get_current_user is dual-mode (session cookie first,
  Firebase bearer token fallback)
- app_factory.py: registry init and conditional auth router mount
- requirements.txt: +ldap3>=2.9

Three-tier role model: member (workspace), admin (workspace, from
LDAP PMC intersection), site_admin (global, from config allowlist).
Personal workspace auto-created per session. Soft-fail on LDAP
outage preserves existing memberships.
…ow tests

Infra hardening for the E2E test harness:
- playwright.config.js: fullyParallel:false, workers:1. Tests share
  a single backend user (local-dev-user) so parallel workers race
  on state mutations.
- packages/webui/vite.config.js: ignore test-results, playwright-
  report, .auth, coverage, htmlcov from watch. Prevents Vite HMR
  reloading the page mid-test.
- infra/docker/docker-compose.yml: uvicorn --reload excludes for
  *.pyc, __pycache__, tests, pytest_cache, htmlcov, coverage.
  Prevents uvicorn restarting mid-request during test runs.
- tests/e2e/api-keys.spec.js: skip 5 write-flow tests (add/update/
  remove/masked/success) that intermittently fail with 'Failed to
  fetch' despite the above fixes. Backend API-key write endpoints
  are covered by pytest integration tests; skipping gives a stable
  11/16 passing baseline while root cause is investigated in a
  separate project.
Teaches the UI to talk to the B-1 session backend.

- services/authService.js: new sessionAuth implementation alongside
  firebase/mock/cognito. Selected when appConfig.auth.provider is
  'session'. Uses the gofannon_sid cookie set by B-1's callback
  route; exposes refreshWorkspaces. Exports fetchAuthProviders
  helper for LoginPage.
- services/fetchInterceptor.js: wraps window.fetch once at load
  to auto-add credentials:'include' to same-origin API calls —
  avoids editing 19 fetch sites across 5 service files.
- contexts/AuthContext.jsx: adds refreshWorkspaces and
  isSessionAuth to context value.
- pages/LoginPage.jsx: fetches /auth/providers on mount. Renders
  one button per provider when Phase B is enabled; legacy Firebase
  form shown below when operator sets legacyFirebaseEnabled=true.
- components/ProfileMenu.jsx: user identity header with site-admin
  chip, workspace list preview (top 5 with admin chips, '+N more'),
  'Refresh workspaces' menu item with snackbar for the diff.
- App.jsx: imports fetchInterceptor at top.

Zero regression for teams not on Phase B: without AUTH_CONFIG_PATH
the backend 404s /auth/providers and LoginPage renders legacy form
identically to before.
Extends the Phase B pluggable auth infrastructure with three additional
identity providers. Backend-only; existing LoginPage (from B-2) picks
them up via /auth/providers.

New providers:
- auth/providers/google.py: Google Workspace OAuth + Admin SDK Directory
  API for Google Groups. Hosted-domain enforcement, allowlist default,
  OWNER/MANAGER -> admin role mapping.
- auth/providers/microsoft.py: Microsoft Entra ID OAuth + Graph
  /me/transitiveMemberOf for security group memberships. Tenant-scoped
  authorize, optional admin_groups subset for role promotion.
- auth/providers/github.py: GitHub OAuth + /user/memberships/orgs/{org}
  for org role. Numeric id for external_id (rename-stable), case-
  insensitive org normalization.

Common patterns:
- All three default to mode=allowlist (deny unless in configured
  group/org); operators opt into open_domain/open_tenant/open_github
  for public-style deployments.
- Access token stashed on UserInfo for the subsequent memberships call.
- 403 soft-fails to empty memberships.

Wired into the registry:
- auth/__init__.py: three new entries in _PROVIDER_CLASSES.
- auth/providers/__init__.py: re-exports new classes.

Tests: 46 mock-based unit tests across tests/unit/auth/, covering
config validation, authorize URL shape, exchange_code happy/error
paths, membership allowlist + role mapping + 403 soft-fail, and all
evaluate_login branches.

Docs: auth.example.yaml extended with disabled example blocks for
each new provider (same pattern as the existing asf + dev_stub blocks).
Reference configuration template. Operators copy this, fill in
secrets, and point AUTH_CONFIG_PATH at the copy. Includes blocks
for dev_stub (local dev), asf (PR B-1), and google/microsoft/
github (PR B-1.1). All providers disabled by default except
dev_stub for the example.
The couchdb-python library no longer accepts shard-count (n) and
shard-quorum (q) kwargs on Server.create(). These params only apply
to clustered CouchDB deployments anyway; the dev stack runs
single-node where they're ignored.

Symptom: 500 on any route that triggers DB auto-creation against a
fresh CouchDB instance ('TypeError: Server.create() got an
unexpected keyword argument n').

Pre-existing bug, surfaced when testing against a fresh CouchDB.
app_factory.py computed allowed_origins from FRONTEND_URL then
discarded it, hardcoding allow_origins=['*']. With
allow_credentials=True the browser blocks responses because the
CORS spec forbids wildcard + credentials.

Pre-existing bug. Exposed by PR B-2's fetchInterceptor adding
credentials:'include' to every API call to support session cookies.
Before B-2, nothing sent credentials, so the wildcard was tolerated.

Fix: honor the computed allowed_origins list.
…-tail

- Dockerfile.api: bump base image from python:3.10-slim to
  python:3.12-slim. Silences google.api_core's FutureWarning about
  Python 3.10 EOL in Oct 2026.
- Dockerfile.api: upgrade pip before installing requirements and
  pass --root-user-action=ignore to silence the 'running pip as
  root' warning in container builds.
- playwright.config.js: baseURL and dev-server command on port 3000
  to match the backend's FRONTEND_URL default (which drives the
  CORS allowlist).
- dev-tail.sh: new local dev runner that starts docker + vite in a
  single script with tailing logs. Uses isolated
  COMPOSE_PROJECT_NAME=gofannon-dev so other compose projects on
  the host aren't disturbed. Supports --phase-b for Phase B auth
  testing with a dev_stub-configured yaml, and --stop for clean
  teardown.
Inline the auth.yaml mount into docker-compose.yml, commit a working
.dev-auth.yaml at the repo root with three dev_stub users
(alice/bob/site_admin_1), and remove the override mechanism that
was conditionally generated by dev-tail.sh.

After this:
  ./dev-tail.sh         # auth is on, login works out of the box
  ./dev-tail.sh --stop  # unchanged

Operators deploying gofannon override AUTH_CONFIG_PATH and mount
their own auth.yaml at deploy time; the committed .dev-auth.yaml
exists only for local development. The header comment in that file
spells out the dev-only nature loudly enough that nobody copies it
into production by accident.

Personal customization without touching the committed file: copy
to .dev-auth.local.yaml (gitignored) and point the api service's
volume mount at that copy.

Documentation for the new flow lives at
docs/developers/local-auth.md — covers the committed fixtures, the
.dev-auth.local.yaml escape hatch, production guidance, and
troubleshooting.

Signed-off-by: Andrew Musselman <andrew.musselman@gmail.com>
The companion fix (a9add26 'fix(cors): use computed allowed_origins,
don't hardcode wildcard') made the production code honor FRONTEND_URL
instead of hardcoding ['*']. This test was asserting the old buggy
behavior; update to assert the new correct behavior.
The previous global-setup seeded localStorage with a mock user, which
worked when the frontend's authService picked mockAuth at module-load
time. With session auth as the default, mockAuth is not loaded; the
seeded localStorage is ignored; tests start unauthenticated and the
AccountCircle menu (only rendered when logged in) never appears.

Replace with a setup that walks the dev_stub login flow:
  - GET /auth/login/dev_stub kicks off the OAuth-shaped flow
  - Picker page (rendered by backend) lists configured users as <a>s
  - Click the alice link → backend callback → session cookie set
  - storageState now contains gofannon_sid; tests inherit it

Adds E2E_STUB_USER env var so a suite can run as bob (deny path)
or site_admin_1 (admin views) without editing the file. Adds a post-
login /auth/me sanity check so authentication failures produce a
useful error instead of cascading into 'AccountCircle not found'
timeouts in every test.
Three unrelated fixes bundled into one commit.

1. sessionAuth.onAuthStateChanged was firing callback(null)
   synchronously on first listener subscription, before /auth/me
   resolved. AuthContext set loading=false on the null callback;
   PrivateRoute saw {user: null, loading: false} and bounced to
   /login; LoginPage's 'if (user) navigate(/)' fired the moment
   /auth/me resolved with the real user; user landed on home with
   no idea what happened. Refresh on /agent/<id> always landed on
   home as a result. Fix: only emit synchronously if we already
   have a resolved user (covers later subscriptions); otherwise
   wait for _fetchMe to resolve and let _emit() send the real
   value, keeping AuthContext in loading=true until then.

2. HomePage and DataStoreConfigAccordion fetched the namespace
   list only on mount with [] deps. A namespace created in another
   tab/page didn't appear until hard refresh. Fix: refetch on
   visibilitychange so coming back to a tab gives fresh data.

3. run-all-tests.sh checked for a 'webui' container in 'docker ps'
   as the gate on running e2e tests. With dev-tail.sh the webui is
   vite on the host, not a container, so the check always failed.
   Fix: replace with a curl probe of localhost:3000 — answers the
   actual question (can e2e reach the frontend?) and works whether
   the frontend is vite, nginx-in-compose, or anything else.
The agent editor required users to click Generate Code before they
could see the code editor or save the agent. Two friction points:

1. The Agent Code accordion was gated on hasCode, so the editor
   was hidden entirely until code existed. Users with code already
   written (or copied from another agent) had no way to drop it in.

2. Save validation required a non-empty description even when code
   was already present. Description exists primarily as the prompt
   input for the LLM generator; once code exists, it's optional
   metadata.

Un-gate the accordion: always render, default-expanded in the
creation flow or when code exists, with a hint pointing at both
paths (paste here, or use Generate Code below). Reorder save
validation to check code first, and only require description
when code is absent.

The Generate Code button is unchanged and still requires a
description (correct — that operation does need it). This just
adds 'paste your own' as an equally valid path.
Sandbox Progress Log

Adds a structured per-run trace and a Progress Log accordion in the
sandbox right column above the data-store panel.

Backend (services/agent_trace.py): Trace collects events
(agent_start, agent_end, llm_call, data_store, error, stdout, log).
Bound to the asyncio task tree via contextvar so nested layers emit
without threading the collector through every signature.
capture_user_io() routes stdout/stderr/logging into the trace, with
4KB-per-event and 2000-event caps. GOFANNON_DISABLE_USER_TRACE=1
suppresses user-origin capture; structural events still emit.
Failure path returns a structured response with partial trace
instead of raising. friendly_name flows through the request to the
trace's per-event agent_name.

Frontend (SandboxProgressLog): runs listed newest-first, each with
per-agent groups. Outcome icons, durations, 'chained' badges for
nested calls. Errors highlighted with red border + bg. Multi-line
content truncated to 3 lines with a 'more' link to a Drawer side
sheet (right anchor, 600px). In-memory history, refresh wipes.
Transport errors get a synthetic event so the log doesn't spin.

Trace ships in the bulk response — events appear at run completion,
not live. SSE streaming follow-up planned.

docs/developers/agent-trace.md covers the env var, caps, contextvar
rationale, and how to add new event types.

Profile Menu Cleanup

Drop Basic Info / Usage / Billing menu items (placeholders with no
content); ProfilePage now renders ApiKeysTab directly.

Restyle ApiKeysTab to match other top-level pages: constrained max
width, back-arrow + h5 title, single in-place TextField per row,
'Configured' chip only when keyed. Tests realigned.
Long-running agents (e.g. the ASVS auditor that emits hundreds of
print() lines per run) made the Progress Log accordion unwieldy:
the panel grew vertically without bound and the right column
spilled past its boundary. Reading the run shape — what LLM calls
happened, where errors hit — meant scrolling through walls of
debug output.

Bucket stdout/log events into per-agent collapsible rows. Each
contiguous run of stdout/log events between structural events
(llm_call, data_store, error, agent_end) becomes one row showing
'N lines of stdout/log output · click to view'. Buckets break at
structural events so chronological flow is preserved — you still
see 'agent prints → LLM call → agent prints more', just with each
'prints' segment collapsed.

Click → side sheet shows all the bucketed lines in a scrollable
monospace block, with timestamp gutter and per-line error
highlighting (heuristic: lines containing ERROR/FAIL/TRACEBACK
get a red left border).

The bucket summary row pops the latest error-flavored line into a
preview underneath ('Latest: ERROR: ...consolidated.md') and
shows an error count badge ('47 lines · 30 with errors'), so the
common 'agent caught an exception, printed it, kept going'
failure mode is visible without clicking.

Structural events (llm_call, data_store, error, agent_end) still
render inline. Errors keep their red border + bg highlight.
Stack traces still go to the existing single-event side sheet
view; bucket view is a sibling Drawer mode that's mutually
exclusive with the single-event view.
Sandbox runs waited for the agent to finish before showing any
trace events — for long agents (file ingest, multi-step LLM
flows) the Progress Log spun for minutes showing 'No events
recorded' while api logs filled with the agent's actual output.

Adds POST /agents/run-code/stream returning text/event-stream.
Each Trace event becomes one SSE 'trace' frame, dispatched to
the client within ~50ms. A final 'done' frame carries result,
error, schema_warnings, and ops_log.

Trace gains an optional asyncio.Queue. attach_queue() wires it
up before the agent runs; every Trace.append() also publishes
to the queue (put_nowait, never blocks emitters). The streaming
endpoint runs the agent in an asyncio.Task and pulls from the
queue, yielding SSE frames until a sentinel signals completion.

Frontend uses fetch + ReadableStream rather than EventSource
(POST and custom headers needed). agentService gains
runCodeInSandboxStreaming which parses SSE frames manually and
dispatches each event to an onEvent callback. SandboxScreen
wires onEvent into setRuns so events accumulate in the in-flight
'running' entry as they arrive.

Non-streaming /agents/run-code endpoint stays for callers that
want the bulk response (deployed agents, scripts, batch tests).
Both share _execute_agent_code; only the response shape differs.

30s heartbeat comment frames prevent proxy idle-timeout.
X-Accel-Buffering: no header tells nginx not to buffer the
response body.
The test navigated to /profile/basic (a placeholder route that's
gone) then clicked text=API Keys, which drifted onto the h5 page
title because ProfilePage now always renders ApiKeysTab regardless
of the path segment. Click hit the page title behind the still-
open menu backdrop, timed out.

Start from / instead, and use getByRole('menuitem') to target the
menu item unambiguously.
Signed-off-by: Andrew Musselman <andrew.musselman@gmail.com>
Companion to 9e113f9 which dropped lines but missed statements.

Signed-off-by: Andrew Musselman <andrew.musselman@gmail.com>
@andrewmusselman andrewmusselman merged commit 3f41a93 into main May 5, 2026
4 checks passed
@andrewmusselman andrewmusselman deleted the integrate-all-prs branch May 12, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant