feat(ui-tars): ProofPilot - auditable, approval-gated, simulation-first agent execution#1911
Open
ManeeshJupalle wants to merge 9 commits into
Open
Conversation
…tcss + vitest setup - Make the bytecode plugin in electron.vite.config.ts optional so local builds without UI_TARS_APP_PRIVATE_KEY_BASE64 do not crash. - Add a root postcss.config.cjs that delegates to the existing app config. - Add apps/ui-tars/scripts/vitest-setup.ts for main-process service tests.
- Introduce CommandApprovalHooks (onApprovalRequired, getDecision, now) on the commands MCP server so a host can subscribe to approval requests and supply a decision before the gated command is executed. - evaluateSafetyWithApprovals consults getDecision first to fast-path already-approved or already-denied requests, otherwise publishes the CommandApprovalRequest via onApprovalRequired and returns require_approval so the call is blocked until a decision arrives. - Extend safety-policy.ts with stable SHA-256 approval request ids, risk level mapping, and a CommandApprovalRequest builder. - Cover the new producer hook in tests/server-in-memory.test.ts.
…of Pack, Workflow Capsule and Compiled Workflow exports - Record every agent run under userData/proofpilot/flight-records/<runId> with a manifest.json, an append-only events.jsonl, and content-hashed screenshots/SoM screenshots. - Event types: run_started, status_changed, conversation_added, action_planned, simulation_action_skipped, run_error, run_finished. - Every event carries previousHash + eventHash (SHA-256 over canonical JSON) so the chain can be re-validated as verified | failed | legacy. - exportProofPack renders a self-contained HTML proof pack with embedded artifacts plus a stripped JSON dump. - exportWorkflowCapsule pairs action_planned events with their preceding conversation_added events and emits a draft workflow.json artifact. - compileWorkflow wraps the capsule with guardrails (simulation default, integrity verification, approval for risky actions, stop conditions), per-step retry policy, evidence hashes, and emits workflow.json + workflow.md. - Expose listProofPilotRuns, getProofPilotRun, exportProofPilotRun, exportProofPilotWorkflowCapsule and compileProofPilotWorkflow via a new proofPilot IPC route, registered in ipcRoutes.
…actions - Persistent JSON queue at userData/proofpilot/approval-queue.json holding ApprovalRequest records (id, title, reason, source, riskLevel, ruleId, subject, status, createdAt, updatedAt, decidedAt, decisionReason). - submit() is idempotent on a SHA-256 id derived from title+source+ruleId+ subject when an id is not supplied, so the commands MCP producer hook can call it repeatedly without duplicating pending requests. - resolve() marks a request approved or denied with an optional decision reason; clearResolved() trims out non-pending entries. - list() returns a snapshot sorted pending-first then by recency, along with counts per status. - Expose listApprovalRequests, submitApprovalRequest, resolveApprovalRequest and clearResolvedApprovalRequests via a new approval IPC route.
…ips real actions
- createSimulationOperator wraps any operator and replaces its execute
with a recorder-only path that calls an onSimulatedAction handler and
returns { simulated: true, actionType, actionInputs } with a status of
END for terminal actions (finished, call_user, error_env, user_stop)
and RUNNING otherwise.
- Lets the agent loop plan workflows end-to-end with no actual UI side
effects, so the flight recorder can capture a full simulation_action_skipped
trail for replay and workflow compilation.
…gent run loop
- Add activeSessionId and runMode ('live' | 'simulation') to AppState so the
main store tracks the originating session and selected mode for the
current run, and reset them on clearHistory.
- agentRoute.runAgent now accepts { sessionId?, mode? } and seeds the
store before delegating to runAgent().
- runAgent() starts a FlightRecording at the beginning of every run,
records agent data (with SoM screenshots) on each onData callback,
records errors on onError and on a thrown loop, and finishes the
recording with the final status in a try/finally so the chain always
closes cleanly.
- In simulation mode the operator is wrapped via createSimulationOperator,
the system prompt is amended to tell the model no real action will be
taken, and maxLoopCount is capped at min(setting, 5) to keep simulated
runs bounded.
- Recorder failures are swallowed via a recordFlight helper so a disk or
serialization issue cannot break a live agent run.
…policies, and audit report export - TeamWorkspaceStore persists workspace.json under userData/proofpilot/team with workspaceName, plan (solo|team|business), seatLimit, members and policies (defaultRunMode, approvalRequiredFor, retentionDays, proofPackRequired, auditExportsEnabled, allowLiveRuns). - Member roles: owner, admin, operator, viewer. The owner cannot be removed. upsertMember preserves joinedAt on edit and validates required fields. - exportAuditReport pulls run summaries from the Flight Recorder and the ApprovalQueue snapshot, computes commercial-readiness metrics and a readiness checklist (simulation default, approval policy, proof pack policy, audit export, team roster, run evidence), and writes team-audit.json + team-audit.html under audit-reports/<reportId>. - Expose getTeamWorkspace, updateTeamWorkspace, upsertTeamMember, removeTeamMember, and exportTeamAuditReport via a new team IPC route.
… simulation toggle, and session-aware run dispatch
- App.tsx: register lazy /proofpilot, /approvals, /team routes inside the
existing MainLayout.
- SideBar/app-sidebar.tsx: add ProofPilot, Approvals, and Team nav items
with isActive styling and reuse the existing run-in-progress confirmation
dialog via pendingAction.
- ChatInput/index.tsx: add a Simulate switch (FlaskConical icon) that
forwards a 'simulation' run mode to useRunAgent.
- useRunAgent.ts: extend run() with sessionId and mode parameters that flow
through to api.runAgent({ sessionId, mode }); also fix a pre-existing
truthiness bug in the LocalBrowser || LocalComputer permission check.
- pages/proofpilot/index.tsx: timeline + replay UI listing recorded runs,
rendering events with integrity badge and SoM screenshots, and exposing
Proof Pack, Workflow Capsule, and Compiled Workflow exports.
- pages/approvals/index.tsx: review UI to submit, approve, deny, annotate,
and clear resolved approval requests.
- pages/team/index.tsx: Team Console for workspace name, plan, seat limit,
governance policies, member roster, and audit report export.
✅ Deploy Preview for agent-tars-docs canceled.
|
✅ Deploy Preview for tarko canceled.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces ProofPilot — a product layer on top of UI-TARS Desktop focused on safer, auditable, and reproducible agent execution. As GUI agents take real actions on a user's machine, teams increasingly need to prove what the agent did, gate risky actions behind a human, dry-run before going live, and turn a proven run into a reusable workflow. ProofPilot adds exactly those capabilities, plus an org-level governance foundation, while keeping everything fully local (no data leaves the machine).
It is additive and opt-in: live-mode behavior is unchanged apart from best-effort recording, and recording failures can never break an agent run.
Motivation
Today UI-TARS Desktop renders conversations and screenshots in-session, but there is:
ProofPilot fills these gaps so the agent can be used for work where mistakes are expensive and accountability matters.
What's included
1. Command safety gate + approval producer hook —
packages/agent-infra/mcp-servers/commandsA safety policy evaluates
run_command/run_scriptagainst rules (destructive file removal, disk/partition mutation, system power actions, privileged/recursive permission changes, destructive git operations, remotecurl | shexecution). Risky commands resolve torequire_approval, which now emits aCommandApprovalRequestvia a producer hook (onApprovalRequired) and consultsgetDecisionbefore executing — a clean seam any host can wire an approval UI into.2. ProofPilot Flight Recorder —
apps/ui-tars/src/main/services/flightRecorder.tsRecords every agent run as an append-only, SHA-256 hash-chained event log (
run_started,status_changed,conversation_added,action_planned,simulation_action_skipped,run_error,run_finished). Each event carriespreviousHash+eventHash; the manifest holds a rollingrecordingHash. Integrity is re-verified on read and reported asverified/failed/legacy. Screenshots and Set-of-Marks screenshots are persisted as content-addressed artifacts (deduplicated by SHA-256).3. Proof Pack export — self-contained
index.html+ strippedproof-pack.json, with an embedded timeline, integrity badge, and inline screenshots — shareable as evidence of what the agent did.4. Workflow Capsule export — converts a recorded run into a portable
workflow-capsule.json, pairing each planned action with its source conversation and source event hash.5. Workflow Compiler — compiles a run into
workflow.json+ human-readableworkflow.mdwith guardrails (simulation-first default, integrity verification required, approval required for risky actions, explicit stop conditions), per-step retry policy, and source-hash evidence.6. Approval Queue —
apps/ui-tars/src/main/services/approvalQueue.tsPersistent, idempotent queue (
list/submit/resolve/clearResolved) — the consumer side of the command safety gate's producer hook.7. Sandbox / Simulation mode —
apps/ui-tars/src/main/services/simulationOperator.tsWraps the operator to record-and-skip: the agent observes and plans normally, but no real UI action executes and skipped actions are recorded. The system prompt is annotated and loop count is capped in simulation.
8. Team Workspace —
apps/ui-tars/src/main/services/teamWorkspace.tsPlan/seats, member roles (owner/admin/operator/viewer), governance policies (default run mode, approval policy, retention, proof-pack requirement, audit exports), and an exportable HTML/JSON audit report with a commercial-readiness checklist computed from recorded runs and approvals.
9. UI —
apps/ui-tars/src/renderer/src/pages/{proofpilot,approvals,team}New ProofPilot timeline/replay, Approval Queue, and Team console pages; sidebar navigation; and a Simulate toggle on the chat input that dispatches runs in
liveorsimulationmode (run dispatch is now session-aware).10. Build/config — made the private-key bytecode plugin optional so local builds without
UI_TARS_APP_PRIVATE_KEY_BASE64don't crash (no effect on official signed builds); added a rootpostcss.config.cjsand a Vitest setup file for main-process service tests.How it works
runAgent) via the existingonData/onErrorcallbacks; recording is wrapped so a recorder error is logged and swallowed, never breaking the run.executeto record the planned action and return a non-executing result.@ui-tars/electron-ipclayer; three new routers (proofPilot,approval,team) are merged into the sharedRouter.userData/proofpilot/:Backward compatibility
Fully additive and opt-in. Simulation is off by default; live-mode agent behavior is unchanged apart from best-effort recording. No changes to existing operator or agent control flow. The bytecode-plugin change is a no-op for official signed builds (where the key env var is set).
Testing
run_scriptgate, prompt gate), 9 integration tests skipped by design.flightRecorder,approvalQueue,simulationOperator,teamWorkspace): 9 passed.typecheck:node+typecheck:webclean.electron-vite build(main / preload / renderer) succeeds; only the pre-existing third-partyeval/ chunk-size / empty-private-chunk warnings remain.Notes for reviewers
apps/ui-tars/src/main/services,apps/ui-tars/src/main/ipcRoutes,apps/ui-tars/src/renderer/src/pages, plus thecommandsMCP safety hook.