feat(ui-tars): ProofPilot - auditable, approval-gated, simulation-first agent execution by ManeeshJupalle · Pull Request #1911 · bytedance/UI-TARS-desktop

ManeeshJupalle · 2026-05-30T01:22:16Z

Summary

This PR introduces ProofPilot — a product layer on top of UI-TARS Desktop focused on safer, auditable, and reproducible agent execution. As GUI agents take real actions on a user's machine, teams increasingly need to prove what the agent did, gate risky actions behind a human, dry-run before going live, and turn a proven run into a reusable workflow. ProofPilot adds exactly those capabilities, plus an org-level governance foundation, while keeping everything fully local (no data leaves the machine).

It is additive and opt-in: live-mode behavior is unchanged apart from best-effort recording, and recording failures can never break an agent run.

Motivation

Today UI-TARS Desktop renders conversations and screenshots in-session, but there is:

no tamper-evident record of what happened (for audit, debugging, or sharing),
no way to require human approval before a risky action executes,
no simulation / dry-run mode to plan a workflow without touching the real UI,
no way to export a run as evidence or convert it into a reusable workflow,
no team-level policy surface for governing how agents are allowed to run.

ProofPilot fills these gaps so the agent can be used for work where mistakes are expensive and accountability matters.

What's included

1. Command safety gate + approval producer hook — packages/agent-infra/mcp-servers/commands
A safety policy evaluates run_command / run_script against rules (destructive file removal, disk/partition mutation, system power actions, privileged/recursive permission changes, destructive git operations, remote curl | sh execution). Risky commands resolve to require_approval, which now emits a CommandApprovalRequest via a producer hook (onApprovalRequired) and consults getDecision before executing — a clean seam any host can wire an approval UI into.

2. ProofPilot Flight Recorder — apps/ui-tars/src/main/services/flightRecorder.ts
Records every agent run as an append-only, SHA-256 hash-chained event log (run_started, status_changed, conversation_added, action_planned, simulation_action_skipped, run_error, run_finished). Each event carries previousHash + eventHash; the manifest holds a rolling recordingHash. Integrity is re-verified on read and reported as verified / failed / legacy. Screenshots and Set-of-Marks screenshots are persisted as content-addressed artifacts (deduplicated by SHA-256).

3. Proof Pack export — self-contained index.html + stripped proof-pack.json, with an embedded timeline, integrity badge, and inline screenshots — shareable as evidence of what the agent did.

4. Workflow Capsule export — converts a recorded run into a portable workflow-capsule.json, pairing each planned action with its source conversation and source event hash.

5. Workflow Compiler — compiles a run into workflow.json + human-readable workflow.md with guardrails (simulation-first default, integrity verification required, approval required for risky actions, explicit stop conditions), per-step retry policy, and source-hash evidence.

6. Approval Queue — apps/ui-tars/src/main/services/approvalQueue.ts
Persistent, idempotent queue (list / submit / resolve / clearResolved) — the consumer side of the command safety gate's producer hook.

7. Sandbox / Simulation mode — apps/ui-tars/src/main/services/simulationOperator.ts
Wraps the operator to record-and-skip: the agent observes and plans normally, but no real UI action executes and skipped actions are recorded. The system prompt is annotated and loop count is capped in simulation.

8. Team Workspace — apps/ui-tars/src/main/services/teamWorkspace.ts
Plan/seats, member roles (owner/admin/operator/viewer), governance policies (default run mode, approval policy, retention, proof-pack requirement, audit exports), and an exportable HTML/JSON audit report with a commercial-readiness checklist computed from recorded runs and approvals.

9. UI — apps/ui-tars/src/renderer/src/pages/{proofpilot,approvals,team}
New ProofPilot timeline/replay, Approval Queue, and Team console pages; sidebar navigation; and a Simulate toggle on the chat input that dispatches runs in live or simulation mode (run dispatch is now session-aware).

10. Build/config — made the private-key bytecode plugin optional so local builds without UI_TARS_APP_PRIVATE_KEY_BASE64 don't crash (no effect on official signed builds); added a root postcss.config.cjs and a Vitest setup file for main-process service tests.

How it works

The Flight Recorder hooks into the agent run loop (runAgent) via the existing onData / onError callbacks; recording is wrapped so a recorder error is logged and swallowed, never breaking the run.
Simulation mode wraps the selected operator's execute to record the planned action and return a non-executing result.
The command safety gate (producer) and the Approval Queue (consumer) communicate through a small hook contract, so approval is enforced before a gated command runs.
All renderer↔main calls use the existing type-safe @ui-tars/electron-ipc layer; three new routers (proofPilot, approval, team) are merged into the shared Router.
All state is persisted locally under userData/proofpilot/:

proofpilot/
├── flight-records/<runId>/
│   ├── manifest.json          # run metadata + rolling recordingHash
│   ├── events.jsonl           # append-only, hash-chained event log
│   ├── screenshots/           # content-addressed artifacts (sha256)
│   ├── proof-pack/            # exported HTML + JSON evidence
│   ├── workflow-capsule/      # exported reusable capsule
│   └── compiled-workflow/     # workflow.json + workflow.md
├── approval-queue.json
└── team/{workspace.json, audit-reports/<id>/...}

Backward compatibility

Fully additive and opt-in. Simulation is off by default; live-mode agent behavior is unchanged apart from best-effort recording. No changes to existing operator or agent control flow. The bytecode-plugin change is a no-op for official signed builds (where the key env var is set).

Testing

commands-MCP suite: 10 passed (full safety-gate + approval-hook coverage: approval-without-execute, producer-hook publish, approve-on-retry, run_script gate, prompt gate), 9 integration tests skipped by design.
ProofPilot service suites (flightRecorder, approvalQueue, simulationOperator, teamWorkspace): 9 passed.
Typecheck: typecheck:node + typecheck:web clean.
Production build: electron-vite build (main / preload / renderer) succeeds; only the pre-existing third-party eval / chunk-size / empty-private-chunk warnings remain.

Note: the broader app test suite includes cases that require a live Electron runtime / platform-native modules and are not runnable headless; those are unrelated to and unchanged by this PR.

Notes for reviewers

~6.4k LOC across 9 focused commits — reviewing commit-by-commit is recommended.
New code is isolated under apps/ui-tars/src/main/services, apps/ui-tars/src/main/ipcRoutes, apps/ui-tars/src/renderer/src/pages, plus the commands MCP safety hook.
Privacy by design: all recording, approval, and team state is stored locally; nothing is transmitted off-device.

Screenshots of the ProofPilot timeline, Approval Queue, and Team console can be attached here.

…tcss + vitest setup - Make the bytecode plugin in electron.vite.config.ts optional so local builds without UI_TARS_APP_PRIVATE_KEY_BASE64 do not crash. - Add a root postcss.config.cjs that delegates to the existing app config. - Add apps/ui-tars/scripts/vitest-setup.ts for main-process service tests.

- Introduce CommandApprovalHooks (onApprovalRequired, getDecision, now) on the commands MCP server so a host can subscribe to approval requests and supply a decision before the gated command is executed. - evaluateSafetyWithApprovals consults getDecision first to fast-path already-approved or already-denied requests, otherwise publishes the CommandApprovalRequest via onApprovalRequired and returns require_approval so the call is blocked until a decision arrives. - Extend safety-policy.ts with stable SHA-256 approval request ids, risk level mapping, and a CommandApprovalRequest builder. - Cover the new producer hook in tests/server-in-memory.test.ts.

…of Pack, Workflow Capsule and Compiled Workflow exports - Record every agent run under userData/proofpilot/flight-records/<runId> with a manifest.json, an append-only events.jsonl, and content-hashed screenshots/SoM screenshots. - Event types: run_started, status_changed, conversation_added, action_planned, simulation_action_skipped, run_error, run_finished. - Every event carries previousHash + eventHash (SHA-256 over canonical JSON) so the chain can be re-validated as verified | failed | legacy. - exportProofPack renders a self-contained HTML proof pack with embedded artifacts plus a stripped JSON dump. - exportWorkflowCapsule pairs action_planned events with their preceding conversation_added events and emits a draft workflow.json artifact. - compileWorkflow wraps the capsule with guardrails (simulation default, integrity verification, approval for risky actions, stop conditions), per-step retry policy, evidence hashes, and emits workflow.json + workflow.md. - Expose listProofPilotRuns, getProofPilotRun, exportProofPilotRun, exportProofPilotWorkflowCapsule and compileProofPilotWorkflow via a new proofPilot IPC route, registered in ipcRoutes.

…actions - Persistent JSON queue at userData/proofpilot/approval-queue.json holding ApprovalRequest records (id, title, reason, source, riskLevel, ruleId, subject, status, createdAt, updatedAt, decidedAt, decisionReason). - submit() is idempotent on a SHA-256 id derived from title+source+ruleId+ subject when an id is not supplied, so the commands MCP producer hook can call it repeatedly without duplicating pending requests. - resolve() marks a request approved or denied with an optional decision reason; clearResolved() trims out non-pending entries. - list() returns a snapshot sorted pending-first then by recency, along with counts per status. - Expose listApprovalRequests, submitApprovalRequest, resolveApprovalRequest and clearResolvedApprovalRequests via a new approval IPC route.

…ips real actions - createSimulationOperator wraps any operator and replaces its execute with a recorder-only path that calls an onSimulatedAction handler and returns { simulated: true, actionType, actionInputs } with a status of END for terminal actions (finished, call_user, error_env, user_stop) and RUNNING otherwise. - Lets the agent loop plan workflows end-to-end with no actual UI side effects, so the flight recorder can capture a full simulation_action_skipped trail for replay and workflow compilation.

…gent run loop - Add activeSessionId and runMode ('live' | 'simulation') to AppState so the main store tracks the originating session and selected mode for the current run, and reset them on clearHistory. - agentRoute.runAgent now accepts { sessionId?, mode? } and seeds the store before delegating to runAgent(). - runAgent() starts a FlightRecording at the beginning of every run, records agent data (with SoM screenshots) on each onData callback, records errors on onError and on a thrown loop, and finishes the recording with the final status in a try/finally so the chain always closes cleanly. - In simulation mode the operator is wrapped via createSimulationOperator, the system prompt is amended to tell the model no real action will be taken, and maxLoopCount is capped at min(setting, 5) to keep simulated runs bounded. - Recorder failures are swallowed via a recordFlight helper so a disk or serialization issue cannot break a live agent run.

…policies, and audit report export - TeamWorkspaceStore persists workspace.json under userData/proofpilot/team with workspaceName, plan (solo|team|business), seatLimit, members and policies (defaultRunMode, approvalRequiredFor, retentionDays, proofPackRequired, auditExportsEnabled, allowLiveRuns). - Member roles: owner, admin, operator, viewer. The owner cannot be removed. upsertMember preserves joinedAt on edit and validates required fields. - exportAuditReport pulls run summaries from the Flight Recorder and the ApprovalQueue snapshot, computes commercial-readiness metrics and a readiness checklist (simulation default, approval policy, proof pack policy, audit export, team roster, run evidence), and writes team-audit.json + team-audit.html under audit-reports/<reportId>. - Expose getTeamWorkspace, updateTeamWorkspace, upsertTeamMember, removeTeamMember, and exportTeamAuditReport via a new team IPC route.

… simulation toggle, and session-aware run dispatch - App.tsx: register lazy /proofpilot, /approvals, /team routes inside the existing MainLayout. - SideBar/app-sidebar.tsx: add ProofPilot, Approvals, and Team nav items with isActive styling and reuse the existing run-in-progress confirmation dialog via pendingAction. - ChatInput/index.tsx: add a Simulate switch (FlaskConical icon) that forwards a 'simulation' run mode to useRunAgent. - useRunAgent.ts: extend run() with sessionId and mode parameters that flow through to api.runAgent({ sessionId, mode }); also fix a pre-existing truthiness bug in the LocalBrowser || LocalComputer permission check. - pages/proofpilot/index.tsx: timeline + replay UI listing recorded runs, rendering events with integrity badge and SoM screenshots, and exposing Proof Pack, Workflow Capsule, and Compiled Workflow exports. - pages/approvals/index.tsx: review UI to submit, approve, deny, annotate, and clear resolved approval requests. - pages/team/index.tsx: Team Console for workspace name, plan, seat limit, governance policies, member roster, and audit report export.

netlify · 2026-05-30T01:22:20Z

✅ Deploy Preview for agent-tars-docs canceled.

Name	Link
🔨 Latest commit	`f22bff6`
🔍 Latest deploy log	https://app.netlify.com/projects/agent-tars-docs/deploys/6a1a3bca9733600008876e03

netlify · 2026-05-30T01:22:20Z

✅ Deploy Preview for tarko canceled.

Name	Link
🔨 Latest commit	`f22bff6`
🔍 Latest deploy log	https://app.netlify.com/projects/tarko/deploys/6a1a3bca22e2db000884b914

ManeeshJupalle added 9 commits May 26, 2026 02:37

feat: add command safety approval gate

3cab33e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ui-tars): ProofPilot - auditable, approval-gated, simulation-first agent execution#1911

feat(ui-tars): ProofPilot - auditable, approval-gated, simulation-first agent execution#1911
ManeeshJupalle wants to merge 9 commits into
bytedance:mainfrom
ManeeshJupalle:product/proofpilot-flight-recorder

ManeeshJupalle commented May 30, 2026

Uh oh!

netlify Bot commented May 30, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ManeeshJupalle commented May 30, 2026

Summary

Motivation

What's included

How it works

Backward compatibility

Testing

Notes for reviewers

Uh oh!

netlify Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for agent-tars-docs canceled.

Uh oh!

netlify Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for tarko canceled.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

netlify Bot commented May 30, 2026 •

edited

Loading

netlify Bot commented May 30, 2026 •

edited

Loading