Skip to content

feat(ui-tars): ProofPilot - auditable, approval-gated, simulation-first agent execution#1911

Open
ManeeshJupalle wants to merge 9 commits into
bytedance:mainfrom
ManeeshJupalle:product/proofpilot-flight-recorder
Open

feat(ui-tars): ProofPilot - auditable, approval-gated, simulation-first agent execution#1911
ManeeshJupalle wants to merge 9 commits into
bytedance:mainfrom
ManeeshJupalle:product/proofpilot-flight-recorder

Conversation

@ManeeshJupalle

Copy link
Copy Markdown

Summary

This PR introduces ProofPilot — a product layer on top of UI-TARS Desktop focused on safer, auditable, and reproducible agent execution. As GUI agents take real actions on a user's machine, teams increasingly need to prove what the agent did, gate risky actions behind a human, dry-run before going live, and turn a proven run into a reusable workflow. ProofPilot adds exactly those capabilities, plus an org-level governance foundation, while keeping everything fully local (no data leaves the machine).

It is additive and opt-in: live-mode behavior is unchanged apart from best-effort recording, and recording failures can never break an agent run.

Motivation

Today UI-TARS Desktop renders conversations and screenshots in-session, but there is:

  • no tamper-evident record of what happened (for audit, debugging, or sharing),
  • no way to require human approval before a risky action executes,
  • no simulation / dry-run mode to plan a workflow without touching the real UI,
  • no way to export a run as evidence or convert it into a reusable workflow,
  • no team-level policy surface for governing how agents are allowed to run.

ProofPilot fills these gaps so the agent can be used for work where mistakes are expensive and accountability matters.

What's included

1. Command safety gate + approval producer hookpackages/agent-infra/mcp-servers/commands
A safety policy evaluates run_command / run_script against rules (destructive file removal, disk/partition mutation, system power actions, privileged/recursive permission changes, destructive git operations, remote curl | sh execution). Risky commands resolve to require_approval, which now emits a CommandApprovalRequest via a producer hook (onApprovalRequired) and consults getDecision before executing — a clean seam any host can wire an approval UI into.

2. ProofPilot Flight Recorderapps/ui-tars/src/main/services/flightRecorder.ts
Records every agent run as an append-only, SHA-256 hash-chained event log (run_started, status_changed, conversation_added, action_planned, simulation_action_skipped, run_error, run_finished). Each event carries previousHash + eventHash; the manifest holds a rolling recordingHash. Integrity is re-verified on read and reported as verified / failed / legacy. Screenshots and Set-of-Marks screenshots are persisted as content-addressed artifacts (deduplicated by SHA-256).

3. Proof Pack export — self-contained index.html + stripped proof-pack.json, with an embedded timeline, integrity badge, and inline screenshots — shareable as evidence of what the agent did.

4. Workflow Capsule export — converts a recorded run into a portable workflow-capsule.json, pairing each planned action with its source conversation and source event hash.

5. Workflow Compiler — compiles a run into workflow.json + human-readable workflow.md with guardrails (simulation-first default, integrity verification required, approval required for risky actions, explicit stop conditions), per-step retry policy, and source-hash evidence.

6. Approval Queueapps/ui-tars/src/main/services/approvalQueue.ts
Persistent, idempotent queue (list / submit / resolve / clearResolved) — the consumer side of the command safety gate's producer hook.

7. Sandbox / Simulation modeapps/ui-tars/src/main/services/simulationOperator.ts
Wraps the operator to record-and-skip: the agent observes and plans normally, but no real UI action executes and skipped actions are recorded. The system prompt is annotated and loop count is capped in simulation.

8. Team Workspaceapps/ui-tars/src/main/services/teamWorkspace.ts
Plan/seats, member roles (owner/admin/operator/viewer), governance policies (default run mode, approval policy, retention, proof-pack requirement, audit exports), and an exportable HTML/JSON audit report with a commercial-readiness checklist computed from recorded runs and approvals.

9. UIapps/ui-tars/src/renderer/src/pages/{proofpilot,approvals,team}
New ProofPilot timeline/replay, Approval Queue, and Team console pages; sidebar navigation; and a Simulate toggle on the chat input that dispatches runs in live or simulation mode (run dispatch is now session-aware).

10. Build/config — made the private-key bytecode plugin optional so local builds without UI_TARS_APP_PRIVATE_KEY_BASE64 don't crash (no effect on official signed builds); added a root postcss.config.cjs and a Vitest setup file for main-process service tests.

How it works

  • The Flight Recorder hooks into the agent run loop (runAgent) via the existing onData / onError callbacks; recording is wrapped so a recorder error is logged and swallowed, never breaking the run.
  • Simulation mode wraps the selected operator's execute to record the planned action and return a non-executing result.
  • The command safety gate (producer) and the Approval Queue (consumer) communicate through a small hook contract, so approval is enforced before a gated command runs.
  • All renderer↔main calls use the existing type-safe @ui-tars/electron-ipc layer; three new routers (proofPilot, approval, team) are merged into the shared Router.
  • All state is persisted locally under userData/proofpilot/:
proofpilot/
├── flight-records/<runId>/
│   ├── manifest.json          # run metadata + rolling recordingHash
│   ├── events.jsonl           # append-only, hash-chained event log
│   ├── screenshots/           # content-addressed artifacts (sha256)
│   ├── proof-pack/            # exported HTML + JSON evidence
│   ├── workflow-capsule/      # exported reusable capsule
│   └── compiled-workflow/     # workflow.json + workflow.md
├── approval-queue.json
└── team/{workspace.json, audit-reports/<id>/...}

Backward compatibility

Fully additive and opt-in. Simulation is off by default; live-mode agent behavior is unchanged apart from best-effort recording. No changes to existing operator or agent control flow. The bytecode-plugin change is a no-op for official signed builds (where the key env var is set).

Testing

  • commands-MCP suite: 10 passed (full safety-gate + approval-hook coverage: approval-without-execute, producer-hook publish, approve-on-retry, run_script gate, prompt gate), 9 integration tests skipped by design.
  • ProofPilot service suites (flightRecorder, approvalQueue, simulationOperator, teamWorkspace): 9 passed.
  • Typecheck: typecheck:node + typecheck:web clean.
  • Production build: electron-vite build (main / preload / renderer) succeeds; only the pre-existing third-party eval / chunk-size / empty-private-chunk warnings remain.

Note: the broader app test suite includes cases that require a live Electron runtime / platform-native modules and are not runnable headless; those are unrelated to and unchanged by this PR.

Notes for reviewers

  • ~6.4k LOC across 9 focused commits — reviewing commit-by-commit is recommended.
  • New code is isolated under apps/ui-tars/src/main/services, apps/ui-tars/src/main/ipcRoutes, apps/ui-tars/src/renderer/src/pages, plus the commands MCP safety hook.
  • Privacy by design: all recording, approval, and team state is stored locally; nothing is transmitted off-device.

Screenshots of the ProofPilot timeline, Approval Queue, and Team console can be attached here.

…tcss + vitest setup

- Make the bytecode plugin in electron.vite.config.ts optional so local
  builds without UI_TARS_APP_PRIVATE_KEY_BASE64 do not crash.
- Add a root postcss.config.cjs that delegates to the existing app config.
- Add apps/ui-tars/scripts/vitest-setup.ts for main-process service tests.
- Introduce CommandApprovalHooks (onApprovalRequired, getDecision, now) on
  the commands MCP server so a host can subscribe to approval requests and
  supply a decision before the gated command is executed.
- evaluateSafetyWithApprovals consults getDecision first to fast-path
  already-approved or already-denied requests, otherwise publishes the
  CommandApprovalRequest via onApprovalRequired and returns require_approval
  so the call is blocked until a decision arrives.
- Extend safety-policy.ts with stable SHA-256 approval request ids, risk
  level mapping, and a CommandApprovalRequest builder.
- Cover the new producer hook in tests/server-in-memory.test.ts.
…of Pack, Workflow Capsule and Compiled Workflow exports

- Record every agent run under userData/proofpilot/flight-records/<runId>
  with a manifest.json, an append-only events.jsonl, and content-hashed
  screenshots/SoM screenshots.
- Event types: run_started, status_changed, conversation_added,
  action_planned, simulation_action_skipped, run_error, run_finished.
- Every event carries previousHash + eventHash (SHA-256 over canonical JSON)
  so the chain can be re-validated as verified | failed | legacy.
- exportProofPack renders a self-contained HTML proof pack with embedded
  artifacts plus a stripped JSON dump.
- exportWorkflowCapsule pairs action_planned events with their preceding
  conversation_added events and emits a draft workflow.json artifact.
- compileWorkflow wraps the capsule with guardrails (simulation default,
  integrity verification, approval for risky actions, stop conditions),
  per-step retry policy, evidence hashes, and emits workflow.json +
  workflow.md.
- Expose listProofPilotRuns, getProofPilotRun, exportProofPilotRun,
  exportProofPilotWorkflowCapsule and compileProofPilotWorkflow via a new
  proofPilot IPC route, registered in ipcRoutes.
…actions

- Persistent JSON queue at userData/proofpilot/approval-queue.json holding
  ApprovalRequest records (id, title, reason, source, riskLevel, ruleId,
  subject, status, createdAt, updatedAt, decidedAt, decisionReason).
- submit() is idempotent on a SHA-256 id derived from title+source+ruleId+
  subject when an id is not supplied, so the commands MCP producer hook can
  call it repeatedly without duplicating pending requests.
- resolve() marks a request approved or denied with an optional decision
  reason; clearResolved() trims out non-pending entries.
- list() returns a snapshot sorted pending-first then by recency, along
  with counts per status.
- Expose listApprovalRequests, submitApprovalRequest, resolveApprovalRequest
  and clearResolvedApprovalRequests via a new approval IPC route.
…ips real actions

- createSimulationOperator wraps any operator and replaces its execute
  with a recorder-only path that calls an onSimulatedAction handler and
  returns { simulated: true, actionType, actionInputs } with a status of
  END for terminal actions (finished, call_user, error_env, user_stop)
  and RUNNING otherwise.
- Lets the agent loop plan workflows end-to-end with no actual UI side
  effects, so the flight recorder can capture a full simulation_action_skipped
  trail for replay and workflow compilation.
…gent run loop

- Add activeSessionId and runMode ('live' | 'simulation') to AppState so the
  main store tracks the originating session and selected mode for the
  current run, and reset them on clearHistory.
- agentRoute.runAgent now accepts { sessionId?, mode? } and seeds the
  store before delegating to runAgent().
- runAgent() starts a FlightRecording at the beginning of every run,
  records agent data (with SoM screenshots) on each onData callback,
  records errors on onError and on a thrown loop, and finishes the
  recording with the final status in a try/finally so the chain always
  closes cleanly.
- In simulation mode the operator is wrapped via createSimulationOperator,
  the system prompt is amended to tell the model no real action will be
  taken, and maxLoopCount is capped at min(setting, 5) to keep simulated
  runs bounded.
- Recorder failures are swallowed via a recordFlight helper so a disk or
  serialization issue cannot break a live agent run.
…policies, and audit report export

- TeamWorkspaceStore persists workspace.json under userData/proofpilot/team
  with workspaceName, plan (solo|team|business), seatLimit, members and
  policies (defaultRunMode, approvalRequiredFor, retentionDays,
  proofPackRequired, auditExportsEnabled, allowLiveRuns).
- Member roles: owner, admin, operator, viewer. The owner cannot be
  removed. upsertMember preserves joinedAt on edit and validates required
  fields.
- exportAuditReport pulls run summaries from the Flight Recorder and the
  ApprovalQueue snapshot, computes commercial-readiness metrics and a
  readiness checklist (simulation default, approval policy, proof pack
  policy, audit export, team roster, run evidence), and writes
  team-audit.json + team-audit.html under audit-reports/<reportId>.
- Expose getTeamWorkspace, updateTeamWorkspace, upsertTeamMember,
  removeTeamMember, and exportTeamAuditReport via a new team IPC route.
… simulation toggle, and session-aware run dispatch

- App.tsx: register lazy /proofpilot, /approvals, /team routes inside the
  existing MainLayout.
- SideBar/app-sidebar.tsx: add ProofPilot, Approvals, and Team nav items
  with isActive styling and reuse the existing run-in-progress confirmation
  dialog via pendingAction.
- ChatInput/index.tsx: add a Simulate switch (FlaskConical icon) that
  forwards a 'simulation' run mode to useRunAgent.
- useRunAgent.ts: extend run() with sessionId and mode parameters that flow
  through to api.runAgent({ sessionId, mode }); also fix a pre-existing
  truthiness bug in the LocalBrowser || LocalComputer permission check.
- pages/proofpilot/index.tsx: timeline + replay UI listing recorded runs,
  rendering events with integrity badge and SoM screenshots, and exposing
  Proof Pack, Workflow Capsule, and Compiled Workflow exports.
- pages/approvals/index.tsx: review UI to submit, approve, deny, annotate,
  and clear resolved approval requests.
- pages/team/index.tsx: Team Console for workspace name, plan, seat limit,
  governance policies, member roster, and audit report export.
@netlify

netlify Bot commented May 30, 2026

Copy link
Copy Markdown

Deploy Preview for agent-tars-docs canceled.

Name Link
🔨 Latest commit f22bff6
🔍 Latest deploy log https://app.netlify.com/projects/agent-tars-docs/deploys/6a1a3bca9733600008876e03

@netlify

netlify Bot commented May 30, 2026

Copy link
Copy Markdown

Deploy Preview for tarko canceled.

Name Link
🔨 Latest commit f22bff6
🔍 Latest deploy log https://app.netlify.com/projects/tarko/deploys/6a1a3bca22e2db000884b914

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant