migi is a task-oriented desktop GUI vision automation CLI focused on skill-style integration for LLM agents.
- What It Does
- Current Model Support
- Install
- Quick Start
- CLI Usage
- Configuration
- Advanced: Custom Action Parser
- JSON Output Contract
- Platform and Dependencies
- Troubleshooting
- FAQ
- Roadmap
- Uses screenshots + multimodal model inference to understand current desktop UI
- Supports instruction-driven automation via
see(analyze only) andact(analyze + execute) - Supports local image understanding via
image/vision - Returns machine-readable JSON for every command
- Includes a skill installer for agent platforms
migi currently ships with a Doubao-oriented action parser by default (doubao), and at this stage only doubao-seed is officially supported.
Notes:
- You can still pass custom model/base URL values, but the built-in action parsing logic is currently tuned for Doubao-style action outputs.
- If you need a different model, use
--action-parser customwith your own parser callable.
pip install migi-clior:
uv pip install migi-cli- Configure credentials and model:
migi setup --api-key "YOUR_API_KEY" --model "doubao-seed" --base-url "https://ark.cn-beijing.volces.com/api/v3"- Analyze current screen (no execution):
migi see "What apps are visible on the screen?"- Analyze and execute:
migi act "Click the search box and type Li Bai"If you prefer lower latency, use the faster runtime profile:
migi act --performance fast "Click the search box and type Li Bai"- Install skill package:
migi install --target cursor- Understand a local image file:
migi image ./example.png "Describe the key objects and visible text."migi <command> [options]Core commands:
setup/init: initialize or update model configstatus: show effective runtime config and dependency statusconfig show: alias ofstatussee <instruction>: analyze screen onlyact <instruction>: analyze and execute actionsimage <image_path> [instruction]/vision: analyze a local image fileinstall: install skill package(s)
Performance profile:
--performance balanced(default): a faster default balance for most GUI tasks--performance fast: smaller screenshots, tighter limits, lowest latency--performance accurate: larger screenshots and more generous outputs for tiny text / dense UIs
Multi-step execution:
migi actnow supports--max-steps Nand defaults to3- Use higher values for cross-screen tasks such as opening an app, searching, then sending a message
- App-targeted tasks such as "send a WeChat message" now try to bring the target app to the foreground before visual steps begin
- Recipient-targeted messaging instructions now carry the recipient hint forward so the model is less likely to send into the currently open chat by mistake
- Non-essential close/quit shortcuts such as
Cmd+Ware now blocked unless the instruction explicitly asks to close or quit something - After the target app is brought to the foreground,
autocapture can narrow back down to the front window for the remaining steps - WeChat text-message instructions in the form
给 <recipient> 发送微信消息,说 <content>now use a specialized flow: foreground WeChat, search recipient, confirm the chat, then send - That specialized flow now tries
Enteron the first search result before falling back to visual contact clicking
Capture mode:
--capture-mode auto(default): prefer the front window for in-app tasks, but keep full-screen capture for app-launch flows--capture-mode window: focus perception on the current front window--capture-mode screen: keep full-screen capture when you need Dock / desktop / cross-app context
For runtime values, priority is:
- CLI flags (
--api-key,--model,--base-url, etc.) - Config file (
~/.migi/config.json)
Default path:
~/.migi/config.json
Run migi setup to write the config interactively, or set fields via CLI flags:
migi setup --api-key "YOUR_API_KEY" --model "doubao-seed" --base-url "https://ark.cn-beijing.volces.com/api/v3"When using non-Doubao model outputs, provide your own parser:
migi act "..." \
--action-parser custom \
--action-parser-callable "your_module:your_parser"Your parser callable should accept:
def your_parser(response: str, img_width: int, img_height: int, scale_factor: int):
...Coordinate behavior in executor:
- Recommended: normalized
0..1000coordinate space (independent of screen resolution) - Also accepted:
0..1ratio coordinates - Also accepted: absolute screenshot pixel coordinates
(migiremaps screenshot coordinates to the actual pyautogui control space for DPI/scaling differences)
All commands print exactly one JSON object to stdout.
compact(default, token-efficient):- success:
ok,cmd,code,data - failure:
ok,cmd,code,error(anddatawhen needed)
- success:
full(debug-friendly):ok,command,code,message,data,error,meta
Switch mode:
migi status --json fullTarget runtime:
- Python:
>=3.11 - OS: macOS / Linux / Windows (desktop environment required)
Runtime dependencies:
- Required package dependency:
httpx - Local image understanding (
image/vision) requires:pillow - Optional but practically required for GUI automation:
mss,pyautogui,pyperclip,pillow
Install optional GUI dependencies:
pip install mss pyautogui pyperclip pillowCONFIG_MISSINGfor API key/model/base URL- Run
migi setupagain, or set env vars directly.
- Run
- No action executed after
act- Start with
migi see "..."to inspect response first. - Ensure model is
doubao-seedand parser isdoubao.
- Start with
act/imagefeels slow- Run with
--performance fastfirst. miginow downsizes screenshots and local images before upload;accuratekeeps larger inputs when you need more detail.- Use
--json fulland inspecttiming.inference_msvstiming.screenshot_msto see whether the slowdown is model-side or local.
- Run with
- Complex tasks stop after only one visible step
- Increase
--max-steps, for example:migi act --max-steps 3 "..." miginow carries forward action history between steps, but cross-screen flows still depend heavily on model quality and visible UI confirmation.
- Increase
- The model keeps clicking the wrong small control in the active app
- Prefer
--capture-mode windowso the model sees only the front window instead of the whole desktop. - Use
--capture-mode screenonly when you explicitly need desktop-wide context.
- Prefer
- Dependency error for GUI modules
- Install missing packages:
mss pyautogui pyperclip pillow.
- Install missing packages:
which <app>/where <app>returns not found (exit code 1)- This is expected for many GUI apps (they are not in PATH).
- For app launch tasks,
miginow uses a 3-stage fallback chain:- Direct command launch first (macOS
open, WindowsStart-Process) - Then shortcut search (macOS
Command+Space, WindowsWin+S) - Then GUI-visible search fallback if shortcut action fails
- macOS:
Command+Space-> type app name -> select the app entry under Applications -> Enter - Windows:
Win+S-> type app name -> Enter
- Direct command launch first (macOS
- Config path permission issue
- Use
--config-pathto specify a writable location.
- Use
- Need to use another model
- Switch to
--action-parser customand implementmodule:function.
- Switch to
- Is
migiproduction-ready?- Current release is alpha and focuses on a stable CLI/JSON contract.
- Can I use OpenAI-compatible providers directly?
- Yes, request transport is OpenAI-compatible, but built-in parsing is currently optimized for Doubao-style outputs.
- Why only
doubao-seedis officially supported now?- The default parser backend is Doubao-oriented; parser behavior for other models is not officially guaranteed yet.
- How to integrate with agents?
- Use the stable compact JSON mode and install skills via
migi install.
- Use the stable compact JSON mode and install skills via
- Multi-model official parser support
- Safer and richer action execution controls
- More robust cross-platform test coverage
- Better parser debug tooling and evaluation suites
migi 是一个面向任务的桌面 GUI 视觉自动化 CLI,重点用于 LLM Agent 的 skill 化集成与调用。
- 通过截图 + 多模态模型理解当前界面
- 支持
see(只分析)与act(分析并执行) - 支持
image/vision(针对本地图片做图像理解) - 全部命令输出稳定 JSON,方便程序消费
- 内置技能安装能力(如 Cursor)
目前项目默认只实现了豆包方向的动作解析器(doubao),因此当前仅官方支持 doubao-seed 模型。
- 你仍可传入其他模型参数,但内置解析逻辑目前针对 Doubao 风格动作输出
- 若要接入其他模型,请使用
custom解析器自行适配
pip install migi-cli或:
uv pip install migi-cli- 初始化配置(推荐):
migi setup --api-key "你的密钥" --model "doubao-seed" --base-url "https://ark.cn-beijing.volces.com/api/v3"- 仅分析当前屏幕:
migi see "屏幕上有哪些应用?"- 分析并执行动作:
migi act "点击搜索框并输入 李白"如果你更在意响应速度,可以直接切到快速档:
migi act --performance fast "点击搜索框并输入 李白"- 安装 Cursor 技能包:
migi install --target cursor- 理解一张本地图片:
migi image ./example.png "这张图里有哪些关键元素和文字?"migi <command> [options]setup/init:初始化或更新模型配置status:查看当前生效配置与依赖状态config show:status的别名see <instruction>:只做视觉分析,不执行动作act <instruction>:视觉分析并执行动作image <image_path> [instruction]/vision:分析本地图片内容install:安装技能包
性能档位:
--performance balanced(默认):兼顾速度与识别稳定性--performance fast:更小的截图、更紧的输出限制,延迟最低--performance accurate:更大的截图和更宽松的输出上限,适合小字或复杂界面
多步执行:
migi act现在支持--max-steps N,默认是3- 对于“打开应用 -> 搜索 -> 发送消息”这类跨界面任务,可以适当调高
- 像“发送微信消息”这类明确点名应用的任务,现在会在视觉步骤开始前优先尝试把目标应用切到前台
- 像“给某人发微信消息”这类带收件人的指令,现在会把收件人提示带进后续推理,降低误发到当前会话的概率
- 像
Cmd+W这类非必要的关闭/退出快捷键现在会默认被拦截,除非指令明确要求关闭或退出 - 当目标应用已经被切到前台后,
auto截图模式会优先收回到前台窗口,减少后续步骤的整屏干扰 - 像
给 <收件人> 发送微信消息,说 <内容>这样的微信纯文字指令,现在会优先命中专用流程:切前台、搜索联系人、确认会话、再发送 - 这个专用流程在输入联系人后会先尝试用回车打开首个搜索结果,不行再回退到视觉点选
截图模式:
--capture-mode auto(默认):应用内任务优先看前台窗口,打开应用这类任务仍保留全屏截图--capture-mode window:只看当前前台窗口,适合点小控件、搜索框、输入框--capture-mode screen:保留全屏截图,适合需要看 Dock、桌面、跨应用上下文的任务
- 命令行参数(CLI)
- 配置文件(
~/.migi/config.json)
默认:
~/.migi/config.json
通过 migi setup 交互式写入配置,或通过命令行参数设置:
migi setup --api-key "你的密钥" --model "doubao-seed" --base-url "https://ark.cn-beijing.volces.com/api/v3"接入非 Doubao 风格输出时,可使用自定义解析器:
migi act "..." \
--action-parser custom \
--action-parser-callable "你的模块:你的函数"函数签名建议:
def your_parser(response: str, img_width: int, img_height: int, scale_factor: int):
...执行器坐标兼容策略:
- 推荐使用
0..1000归一化坐标(与屏幕分辨率无关) - 兼容
0..1比例坐标 - 兼容截图像素绝对坐标
(migi会把截图坐标重映射到 pyautogui 实际控制坐标,适配 DPI/缩放差异)
所有命令都只向标准输出打印一个 JSON 对象。
compact(默认,节省 token):- 成功:
ok,cmd,code,data - 失败:
ok,cmd,code,error(必要时含data)
- 成功:
full(调试模式):ok,command,code,message,data,error,meta
切换方式:
migi status --json full运行环境建议:
- Python:
>=3.11 - 操作系统:macOS / Linux / Windows(需要桌面环境)
依赖说明:
- 必需包依赖:
httpx - 本地图片理解(
image/vision)依赖:pillow - GUI 自动化常用依赖:
mss、pyautogui、pyperclip、pillow
安装 GUI 相关依赖:
pip install mss pyautogui pyperclip pillow- 提示
CONFIG_MISSING(缺少 key/model/base_url)- 重新执行
migi setup,或直接设置环境变量。
- 重新执行
- 执行
act没有动作- 先用
migi see "..."检查模型输出。 - 确保模型使用
doubao-seed,解析器使用doubao。
- 先用
act/image运行偏慢- 先试试
--performance fast。 - 现在
migi会在上传前自动缩小截图和本地图片;如果你需要更细的小字识别,再切回--performance accurate。 - 用
--json full查看timing.inference_ms和timing.screenshot_ms,可以快速判断是模型推理慢还是本地处理慢。
- 先试试
- 复杂任务只走了一步就停了
- 可以调高
--max-steps,例如:migi act --max-steps 3 "..." - 现在
migi会把前一步动作历史带进下一轮推理,但跨界面任务依然很依赖模型质量和界面是否清晰可见。
- 可以调高
- 模型总是点偏当前应用里的小控件
- 优先使用
--capture-mode window,让模型只看前台窗口而不是整个桌面。 - 只有明确需要桌面全局信息时,再切回
--capture-mode screen。
- 优先使用
- 出现 GUI 依赖缺失报错
- 安装:
mss pyautogui pyperclip pillow。
- 安装:
which <app>/where <app>返回未找到(exit code 1)- 这是常见现象,很多 GUI 应用并不在 PATH 中。
migi对“打开应用”默认使用三段式回退链路:- 先命令直启(macOS
open,WindowsStart-Process) - 再快捷键搜索(macOS
Command+Space,WindowsWin+S) - 若快捷键动作失败,再自动走 GUI 可见入口回退流程
- macOS:
Command+Space-> 输入应用名 -> 先选中“应用程序”分组中的目标应用 -> 回车 - Windows:
Win+S-> 输入应用名 -> 回车
- 先命令直启(macOS
- 配置文件写入失败(权限问题)
- 使用
--config-path指向可写目录。
- 使用
- 想接入其他模型
- 使用
--action-parser custom并实现module:function自定义解析器。
- 使用
- 现在可以直接用于生产吗?
- 当前版本为 alpha,优先保证 CLI 与 JSON 协议稳定。
- 是否兼容 OpenAI 接口格式?
- 传输层兼容,但内置动作解析目前主要针对豆包输出风格。
- 为什么当前只官方支持
doubao-seed?- 因为默认解析器是豆包方向实现,其他模型暂未给出官方解析保证。
- 如何与 Agent 集成?
- 推荐使用默认 compact JSON 输出,并通过
migi install安装技能。
- 推荐使用默认 compact JSON 输出,并通过
- 增加多模型官方解析支持
- 增强动作执行安全与控制能力
- 完善跨平台自动化测试覆盖
- 提供更强的解析调试与评估工具