migi

migi is a task-oriented desktop GUI vision automation CLI focused on skill-style integration for LLM agents.

Switch to English | 切换到中文

English

Navigation

What It Does
Current Model Support
Install
Quick Start
CLI Usage
Configuration
Advanced: Custom Action Parser
JSON Output Contract
Platform and Dependencies
Troubleshooting
FAQ
Roadmap

What It Does

Uses screenshots + multimodal model inference to understand current desktop UI
Supports instruction-driven automation via see (analyze only) and act (analyze + execute)
Supports local image understanding via image / vision
Returns machine-readable JSON for every command
Includes a skill installer for agent platforms

Current Model Support

migi currently ships with a Doubao-oriented action parser by default (doubao), and at this stage only doubao-seed is officially supported.

Notes:

You can still pass custom model/base URL values, but the built-in action parsing logic is currently tuned for Doubao-style action outputs.
If you need a different model, use --action-parser custom with your own parser callable.

Install

pip install migi-cli

or:

uv pip install migi-cli

Quick Start

Configure credentials and model:

migi setup --api-key "YOUR_API_KEY" --model "doubao-seed" --base-url "https://ark.cn-beijing.volces.com/api/v3"

Analyze current screen (no execution):

migi see "What apps are visible on the screen?"

Analyze and execute:

migi act "Click the search box and type Li Bai"

If you prefer lower latency, use the faster runtime profile:

migi act --performance fast "Click the search box and type Li Bai"

Install skill package:

migi install --target cursor

Understand a local image file:

migi image ./example.png "Describe the key objects and visible text."

CLI Usage

migi <command> [options]

Core commands:

setup / init: initialize or update model config
status: show effective runtime config and dependency status
config show: alias of status
see <instruction>: analyze screen only
act <instruction>: analyze and execute actions
image <image_path> [instruction] / vision: analyze a local image file
install: install skill package(s)

Performance profile:

--performance balanced (default): a faster default balance for most GUI tasks
--performance fast: smaller screenshots, tighter limits, lowest latency
--performance accurate: larger screenshots and more generous outputs for tiny text / dense UIs

Multi-step execution:

migi act now supports --max-steps N and defaults to 3
Use higher values for cross-screen tasks such as opening an app, searching, then sending a message
App-targeted tasks such as "send a WeChat message" now try to bring the target app to the foreground before visual steps begin
Recipient-targeted messaging instructions now carry the recipient hint forward so the model is less likely to send into the currently open chat by mistake
Non-essential close/quit shortcuts such as Cmd+W are now blocked unless the instruction explicitly asks to close or quit something
After the target app is brought to the foreground, auto capture can narrow back down to the front window for the remaining steps
WeChat text-message instructions in the form 给 <recipient> 发送微信消息，说 <content> now use a specialized flow: foreground WeChat, search recipient, confirm the chat, then send
That specialized flow now tries Enter on the first search result before falling back to visual contact clicking

Capture mode:

--capture-mode auto (default): prefer the front window for in-app tasks, but keep full-screen capture for app-launch flows
--capture-mode window: focus perception on the current front window
--capture-mode screen: keep full-screen capture when you need Dock / desktop / cross-app context

Configuration

Config Sources and Priority

For runtime values, priority is:

CLI flags (--api-key, --model, --base-url, etc.)
Config file (~/.migi/config.json)

Config File Location

Default path:

~/.migi/config.json

Run migi setup to write the config interactively, or set fields via CLI flags:

migi setup --api-key "YOUR_API_KEY" --model "doubao-seed" --base-url "https://ark.cn-beijing.volces.com/api/v3"

Advanced: Custom Action Parser

When using non-Doubao model outputs, provide your own parser:

migi act "..." \
  --action-parser custom \
  --action-parser-callable "your_module:your_parser"

Your parser callable should accept:

def your_parser(response: str, img_width: int, img_height: int, scale_factor: int):
    ...

Coordinate behavior in executor:

Recommended: normalized 0..1000 coordinate space (independent of screen resolution)
Also accepted: 0..1 ratio coordinates
Also accepted: absolute screenshot pixel coordinates
(migi remaps screenshot coordinates to the actual pyautogui control space for DPI/scaling differences)

JSON Output Contract

All commands print exactly one JSON object to stdout.

compact (default, token-efficient):
- success: ok, cmd, code, data
- failure: ok, cmd, code, error (and data when needed)
full (debug-friendly):
- ok, command, code, message, data, error, meta

Switch mode:

migi status --json full

Platform and Dependencies

Target runtime:

Python: >=3.11
OS: macOS / Linux / Windows (desktop environment required)

Runtime dependencies:

Required package dependency: httpx
Local image understanding (image / vision) requires: pillow
Optional but practically required for GUI automation: mss, pyautogui, pyperclip, pillow

Install optional GUI dependencies:

pip install mss pyautogui pyperclip pillow

Troubleshooting

CONFIG_MISSING for API key/model/base URL
- Run migi setup again, or set env vars directly.
No action executed after act
- Start with migi see "..." to inspect response first.
- Ensure model is doubao-seed and parser is doubao.
act / image feels slow
- Run with --performance fast first.
- migi now downsizes screenshots and local images before upload; accurate keeps larger inputs when you need more detail.
- Use --json full and inspect timing.inference_ms vs timing.screenshot_ms to see whether the slowdown is model-side or local.
Complex tasks stop after only one visible step
- Increase --max-steps, for example: migi act --max-steps 3 "..."
- migi now carries forward action history between steps, but cross-screen flows still depend heavily on model quality and visible UI confirmation.
The model keeps clicking the wrong small control in the active app
- Prefer --capture-mode window so the model sees only the front window instead of the whole desktop.
- Use --capture-mode screen only when you explicitly need desktop-wide context.
Dependency error for GUI modules
- Install missing packages: mss pyautogui pyperclip pillow.
which <app> / where <app> returns not found (exit code 1)
- This is expected for many GUI apps (they are not in PATH).
- For app launch tasks, migi now uses a 3-stage fallback chain:
  - Direct command launch first (macOS open, Windows Start-Process)
  - Then shortcut search (macOS Command+Space, Windows Win+S)
  - Then GUI-visible search fallback if shortcut action fails
  - macOS: Command+Space -> type app name -> select the app entry under Applications -> Enter
  - Windows: Win+S -> type app name -> Enter
Config path permission issue
- Use --config-path to specify a writable location.
Need to use another model
- Switch to --action-parser custom and implement module:function.

FAQ

Is migi production-ready?
- Current release is alpha and focuses on a stable CLI/JSON contract.
Can I use OpenAI-compatible providers directly?
- Yes, request transport is OpenAI-compatible, but built-in parsing is currently optimized for Doubao-style outputs.
Why only doubao-seed is officially supported now?
- The default parser backend is Doubao-oriented; parser behavior for other models is not officially guaranteed yet.
How to integrate with agents?
- Use the stable compact JSON mode and install skills via migi install.

Roadmap

Multi-model official parser support
Safer and richer action execution controls
More robust cross-platform test coverage
Better parser debug tooling and evaluation suites

中文

返回 English | 点击切换到中文

项目简介

migi 是一个面向任务的桌面 GUI 视觉自动化 CLI，重点用于 LLM Agent 的 skill 化集成与调用。

通过截图 + 多模态模型理解当前界面
支持 see（只分析）与 act（分析并执行）
支持 image / vision（针对本地图片做图像理解）
全部命令输出稳定 JSON，方便程序消费
内置技能安装能力（如 Cursor）

当前模型支持说明（重要）

目前项目默认只实现了豆包方向的动作解析器（doubao），因此当前仅官方支持 doubao-seed 模型。

你仍可传入其他模型参数，但内置解析逻辑目前针对 Doubao 风格动作输出
若要接入其他模型，请使用 custom 解析器自行适配

安装

pip install migi-cli

或：

uv pip install migi-cli

快速开始

初始化配置（推荐）：

migi setup --api-key "你的密钥" --model "doubao-seed" --base-url "https://ark.cn-beijing.volces.com/api/v3"

仅分析当前屏幕：

migi see "屏幕上有哪些应用？"

分析并执行动作：

migi act "点击搜索框并输入 李白"

如果你更在意响应速度，可以直接切到快速档：

migi act --performance fast "点击搜索框并输入 李白"

安装 Cursor 技能包：

migi install --target cursor

理解一张本地图片：

migi image ./example.png "这张图里有哪些关键元素和文字？"

命令总览

migi <command> [options]

setup / init：初始化或更新模型配置
status：查看当前生效配置与依赖状态
config show：status 的别名
see <instruction>：只做视觉分析，不执行动作
act <instruction>：视觉分析并执行动作
image <image_path> [instruction] / vision：分析本地图片内容
install：安装技能包

性能档位：

--performance balanced（默认）：兼顾速度与识别稳定性
--performance fast：更小的截图、更紧的输出限制，延迟最低
--performance accurate：更大的截图和更宽松的输出上限，适合小字或复杂界面

多步执行：

migi act 现在支持 --max-steps N，默认是 3
对于“打开应用 -> 搜索 -> 发送消息”这类跨界面任务，可以适当调高
像“发送微信消息”这类明确点名应用的任务，现在会在视觉步骤开始前优先尝试把目标应用切到前台
像“给某人发微信消息”这类带收件人的指令，现在会把收件人提示带进后续推理，降低误发到当前会话的概率
像 Cmd+W 这类非必要的关闭/退出快捷键现在会默认被拦截，除非指令明确要求关闭或退出
当目标应用已经被切到前台后，auto 截图模式会优先收回到前台窗口，减少后续步骤的整屏干扰
像 给 <收件人> 发送微信消息，说 <内容> 这样的微信纯文字指令，现在会优先命中专用流程：切前台、搜索联系人、确认会话、再发送
这个专用流程在输入联系人后会先尝试用回车打开首个搜索结果，不行再回退到视觉点选

截图模式：

--capture-mode auto（默认）：应用内任务优先看前台窗口，打开应用这类任务仍保留全屏截图
--capture-mode window：只看当前前台窗口，适合点小控件、搜索框、输入框
--capture-mode screen：保留全屏截图，适合需要看 Dock、桌面、跨应用上下文的任务

配置方式

配置优先级（高到低）

命令行参数（CLI）
配置文件（~/.migi/config.json）

配置文件路径

默认：

~/.migi/config.json

通过 migi setup 交互式写入配置，或通过命令行参数设置：

migi setup --api-key "你的密钥" --model "doubao-seed" --base-url "https://ark.cn-beijing.volces.com/api/v3"

高级用法：自定义解析器

接入非 Doubao 风格输出时，可使用自定义解析器：

migi act "..." \
  --action-parser custom \
  --action-parser-callable "你的模块:你的函数"

函数签名建议：

def your_parser(response: str, img_width: int, img_height: int, scale_factor: int):
    ...

执行器坐标兼容策略：

推荐使用 0..1000 归一化坐标（与屏幕分辨率无关）
兼容 0..1 比例坐标
兼容截图像素绝对坐标
（migi 会把截图坐标重映射到 pyautogui 实际控制坐标，适配 DPI/缩放差异）

JSON 输出协议

所有命令都只向标准输出打印一个 JSON 对象。

compact（默认，节省 token）：
- 成功：ok, cmd, code, data
- 失败：ok, cmd, code, error（必要时含 data）
full（调试模式）：
- ok, command, code, message, data, error, meta

切换方式：

migi status --json full

平台与依赖

运行环境建议：

Python：>=3.11
操作系统：macOS / Linux / Windows（需要桌面环境）

依赖说明：

必需包依赖：httpx
本地图片理解（image / vision）依赖：pillow
GUI 自动化常用依赖：mss、pyautogui、pyperclip、pillow

安装 GUI 相关依赖：

pip install mss pyautogui pyperclip pillow

故障排查

提示 CONFIG_MISSING（缺少 key/model/base_url）
- 重新执行 migi setup，或直接设置环境变量。
执行 act 没有动作
- 先用 migi see "..." 检查模型输出。
- 确保模型使用 doubao-seed，解析器使用 doubao。
act / image 运行偏慢
- 先试试 --performance fast。
- 现在 migi 会在上传前自动缩小截图和本地图片；如果你需要更细的小字识别，再切回 --performance accurate。
- 用 --json full 查看 timing.inference_ms 和 timing.screenshot_ms，可以快速判断是模型推理慢还是本地处理慢。
复杂任务只走了一步就停了
- 可以调高 --max-steps，例如：migi act --max-steps 3 "..."
- 现在 migi 会把前一步动作历史带进下一轮推理，但跨界面任务依然很依赖模型质量和界面是否清晰可见。
模型总是点偏当前应用里的小控件
- 优先使用 --capture-mode window，让模型只看前台窗口而不是整个桌面。
- 只有明确需要桌面全局信息时，再切回 --capture-mode screen。
出现 GUI 依赖缺失报错
- 安装：mss pyautogui pyperclip pillow。
which <app> / where <app> 返回未找到（exit code 1）
- 这是常见现象，很多 GUI 应用并不在 PATH 中。
- migi 对“打开应用”默认使用三段式回退链路：
  - 先命令直启（macOS open，Windows Start-Process）
  - 再快捷键搜索（macOS Command+Space，Windows Win+S）
  - 若快捷键动作失败，再自动走 GUI 可见入口回退流程
  - macOS：Command+Space -> 输入应用名 -> 先选中“应用程序”分组中的目标应用 -> 回车
  - Windows：Win+S -> 输入应用名 -> 回车
配置文件写入失败（权限问题）
- 使用 --config-path 指向可写目录。
想接入其他模型
- 使用 --action-parser custom 并实现 module:function 自定义解析器。

常见问题（FAQ）

现在可以直接用于生产吗？
- 当前版本为 alpha，优先保证 CLI 与 JSON 协议稳定。
是否兼容 OpenAI 接口格式？
- 传输层兼容，但内置动作解析目前主要针对豆包输出风格。
为什么当前只官方支持 doubao-seed？
- 因为默认解析器是豆包方向实现，其他模型暂未给出官方解析保证。
如何与 Agent 集成？
- 推荐使用默认 compact JSON 输出，并通过 migi install 安装技能。

路线图

增加多模型官方解析支持
增强动作执行安全与控制能力
完善跨平台自动化测试覆盖
提供更强的解析调试与评估工具

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
src/migi		src/migi
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

migi

English

Navigation

What It Does

Current Model Support

Install

Quick Start

CLI Usage

Configuration

Config Sources and Priority

Config File Location

Advanced: Custom Action Parser

JSON Output Contract

Platform and Dependencies

Troubleshooting

FAQ

Roadmap

中文

导航

项目简介

当前模型支持说明（重要）

安装

快速开始

命令总览

配置方式

配置优先级（高到低）

配置文件路径

高级用法：自定义解析器

JSON 输出协议

平台与依赖

故障排查

常见问题（FAQ）

路线图

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages