docs: add comparison for different model (#329)

yuyutaotao · zhoushaw · web-flow · commit 57e8b485a873 · 2025-01-26T15:16:35.000+08:00
---------

Co-authored-by: zhouxiao.shaw &lt;zhouxiao.shaw@bytedance.com&gt;
diff --git a/apps/site/docs/en/caching.md b/apps/site/docs/en/caching.md
@@ -1,4 +1,4 @@
-# Cache
+# Caching
 
 Midscene.js provides AI caching features to improve the stability and speed of the entire AI execution process. The cache mainly refers to caching how AI recognizes page elements. Cached AI query results are used if page elements haven't changed.
 
diff --git a/apps/site/docs/en/choose-a-model.md b/apps/site/docs/en/choose-a-model.md
@@ -10,10 +10,27 @@ Midscene.js uses general-purpose large language models (LLMs, like `gpt-4o`) as
 You can also use open-source models like `UI-TARS` to improve the performance and data privacy.
 :::
 
+## Comparison between general-purpose LLMs and dedicated model
+
+This is a table for comparison between general-purpose LLMs and dedicated model (like `UI-TARS`). We will talk about them in detail later.
+
+| | General-purpose LLMs (default) | Dedicated model like `UI-TARS` |
+| --- | --- | --- | 
+| **What it is** | for general-purpose tasks | dedicated for UI automation |
+| **How to get started** | easy, just to get an API key | a bit complex, you need to deploy it on your own server |
+| **Performance** | 3-10x slower compared to pure JavaScript automation | could be acceptable with proper deployment |
+| **Who will get the page data** | the model provider | your own server |
+| **Cost** | more expensive, usually pay for the token | less expensive, pay for the server |
+| **Prompting** | prefer step-by-step instructions | still prefer step-by-step instructions, but performs better in uncertainty situations |
+
 ## Choose a general-purpose LLM
 
 Midscene uses OpenAI `gpt-4o` as the default model, since this model performs the best among all general-purpose LLMs at this moment.
 
+To use the official `gpt-4o` from OpenAI, you can simply set the `OPENAI_API_KEY` in the environment variables. Refer to [Config Model and Provider](./model-provider) for more details.
+
+### Choose a model other than `gpt-4o`
+
 If you want to use other models, please follow these steps:
 
 1. A multimodal model is required, which means it must support image input.
@@ -22,7 +39,7 @@ If you want to use other models, please follow these steps:
 1. If you find it not working well after changing the model, you can try using some short and clear prompt, or roll back to the previous model. See more details in [Prompting Tips](./prompting-tips).
 1. Remember to follow the terms of use of each model and provider.
 
-### Known Supported General-Purpose Models
+### Known supported general-purpose models
 
 Besides `gpt-4o`, the known supported models are:
 
@@ -31,17 +48,33 @@ Besides `gpt-4o`, the known supported models are:
 - `qwen-vl-max-latest`
 - `doubao-vision-pro-32k`
 
+### About the token cost
+
+Image resolution and element numbers (i.e., a UI context size created by Midscene) will affect the token bill.
+
+Here are some typical data with gpt-4o-0806 without prompt caching.
+
+|Task | Resolution | Prompt Tokens / Price | Completion Tokens / Price | Total Cost |
+|-----|------------|--------------|---------------|-----------------|
+|Plan and perform a search on eBay homepage| 1280x800 | 6005 / $0.0150125 |146 / $0.00146| $0.0164725 |
+|Query the information about the item in the search results| 1280x800 | 9107 / $0.0227675 | 122 / $0.00122 | $0.0239875 |
+
+> The price data was calculated in Nov 2024.
+
 ## Choose `UI-TARS` (a open-source model dedicated for UI automation)
 
 UI-TARS is an end-to-end GUI agent model based on VLM architecture. It solely perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations), achieving state-of-the-art performance on 10+ GUI benchmarks.
 
 UI-TARS is an open-source model, and provides different versions of size. You can deploy it on your own server, and it will dramatically improve the performance and data privacy.
 
-For more details about UI-TARS, see [Github - UI-TARS](https://github.com/bytedance/ui-tars), [🤗 HuggingFace - UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT).
+For more details about UI-TARS, see
+* [Github - UI-TARS](https://github.com/bytedance/ui-tars)
+* [🤗 HuggingFace - UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
+* [UI-TARS - Model Deployment Guide](https://juniper-switch-f10.notion.site/UI-TARS-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
 
 ### What you will have after using UI-TARS
 
-- **Speed**: a private-deployed UI-TARS model can be 5x faster than a general-purpose LLM. Each step of `.ai` call can be processed in 1-2 seconds.
+- **Speed**: a private-deployed UI-TARS model can be 5x faster than a general-purpose LLM. Each step of `.ai` call can be processed in 1-2 seconds on a high-performance GPU server.
 - **Data privacy**: you can deploy it on your own server and your data will no longer be sent to the cloud.
 - **More stable with short prompt**: ⁠UI-TARS is optimized for UI automation and is capable of handling more complex tasks with target-driven prompts. You can use it with a shorter prompt (although it is not recommended), and it performs even better when compared to a general-purpose LLM.
 
@@ -76,4 +109,4 @@ Once you feel uncomfortable with the speed, the cost, the accuracy, or the data
 ## More
 
 * [Config Model and Provider](./model-provider)
-* [UI-TARS on Github](https://github.com/bytedance/ui-tars)
+* [UI-TARS on Github](https://github.com/bytedance/ui-tars)
diff --git a/apps/site/docs/en/faq.md b/apps/site/docs/en/faq.md
@@ -10,42 +10,33 @@ Related Docs: [Prompting Tips](./prompting-tips)
 
 There are some limitations with Midscene. We are still working on them.
 
-1. The interaction types are limited to only tap, type, keyboard press, and scroll.
+1. The interaction types are limited to only tap, drag, type, keyboard press, and scroll.
 2. LLM is not 100% stable. Even GPT-4o can't return the right answer all the time. Following the [Prompting Tips](./prompting-tips) will help improve stability.
 3. Since we use JavaScript to retrieve elements from the page, the elements inside the cross-origin iframe cannot be accessed.
 4. We cannot access the native elements of Chrome, like the right-click context menu or file upload dialog.
 5. Do not use Midscene to bypass CAPTCHA. Some LLM services are set to decline requests that involve CAPTCHA-solving (e.g., OpenAI), while the DOM of some CAPTCHA pages is not accessible by regular web scraping methods. Therefore, using Midscene to bypass CAPTCHA is not a reliable method.
 
 ## Can I use a model other than `gpt-4o`?
 
-Yes. You can [customize model and provider](./model-provider) if needed.
+Of course. You can [choose a model](./choose-a-model) according to your needs.
 
-## About the token cost
-
-Image resolution and element numbers (i.e., a UI context size created by Midscene) will affect the token bill.
-
-Here are some typical data with gpt-4o-0806 without prompt caching.
-
-|Task | Resolution | Prompt Tokens / Price | Completion Tokens / Price | Total Cost |
-|-----|------------|--------------|---------------|-----------------|
-|Plan and perform a search on eBay homepage| 1280x800 | 6005 / $0.0150125 |146 / $0.00146| $0.0164725 |
-|Query the information about the item in the search results| 1280x800 | 9107 / $0.0227675 | 122 / $0.00122 | $0.0239875 |
-
-> The price data was calculated in Nov 2024.
-
-## What data is sent to LLM ?
+## What data is sent to AI model?
 
 Currently, the contents are: 
 1. the key information extracted from the DOM, such as text content, class name, tag name, coordinates; 
 2. a screenshot of the page.
 
+If you are concerned about the data privacy, please refer to [Data Privacy](./data-privacy).
+
 ## The automation process is running more slowly than the traditional one
 
-Since Midscene.js invokes AI for each planning and querying operation, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. This is currently inevitable but may improve with advancements in LLMs.
+When using general-purpose LLM in Midscene.js, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. To make the result more stable, the token and time cost is inevitable.
+
 
-Despite the increased time and cost, Midscene stands out in practical applications due to its unique development experience and easy-to-maintain codebase. We are confident that incorporating automation scripts powered by Midscene will significantly enhance your project’s efficiency, cover many more situations, and boost overall productivity.
 
-In short, it is worth the time and cost.
+There are two ways to improve the running time:
+1. Use a dedicated model, like UI-TARS. This is the recommended way. Read more about it in [Choose a model](./choose-a-model).
+2. Use caching to reduce the token cost. Read more about it in [Caching](./caching).
 
 ## The webpage continues to flash when running in headed mode
 
diff --git a/apps/site/docs/en/integrate-with-puppeteer.mdx b/apps/site/docs/en/integrate-with-puppeteer.mdx
@@ -22,7 +22,7 @@ export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
 
 ## Step 1. install dependencies
 
-<PackageManagerTabs command="install @midscene/web puppeteer ts-node --save-dev" />
+<PackageManagerTabs command="install @midscene/web puppeteer tsx --save-dev" />
 
 ## Step 2. write scripts
 
@@ -73,11 +73,11 @@ Promise.resolve(
 
 ## Step 3. run
 
-Using ts-node to run, you will get the data of Headphones on eBay:
+Using `tsx` to run, you will get the data of Headphones on eBay:
 
 ```bash
 # run
-npx ts-node demo.ts
+npx tsx demo.ts
 
 # it should print 
 #  [
diff --git a/apps/site/docs/en/prompting-tips.md b/apps/site/docs/en/prompting-tips.md
@@ -1,6 +1,6 @@
 # Prompting Tips
 
-The natural language parameter passed to Midscene will be part of the prompt sent to the LLM. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
+The natural language parameter passed to Midscene will be part of the prompt sent to the AI model. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
 
 ## The purpose of optimization is to get a stable response from AI
 
@@ -51,6 +51,8 @@ To launch the local Playground server:
 npx --yes @midscene/web
 ```
 
+![Playground](/midescene-playground-entry.jpg)
+
 ## Infer or assert from the interface, not the DOM properties or browser status
 
 All the data sent to the LLM is in the form of screenshots and element coordinates. The DOM and the browser instance are almost invisible to the LLM. Therefore, ensure everything you expect is visible on the screen.
diff --git a/apps/site/docs/public/midescene-playground-entry.jpg b/apps/site/docs/public/midescene-playground-entry.jpg
diff --git a/apps/site/docs/zh/caching.md b/apps/site/docs/zh/caching.md
diff --git a/apps/site/docs/zh/choose-a-model.md b/apps/site/docs/zh/choose-a-model.md
@@ -5,16 +5,33 @@
 如果你想了解更多关于模型和提供商的配置方法，请查看 [配置模型和服务商](./model-provider)。
 
 :::info 先说结论
-Midscene.js 支持使用通用的 LLM 模型，如 `gpt-4o`，作为默认模型。这是最简单的上手方法。
+Midscene.js 支持使用通用的 LLM 模型（如 `gpt-4o`）作为默认模型。这是最简单的上手方法。
 
 你也可以使用开源模型，如 `UI-TARS`，来提高运行性能和数据隐私。
 :::
 
-## 选择通用的 LLM 模型
+## 通用 LLM 模型与专用模型（如 `UI-TARS`）的对比
 
-Midscene.js 使用 OpenAI `gpt-4o` 作为默认模型，因为这是目前最好的通用 LLM 模型。
+以下是通用 LLM 模型与专用模型（如 `UI-TARS`）的特性对比。我们将在后文详细讨论它们。
 
-如果你想要使用其他模型，请按照以下步骤操作：
+| | 通用 LLM 模型（默认） | 专用模型（如 `UI-TARS`） |
+| --- | --- | --- |
+| **模型用途** | 通用任务 | 专为 UI 自动化设计 |
+| **如何上手** | 简单，只需获取 API 密钥 | 复杂，需要部署到你自己的服务器 |
+| **性能** | 3-10倍慢于纯 JavaScript 驱动自动化 | 可以接受，如果部署得当 |
+| **谁会获取页面数据** | 模型提供商 | 你自己的服务器 |
+| **成本** | 更贵，通常按 token 付费 | 更便宜，按服务器付费 |
+| **提示词** | 更喜欢逐步指令 | 仍然更喜欢逐步指令，但在不确定性情况下表现更好 |
+
+## 选择通用 LLM 模型
+
+Midscene.js 使用 OpenAI `gpt-4o` 作为默认模型，因为这是通用 LLM 模型领域的最佳产品。
+
+要使用 OpenAI 官方的 `gpt-4o`，你只需要在环境变量中设置 `OPENAI_API_KEY`。更多详情请参阅 [配置模型和服务商](./model-provider)。
+
+### 选择 `gpt-4o` 以外的模型
+
+如果你想要选用其他模型，请按照以下步骤操作：
 1. 必须使用多模态模型，也就是支持图像输入的模型。
 1. 模型越大，效果表现越好。然而，它也更昂贵。
 1. 找出如何使用与 OpenAI SDK 兼容的方式调用它，服务商一般都会提供这样的接入点，你需要配置的是 `OPENAI_BASE_URL`, `OPENAI_API_KEY` 和 `MIDSCENE_MODEL_NAME`。
@@ -30,23 +47,40 @@ Midscene.js 使用 OpenAI `gpt-4o` 作为默认模型，因为这是目前最好
 - `qwen-vl-max-latest`（千问）
 - `doubao-vision-pro-32k`（豆包）
 
+### 关于 token 消耗量
+
+图像分辨率和元素数量（即 Midscene 创建的 UI 上下文大小）会显著影响 token 消耗。
+
+以下是使用 gpt-4o-08-06 模型且未启用 prompting caching 的典型数据：
+
+|任务 | 分辨率 | Prompt Tokens / 价格 | Completion Tokens / 价格 | 总价 |
+|-----|------------|--------------|---------------|--------------|
+|拆解（Plan）并在 eBay 执行一次搜索| 1280x800| 6005 / $0.0150125 |146 / $0.00146| $0.0164725 |
+|提取（Query）eBay 搜索结果的商品信息| 1280x800 | 9107 / $0.0227675 | 122 / $0.00122 | $0.0239875 |
+
+> 测算时间是 2024 年 11 月
+
+
 ## 选择 `UI-TARS`（一个专为 UI 自动化设计的开源模型）
 
 UI-TARS 是一个基于 VLM 架构的 GUI agent 模型。它仅以截图作为输入，并执行人类常用的交互（如键盘和鼠标操作），在 10 多个 GUI 基准测试中取得了顶尖性能。
 
 UI-TARS 是一个开源模型，并提供了不同大小的版本。你可以部署到你自己的服务器上，它也支持在浏览器插件中使用。
 
-更多详情请查看 [Github - UI-TARS](https://github.com/bytedance/ui-tars), [🤗 HuggingFace - UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT).
+更多 UI-TARS 的详情可参阅:
+* [Github - UI-TARS](https://github.com/bytedance/ui-tars)
+* [🤗 HuggingFace - UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
+* [中文版: UI-TARS 模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb)
 
 ### 使用 UI-TARS 后你能获得什么
 
-- **速度**：一个私有的 UI-TARS 模型可以比通用 LLM 快 5 倍。每次 `.ai` 中的步骤可以在 1-2 秒内完成。
+- **速度**：一个私有的 UI-TARS 模型可以比通用 LLM 快 5 倍。当部署在性能良好的 GPU 服务器上时，每次 `.ai` 中的步骤可以在 1-2 秒内完成。
 - **数据隐私**：你可以部署到你自己的服务器上，而不是每次发送到云端。
 - **更稳定的短提示**：UI-TARS 针对 UI 自动化进行了优化，并能够处理更复杂的目标驱动的任务。你可以使用更短的 Prompt 指令（尽管不推荐），并且它比通用 LLM 表现得更好。
 
 ### 配置 UI-TARS 的步骤
 
-UI-TARS 的输出与通用 LLM 的输出不同。你需要添加以下配置来启用这个功能。
+UI-TARS 的输出与通用 LLM 的输出不同。你需要在 Midscene.js 中添加以下配置来启用这个功能。
 
 ```bash
 MIDSCENE_USE_VLM_UI_TARS=1
diff --git a/apps/site/docs/zh/faq.md b/apps/site/docs/zh/faq.md
@@ -12,42 +12,31 @@ Midscene 是一个辅助 UI 自动化的 SDK，运行时稳定性很关键——
 
 Midscene 存在一些局限性，我们仍在努力改进。
 
-1. 交互类型有限：目前仅支持点击、输入、键盘和滚动操作。
+1. 交互类型有限：目前仅支持点击、拖拽、输入、键盘和滚动操作。
 2. 稳定性风险：即使是 GPT-4o 也无法确保 100% 返回正确答案。遵循 [编写提示词的技巧](./prompting-tips) 可以帮助提高 SDK 稳定性。
 3. 元素访问受限：由于我们使用 JavaScript 从页面提取元素，所以无法访问跨域 iframe 内部的元素。
 4. 无法访问 Chrome 原生元素：无法访问右键菜单、文件上传对话框等。
 5. 无法绕过验证码：有些 LLM 服务会拒绝涉及验证码解决的请求（例如 OpenAI），而有些验证码页面的 DOM 无法通过常规的网页抓取方法访问。因此，使用 Midscene 绕过验证码不是一个可靠的方法。
 
 ## 能否选用 `gpt-4o` 以外的其他模型？
 
-可以。你可以[自定义模型和服务商](./model-provider)。
+当然可以。你可以按需[选择 AI 模型](./choose-a-model)。
 
-## 关于 token 成本
-
-图像分辨率和元素数量（即 Midscene 创建的 UI 上下文大小）会显著影响 token 消耗。
-
-以下是使用 gpt-4o-08-06 模型且未启用 prompting caching 的典型数据：
-
-|任务 | 分辨率 | Prompt Tokens / 价格 | Completion Tokens / 价格 | 总价 |
-|-----|------------|--------------|---------------|--------------|
-|拆解（Plan）并在 eBay 执行一次搜索| 1280x800| 6005 / $0.0150125 |146 / $0.00146| $0.0164725 |
-|提取（Query）eBay 搜索结果的商品信息| 1280x800 | 9107 / $0.0227675 | 122 / $0.00122 | $0.0239875 |
-
-> 测算时间是 2024 年 11 月
-
-## 会有哪些信息发送到 LLM ？
+## 会有哪些信息发送到 AI 模型？
 
 这些信息: 
 1. 从 DOM 提取的关键信息，如文字内容、class name、tag name、坐标
 2. 界面截图
 
-## 脚本运行偏慢？
+如果你担心数据隐私问题，请参阅 [数据隐私](./data-privacy)。
 
-由于 Midscene.js 每次进行规划（Planning）和查询（Query）时都会调用 AI，其运行耗时可能比传统 Playwright 用例增加 3 到 10 倍，比如从 5 秒变成 20秒。目前，这一点仍无法避免。但随着大型语言模型（LLM）的进步，未来性能可能会有所改善。
+## 脚本运行偏慢？
 
-尽管运行时间较长，Midscene 在实际应用中依然表现出色。它独特的开发体验会让代码库易于维护。我们相信，集成了 Midscene 的自动化脚本能够显著提升项目迭代效率，覆盖更多场景，提高整体生产力。
+在 Midscene.js 中使用通用大模型时，由于每次进行规划（Planning）和查询（Query）时都会调用 AI，其运行耗时可能比传统 Playwright 用例增加 3 到 10 倍，比如从 5 秒变成 20秒。为了让结果更可靠，token 和时间成本是不可避免的。
 
-简而言之，虽然偏慢，但这些投入一定都是值得的。
+有两种方法可以提高运行效率：
+1. 使用专用的模型，比如 UI-TARS。这是推荐的做法。更多详情请参阅 [选择 AI 模型](./choose-a-model)。
+2. 使用缓存来减少 token 消耗。更多详情请参阅 [缓存](./caching)。
 
 ## 浏览器界面持续闪动
 
diff --git a/apps/site/docs/zh/integrate-with-puppeteer.mdx b/apps/site/docs/zh/integrate-with-puppeteer.mdx
@@ -22,7 +22,7 @@ export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
 
 ## 第一步：安装依赖
 
-<PackageManagerTabs command="install @midscene/web puppeteer ts-node --save-dev" />
+<PackageManagerTabs command="install @midscene/web puppeteer tsx --save-dev" />
 
 ## 第二步：编写脚本
 
@@ -75,11 +75,11 @@ Promise.resolve(
 
 ## 第三步：运行
 
-使用 `ts-node` 来运行，你会看到命令行打印出了耳机的商品信息：
+使用 `tsx` 来运行，你会看到命令行打印出了耳机的商品信息：
 
 ```bash
 # run
-npx ts-node demo.ts
+npx tsx demo.ts
 
 # 命令行应该有如下输出
 #  [
diff --git a/apps/site/docs/zh/prompting-tips.md b/apps/site/docs/zh/prompting-tips.md
@@ -43,13 +43,16 @@
 
 ### 使用可视化报告和 Playground 进行调试
 
-测试报告里有每个步骤的详细信息。如果你想结合报告里的 UI 状态重新运行 Prompt，你可以启动本地 Playground Server，然后点击“Send to Playground”.
+测试报告里有每个步骤的详细信息。如果你想结合报告里的 UI 状态重新运行 Prompt，你可以启动本地 Playground Server，然后点击“Send to Playground”。
 
 启动本地 Playground Server:
 ```
 npx --yes @midscene/web
 ```
 
+![Playground](/midescene-playground-entry.jpg)
+
+
 ### 从界面做推断，而不是 DOM 属性或者浏览器状态
 
 所有传递给 LLM 的数据都是截图和元素坐标。DOM和浏览器 对 LLM 来说几乎是不可见的。因此，务必确保你想提取的信息都在截图中有所体现且能被 LLM “看到”。
diff --git a/apps/site/rspress.config.ts b/apps/site/rspress.config.ts

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Cache`
	`1`	`+# Caching`
`2`	`2`
`3`	`3`	`Midscene.js provides AI caching features to improve the stability and speed of the entire AI execution process. The cache mainly refers to caching how AI recognizes page elements. Cached AI query results are used if page elements haven't changed.`
`4`	`4`