chore: merge main

web-infra-dev · Jan 26, 2025 · 73b343e · 73b343e
2 parents 469c7a2 + 57e8b48
commit 73b343e
Show file tree

Hide file tree

Showing 31 changed files with 345 additions and 90 deletions.
diff --git a/.github/workflows/ai.yml b/.github/workflows/ai.yml
@@ -21,7 +21,7 @@ jobs:
     env:
       OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
       OPENAI_BASE_URL: ${{ secrets.OPENAI_BASE_URL }}
-      MIDSCENE_MODEL_NAME: gpt-4o-2024-11-20
+      MIDSCENE_MODEL_NAME: gpt-4o-2024-08-06
       CI: 1
       # MIDSCENE_DEBUG_AI_PROFILE: 1
 

diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ English | [简体中文](./README.zh.md)
 </div>
 
 <p align="center">
-  Joyful UI Automation
+  Let AI be your browser operator.
 </p>
 
 <p align="center">
@@ -22,10 +22,13 @@ English | [简体中文](./README.zh.md)
   <a href="https://x.com/midscene_ai"><img src="https://img.shields.io/twitter/follow/midscene_ai?style=flat-square" alt="twitter" /></a>
 </p>
 
-Midscene.js is an AI-powered automation SDK with the abilities to control the page, perform assertions and extract data in JSON format using natural language.
+Midscene.js lets AI be your browser operator 🤖.Just describe what you want to do in natural language, and it will help you operate web pages, validate content, and extract data. Whether you want a quick experience or deep development, you can get started easily.
+
 
 ## Showcases
 
+The following recorded example video is based on the [UI-TARS 7B SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) model, and the video has not been sped up at all~
+
 | Instruction  | Video |
 | :---:  | :---: |
 | Post a Tweet      |    <video src="https://github.com/user-attachments/assets/bb3d695a-fbff-4af1-b6cc-5e967c07ccee" height="300" />    |
@@ -37,13 +40,15 @@ Midscene.js is an AI-powered automation SDK with the abilities to control the pa
 From version v0.10.0, we support a new open-source model named [`UI-TARS`](https://github.com/bytedance/ui-tars). Read more about it in [Choose a model](https://midscenejs.com/choose-a-model).
 
 ## 💡 Features
-
-- **Natural Language Interaction 👆**: Describe the steps, and let Midscene plan and control the user interface for you
-- **Understand UI, Answer in JSON 🔍**: Provide prompts regarding the desired data format, and then receive the expected response in JSON format.
-- **Intuitive Assertion 🤔**: Make assertions in natural language; it’s all based on AI understanding.
-- **Experience by Chrome Extension 🖥️**: Start immediately with the Chrome Extension. No code is needed while exploring.
-- **Visualized Report for Debugging 🎞️**: With our visualized report file, you can easily understand and debug the whole process.
-- **Totally Open Source! 🔥**: Experience a whole new world of automation development. Enjoy!
+- **Natural Language Interaction 👆**: Just describe your goals and steps, and Midscene will plan and operate the user interface for you.
+- **Chrome Extension Experience 🖥️**: Start experiencing immediately through the Chrome extension, no coding required.
+- **Puppeteer/Playwright Integration 🔧**: Supports Puppeteer and Playwright integration, allowing you to combine AI capabilities with these powerful automation tools for easy automation.
+- **Support Private Deployment 🤖**: Supports private deployment of [`UI-TARS`](https://github.com/bytedance/ui-tars) model, which outperforms closed-source models like GPT-4o and Claude in UI automation scenarios while better protecting data security.
+- **Support General Models 🌟**: Supports general large models like GPT-4o and Claude, adapting to various scenario needs.
+- **Visual Reports for Debugging 🎞️**: Through our test reports and Playground, you can easily understand, replay and debug the entire process.
+- **Completely Open Source 🔥**: Experience a whole new automation development experience, enjoy!
+- **Understand UI, JSON Format Responses 🔍**: You can specify data format requirements and receive responses in JSON format.
+- **Intuitive Assertions 🤔**: Express your assertions in natural language, and AI will understand and process them.
 
 ## ✨ Model Choices
 
@@ -80,6 +85,28 @@ There are so many UI automation tools out there, and each one seems to be all-po
 * [Follow us on X](https://x.com/midscene_ai)
 * [Lark Group](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=291q2b25-e913-411a-8c51-191e59aab14d)
 
+
+## Citation
+
+If you use Midscene.js in your research or project, please cite:
+
+```bibtex
+@software{Midscene.js,
+  author = {Zhou, Xiao and Yu, Tao},
+  title = {Midscene.js: Assign AI as your web operator.},
+  year = {2025},
+  publisher = {GitHub},
+  url = {https://github.com/web-infra-dev/midscene}
+}
+```
+
+
 ## 📝 License
 
 Midscene.js is [MIT licensed](https://github.com/web-infra-dev/midscene/blob/main/LICENSE).
+
+---
+
+<div align="center">
+  If this project helps you or inspires you, please give us a ⭐️
+</div>
diff --git a/README.zh.md b/README.zh.md
@@ -10,7 +10,7 @@
 </div>
 
 <p align="center">
-  AI 加持，更愉悦的 UI 自动化
+  让 AI 成为你的浏览器操作员
 </p>
 
 <p align="center">
@@ -22,10 +22,12 @@
   <a href="https://x.com/midscene_ai"><img src="https://img.shields.io/twitter/follow/midscene_ai" alt="twitter" /></a>
 </p>
 
-Midscene.js 是一个由 AI 驱动的自动化 SDK，能够使用自然语言对网页进行操作、验证，并提取 JSON 格式的数据。
+Midscene.js 让 AI 成为你的浏览器操作员 🤖。只需用自然语言描述你想做什么，它就能帮你操作网页、验证内容，并提取数据。无论你是想快速体验还是深度开发，都可以轻松上手。如果您在项目中使用了 Midscene.js，可以加入我们的 [社区](https://github.com/web-infra-dev/midscene?tab=readme-ov-file#-community) 来与我们交流和分享。
 
 ## 案例
 
+下面的录制 example 视频基于 [UI-TARS 7B SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) 模型，视频没有任何加速～
+
 | 指令  | 视频 |
 | :---:  | :---: |
 | 发布一条 Twitter      |    <video src="https://github.com/user-attachments/assets/bb3d695a-fbff-4af1-b6cc-5e967c07ccee" height="300" />    |
@@ -39,12 +41,15 @@ Midscene.js 是一个由 AI 驱动的自动化 SDK，能够使用自然语言对
 
 ## 💡 特性
 
-- **自然语言互动 👆**：只需描述你的步骤，Midscene 会为你规划和操作用户界面
-- **理解UI、JSON格式回答 🔍**：你可以提出关于数据格式的要求，然后得到 JSON 格式的预期回应。
-- **直观断言 🤔**：用自然语言表达你的断言，AI 会理解并处理。
+- **自然语言互动 👆**：只需描述你的目标和步骤，Midscene 会为你规划和操作用户界面。
 - **Chrome 插件体验 🖥️**：通过 Chrome 插件，你可以立即开始体验，无需编写代码。
-- **用可视化报告来调试 🎞️**：通过我们的测试报告和 Playground，你可以轻松理解和调试整个过程。
+- **Puppeteer/Playwright 集成 🔧**：支持 Puppeteer 和 Playwright 集成，让你能够结合 AI 能力和这些自动化工具的强大功能，轻松实现自动化操作。
+- **支持私有化部署 🤖**：支持私有化部署 [`UI-TARS`](https://github.com/bytedance/ui-tars) 模型，相比 GPT-4o、Claude 等闭源模型，不仅在 UI 自动化场景下表现更加出色，还能更好地保护数据安全。
+- **支持通用模型 🌟**：支持 GPT-4o、Claude 等通用大模型，适配多种场景需求。
+- **用可视化报告来调试 🎞️**：通过我们的测试报告和 Playground，你可以轻松理解、回放和调试整个过程。
 - **完全开源 🔥**：体验全新的自动化开发体验，尽情享受吧！
+- **理解UI、JSON格式回答 🔍**：你可以提出关于数据格式的要求，然后得到 JSON 格式的预期回应。
+- **直观断言 🤔**：用自然语言表达你的断言，AI 会理解并处理。
 
 ## ✨ 选择 AI 模型 
 
@@ -83,7 +88,28 @@ Midscene.js 是一个由 AI 驱动的自动化 SDK，能够使用自然语言对
 
   <img src="https://github.com/user-attachments/assets/211b05c9-3ccd-4f52-b798-f3a7f51330ed" alt="lark group link" width="300" />
 
+## 引用
+
+如果您在研究或项目中使用了 Midscene.js，请引用：
+
+```bibtex
+@software{Midscene.js,
+  author = {Zhou, Xiao and Yu, Tao},
+  title = {Midscene.js: Assign AI as your web operator.},
+  year = {2025},
+  publisher = {GitHub},
+  url = {https://github.com/web-infra-dev/midscene}
+}
+```
+
 
 ## 📝 授权许可
 
 Midscene.js 遵循 [MIT 许可协议](https://github.com/web-infra-dev/midscene/blob/main/LICENSE)。
+
+
+---
+
+<div align="center">
+  如果本项目对你有帮助或启发，请给我们一个 ⭐️
+</div>
diff --git a/apps/site/docs/en/cache.md → apps/site/docs/en/caching.md b/apps/site/docs/en/cache.md → apps/site/docs/en/caching.md
@@ -1,4 +1,4 @@
-# Cache
+# Caching
 
 Midscene.js provides AI caching features to improve the stability and speed of the entire AI execution process. The cache mainly refers to caching how AI recognizes page elements. Cached AI query results are used if page elements haven't changed.
 

diff --git a/apps/site/docs/en/choose-a-model.mdx → apps/site/docs/en/choose-a-model.md b/apps/site/docs/en/choose-a-model.mdx → apps/site/docs/en/choose-a-model.md
@@ -10,10 +10,27 @@ Midscene.js uses general-purpose large language models (LLMs, like `gpt-4o`) as
 You can also use open-source models like `UI-TARS` to improve the performance and data privacy.
 :::
 
+## Comparison between general-purpose LLMs and dedicated model
+
+This is a table for comparison between general-purpose LLMs and dedicated model (like `UI-TARS`). We will talk about them in detail later.
+
+| | General-purpose LLMs (default) | Dedicated model like `UI-TARS` |
+| --- | --- | --- | 
+| **What it is** | for general-purpose tasks | dedicated for UI automation |
+| **How to get started** | easy, just to get an API key | a bit complex, you need to deploy it on your own server |
+| **Performance** | 3-10x slower compared to pure JavaScript automation | could be acceptable with proper deployment |
+| **Who will get the page data** | the model provider | your own server |
+| **Cost** | more expensive, usually pay for the token | less expensive, pay for the server |
+| **Prompting** | prefer step-by-step instructions | still prefer step-by-step instructions, but performs better in uncertainty situations |
+
 ## Choose a general-purpose LLM
 
 Midscene uses OpenAI `gpt-4o` as the default model, since this model performs the best among all general-purpose LLMs at this moment.
 
+To use the official `gpt-4o` from OpenAI, you can simply set the `OPENAI_API_KEY` in the environment variables. Refer to [Config Model and Provider](./model-provider) for more details.
+
+### Choose a model other than `gpt-4o`
+
 If you want to use other models, please follow these steps:
 
 1. A multimodal model is required, which means it must support image input.
@@ -22,7 +39,7 @@ If you want to use other models, please follow these steps:
 1. If you find it not working well after changing the model, you can try using some short and clear prompt, or roll back to the previous model. See more details in [Prompting Tips](./prompting-tips).
 1. Remember to follow the terms of use of each model and provider.
 
-### Known Supported General-Purpose Models
+### Known supported general-purpose models
 
 Besides `gpt-4o`, the known supported models are:
 
@@ -31,17 +48,33 @@ Besides `gpt-4o`, the known supported models are:
 - `qwen-vl-max-latest`
 - `doubao-vision-pro-32k`
 
+### About the token cost
+
+Image resolution and element numbers (i.e., a UI context size created by Midscene) will affect the token bill.
+
+Here are some typical data with gpt-4o-0806 without prompt caching.
+
+|Task | Resolution | Prompt Tokens / Price | Completion Tokens / Price | Total Cost |
+|-----|------------|--------------|---------------|-----------------|
+|Plan and perform a search on eBay homepage| 1280x800 | 6005 / $0.0150125 |146 / $0.00146| $0.0164725 |
+|Query the information about the item in the search results| 1280x800 | 9107 / $0.0227675 | 122 / $0.00122 | $0.0239875 |
+
+> The price data was calculated in Nov 2024.
+
 ## Choose `UI-TARS` (a open-source model dedicated for UI automation)
 
 UI-TARS is an end-to-end GUI agent model based on VLM architecture. It solely perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations), achieving state-of-the-art performance on 10+ GUI benchmarks.
 
 UI-TARS is an open-source model, and provides different versions of size. You can deploy it on your own server, and it will dramatically improve the performance and data privacy.
 
-For more details about UI-TARS, see [Github - UI-TARS](https://github.com/bytedance/ui-tars), [🤗 HuggingFace - UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT).
+For more details about UI-TARS, see
+* [Github - UI-TARS](https://github.com/bytedance/ui-tars)
+* [🤗 HuggingFace - UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
+* [UI-TARS - Model Deployment Guide](https://juniper-switch-f10.notion.site/UI-TARS-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
 
 ### What you will have after using UI-TARS
 
-- **Speed**: a private-deployed UI-TARS model can be 5x faster than a general-purpose LLM. Each step of `.ai` call can be processed in 1-2 seconds.
+- **Speed**: a private-deployed UI-TARS model can be 5x faster than a general-purpose LLM. Each step of `.ai` call can be processed in 1-2 seconds on a high-performance GPU server.
 - **Data privacy**: you can deploy it on your own server and your data will no longer be sent to the cloud.
 - **More stable with short prompt**: ⁠UI-TARS is optimized for UI automation and is capable of handling more complex tasks with target-driven prompts. You can use it with a shorter prompt (although it is not recommended), and it performs even better when compared to a general-purpose LLM.
 
@@ -76,4 +109,4 @@ Once you feel uncomfortable with the speed, the cost, the accuracy, or the data
 ## More
 
 * [Config Model and Provider](./model-provider)
-* [UI-TARS on Github](https://github.com/bytedance/ui-tars)
+* [UI-TARS on Github](https://github.com/bytedance/ui-tars)
diff --git a/apps/site/docs/en/faq.md b/apps/site/docs/en/faq.md
@@ -10,42 +10,33 @@ Related Docs: [Prompting Tips](./prompting-tips)
 
 There are some limitations with Midscene. We are still working on them.
 
-1. The interaction types are limited to only tap, type, keyboard press, and scroll.
+1. The interaction types are limited to only tap, drag, type, keyboard press, and scroll.
 2. LLM is not 100% stable. Even GPT-4o can't return the right answer all the time. Following the [Prompting Tips](./prompting-tips) will help improve stability.
 3. Since we use JavaScript to retrieve elements from the page, the elements inside the cross-origin iframe cannot be accessed.
 4. We cannot access the native elements of Chrome, like the right-click context menu or file upload dialog.
 5. Do not use Midscene to bypass CAPTCHA. Some LLM services are set to decline requests that involve CAPTCHA-solving (e.g., OpenAI), while the DOM of some CAPTCHA pages is not accessible by regular web scraping methods. Therefore, using Midscene to bypass CAPTCHA is not a reliable method.
 
 ## Can I use a model other than `gpt-4o`?
 
-Yes. You can [config model and provider](./model-provider) if needed.
+Of course. You can [choose a model](./choose-a-model) according to your needs.
 
-## About the token cost
-
-Image resolution and element numbers (i.e., a UI context size created by Midscene) will affect the token bill.
-
-Here are some typical data with gpt-4o-0806 without prompt caching.
-
-|Task | Resolution | Prompt Tokens / Price | Completion Tokens / Price | Total Cost |
-|-----|------------|--------------|---------------|-----------------|
-|Plan and perform a search on eBay homepage| 1280x800 | 6005 / $0.0150125 |146 / $0.00146| $0.0164725 |
-|Query the information about the item in the search results| 1280x800 | 9107 / $0.0227675 | 122 / $0.00122 | $0.0239875 |
-
-> The price data was calculated in Nov 2024.
-
-## What data is sent to LLM ?
+## What data is sent to AI model?
 
 Currently, the contents are: 
 1. the key information extracted from the DOM, such as text content, class name, tag name, coordinates; 
 2. a screenshot of the page.
 
+If you are concerned about the data privacy, please refer to [Data Privacy](./data-privacy).
+
 ## The automation process is running more slowly than the traditional one
 
-Since Midscene.js invokes AI for each planning and querying operation, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. This is currently inevitable but may improve with advancements in LLMs.
+When using general-purpose LLM in Midscene.js, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. To make the result more stable, the token and time cost is inevitable.
+
 
-Despite the increased time and cost, Midscene stands out in practical applications due to its unique development experience and easy-to-maintain codebase. We are confident that incorporating automation scripts powered by Midscene will significantly enhance your project’s efficiency, cover many more situations, and boost overall productivity.
 
-In short, it is worth the time and cost.
+There are two ways to improve the running time:
+1. Use a dedicated model, like UI-TARS. This is the recommended way. Read more about it in [Choose a model](./choose-a-model).
+2. Use caching to reduce the token cost. Read more about it in [Caching](./caching).
 
 ## The webpage continues to flash when running in headed mode
 

diff --git a/apps/site/docs/en/integrate-with-puppeteer.mdx b/apps/site/docs/en/integrate-with-puppeteer.mdx
@@ -22,7 +22,7 @@ export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
 
 ## Step 1. install dependencies
 
-<PackageManagerTabs command="install @midscene/web puppeteer ts-node --save-dev" />
+<PackageManagerTabs command="install @midscene/web puppeteer tsx --save-dev" />
 
 ## Step 2. write scripts
 
@@ -73,11 +73,11 @@ Promise.resolve(
 
 ## Step 3. run
 
-Using ts-node to run, you will get the data of Headphones on eBay:
+Using `tsx` to run, you will get the data of Headphones on eBay:
 
 ```bash
 # run
-npx ts-node demo.ts
+npx tsx demo.ts
 
 # it should print 
 #  [

diff --git a/apps/site/docs/en/prompting-tips.md b/apps/site/docs/en/prompting-tips.md
@@ -1,6 +1,6 @@
 # Prompting Tips
 
-The natural language parameter passed to Midscene will be part of the prompt sent to the LLM. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
+The natural language parameter passed to Midscene will be part of the prompt sent to the AI model. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
 
 ## The purpose of optimization is to get a stable response from AI
 
@@ -51,6 +51,8 @@ To launch the local Playground server:
 npx --yes @midscene/web
 ```
 
+![Playground](/midescene-playground-entry.jpg)
+
 ## Infer or assert from the interface, not the DOM properties or browser status
 
 All the data sent to the LLM is in the form of screenshots and element coordinates. The DOM and the browser instance are almost invisible to the LLM. Therefore, ensure everything you expect is visible on the screen.

diff --git a/apps/site/docs/public/midescene-playground-entry.jpg b/apps/site/docs/public/midescene-playground-entry.jpg
diff --git a/apps/site/docs/zh/cache.md → apps/site/docs/zh/caching.md b/apps/site/docs/zh/cache.md → apps/site/docs/zh/caching.md