diff --git a/apps/site/docs/doc/faq.md b/apps/site/docs/doc/faq.md deleted file mode 100644 index 1d0653fe3..000000000 --- a/apps/site/docs/doc/faq.md +++ /dev/null @@ -1,22 +0,0 @@ -# FAQ - -#### About the token cost - -Image resolution and element numbers (i.e., a UI context size created by MidScene) form the token bill. - -Here are some typical data. - -|Task | Resolution | Input tokens | Output tokens | GPT-4o Price | -|-----|------------|--------------|---------------|----------------| -|Find the download button on the VSCode website| 1920x1080| 2011|54| $0.011| -|Split the Github status page| 1920x1080| 3609|1020| $0.034| - -The above price data was calculated in June 2024. - -#### The automation process is running more slowly than it did before - -Since MidScene.js will invoke the AI each time it performs planning and querying, the running time may increase by a factor of 5 to 10. This is inevitable for now, but it may improve with advancements in LLMs. - -Despite the increased time and cost, MidScene stands out in practical applications due to its unique development experience and easy-to-maintain codebase. We are confident that incorporating automation scripts powered by MidScene will significantly enhance your project’s efficiency, streamline complex tasks, and boost overall productivity. By integrating MidScene, your team can focus on more strategic and innovative activities, leading to faster development cycles and better outcomes. - -In short, it is worth the time and cost. diff --git a/apps/site/docs/_meta.json b/apps/site/docs/en/_meta.json similarity index 64% rename from apps/site/docs/_meta.json rename to apps/site/docs/en/_meta.json index 9bf26aca5..6b8548c77 100644 --- a/apps/site/docs/_meta.json +++ b/apps/site/docs/en/_meta.json @@ -1,8 +1,8 @@ [ { "text": "Docs", - "link": "/doc/getting-started/introduction", - "activeMatch": "/doc/" + "link": "/docs/getting-started/introduction", + "activeMatch": "/docs" }, { "text": "Visualization Tool", diff --git a/apps/site/docs/doc/_meta.json b/apps/site/docs/en/docs/_meta.json similarity index 69% rename from apps/site/docs/doc/_meta.json rename to apps/site/docs/en/docs/_meta.json index 61a8eb1af..30c2e079b 100644 --- a/apps/site/docs/doc/_meta.json +++ b/apps/site/docs/en/docs/_meta.json @@ -13,6 +13,11 @@ "collapsible": false, "collapsed": false }, - "prompting-tips.md", - "faq.md" + { + "type": "dir", + "name": "more", + "label": "More", + "collapsible": false, + "collapsed": false + } ] \ No newline at end of file diff --git a/apps/site/docs/doc/getting-started/_meta.json b/apps/site/docs/en/docs/getting-started/_meta.json similarity index 100% rename from apps/site/docs/doc/getting-started/_meta.json rename to apps/site/docs/en/docs/getting-started/_meta.json diff --git a/apps/site/docs/doc/getting-started/introduction.mdx b/apps/site/docs/en/docs/getting-started/introduction.mdx similarity index 68% rename from apps/site/docs/doc/getting-started/introduction.mdx rename to apps/site/docs/en/docs/getting-started/introduction.mdx index 32184cce7..0247a26f0 100644 --- a/apps/site/docs/doc/getting-started/introduction.mdx +++ b/apps/site/docs/en/docs/getting-started/introduction.mdx @@ -8,33 +8,31 @@ UI automation can be frustrating, often involving a maze of *#ids*, *data-test-x Introducing MidScene.js, an innovative SDK designed to bring joy back to programming by simplifying automation tasks. -MidScene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. Rather than writing and maintaining complex selectors, you can simply describe the interaction steps or expected data formats using a screenshot, and the AI will handle the execution for you. - -By employing MidScene.js, you ensure a more streamlined, efficient, and enjoyable approach to UI automation. +MidScene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. You can simply describe the interaction steps or expected data formats, and the AI will handle the execution for you. ## Features -### Public LLMs are Fine +### Out-of-box LLM -It is fine to use publicly available LLMs such as GPT-4. There is no need for custom training. To experience the out-of-the-box AI-driven automation, token is all you need. 😀 +It is fine to use publicly available LLMs such as GPT-4o. There is no need for custom training. To experience the brand new way of writing automation, token is all you need. 😀 -### Execute Actions +### Execute Actions by AI Use `.aiAction` to perform a series of actions by describing the steps. For example `.aiAction('Enter "Learn JS today" in the task box, then press Enter to create')`. -### Extract Data from Page +### Extract Data from Page by AI `.aiQuery` is the method to extract customized data from the UI. For example, by calling `const dataB = await agent.aiQuery('string[], task names in the list');`, you will get an array with string of the task names. -### Perform Assertions +### Perform Assertions by AI Call `.aiAssert` to perform assertions on the page. -#### Visualization Tool +### Visualization Tool With our visualization tool, you can easily debug the prompt and AI response. All intermediate data, such as queries, plans, and actions, can be visualized. diff --git a/apps/site/docs/doc/getting-started/quick-start.md b/apps/site/docs/en/docs/getting-started/quick-start.md similarity index 79% rename from apps/site/docs/doc/getting-started/quick-start.md rename to apps/site/docs/en/docs/getting-started/quick-start.md index ab4c3a247..caf8d00cd 100644 --- a/apps/site/docs/doc/getting-started/quick-start.md +++ b/apps/site/docs/en/docs/getting-started/quick-start.md @@ -1,6 +1,8 @@ # Quick Start -In this example, we use OpenAI GPT-4o and Puppeteer.js to search headphones on ebay, and then get the result items and prices in JSON format. Remember to prepare an OpenAI key that is eligible for accessing GPT-4o before running. +In this example, we use OpenAI GPT-4o and Puppeteer.js to search headphones on ebay, and then get the result items and prices in JSON format. + +Remember to prepare an OpenAI key that is eligible for accessing GPT-4o before running. > [Puppeteer](https://pptr.dev/) is a Node.js library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi. Puppeteer runs in the headless (no visible UI) by default but can be configured to run in a visible ("headful") browser. @@ -19,8 +21,7 @@ npm install @midscene/web --save-dev npm install puppeteer ts-node --save-dev ``` -Write a simple demo to **extract the main download button of vscode website**. -Save the following code as `./demo.ts`. +Write and save the following code as `./demo.ts`. ```typescript import puppeteer, { Viewport } from 'puppeteer'; @@ -39,20 +40,31 @@ await page.waitForNavigation({ }); const page = await launchPage(); -// init MidScene agent +// 👀 init MidScene agent const mid = new PuppeteerAgent(page); -// perform a search +// 👀 perform a search await mid.aiAction('type "Headphones" in search box, hit Enter'); await sleep(5000); -// find the items +// 👀 find the items const items = await mid.aiQuery( '{itemTitle: string, price: Number}[], find item in list and corresponding price', ); console.log('headphones in stock', items); ``` +:::tip +You may have noticed that the key lines of code for this only consist of two lines. They are all written in plain language. + +```typescript +await mid.aiAction('type "Headphones" in search box, hit Enter'); +await mid.aiQuery( + '{itemTitle: string, price: Number}[], find item in list and corresponding price', +); +``` +::: + Using ts-node to run, you will get the data of Headphones on ebay: ```bash diff --git a/apps/site/docs/en/docs/more/_meta.json b/apps/site/docs/en/docs/more/_meta.json new file mode 100644 index 000000000..526724ee2 --- /dev/null +++ b/apps/site/docs/en/docs/more/_meta.json @@ -0,0 +1,4 @@ +[ + "prompting-tips", + "faq" +] \ No newline at end of file diff --git a/apps/site/docs/en/docs/more/faq.md b/apps/site/docs/en/docs/more/faq.md new file mode 100644 index 000000000..fda89a7b9 --- /dev/null +++ b/apps/site/docs/en/docs/more/faq.md @@ -0,0 +1,43 @@ +# FAQ + +### Can MidScene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'" + +MidScene is an automation assistance SDK with a key feature of action stability — ensuring the same actions are performed in each run. To maintain this stability, we encourage you to provide detailed instructions to help the AI understand each step of your task. + +If you require a 'goal-to-task' AI planning tool, you can develop one based on MidScene. + +Related Docs: +* [Tips for Prompting](./prompting-tips.html) + +### Limitations + +There are some limitations with MidScene. We are still working on them. + +1. The interaction types are limited to only tap, type, keyboard press, and scroll. +2. It's not 100% stable. Even GPT-4o can't return the right answer all the time. Following the [Prompting Tips](./prompting-tips) will help improve stability. +3. Since we use JavaScript to retrieve items from the page, the elements inside the iframe cannot be accessed. + +### Which LLM should I choose ? + +MidScene needs a multimodal Large Language Model (LLM) to understand the UI. Currently, we find that OpenAI's GPT-4o performs much better than others. + +### About the token cost + +Image resolution and element numbers (i.e., a UI context size created by MidScene) form the token bill. + +Here are some typical data. + +|Task | Resolution | Input tokens | Output tokens | GPT-4o Price | +|-----|------------|--------------|---------------|----------------| +|Find the download button on the VSCode website| 1920x1080| 2011|54| $0.011| +|Split the Github status page| 1920x1080| 3609|1020| $0.034| + +> The price data was calculated in June 2024. + +### The automation process is running more slowly than it did before + +Since MidScene.js invokes AI for each planning and querying operation, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. This is currently inevitable but may improve with advancements in LLMs. + +Despite the increased time and cost, MidScene stands out in practical applications due to its unique development experience and easy-to-maintain codebase. We are confident that incorporating automation scripts powered by MidScene will significantly enhance your project’s efficiency, cover many more situations, and boost overall productivity. + +In short, it is worth the time and cost. diff --git a/apps/site/docs/doc/prompting-tips.md b/apps/site/docs/en/docs/more/prompting-tips.md similarity index 61% rename from apps/site/docs/doc/prompting-tips.md rename to apps/site/docs/en/docs/more/prompting-tips.md index aba220485..bf96bafdb 100644 --- a/apps/site/docs/doc/prompting-tips.md +++ b/apps/site/docs/en/docs/more/prompting-tips.md @@ -1,10 +1,10 @@ # Tips for Prompting -There are certain techniques in prompt engineering that can help improve the understanding of user interfaces. +The natural language parameter passed to MidScene will be part of the prompt sent to the LLM. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces. ### The purpose of optimization is to get a stable response from AI -Since AI has the nature of heuristic, the purpose of prompt tuning should be to obtain stable responses from the AI model across runs. In most cases, to expect a consistent response from GPT-4 by using a good prompt is entirely feasible. +Since AI has the nature of heuristic, the purpose of prompt tuning should be to obtain stable responses from the AI model across runs. In most cases, to expect a consistent response from LLM by using a good prompt is entirely feasible. ### Detailed description and samples are welcome @@ -12,9 +12,9 @@ Detailed descriptions and examples are always welcome. For example: -Good ✅: "Find the search box, along with a region switch such as 'domestic', 'international'" +Good ✅: "Find the search box (it should be along with a region switch, such as 'domestic' or 'international'), type 'headphone', and hit Enter." -Bad ❌: "Lower Part of page" +Bad ❌: "Search 'headphone'" ### Infer from the UI, not the DOM properties @@ -22,7 +22,7 @@ All the data sent to the LLM are the screenshots and element coordinates. The DO Ensure everything you expect from the LLM is visible in the screenshot. -### LLMs can NOT tell the exact number like coords or hex-color, give it some choices +### LLMs can NOT tell the exact number like coords or hex-style color, give it some choices For example: @@ -34,11 +34,11 @@ Bad ❌: "[number, number], the [x, y] coords of the main button" ### Use visualization tool to debug -Use the visualization tool to debug and understand how the AI parse the interface. Just upload the log, and view the AI's parse results. You can find [the tool](/visualization/index.html) on the navigation bar on this site. +Use the visualization tool to debug and understand each step of MidScene. Just upload the log, and view the AI's parse results. You can find [the tool](/visualization/) on the navigation bar on this site. ### non-English prompting is acceptable -⁠Since AI models can understand many languages, feel free to write the prompt in any language you like. +Since most AI models can understand many languages, feel free to write the prompt in any language you prefer. It usually works even if the prompt is in a language different from the page's language. Good ✅: "点击顶部左侧导航栏中的“首页”链接" diff --git a/apps/site/docs/doc/usage/API.md b/apps/site/docs/en/docs/usage/API.md similarity index 76% rename from apps/site/docs/doc/usage/API.md rename to apps/site/docs/en/docs/usage/API.md index 8dd5a9a59..69a877b9a 100644 --- a/apps/site/docs/doc/usage/API.md +++ b/apps/site/docs/en/docs/usage/API.md @@ -2,7 +2,7 @@ ## config AI vendor -MidScene uses the OpenAI SDK as the default AI service. Currently OpenAI GPT-4o seems to perform best. However, you can customize the caller configuration with environment variables. +MidScene uses the OpenAI SDK as the default AI service. You can customize the configuration using environment variables. There are the main configs, in which `OPENAI_API_KEY` is required. @@ -17,7 +17,7 @@ Optional: ```bash # optional, if you want to use a customized endpoint -export OPENAI_BASE_URL="..." +export OPENAI_BASE_URL="https://..." # optional, if you want to specify a model name other than gpt-4o export MIDSCENE_MODEL_NAME='claude-3-opus-20240229'; @@ -28,6 +28,8 @@ export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"ke ## Use in Puppeteer +To initialize: + ```typescript import { PuppeteerAgent } from '@midscene/web/puppeteer'; @@ -54,7 +56,7 @@ await page.waitForNavigation({ // init MidScene agent, perform actions const mid = new PuppeteerAgent(page); -await mid.ai('type "how much is the ferry ticket in Shanghai" in search box, hit Enter'); +await mid.ai('type "Headphones" in search box, hit Enter'); ``` ## Use in Playwright @@ -63,7 +65,7 @@ await mid.ai('type "how much is the ferry ticket in Shanghai" in search box, hit > In the following documentation, you may see functions called with the `mid.` prefix. If you use destructuring in Playwright, like `async ({ ai, aiQuery }) => { /* ... */}`, you can call the functions without this prefix. It's just a matter of syntax. -### `.aiAction(steps: string)` or `.ai(steps: string)` - perform your actions +### `.aiAction(steps: string)` or `.ai(steps: string)` - Control the page You can use `.aiAction` to perform a series of actions. It accepts a `steps: string` as a parameter, which describes the actions. In the prompt, you should clearly describe the steps. MidScene will take care of the rest. @@ -79,25 +81,23 @@ await mid.aiAction('Move your mouse over the second item in the task list and cl await mid.ai('Click the "completed" status button below the task list'); ``` -Steps should always be clearly described. A very brief prompt like 'Tweet "Hello World"' will result in unstable performance and a high likelihood of failure. +Steps should always be clearly and thoroughly described. A very brief prompt like 'Tweet "Hello World"' will result in unstable performance and a high likelihood of failure. Under the hood, MidScene will plan the detailed steps by sending your page context and a screenshot to the AI. After that, MidScene will execute the steps one by one. If MidScene deems it impossible to execute, an error will be thrown. -The main capabilities of MidScene are as follows, which can be seen in the visualization tools: -1. **Planning**: Determine the steps to accomplish the task -2. **Find**: Identify the target element using a natural language description -3. **Action**: Tap, scroll, keyboard input, hover -4. **Others**: Sleep - -Currently, MidScene can't plan steps that include conditions and loops. +The main capabilities of MidScene are as follows, and your task will be split into these types. You can see them in the visualization tools: -:::tip Why can't MidScene smartly plan the actions according to my one-line goal? +1. **Locator**: Identify the target element using a natural language description +2. **Action**: Tap, scroll, keyboard input, hover +3. **Others**: Sleep -MidScene aims to be an automation assistance SDK. Its action stability (i.e., perform the same actions on each run) is a key feature. To achieve this, we encourage you to write down detailed instructions to help the AI better understand each step of your task. If you want a 'goal-to-task' AI planning tool, you can build one on top of MidScene. +Currently, MidScene can't plan steps that include conditions and loops. -::: +Related Docs: +* [FAQ: Can MidScene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'](../more/faq.html) +* [Tips for Prompting](../more/prompting-tips.html) -### `.aiQuery(dataShape: any)` - extract any data from page +### `.aiQuery(dataDemand: any)` - extract any data from page You can extract customized data from the UI. Provided that the multi-modal AI can perform inference, it can return both data directly written on the page and any data based on "understanding". The return value can be any valid primitive type, like String, Number, JSON, Array, etc. Just describe it in the `dataDemand`. @@ -137,9 +137,6 @@ This method will soon be available in MidScene. LangSmith is a platform designed to debug the LLMs. To integrate LangSmith, please follow these steps: ```shell -# install langsmith dependency -npm i langsmith - # set env variables # Flag to enable debug diff --git a/apps/site/docs/doc/usage/_meta.json b/apps/site/docs/en/docs/usage/_meta.json similarity index 100% rename from apps/site/docs/doc/usage/_meta.json rename to apps/site/docs/en/docs/usage/_meta.json diff --git a/apps/site/docs/en/index.md b/apps/site/docs/en/index.md new file mode 100644 index 000000000..bee660ba7 --- /dev/null +++ b/apps/site/docs/en/index.md @@ -0,0 +1,37 @@ +--- +pageType: home + +hero: + name: MidScene.js + text: Joyful Automation by AI + tagline: + actions: + - theme: brand + text: Introduction + link: /docs/getting-started/introduction + - theme: alt + text: Quick Start + link: /docs/getting-started/quick-start + image: + src: /midscene.png + alt: MidScene Logo +features: + - title: Natural Language Interaction + details: Describe the steps, let MidScene plan and execute for you. + icon: 🔍 + - title: Understand UI, Answer in JSON + details: Provide prompts for the desired data format, and then receive the predictable answer in JSON format. + icon: 🤔 + - title: Intuitive Assertion + details: Make assertions in natural language. It’s all based on AI understanding. + icon: 🤔 + - title: Out-of-box LLM + details: It is fine to use public multimodal LLMs like GPT-4o. There is no need for any custom training. + icon: 🪓 + - title: Visualization + details: With our visualization tool, you can easily understand and debug the whole process. + icon: 🎞️ + - title: Brand New Experience ! + details: Experience a whole new world of automation development. Enjoy ! + icon: 🔥 +--- \ No newline at end of file diff --git a/apps/site/docs/visualization/index.mdx b/apps/site/docs/en/visualization/index.mdx similarity index 100% rename from apps/site/docs/visualization/index.mdx rename to apps/site/docs/en/visualization/index.mdx diff --git a/apps/site/docs/zh/_meta.json b/apps/site/docs/zh/_meta.json new file mode 100644 index 000000000..11ff96581 --- /dev/null +++ b/apps/site/docs/zh/_meta.json @@ -0,0 +1,12 @@ +[ + { + "text": "文档", + "link": "/docs/getting-started/introduction", + "activeMatch": "/docs" + }, + { + "text": "可视化工具", + "link": "/visualization/", + "activeMatch": "/visualization/" + } +] \ No newline at end of file diff --git a/apps/site/docs/zh/docs/_meta.json b/apps/site/docs/zh/docs/_meta.json new file mode 100644 index 000000000..b8757c024 --- /dev/null +++ b/apps/site/docs/zh/docs/_meta.json @@ -0,0 +1,23 @@ +[ + { + "type": "dir", + "name": "getting-started", + "label": "开始使用", + "collapsible": false, + "collapsed": false + }, + { + "type": "dir", + "name": "usage", + "label": "接口文档", + "collapsible": false, + "collapsed": false + }, + { + "type": "dir", + "name": "more", + "label": "更多", + "collapsible": false, + "collapsed": false + } +] \ No newline at end of file diff --git a/apps/site/docs/zh/docs/getting-started/_meta.json b/apps/site/docs/zh/docs/getting-started/_meta.json new file mode 100644 index 000000000..ba8c44235 --- /dev/null +++ b/apps/site/docs/zh/docs/getting-started/_meta.json @@ -0,0 +1,4 @@ +[ + "introduction", + "quick-start.md" +] \ No newline at end of file diff --git a/apps/site/docs/zh/docs/getting-started/introduction.mdx b/apps/site/docs/zh/docs/getting-started/introduction.mdx new file mode 100644 index 000000000..7eac839de --- /dev/null +++ b/apps/site/docs/zh/docs/getting-started/introduction.mdx @@ -0,0 +1,57 @@ +# 介绍 + + + +UI 自动化太难写了。自动化脚本里到处都是选择器,比如 `#ids`、`data-test-xxx`、`.selectors`。在页面重构的时候,维护自动化脚本更会会是一场灾难。 + +我们在这里推出 MidScene.js。通过 AI 加持,它能让自动化脚本变得简单、可维护,助你重拾编码的乐趣。 + +MidScene.js 采用了多模态大语言模型(LLM),能够直观地“理解”你的用户界面并执行必要的操作。你只需描述交互步骤或期望的数据格式,AI 就能为你完成任务。 + +# 特性 + +### 开箱即用的大型语言模型 (LLM) + +你可以直接使用公开可用的 LLM,例如 GPT-4,而不需要任何定制训练。只需 API Key 和 token 额度,你就能体验全新的编码体验。😀 + +### 通过 AI 执行交互 + +使用 `.aiAction` 方法,你可以通过描述步骤来执行交互。 + +比如: + +```javascript +.aiAction('在任务框中输入 "Learn JS today",然后按 Enter 键创建任务') +``` + +### 通过 AI 从页面提取数据 + +`.aiQuery` 方法用于从 UI 中提取定制的数据。 + +例如: + +```javascript +const dataB = await agent.aiQuery('string[], 任务列表中的任务名'); +``` + +你会得到一个包含任务名字符串的数组。 + +### 通过 AI 执行断言 + +调用 `.aiAssert` 方法可以对页面进行断言。 + +### 可视化工具 + +我们提供的可视化工具,可以非常方便地调试提示和 AI 的响应。所有的中间数据,例如查询(Query)、计划(Planning)和动作(Actions),都可以被可视化。 + +你可以打开 [可视化工具](/visualization/index.html) 来查看示例。 + +![可视化工具示例](/Visualizer.gif) + +## 流程图 + +下图展示了 MidScene 的核心流程。 + +![](/flow.png) \ No newline at end of file diff --git a/apps/site/docs/zh/docs/getting-started/quick-start.md b/apps/site/docs/zh/docs/getting-started/quick-start.md new file mode 100644 index 000000000..6d42ffc7a --- /dev/null +++ b/apps/site/docs/zh/docs/getting-started/quick-start.md @@ -0,0 +1,90 @@ +# 快速开始 + +在这个例子中,我们将使用 OpenAI GPT-4o 和 Puppeteer.js 在 eBay 上搜索 "耳机",并以 JSON 格式返回商品和价格结果。 + +在运行该示例之前,请确保您已经准备了有权限访问 GPT-4o 的 OpenAI key。 + +> [Puppeteer](https://pptr.dev/) 是一个 Node.js 库,它通过 DevTools Protocol 或 WebDriver BiDi 提供了用于控制 Chrome 或 Firefox 的高级 API。默认情况下,Puppeteer 运行在无头模式(headless mode, 即没有可见的 UI),但也可以配置为在有头模式(headed mode, 即有可见的浏览器界面)下运行。 + +配置 API Key + +```bash +# replace by your own +export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz" +``` + +安装依赖 + +```bash +npm install @midscene/web --save-dev +# for demo use +npm install puppeteer ts-node --save-dev +``` + +编写下方代码,保存为 `./demo.ts` + +```typescript +import puppeteer, { Viewport } from 'puppeteer'; +import { PuppeteerAgent } from '@midscene/web/puppeteer'; + +// 初始化 Puppeteer Page +const browser = await puppeteer.launch({ + headless: false, // here we use headed mode to help debug +}); + +const page = await browser.newPage(); +await page.goto('https://www.ebay.com'); +await page.waitForNavigation({ + timeout: 20 * 1000, + waitUntil: 'networkidle0', +}); +const page = await launchPage(); + +// 👀 初始化 MidScene agent +const mid = new PuppeteerAgent(page); + +// 👀 执行搜索 +await mid.aiAction('type "Headphones" in search box, hit Enter'); +await sleep(5000); + +// 👀 提取数据 +const items = await mid.aiQuery( + '{itemTitle: string, price: Number}[], find item in list and corresponding price', +); +console.log('headphones in stock', items); +``` + +:::tip + +你可能已经注意到了,上述文件中的关键代码只有两行,且都是用自然语言编写的 + +```typescript +await mid.aiAction('type "Headphones" in search box, hit Enter'); +await mid.aiQuery( + '{itemTitle: string, price: Number}[], find item in list and corresponding price', +); +``` +::: + +使用 `ts-node` 来运行,你会看到命令行打印出了耳机的商品信息: + +```bash +# run +npx ts-node demo.ts + +# 命令行应该有如下输出 +# [ +# { +# itemTitle: 'JBL Tour Pro 2 - True wireless Noise Cancelling earbuds with Smart Charging Case', +# price: 551.21 +# }, +# { +# itemTitle: 'Soundcore Space One无线耳机40H ANC播放时间2XStronger语音还原', +# price: 543.94 +# } +# ] +``` + +运行 MidScene 之后,系统会生成一个日志文件,默认存放在 `./midscene_run/latest.web-dump.json`。然后,你可以把这个文件导入 [可视化工具](/visualization/),这样你就能更清楚地了解整个过程。 + +在 [可视化工具](/visualization/) 中,点击 `Load Demo` 按钮,你将能够看到上方代码的运行结果以及其他的一些示例。 \ No newline at end of file diff --git a/apps/site/docs/zh/docs/more/_meta.json b/apps/site/docs/zh/docs/more/_meta.json new file mode 100644 index 000000000..526724ee2 --- /dev/null +++ b/apps/site/docs/zh/docs/more/_meta.json @@ -0,0 +1,4 @@ +[ + "prompting-tips", + "faq" +] \ No newline at end of file diff --git a/apps/site/docs/zh/docs/more/faq.md b/apps/site/docs/zh/docs/more/faq.md new file mode 100644 index 000000000..07bfb47bc --- /dev/null +++ b/apps/site/docs/zh/docs/more/faq.md @@ -0,0 +1,39 @@ +# FAQ + +### MidScene 能否根据一句话指令实现智能规划?比如执行 "发一条微博" + +MidScene 是一个辅助 UI 自动化的 SDK,运行时稳定性很关键——即保证每次运行都能运行相同的动作。为了保持这种稳定性,我们希望你提供详细的指令,以帮助 AI 清晰地理解并执行。 + +如果你需要一个 '目标到任务' 的 AI 规划工具,不妨基于 MidScene 自行开发一个。 + +关联文档: +* [编写提示词的技巧](./prompting-tips) + +### 局限性 + +MidScene 存在一些局限性,我们仍在努力改进。 + +1. 交互类型有限:目前仅支持点击、输入、键盘和滚动操作。 +2. 稳定性不足:即使是 GPT-4o 也无法确保 100% 返回正确答案。遵循 [编写提示词的技巧](./prompting-tips) 可以帮助提高 SDK 稳定性。 +3. 元素访问受限:由于我们使用 JavaScript 从页面提取元素,所以无法访问 iframe 内部的元素。 + +### 关于 token 成本 + +Token 消耗分为两部分:图像分辨率和元素数量(即 MidScene 创建的 UI 上下文大小)。 + +以下是一些典型数据: + +| 任务 | 分辨率 | 输入 token | 输出 token | GPT-4o 价格 | +|-------|----------|----------|----------|--------------| +| 在 VSCode 网站上找到下载按钮 | 1920x1080 | 2011 | 54 | $0.011 | +| 拆分 Github 状态页面 | 1920x1080 | 3609 | 1020 | $0.034 | + +> 这些价格数据是 2024 年 6 月计算所得 + +### 脚本运行偏慢? + +由于 MidScene.js 每次进行规划(Planning)和查询(Query)时都会调用 AI,其运行耗时可能比传统 Playwright 用例增加 3 到 10 倍,比如从 5 秒变成 20秒。目前,这一点仍无法避免。但随着大型语言模型(LLM)的进步,未来性能可能会有所改善。 + +尽管运行时间较长,MidScene 在实际应用中依然表现出色。它独特的开发体验会让代码库易于维护。我们相信,集成了 MidScene 的自动化脚本能够显著提升项目迭代效率,覆盖更多场景,提高整体生产力。 + +简而言之,虽然偏慢,但这些时间投入一定都是值得的。 \ No newline at end of file diff --git a/apps/site/docs/zh/docs/more/prompting-tips.md b/apps/site/docs/zh/docs/more/prompting-tips.md new file mode 100644 index 000000000..8adf360f0 --- /dev/null +++ b/apps/site/docs/zh/docs/more/prompting-tips.md @@ -0,0 +1,55 @@ +# 编写提示词的技巧 + +你在 MidScene 编写的自然语言参数,最终都会变成提示词(Prompt)发送给大语言模型。以下是一些可以帮助提升效果的提示词工程(Prompt Engineering)技巧。 + +## 目的是获得更稳定的响应 + +由于 AI 常常会“幻想”,调优的目的是在多次运行中获得模型的稳定响应。大多数情况下,通过使用良好的提示,LLM 的响应效果可以变得更好。 + +## 提供更详细的描述并提供样例 + +提供详细描述和示例一直是非常有用的提示词技巧。 + +例如: + +正确示例 ✅: "找到搜索框(搜索框的上方应该有区域切换按钮,如 '国内', '国际'),输入'耳机',敲回车" + +错误示例 ❌: "搜'耳机'" + +### 从界面而不是 DOM 属性推断信息 + +所有传递给 LLM 的数据都是截图和元素坐标。DOM 对 LLM 来说几乎是不可见的。因此,不要指望 LLM 能从 DOM 中推断任何信息(比如 `test-id-*` 属性)。 + +务必确保你想提取的信息都在截图中有所体现且能被 LLM “看到”。 + +### LLM 无法准确辨别数值(比如坐标或十六进制颜色值),不妨提供一些选项 + +例如: + +正确示例 ✅:string,文本的颜色,返回:蓝色 / 红色 / 黄色 / 绿色 / 白色 / 黑色 / 其他 + +错误示例 ❌:string,文本颜色的十六进制值 + +错误示例 ❌:[number, number],主按钮的 [x, y] 坐标 + +### 使用可视化工具调试 + +使用可视化工具调试和理解 MidScene 的每个步骤。只需上传日志,就可以查看 AI 的解析结果。你可以在本站导航栏上找到 [可视化工具](/visualization/)。 + +### 中、英文提示词都是可行的 + +由于大多数 AI 模型可以理解多种语言,所以请随意用你喜欢的语言撰写提示指令。即使提示语言与页面语言不同,通常也是可行的。 + +### 通过断言交叉检查结果 + +LLM 可能会表现出错误的行为。更好的做法是运行操作后检查其结果。 + +例如,你可以在插入记录后检查待办应用的列表内容。 + +```typescript +await ai('在任务框中输入“后天学习 AI”,然后按 Enter 键创建'); + +// 检查结果 +const taskList = await aiQuery('string[], 列表中的任务'); +expect(taskList.length).toBe(1); +expect(taskList[0]).toBe('后天学习 AI'); diff --git a/apps/site/docs/zh/docs/usage/API.md b/apps/site/docs/zh/docs/usage/API.md new file mode 100644 index 000000000..74364c2e7 --- /dev/null +++ b/apps/site/docs/zh/docs/usage/API.md @@ -0,0 +1,152 @@ +# SDK 接口文档 + +## 配置 AI 供应商 + +MidScene 默认集成了 OpenAI SDK 调用 AI 服务,你也可以通过环境变量来自定义配置。 + +主要配置项如下,其中 `OPENAI_API_KEY` 是必选项: + +必选项: + +```bash +# 替换为你自己的 API Key +export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz" +``` + +可选项: + +```bash +# 可选, 如果你想更换 base URL +export OPENAI_BASE_URL="https://..." + +# 可选, 如果你想指定模型名称 +export MIDSCENE_MODEL_NAME='claude-3-opus-20240229'; + +# 可选, 如果你想变更 SDK 的初始化参数 +export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"key": "value"}}' +``` + +## 在 Puppeteer 中使用 + +初始化方法: + +```typescript +import { PuppeteerAgent } from '@midscene/web/puppeteer'; + +const mid = new PuppeteerAgent(puppeteerPageInstance); +``` + +一个完整案例: + +```typescript +import puppeteer, { Viewport } from 'puppeteer'; +import { PuppeteerAgent } from '@midscene/web/puppeteer'; + +// 初始化 Puppeteer Page +const browser = await puppeteer.launch({ + headless: false, // here we use headed mode to help debug +}); + +const page = await browser.newPage(); +await page.goto('https://www.bing.com'); +await page.waitForNavigation({ + timeout: 20 * 1000, + waitUntil: 'networkidle0', +}); + +// 初始化 MidScene agent, 执行操作 +const mid = new PuppeteerAgent(page); +await mid.ai('type "Headphones" in search box, hit Enter'); +``` + +## 在 Playwright 中使用 + +## API + +> 在以下文档中,你可能会看到带有 `mid.` 前缀的函数调用。如果你在 Playwright 中使用了解构赋值(object destructuring),如 `async ({ ai, aiQuery }) => { /* ... */}`,你可以不带这个前缀进行调用。这只是语法的区别。 + +### `.aiAction(steps: string)` 或 `.ai(steps: string)` - 控制界面 + +你可以使用 `.aiAction` 来执行一系列操作。它接受一个参数 `steps: string` 用于描述这些操作。在这个参数中,你应该清楚地描述每一个步骤,然后 MidScene 会自动为你分析并执行。 + +`.ai` 是 `.aiAction` 的简写。 + +以下是一些优质示例: + +```typescript +await mid.aiAction('在任务框中输入 "Learn JS today",然后按回车键创建任务'); +await mid.aiAction('将鼠标移动到任务列表中的第二项,然后点击第二个任务右侧的删除按钮'); + +// 使用 `.ai` 简写 +await mid.ai('点击任务列表下方的 "completed" 状态按钮'); +``` + +务必使用清晰、详细的步骤描述。使用非常简略的指令(如 “发一条微博” )会导致非常不稳定的执行结果或运行失败。 + +在底层,MidScene 会将页面上下文和截图发送给 LLM,以详细规划步骤。随后,MidScene 会逐步执行这些步骤。如果 MidScene 认为无法执行,将抛出一个错误。 + +你的任务会被拆解成下述内置方法,你可以在可视化工具中看到它们: + +1. **定位(Locator)**:使用自然语言描述找到目标元素 +2. **操作(Action)**:点击、滚动、键盘输入、悬停(hover) +3. **其他**:等待(sleep) + +目前,MidScene 无法规划包含条件和循环的步骤。 + +关联文档: +* [FAQ: MidScene 能否根据一句话指令实现智能操作?比如执行 "发一条微博"'](../more/faq.html) +* [编写提示词的技巧](../more/prompting-tips.html) + +### `.aiQuery(dataShape: any)` - 从页面提取数据 + +这个方法可以从 UI 提取自定义数据。它不仅能返回页面上直接书写的数据,还能基于“理解”返回数据(前提是多模态 AI 能够推理)。返回值可以是任何合法的基本类型,比如字符串、数字、JSON、数组等。你只需在 `dataDemand` 中描述它,MidScene 就会给你满足格式的返回。 + +例如,从页面解析详细信息: + +```typescript +const dataA = await mid.aiQuery({ + time: '左上角展示的日期和时间,string', + userInfo: '用户信息,{name: string}', + tableFields: '表格的字段名,string[]', + tableDataRecord: '表格中的数据记录,{id: string, [fieldName]: string}[]', +}); + +你也可以用纯字符串描述预期的返回值格式: + +// dataB 将是一个字符串数组 +const dataB = await mid.aiQuery('string[],列表中的任务名称'); + +// dataC 将是一个包含对象的数组 +const dataC = await mid.aiQuery('{name: string, age: string}[], 表格中的数据记录'); +``` + +### `.aiAssert(conditionPrompt: string, errorMsg?: string)` - 进行断言 + +这个方法即将上线。 + +`.aiAssert` 的功能类似于一般的 `assert` 方法,但可以用自然语言编写条件参数 `conditionPrompt`。MidScene 会调用 AI 来判断条件是否为真。若满足条件,详细原因会附加到 `errorMsg` 中。 + +## 使用 LangSmith (可选) + +LangSmith 是一个用于调试大语言模型的平台。想要集成 LangSmith,请按以下步骤操作: + + +```bash +# 设置环境变量 + +# 启用调试标志 +export MIDSCENE_LANGSMITH_DEBUG=1 + +# LangSmith 配置 +export LANGCHAIN_TRACING_V2=true +export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com" +export LANGCHAIN_API_KEY="your_key_here" +export LANGCHAIN_PROJECT="your_project_name_here" +``` + +启动 MidScene 后,你应该会看到类似如下的日志: + +```log +DEBUGGING MODE: langsmith wrapper enabled +``` + diff --git a/apps/site/docs/zh/docs/usage/_meta.json b/apps/site/docs/zh/docs/usage/_meta.json new file mode 100644 index 000000000..e1b87c9bc --- /dev/null +++ b/apps/site/docs/zh/docs/usage/_meta.json @@ -0,0 +1,3 @@ +[ + "API.md" +] \ No newline at end of file diff --git a/apps/site/docs/index.md b/apps/site/docs/zh/index.md similarity index 75% rename from apps/site/docs/index.md rename to apps/site/docs/zh/index.md index d573e7acd..f4b003fca 100644 --- a/apps/site/docs/index.md +++ b/apps/site/docs/zh/index.md @@ -8,25 +8,25 @@ hero: actions: - theme: brand text: Introduction - link: /doc/getting-started/introduction.html + link: /docs/getting-started/introduction - theme: alt text: Quick Start - link: /doc/getting-started/quick-start.html + link: /docs/getting-started/quick-start image: src: /midscene.png alt: MidScene Logo features: - - title: Interact by Natural Language + - title: 自然语言交互 details: Describe the steps, let MidScene plan and execute for you. icon: 🔍 - title: Understand UI, Answer in JSON details: Provide prompts for the desired data format, and then receive the predictable answer in JSON format. icon: 🤔 - - title: AI Assertion + - title: Intuitive Assertion details: Make assertions in natural language. It’s all based on AI understanding. icon: 🤔 - - title: Public LLMs are Fine - details: It is fine to use public LLMs like GPT-4o. There is no need for any custom training. + - title: Out-of-box LLM + details: It is fine to use public multimodal LLMs like GPT-4o. There is no need for any custom training. icon: 🪓 - title: Visualization details: With our visualization tool, you can easily understand and debug the whole process. diff --git a/apps/site/docs/zh/visualization/index.mdx b/apps/site/docs/zh/visualization/index.mdx new file mode 100644 index 000000000..65d049882 --- /dev/null +++ b/apps/site/docs/zh/visualization/index.mdx @@ -0,0 +1,6 @@ +--- +pageType: custom +--- +import Visualizer from '@midscene/visualizer'; + + diff --git a/apps/site/i18n.json b/apps/site/i18n.json new file mode 100644 index 000000000..56f57b20e --- /dev/null +++ b/apps/site/i18n.json @@ -0,0 +1,6 @@ +{ + "gettingStarted": { + "en": "Getting Started", + "zh": "开始" + } +} \ No newline at end of file diff --git a/apps/site/rspress.config.ts b/apps/site/rspress.config.ts index fc861838a..7dc58e204 100644 --- a/apps/site/rspress.config.ts +++ b/apps/site/rspress.config.ts @@ -12,7 +12,36 @@ export default defineConfig({ }, themeConfig: { darkMode: false, - socialLinks: [{ icon: 'gitlab', mode: 'link', content: 'https://github.com/web-infra-dev/midscene' }], + socialLinks: [{ icon: 'github', mode: 'link', content: 'https://github.com/web-infra-dev/midscene' }], + locales: [ + { + lang: 'en', + outlineTitle: 'On This Page', + label: 'On This Page', + }, + { + lang: 'zh', + outlineTitle: '大纲', + label: '大纲', + }, + ], }, globalStyles: path.join(__dirname, 'styles/index.css'), + locales: [ + { + lang: 'en', + // The label in nav bar to switch language + label: 'English', + title: 'MidScene.js', + description: 'MidScene.js', + }, + { + lang: 'zh', + // The label in nav bar to switch language + label: '简体中文', + title: 'MidScene.js', + description: 'MidScene.js', + }, + ], + lang: 'en', });