web-infra-dev · zhoushaw · Feb 21, 2025 · Feb 21, 2025 · Feb 21, 2025 · Feb 21, 2025
diff --git a/.github/workflows/ai-evaluation.yml b/.github/workflows/ai-evaluation.yml
@@ -54,4 +54,12 @@ jobs:
       run: |
         cd packages/evaluation
         pnpm run evaluate:locator
-        pnpm run evaluate:planning
+        pnpm run evaluate:planning
+
+    - name: Upload Logs
+      if: always()
+      uses: actions/upload-artifact@v4
+      with:
+        name: evaluation-logs
+        path: ${{ github.workspace }}/packages/evaluation/tests/__ai_responses__/
+        if-no-files-found: ignore
diff --git a/README.md b/README.md
@@ -43,7 +43,7 @@ Besides the default model *GPT-4o*, we have added two new recommended open-sourc
 - **Natural Language Interaction 👆**: Just describe your goals and steps, and Midscene will plan and operate the user interface for you.
 - **Chrome Extension Experience 🖥️**: Start experiencing immediately through the Chrome extension, no coding required.
 - **Puppeteer/Playwright Integration 🔧**: Supports Puppeteer and Playwright integration, allowing you to combine AI capabilities with these powerful automation tools for easy automation.
-- **Support Private Deployment 🤖**: Supports private deployment of [`UI-TARS`](https://github.com/bytedance/ui-tars) model, which outperforms closed-source models like GPT-4o and Claude in UI automation scenarios while better protecting data security.
+- **Support Open-Source Models 🤖**: Supports private deployment of [`UI-TARS`](https://github.com/bytedance/ui-tars) and [`Qwen2.5-VL`](https://github.com/QwenLM/Qwen2.5-VL), which outperforms closed-source models like GPT-4o and Claude in UI automation scenarios while better protecting data security.
 - **Support General Models 🌟**: Supports general large models like GPT-4o and Claude, adapting to various scenario needs.
 - **Visual Reports for Debugging 🎞️**: Through our test reports and Playground, you can easily understand, replay and debug the entire process.
 - **Support Caching 🔄**: The first time you execute a task through AI, it will be cached, and subsequent executions of the same task will significantly improve execution efficiency.

diff --git a/README.zh.md b/README.zh.md
@@ -34,7 +34,7 @@ Midscene.js 让 AI 成为你的浏览器操作员 🤖。只需用自然语言
 | 用 JS 代码驱动编排任务，搜集周杰伦演唱会的信息，并写入 Google Docs   | <video src="https://github.com/user-attachments/assets/75474138-f51f-4c54-b3cf-46d61d059999" height="300" />        |
 
 
-## 📢 支持了新的开源模型 - UI-TARS 和 Qwen2.5-VL
+## 📢 新增支持开源模型 - UI-TARS 和 Qwen2.5-VL（千问）
 
 除了默认的 `gpt-4o` 模型，我们还支持了两个新的开源模型：`UI-TARS` 和 `Qwen2.5-VL`。（是的，开源模型！）它们是专为 UI 自动化和图像识别设计的模型，在 UI 自动化场景下表现出色。更多信息请查看 [选择 AI 模型](https://midscenejs.com/zh/choose-a-model)。
 
@@ -43,7 +43,7 @@ Midscene.js 让 AI 成为你的浏览器操作员 🤖。只需用自然语言
 - **自然语言互动 👆**：只需描述你的目标和步骤，Midscene 会为你规划和操作用户界面。
 - **Chrome 插件体验 🖥️**：通过 Chrome 插件，你可以立即开始体验，无需编写代码。
 - **Puppeteer/Playwright 集成 🔧**：支持 Puppeteer 和 Playwright 集成，让你能够结合 AI 能力和这些自动化工具的强大功能，轻松实现自动化操作。
-- **支持私有化部署 🤖**：支持私有化部署 [`UI-TARS`](https://github.com/bytedance/ui-tars) 模型，相比 GPT-4o、Claude 等闭源模型，不仅在 UI 自动化场景下表现更加出色，还能更好地保护数据安全。
+- **支持开源模型 🤖**：支持开源模型 [`UI-TARS`](https://github.com/bytedance/ui-tars) 和 [千问 `Qwen2.5-VL`](https://github.com/QwenLM/Qwen2.5-VL)，相比 GPT-4o、Claude 等闭源模型，不仅在 UI 自动化场景下表现更加出色，还能更好地保护数据安全。
 - **支持通用模型 🌟**：支持 GPT-4o、Claude 等通用大模型，适配多种场景需求。
 - **用可视化报告来调试 🎞️**：通过我们的测试报告和 Playground，你可以轻松理解、回放和调试整个过程。
 - **支持缓存 🔄**：首次通过 AI 执行后任务会被缓存，后续执行相同任务时可显著提升执行效率。

diff --git a/apps/site/docs/en/choose-a-model.md b/apps/site/docs/en/choose-a-model.md
@@ -10,7 +10,6 @@ GPT-4o, Qwen-2.5-VL, and UI-TARS are the most recommended models for Midscene.js
 * [Qwen-2.5-VL](#qwen-25-vl): open-source VL model, almost same performance as GPT-4o, and cost less when using Aliyun service.
 * [UI-TARS](#ui-tars): open-source, end-to-end GUI agent model, good at target-driven tasks and error correction.
 
-
 You can also use other models, but you need to follow [the steps in the article](#choose-other-general-purpose-llms).
 
 :::info Which model should I choose to get started?

diff --git a/apps/site/docs/en/quick-experience.mdx b/apps/site/docs/en/quick-experience.mdx
@@ -8,7 +8,9 @@ Midscene.js provides a Chrome extension. By using it, you can quickly experience
 
 ## Preparation
 
-Prepare an OpenAI API key, we will use it soon.
+Prepare an API key from one of these models: OpenAI GPT-4o, Qwen-2.5-VL, UI-TARS, or any other supported providers. We will be using it soon.
+
+You can check the supported models in [Choose a model](./choose-a-model)
 
 ## Install and config
 
@@ -18,15 +20,16 @@ Start the extension (may be folded by Chrome extension icon), setup the config b
 
 ```shell
 OPENAI_API_KEY="sk-replace-by-your-own"
+# ...all other configs here (if any)
 ```
 
-You can also paste the configuration as described in [config model and provider](./model-provider) here.
-
 ## Start experiencing
 
-After the configuration, you can immediately experience Midscene. You can use actions to interact with the page, use queries to extract JSON data, or use assertions to validate. 
+After the configuration, you can immediately experience Midscene. There are three main tabs in the extension:
 
-You may notice that the extension will provide a playback of actions and a report file to review. This is the same report file you will receive from your automation scripts.
+- **Action**: use action to interact with the web page, like "type Midscene in the search box" or "click the login button".
+- **Query**: use query to extract JSON data from the web page, like "extract the user id from the page, return in {id: string}".
+- **Assert**: use assert to validate the web page, like "the page title is "Midscene"".
 
 Enjoy !
 
@@ -39,7 +42,6 @@ After experiencing, you may want to write some code to integrate Midscene. There
 * [Integrate with Puppeteer](./integrate-with-puppeteer)
 * [Integrate with Playwright](./integrate-with-playwright)
 
-
 ## FAQ
 
 * Extension fails to run and shows 'Cannot access a chrome-extension:// URL of different extension'

diff --git a/apps/site/docs/zh/choose-a-model.md b/apps/site/docs/zh/choose-a-model.md
@@ -4,10 +4,10 @@
 
 如果你想了解更多关于模型服务的配置项，请查看 [配置模型和服务商](./model-provider)。
 
-Midscene.js 推荐使用的三种模型是 GPT-4o，Qwen2.5-VL 和 UI-TARS。它们的的主要特性是：
+Midscene.js 推荐使用的三种模型是 GPT-4o，Qwen2.5-VL（千问）和 UI-TARS。它们的的主要特性是：
 
 * [GPT-4o](#gpt-4o): 表现比较平衡，需要使用较多 token。
-* [Qwen-2.5-VL](#qwen-25-vl): 开源的 VL 模型，几乎与 GPT-4o 表现相同，使用阿里云部署的版本时成本很低。
+* [千问 Qwen-2.5-VL](#qwen-25-vl): 开源的 VL 模型，几乎与 GPT-4o 表现相同，使用阿里云部署的版本时成本很低。
 * [UI-TARS](#ui-tars): 开源的端到端 GUI 代理模型，擅长执行目标驱动的任务，有错误纠正能力。
 
 你也可以使用其他模型，但需要按照[文章中的步骤](#选择其他通用-llm-模型)去配置。
@@ -47,7 +47,7 @@ MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20" # 可选，默认是 "gpt-4o"。
 
 ### Qwen-2.5-VL
 
-从 0.12.0 版本开始，Midscene.js 支持 Qwen-2.5-VL 模型。
+从 0.12.0 版本开始，Midscene.js 支持千问 Qwen-2.5-VL 模型。
 
 Qwen-2.5-VL 是一个专为图像识别设计的开源模型，由阿里巴巴开发。在大多数情况下，它的表现与 GPT-4o 相当，有时甚至更好。我们推荐使用最大参数的 72B 版本。
 

diff --git a/apps/site/docs/zh/quick-experience.mdx b/apps/site/docs/zh/quick-experience.mdx
@@ -9,7 +9,9 @@
 
 ## 准备工作
 
-请先准备好 OpenAI 的 API 密钥，我们稍后将用到。
+请先准备好以下任意模型的 API 密钥：OpenAI GPT 4o, Qwen-2.5-VL, UI-TARS 或任何其他支持的模型。我们稍后会用到。
+
+你可以在 [选择模型](./choose-a-model) 文档中查看 Midscene.js 支持的模型和配置。
 
 ## 安装与配置
 
@@ -19,6 +21,7 @@
 
 ```shell
 OPENAI_API_KEY="sk-replace-by-your-own"
+# ...可能还有其他配置项，一并贴入
 ```
 
 ## 开始体验

diff --git a/packages/evaluation/tests/llm-locator.test.ts b/packages/evaluation/tests/llm-locator.test.ts
@@ -4,7 +4,7 @@ import {
   MIDSCENE_MODEL_NAME,
   getAIConfig,
 } from '@midscene/core';
-import { MATCH_BY_POSITION } from '@midscene/core/env';
+import { MIDSCENE_USE_QWEN_VL, getAIConfigInBoolean } from '@midscene/core/env';
 import { sleep } from '@midscene/core/utils';
 import { saveBase64Image } from '@midscene/shared/img';
 import dotenv from 'dotenv';
@@ -17,7 +17,6 @@ dotenv.config({
   override: true,
 });
 
-const failCaseThreshold = process.env.CI ? 1 : 0;
 const testSources = [
   'antd-carousel',
   'todo',
@@ -28,14 +27,19 @@ const testSources = [
   'aweme-play',
 ];
 
-const positionModeTag = getAIConfig(MATCH_BY_POSITION)
+const positionModeTag = getAIConfigInBoolean(MIDSCENE_USE_QWEN_VL)
   ? 'by_coordinates'
   : 'by_element';
 const resultCollector = new TestResultCollector(
   positionModeTag,
   getAIConfig(MIDSCENE_MODEL_NAME) || 'unspecified',
 );
 
+let failCaseThreshold = 0;
+if (process.env.CI && !getAIConfigInBoolean(MIDSCENE_USE_QWEN_VL)) {
+  failCaseThreshold = 3;
+}
+
 afterAll(async () => {
   await resultCollector.analyze(failCaseThreshold);
 });

diff --git a/packages/evaluation/tests/test-analyzer.ts b/packages/evaluation/tests/test-analyzer.ts
@@ -146,7 +146,7 @@ ${errorMsg ? `Error: ${errorMsg}` : ''}
       (item) => item.fail > allowFailCaseCount,
     );
     let errMsg = '';
-    if (failedCaseGroups.length > 0) {
+    if (failedCaseGroups.length > allowFailCaseCount) {
       errMsg = `Failed case groups: ${failedCaseGroups.map((item) => item.caseGroup).join(', ')}`;
       console.log(errMsg);
       console.log('error log file:', this.failedCaseLogPath);

diff --git a/packages/midscene/src/ai-model/prompt/llm-planning.ts b/packages/midscene/src/ai-model/prompt/llm-planning.ts
@@ -67,7 +67,7 @@ You are a versatile professional in software UI automation. Your outstanding con
 - All the actions you composed MUST be based on the page context information you get.
 - Trust the "What have been done" field about the task (if any), don't repeat actions in it.
 - Respond only with valid JSON. Do not write an introduction or summary or markdown prefix like \`\`\`json\`\`\`.
-- If you cannot plan any action at all (i.e. empty actions array), set reason in the \`error\` field.
+- If the screenshot and the instruction are totally irrelevant, set reason in the \`error\` field.
 
 ## About the \`actions\` field
 
@@ -218,7 +218,8 @@ export const planSchema: ResponseFormatJSONSchema = {
               },
               type: {
                 type: 'string',
-                description: 'Type of action, like "Tap", "Hover", etc.',
+                description:
+                  'Type of action, one of "Tap", "Hover" , "Input", "KeyboardPress", "Scroll", "ExpectedFalsyCondition", "Sleep"',
               },
               param: {
                 anyOf: [
@@ -245,6 +246,12 @@ export const planSchema: ResponseFormatJSONSchema = {
                     required: ['direction', 'scrollType', 'distance'],
                     additionalProperties: false,
                   },
+                  {
+                    type: 'object',
+                    properties: { reason: { type: 'string' } },
+                    required: ['reason'],
+                    additionalProperties: false,
+                  },
                 ],
                 description:
                   'Parameter of the action, can be null ONLY when the type field is Tap or Hover',

diff --git a/packages/midscene/src/insight/utils.ts b/packages/midscene/src/insight/utils.ts
@@ -32,6 +32,7 @@ export function emitInsightDump(
   } else if (getAIConfigInBoolean(MIDSCENE_USE_QWEN_VL)) {
     modelDescription = 'qwen-vl mode';
   }
+
   const baseData: DumpMeta = {
     sdkVersion: getVersion(),
     logTime: Date.now(),

diff --git a/packages/midscene/tests/ai/llm-planning/__snapshots__/basic.test.ts.snap b/packages/midscene/tests/ai/llm-planning/__snapshots__/basic.test.ts.snap
@@ -1,12 +1,12 @@
 // Vitest Snapshot v1, https://vitest.dev/guide/snapshot.html
 
-exports[`automation - planning > basic run 1`] = `
+exports[`automation - llm planning > basic run 1`] = `
 {
   "timeMs": 3500,
 }
 `;
 
-exports[`automation - planning > basic run 2`] = `
+exports[`automation - llm planning > basic run 2`] = `
 {
   "value": "Enter",
 }

diff --git a/packages/midscene/tests/ai/llm-planning/basic.test.ts b/packages/midscene/tests/ai/llm-planning/basic.test.ts
@@ -1,4 +1,5 @@
 import { plan } from '@/ai-model';
+import { MIDSCENE_USE_QWEN_VL, getAIConfigInBoolean } from '@/env';
 import { getContextFromFixture } from '@/evaluation';
 /* eslint-disable max-lines-per-function */
 import { describe, expect, it, vi } from 'vitest';
@@ -8,7 +9,9 @@ vi.setConfig({
   hookTimeout: 30 * 1000,
 });
 
-describe('automation - planning', () => {
+const qwenMode = getAIConfigInBoolean(MIDSCENE_USE_QWEN_VL);
+
+describe.skipIf(qwenMode)('automation - llm planning', () => {
   it('basic run', async () => {
     const { context } = await getContextFromFixture('todo');
 
@@ -18,14 +21,33 @@ describe('automation - planning', () => {
         context,
       },
     );
-    expect(actions.length).toBe(3);
-    expect(actions[0].type).toBe('Input');
-    expect(actions[1].type).toBe('Sleep');
-    expect(actions[1].param).toMatchSnapshot();
-    expect(actions[2].type).toBe('KeyboardPress');
-    expect(actions[2].param).toMatchSnapshot();
+    expect(actions).toBeTruthy();
+    expect(actions!.length).toBe(3);
+    expect(actions![0].type).toBe('Input');
+    expect(actions![1].type).toBe('Sleep');
+    expect(actions![1].param).toMatchSnapshot();
+    expect(actions![2].type).toBe('KeyboardPress');
+    expect(actions![2].param).toMatchSnapshot();
   });
 
+  it('scroll page', async () => {
+    const { context } = await getContextFromFixture('todo');
+    const { actions } = await plan(
+      'Scroll down the page by 200px, scroll up the page by 100px, scroll right the second item of the task list by 300px',
+      { context },
+    );
+    expect(actions).toBeTruthy();
+    expect(actions!.length).toBe(3);
+    expect(actions![0].type).toBe('Scroll');
+    expect(actions![0].locate).toBeNull();
+    expect(actions![0].param).toBeDefined();
+
+    expect(actions![2].locate).toBeTruthy();
+    expect(actions![2].param).toBeDefined();
+  });
+});
+
+describe('planning', () => {
   const todoInstructions = [
     {
       name: 'input first todo item',
@@ -59,7 +81,9 @@ describe('automation - planning', () => {
       const { context } = await getContextFromFixture('todo');
       const { actions } = await plan(instruction, { context });
       expect(actions).toBeTruthy();
-      expect(actions[0].locate?.id).toBeTruthy();
+      expect(actions![0].locate).toBeTruthy();
+      expect(actions![0].locate?.prompt).toBeTruthy();
+      expect(actions![0].locate?.id || actions![0].locate?.bbox).toBeTruthy();
     });
   });
 
@@ -72,66 +96,29 @@ describe('automation - planning', () => {
       },
     );
     expect(actions).toBeTruthy();
-    expect(actions[0].type).toBe('Scroll');
-    expect(actions[0].locate).toBeTruthy();
+    expect(actions![0].type).toBe('Scroll');
+    expect(actions![0].locate).toBeTruthy();
   });
 
-  it('scroll page', async () => {
-    const { context } = await getContextFromFixture('todo');
-    const { actions } = await plan(
-      'Scroll down the page by 200px, scroll up the page by 100px, scroll right the second item of the task list by 300px',
-      { context },
-    );
-    expect(actions.length).toBe(3);
-    expect(actions).toBeTruthy();
-    expect(actions[0].type).toBe('Scroll');
-    expect(actions[0].locate).toBeNull();
-    expect(actions[0].param).toBeDefined();
-
-    expect(actions[2].locate).toBeTruthy();
-    expect(actions[2].param).toBeDefined();
-  });
-
-  // it('throw error when instruction is not feasible', async () => {
-  //   const { context } = await getPageDataOfTestName('todo');
-  //   await expect(async () => {
-  //     await plan('close Cookie Prompt', {
-  //       context,
-  //     });
-  //   }).rejects.toThrow();
-  // });
-
   it('should not throw in an "if" statement', async () => {
     const { context } = await getContextFromFixture('todo');
     const { actions, error } = await plan(
       'If there is a cookie prompt, close it',
       { context },
     );
 
-    expect(actions.length === 1).toBeTruthy();
-    expect(actions[0]!.type).toBe('FalsyConditionStatement');
+    expect(actions?.length === 1).toBeTruthy();
+    expect(actions?.[0]!.type).toBe('ExpectedFalsyCondition');
   });
 
-  it('should give a further plan when something is not found', async () => {
+  it('should make mark unfinished when something is not found', async () => {
     const { context } = await getContextFromFixture('todo');
     const res = await plan(
       'click the input box, wait 300ms, click the close button of the cookie prompt',
       { context },
     );
-    // console.log(res);
-    expect(res.furtherPlan).toBeTruthy();
-    expect(res.furtherPlan?.whatToDoNext).toBeTruthy();
-    expect(res.furtherPlan?.log).toBeTruthy();
-  });
 
-  it.skip('partial error', async () => {
-    const { context } = await getContextFromFixture('todo');
-    const res = await plan(
-      'click the input box, click the close button of the cookie prompt',
-      { context },
-    );
-    expect(res.furtherPlan).toBeTruthy();
-    expect(res.furtherPlan?.whatToDoNext).toBeTruthy();
-    expect(res.furtherPlan?.log).toBeTruthy();
+    expect(res.finish).toBeFalsy();
+    expect(res.log).toBeDefined();
   });
 });
diff --git a/packages/web-integration/src/common/tasks.ts b/packages/web-integration/src/common/tasks.ts
@@ -1003,7 +1003,9 @@ export class PageTaskExecutor {
         };
       }
 
-      errorThought = output?.thought || 'unknown error';
+      errorThought =
+        output?.thought ||
+        `unknown error when waiting for assertion: ${assertion}`;
       const now = Date.now();
       if (now - startTime < checkIntervalMs) {
         const timeRemaining = checkIntervalMs - (now - startTime);

diff --git a/packages/web-integration/src/puppeteer/agent-launcher.ts b/packages/web-integration/src/puppeteer/agent-launcher.ts
@@ -9,7 +9,7 @@ export const defaultUA =
 export const defaultViewportWidth = 1440;
 export const defaultViewportHeight = 900;
 export const defaultViewportScale = process.platform === 'darwin' ? 2 : 1;
-export const defaultWaitForNetworkIdleTimeout = 10 * 1000;
+export const defaultWaitForNetworkIdleTimeout = 6 * 1000;
 
 interface FreeFn {
   name: string;