web-infra-dev
diff --git a/‎apps/site/docs/doc/faq.md
Lines changed: 0 additions & 22 deletions b/‎apps/site/docs/doc/faq.md
Lines changed: 0 additions & 22 deletions
diff --git a/‎apps/site/docs/_meta.json renamed to ‎apps/site/docs/en/_meta.json
Lines changed: 2 additions & 2 deletions b/‎apps/site/docs/_meta.json renamed to ‎apps/site/docs/en/_meta.json
Lines changed: 2 additions & 2 deletions
diff --git a/‎apps/site/docs/doc/_meta.json renamed to ‎apps/site/docs/en/docs/_meta.json
Lines changed: 7 additions & 2 deletions b/‎apps/site/docs/doc/_meta.json renamed to ‎apps/site/docs/en/docs/_meta.json
Lines changed: 7 additions & 2 deletions
diff --git a/‎apps/site/docs/doc/getting-started/_meta.json renamed to ‎apps/site/docs/en/docs/getting-started/_meta.json b/‎apps/site/docs/doc/getting-started/_meta.json renamed to ‎apps/site/docs/en/docs/getting-started/_meta.json
diff --git a/‎apps/site/docs/doc/getting-started/introduction.mdx renamed to ‎apps/site/docs/en/docs/getting-started/introduction.mdx
Lines changed: 7 additions & 9 deletions b/‎apps/site/docs/doc/getting-started/introduction.mdx renamed to ‎apps/site/docs/en/docs/getting-started/introduction.mdx
Lines changed: 7 additions & 9 deletions
diff --git a/‎apps/site/docs/doc/getting-started/quick-start.md renamed to ‎apps/site/docs/en/docs/getting-started/quick-start.md
Lines changed: 18 additions & 6 deletions b/‎apps/site/docs/doc/getting-started/quick-start.md renamed to ‎apps/site/docs/en/docs/getting-started/quick-start.md
Lines changed: 18 additions & 6 deletions
diff --git a/‎apps/site/docs/en/docs/more/_meta.json
Lines changed: 4 additions & 0 deletions b/‎apps/site/docs/en/docs/more/_meta.json
Lines changed: 4 additions & 0 deletions
diff --git a/‎apps/site/docs/en/docs/more/faq.md
Lines changed: 43 additions & 0 deletions b/‎apps/site/docs/en/docs/more/faq.md
Lines changed: 43 additions & 0 deletions
diff --git a/‎apps/site/docs/doc/prompting-tips.md renamed to ‎apps/site/docs/en/docs/more/prompting-tips.md
Lines changed: 7 additions & 7 deletions b/‎apps/site/docs/doc/prompting-tips.md renamed to ‎apps/site/docs/en/docs/more/prompting-tips.md
Lines changed: 7 additions & 7 deletions
diff --git a/‎apps/site/docs/doc/usage/API.md renamed to ‎apps/site/docs/en/docs/usage/API.md
Lines changed: 16 additions & 19 deletions b/‎apps/site/docs/doc/usage/API.md renamed to ‎apps/site/docs/en/docs/usage/API.md
Lines changed: 16 additions & 19 deletions
@@ -1,8 +1,8 @@
 [
   {
     "text": "Docs",
-    "link": "/doc/getting-started/introduction",
-    "activeMatch": "/doc/"
+    "link": "/docs/getting-started/introduction",
+    "activeMatch": "/docs"
   },
   {
     "text": "Visualization Tool",
 
@@ -13,6 +13,11 @@
     "collapsible": false,
     "collapsed": false
   },
-  "prompting-tips.md",
-  "faq.md"
+  {
+    "type": "dir",
+    "name": "more",
+    "label": "More",
+    "collapsible": false,
+    "collapsed": false
+  }
 ]
@@ -8,33 +8,31 @@ UI automation can be frustrating, often involving a maze of *#ids*, *data-test-x
 
 Introducing MidScene.js, an innovative SDK designed to bring joy back to programming by simplifying automation tasks.
 
-MidScene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. Rather than writing and maintaining complex selectors, you can simply describe the interaction steps or expected data formats using a screenshot, and the AI will handle the execution for you.
-
-By employing MidScene.js, you ensure a more streamlined, efficient, and enjoyable approach to UI automation.
+MidScene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. You can simply describe the interaction steps or expected data formats, and the AI will handle the execution for you.
 
 ## Features
 
-### Public LLMs are Fine
+### Out-of-box LLM
 
-It is fine to use publicly available LLMs such as GPT-4. There is no need for custom training. To experience the out-of-the-box AI-driven automation, token is all you need. 😀
+It is fine to use publicly available LLMs such as GPT-4o. There is no need for custom training. To experience the brand new way of writing automation, token is all you need. 😀
 
-### Execute Actions
+### Execute Actions by AI
 
 Use `.aiAction` to perform a series of actions by describing the steps.
 
 For example `.aiAction('Enter "Learn JS today" in the task box, then press Enter to create')`.
 
-### Extract Data from Page
+### Extract Data from Page by AI
 
 `.aiQuery` is the method to extract customized data from the UI.
 
 For example, by calling `const dataB = await agent.aiQuery('string[], task names in the list');`, you will get an array with string of the task names.
 
-### Perform Assertions
+### Perform Assertions by AI
 
 Call `.aiAssert` to perform assertions on the page.
 
-#### Visualization Tool
+### Visualization Tool
 
 With our visualization tool, you can easily debug the prompt and AI response. All intermediate data, such as queries, plans, and actions, can be visualized.
 
 
@@ -1,6 +1,8 @@
 # Quick Start
 
-In this example, we use OpenAI GPT-4o and Puppeteer.js to search headphones on ebay, and then get the result items and prices in JSON format. Remember to prepare an OpenAI key that is eligible for accessing GPT-4o before running.
+In this example, we use OpenAI GPT-4o and Puppeteer.js to search headphones on ebay, and then get the result items and prices in JSON format. 
+
+Remember to prepare an OpenAI key that is eligible for accessing GPT-4o before running.
 
 > [Puppeteer](https://pptr.dev/) is a Node.js library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi. Puppeteer runs in the headless (no visible UI) by default but can be configured to run in a visible ("headful") browser.
 
@@ -19,8 +21,7 @@ npm install @midscene/web --save-dev
 npm install puppeteer ts-node --save-dev 
 ```
 
-Write a simple demo to **extract the main download button of vscode website**.
-Save the following code as `./demo.ts`.
+Write and save the following code as `./demo.ts`.
 
 ```typescript
 import puppeteer, { Viewport } from 'puppeteer';
@@ -39,20 +40,31 @@ await page.waitForNavigation({
 });
 const page = await launchPage();
 
-// init MidScene agent
+// 👀 init MidScene agent 
 const mid = new PuppeteerAgent(page);
 
-// perform a search
+// 👀 perform a search
 await mid.aiAction('type "Headphones" in search box, hit Enter');
 await sleep(5000);
 
-// find the items
+// 👀 find the items
 const items = await mid.aiQuery(
   '{itemTitle: string, price: Number}[], find item in list and corresponding price',
 );
 console.log('headphones in stock', items);
 ```
 
+:::tip
+You may have noticed that the key lines of code for this only consist of two lines. They are all written in plain language.
+
+```typescript
+await mid.aiAction('type "Headphones" in search box, hit Enter');
+await mid.aiQuery(
+  '{itemTitle: string, price: Number}[], find item in list and corresponding price',
+);
+```
+:::
+
 Using ts-node to run, you will get the data of Headphones on ebay:
 
 ```bash
 
@@ -0,0 +1,4 @@
+[
+  "prompting-tips",
+  "faq"
+]
@@ -0,0 +1,43 @@
+# FAQ
+
+### Can MidScene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'"
+
+MidScene is an automation assistance SDK with a key feature of action stability — ensuring the same actions are performed in each run. To maintain this stability, we encourage you to provide detailed instructions to help the AI understand each step of your task.
+
+If you require a 'goal-to-task' AI planning tool, you can develop one based on MidScene.
+
+Related Docs:
+* [Tips for Prompting](./prompting-tips.html)
+
+### Limitations
+
+There are some limitations with MidScene. We are still working on them.
+
+1. The interaction types are limited to only tap, type, keyboard press, and scroll.
+2. It's not 100% stable. Even GPT-4o can't return the right answer all the time. Following the [Prompting Tips](./prompting-tips) will help improve stability.
+3. Since we use JavaScript to retrieve items from the page, the elements inside the iframe cannot be accessed.
+
+### Which LLM should I choose ?
+
+MidScene needs a multimodal Large Language Model (LLM) to understand the UI. Currently, we find that OpenAI's  GPT-4o performs much better than others.
+
+### About the token cost
+
+Image resolution and element numbers (i.e., a UI context size created by MidScene) form the token bill.
+
+Here are some typical data.
+
+|Task | Resolution | Input tokens | Output tokens | GPT-4o Price |
+|-----|------------|--------------|---------------|----------------|
+|Find the download button on the VSCode website| 1920x1080| 2011|54| $0.011|
+|Split the Github status page| 1920x1080| 3609|1020| $0.034|
+
+> The price data was calculated in June 2024.
+
+### The automation process is running more slowly than it did before
+
+Since MidScene.js invokes AI for each planning and querying operation, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. This is currently inevitable but may improve with advancements in LLMs.
+
+Despite the increased time and cost, MidScene stands out in practical applications due to its unique development experience and easy-to-maintain codebase. We are confident that incorporating automation scripts powered by MidScene will significantly enhance your project’s efficiency, cover many more situations, and boost overall productivity.
+
+In short, it is worth the time and cost.
@@ -1,28 +1,28 @@
 # Tips for Prompting
 
-There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
+The natural language parameter passed to MidScene will be part of the prompt sent to the LLM. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
 
 ### The purpose of optimization is to get a stable response from AI
 
-Since AI has the nature of heuristic, the purpose of prompt tuning should be to obtain stable responses from the AI model across runs. In most cases, to expect a consistent response from GPT-4 by using a good prompt is entirely feasible.
+Since AI has the nature of heuristic, the purpose of prompt tuning should be to obtain stable responses from the AI model across runs. In most cases, to expect a consistent response from LLM by using a good prompt is entirely feasible.
 
 ### Detailed description and samples are welcome
 
 Detailed descriptions and examples are always welcome.
 
 For example: 
 
-Good ✅:  "Find the search box, along with a region switch such as 'domestic', 'international'"
+Good ✅: "Find the search box (it should be along with a region switch, such as 'domestic' or 'international'), type 'headphone', and hit Enter."
 
-Bad ❌: "Lower Part of page"
+Bad ❌: "Search 'headphone'"
 
 ### Infer from the UI, not the DOM properties
 
 All the data sent to the LLM are the screenshots and element coordinates. The DOM is almost invisible to the LLM. So do not expect the LLM infer any information from the DOM (such as `test-id-*` properties).
 
 Ensure everything you expect from the LLM is visible in the screenshot.
 
-### LLMs can NOT tell the exact number like coords or hex-color, give it some choices
+### LLMs can NOT tell the exact number like coords or hex-style color, give it some choices
 
 For example:
 
@@ -34,11 +34,11 @@ Bad ❌: "[number, number], the [x, y] coords of the main button"
 
 ### Use visualization tool to debug
 
-Use the visualization tool to debug and understand how the AI parse the interface. Just upload the log, and view the AI's parse results. You can find [the tool](/visualization/index.html) on the navigation bar on this site. 
+Use the visualization tool to debug and understand each step of MidScene. Just upload the log, and view the AI's parse results. You can find [the tool](/visualization/) on the navigation bar on this site. 
 
 ### non-English prompting is acceptable
 
-⁠Since AI models can understand many languages, feel free to write the prompt in any language you like.
+Since most AI models can understand many languages, feel free to write the prompt in any language you prefer. It usually works even if the prompt is in a language different from the page's language.
 
 Good ✅: "点击顶部左侧导航栏中的“首页”链接"
 
 
@@ -2,7 +2,7 @@
 
 ## config AI vendor
 
-MidScene uses the OpenAI SDK as the default AI service. Currently OpenAI GPT-4o seems to perform best. However, you can customize the caller configuration with environment variables.
+MidScene uses the OpenAI SDK as the default AI service. You can customize the configuration using environment variables.
 
 There are the main configs, in which `OPENAI_API_KEY` is required.
 
@@ -17,7 +17,7 @@ Optional:
 
 ```bash
 # optional, if you want to use a customized endpoint
-export OPENAI_BASE_URL="..."
+export OPENAI_BASE_URL="https://..."
 
 # optional, if you want to specify a model name other than gpt-4o
 export MIDSCENE_MODEL_NAME='claude-3-opus-20240229';
@@ -28,6 +28,8 @@ export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"ke
 
 ## Use in Puppeteer
 
+To initialize：
+
 ```typescript
 import { PuppeteerAgent } from '@midscene/web/puppeteer';
 
@@ -54,7 +56,7 @@ await page.waitForNavigation({
 
 // init MidScene agent, perform actions
 const mid = new PuppeteerAgent(page);
-await mid.ai('type "how much is the ferry ticket in Shanghai" in search box, hit Enter');
+await mid.ai('type "Headphones" in search box, hit Enter');
 ```
 
 ## Use in Playwright
@@ -63,7 +65,7 @@ await mid.ai('type "how much is the ferry ticket in Shanghai" in search box, hit
 
 > In the following documentation, you may see functions called with the `mid.` prefix. If you use destructuring in Playwright, like `async ({ ai, aiQuery }) => { /* ... */}`, you can call the functions without this prefix. It's just a matter of syntax.
 
-### `.aiAction(steps: string)` or `.ai(steps: string)` - perform your actions
+### `.aiAction(steps: string)` or `.ai(steps: string)` - Control the page
 
 You can use `.aiAction` to perform a series of actions. It accepts a `steps: string` as a parameter, which describes the actions. In the prompt, you should clearly describe the steps. MidScene will take care of the rest.
 
@@ -79,25 +81,23 @@ await mid.aiAction('Move your mouse over the second item in the task list and cl
 await mid.ai('Click the "completed" status button below the task list');
 ```
 
-Steps should always be clearly described. A very brief prompt like 'Tweet "Hello World"' will result in unstable performance and a high likelihood of failure. 
+Steps should always be clearly and thoroughly described. A very brief prompt like 'Tweet "Hello World"' will result in unstable performance and a high likelihood of failure. 
 
 Under the hood, MidScene will plan the detailed steps by sending your page context and a screenshot to the AI. After that, MidScene will execute the steps one by one. If MidScene deems it impossible to execute, an error will be thrown. 
 
-The main capabilities of MidScene are as follows, which can be seen in the visualization tools:
-1. **Planning**: Determine the steps to accomplish the task
-2. **Find**: Identify the target element using a natural language description
-3. **Action**: Tap, scroll, keyboard input, hover
-4. **Others**: Sleep
-
-Currently, MidScene can't plan steps that include conditions and loops.
+The main capabilities of MidScene are as follows, and your task will be split into these types. You can see them in the visualization tools:
 
-:::tip Why can't MidScene smartly plan the actions according to my one-line goal? 
+1. **Locator**: Identify the target element using a natural language description
+2. **Action**: Tap, scroll, keyboard input, hover
+3. **Others**: Sleep
 
-MidScene aims to be an automation assistance SDK. Its action stability (i.e., perform the same actions on each run) is a key feature. To achieve this, we encourage you to write down detailed instructions to help the AI better understand each step of your task. If you want a 'goal-to-task' AI planning tool, you can build one on top of MidScene.
+Currently, MidScene can't plan steps that include conditions and loops.
 
-:::
+Related Docs:
+* [FAQ: Can MidScene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'](../more/faq.html)
+* [Tips for Prompting](../more/prompting-tips.html)
 
-### `.aiQuery(dataShape: any)` - extract any data from page
+### `.aiQuery(dataDemand: any)` - extract any data from page
 
 You can extract customized data from the UI. Provided that the multi-modal AI can perform inference, it can return both data directly written on the page and any data based on "understanding". The return value can be any valid primitive type, like String, Number, JSON, Array, etc. Just describe it in the `dataDemand`.
 
@@ -137,9 +137,6 @@ This method will soon be available in MidScene.
 LangSmith is a platform designed to debug the LLMs. To integrate LangSmith, please follow these steps:
 
 ```shell
-# install langsmith dependency
-npm i langsmith
-
 # set env variables
 
 # Flag to enable debug
Original file line number	Diff line number	Diff line change
`@@ -1,8 +1,8 @@`
`1`	`1`	`[`
`2`	`2`	`{`
`3`	`3`	`"text": "Docs",`
`4`		`- "link": "/doc/getting-started/introduction",`
`5`		`- "activeMatch": "/doc/"`
	`4`	`+ "link": "/docs/getting-started/introduction",`
	`5`	`+ "activeMatch": "/docs"`
`6`	`6`	`},`
`7`	`7`	`{`
`8`	`8`	`"text": "Visualization Tool",`
-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
 +[
 +  "prompting-tips",
 +  "faq"
 +]