Skip to content

Commit 56fbef3

Browse files
committed
feat: add Chinese docs
1 parent 7222816 commit 56fbef3

File tree

27 files changed

+628
-74
lines changed

27 files changed

+628
-74
lines changed

apps/site/docs/doc/faq.md

Lines changed: 0 additions & 22 deletions
This file was deleted.

apps/site/docs/_meta.json renamed to apps/site/docs/en/_meta.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
[
22
{
33
"text": "Docs",
4-
"link": "/doc/getting-started/introduction",
5-
"activeMatch": "/doc/"
4+
"link": "/docs/getting-started/introduction",
5+
"activeMatch": "/docs"
66
},
77
{
88
"text": "Visualization Tool",

apps/site/docs/doc/_meta.json renamed to apps/site/docs/en/docs/_meta.json

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@
1313
"collapsible": false,
1414
"collapsed": false
1515
},
16-
"prompting-tips.md",
17-
"faq.md"
16+
{
17+
"type": "dir",
18+
"name": "more",
19+
"label": "More",
20+
"collapsible": false,
21+
"collapsed": false
22+
}
1823
]

apps/site/docs/doc/getting-started/introduction.mdx renamed to apps/site/docs/en/docs/getting-started/introduction.mdx

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,33 +8,31 @@ UI automation can be frustrating, often involving a maze of *#ids*, *data-test-x
88

99
Introducing MidScene.js, an innovative SDK designed to bring joy back to programming by simplifying automation tasks.
1010

11-
MidScene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. Rather than writing and maintaining complex selectors, you can simply describe the interaction steps or expected data formats using a screenshot, and the AI will handle the execution for you.
12-
13-
By employing MidScene.js, you ensure a more streamlined, efficient, and enjoyable approach to UI automation.
11+
MidScene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. You can simply describe the interaction steps or expected data formats, and the AI will handle the execution for you.
1412

1513
## Features
1614

17-
### Public LLMs are Fine
15+
### Out-of-box LLM
1816

19-
It is fine to use publicly available LLMs such as GPT-4. There is no need for custom training. To experience the out-of-the-box AI-driven automation, token is all you need. 😀
17+
It is fine to use publicly available LLMs such as GPT-4o. There is no need for custom training. To experience the brand new way of writing automation, token is all you need. 😀
2018

21-
### Execute Actions
19+
### Execute Actions by AI
2220

2321
Use `.aiAction` to perform a series of actions by describing the steps.
2422

2523
For example `.aiAction('Enter "Learn JS today" in the task box, then press Enter to create')`.
2624

27-
### Extract Data from Page
25+
### Extract Data from Page by AI
2826

2927
`.aiQuery` is the method to extract customized data from the UI.
3028

3129
For example, by calling `const dataB = await agent.aiQuery('string[], task names in the list');`, you will get an array with string of the task names.
3230

33-
### Perform Assertions
31+
### Perform Assertions by AI
3432

3533
Call `.aiAssert` to perform assertions on the page.
3634

37-
#### Visualization Tool
35+
### Visualization Tool
3836

3937
With our visualization tool, you can easily debug the prompt and AI response. All intermediate data, such as queries, plans, and actions, can be visualized.
4038

apps/site/docs/doc/getting-started/quick-start.md renamed to apps/site/docs/en/docs/getting-started/quick-start.md

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Quick Start
22

3-
In this example, we use OpenAI GPT-4o and Puppeteer.js to search headphones on ebay, and then get the result items and prices in JSON format. Remember to prepare an OpenAI key that is eligible for accessing GPT-4o before running.
3+
In this example, we use OpenAI GPT-4o and Puppeteer.js to search headphones on ebay, and then get the result items and prices in JSON format.
4+
5+
Remember to prepare an OpenAI key that is eligible for accessing GPT-4o before running.
46

57
> [Puppeteer](https://pptr.dev/) is a Node.js library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi. Puppeteer runs in the headless (no visible UI) by default but can be configured to run in a visible ("headful") browser.
68
@@ -19,8 +21,7 @@ npm install @midscene/web --save-dev
1921
npm install puppeteer ts-node --save-dev
2022
```
2123

22-
Write a simple demo to **extract the main download button of vscode website**.
23-
Save the following code as `./demo.ts`.
24+
Write and save the following code as `./demo.ts`.
2425

2526
```typescript
2627
import puppeteer, { Viewport } from 'puppeteer';
@@ -39,20 +40,31 @@ await page.waitForNavigation({
3940
});
4041
const page = await launchPage();
4142

42-
// init MidScene agent
43+
// 👀 init MidScene agent
4344
const mid = new PuppeteerAgent(page);
4445

45-
// perform a search
46+
// 👀 perform a search
4647
await mid.aiAction('type "Headphones" in search box, hit Enter');
4748
await sleep(5000);
4849

49-
// find the items
50+
// 👀 find the items
5051
const items = await mid.aiQuery(
5152
'{itemTitle: string, price: Number}[], find item in list and corresponding price',
5253
);
5354
console.log('headphones in stock', items);
5455
```
5556

57+
:::tip
58+
You may have noticed that the key lines of code for this only consist of two lines. They are all written in plain language.
59+
60+
```typescript
61+
await mid.aiAction('type "Headphones" in search box, hit Enter');
62+
await mid.aiQuery(
63+
'{itemTitle: string, price: Number}[], find item in list and corresponding price',
64+
);
65+
```
66+
:::
67+
5668
Using ts-node to run, you will get the data of Headphones on ebay:
5769

5870
```bash
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[
2+
"prompting-tips",
3+
"faq"
4+
]

apps/site/docs/en/docs/more/faq.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# FAQ
2+
3+
### Can MidScene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'"
4+
5+
MidScene is an automation assistance SDK with a key feature of action stability — ensuring the same actions are performed in each run. To maintain this stability, we encourage you to provide detailed instructions to help the AI understand each step of your task.
6+
7+
If you require a 'goal-to-task' AI planning tool, you can develop one based on MidScene.
8+
9+
Related Docs:
10+
* [Tips for Prompting](./prompting-tips.html)
11+
12+
### Limitations
13+
14+
There are some limitations with MidScene. We are still working on them.
15+
16+
1. The interaction types are limited to only tap, type, keyboard press, and scroll.
17+
2. It's not 100% stable. Even GPT-4o can't return the right answer all the time. Following the [Prompting Tips](./prompting-tips) will help improve stability.
18+
3. Since we use JavaScript to retrieve items from the page, the elements inside the iframe cannot be accessed.
19+
20+
### Which LLM should I choose ?
21+
22+
MidScene needs a multimodal Large Language Model (LLM) to understand the UI. Currently, we find that OpenAI's GPT-4o performs much better than others.
23+
24+
### About the token cost
25+
26+
Image resolution and element numbers (i.e., a UI context size created by MidScene) form the token bill.
27+
28+
Here are some typical data.
29+
30+
|Task | Resolution | Input tokens | Output tokens | GPT-4o Price |
31+
|-----|------------|--------------|---------------|----------------|
32+
|Find the download button on the VSCode website| 1920x1080| 2011|54| $0.011|
33+
|Split the Github status page| 1920x1080| 3609|1020| $0.034|
34+
35+
> The price data was calculated in June 2024.
36+
37+
### The automation process is running more slowly than it did before
38+
39+
Since MidScene.js invokes AI for each planning and querying operation, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. This is currently inevitable but may improve with advancements in LLMs.
40+
41+
Despite the increased time and cost, MidScene stands out in practical applications due to its unique development experience and easy-to-maintain codebase. We are confident that incorporating automation scripts powered by MidScene will significantly enhance your project’s efficiency, cover many more situations, and boost overall productivity.
42+
43+
In short, it is worth the time and cost.

apps/site/docs/doc/prompting-tips.md renamed to apps/site/docs/en/docs/more/prompting-tips.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,28 @@
11
# Tips for Prompting
22

3-
There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
3+
The natural language parameter passed to MidScene will be part of the prompt sent to the LLM. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
44

55
### The purpose of optimization is to get a stable response from AI
66

7-
Since AI has the nature of heuristic, the purpose of prompt tuning should be to obtain stable responses from the AI model across runs. In most cases, to expect a consistent response from GPT-4 by using a good prompt is entirely feasible.
7+
Since AI has the nature of heuristic, the purpose of prompt tuning should be to obtain stable responses from the AI model across runs. In most cases, to expect a consistent response from LLM by using a good prompt is entirely feasible.
88

99
### Detailed description and samples are welcome
1010

1111
Detailed descriptions and examples are always welcome.
1212

1313
For example:
1414

15-
Good ✅: "Find the search box, along with a region switch such as 'domestic', 'international'"
15+
Good ✅: "Find the search box (it should be along with a region switch, such as 'domestic' or 'international'), type 'headphone', and hit Enter."
1616

17-
Bad ❌: "Lower Part of page"
17+
Bad ❌: "Search 'headphone'"
1818

1919
### Infer from the UI, not the DOM properties
2020

2121
All the data sent to the LLM are the screenshots and element coordinates. The DOM is almost invisible to the LLM. So do not expect the LLM infer any information from the DOM (such as `test-id-*` properties).
2222

2323
Ensure everything you expect from the LLM is visible in the screenshot.
2424

25-
### LLMs can NOT tell the exact number like coords or hex-color, give it some choices
25+
### LLMs can NOT tell the exact number like coords or hex-style color, give it some choices
2626

2727
For example:
2828

@@ -34,11 +34,11 @@ Bad ❌: "[number, number], the [x, y] coords of the main button"
3434

3535
### Use visualization tool to debug
3636

37-
Use the visualization tool to debug and understand how the AI parse the interface. Just upload the log, and view the AI's parse results. You can find [the tool](/visualization/index.html) on the navigation bar on this site.
37+
Use the visualization tool to debug and understand each step of MidScene. Just upload the log, and view the AI's parse results. You can find [the tool](/visualization/) on the navigation bar on this site.
3838

3939
### non-English prompting is acceptable
4040

41-
Since AI models can understand many languages, feel free to write the prompt in any language you like.
41+
Since most AI models can understand many languages, feel free to write the prompt in any language you prefer. It usually works even if the prompt is in a language different from the page's language.
4242

4343
Good ✅: "点击顶部左侧导航栏中的“首页”链接"
4444

apps/site/docs/doc/usage/API.md renamed to apps/site/docs/en/docs/usage/API.md

Lines changed: 16 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## config AI vendor
44

5-
MidScene uses the OpenAI SDK as the default AI service. Currently OpenAI GPT-4o seems to perform best. However, you can customize the caller configuration with environment variables.
5+
MidScene uses the OpenAI SDK as the default AI service. You can customize the configuration using environment variables.
66

77
There are the main configs, in which `OPENAI_API_KEY` is required.
88

@@ -17,7 +17,7 @@ Optional:
1717

1818
```bash
1919
# optional, if you want to use a customized endpoint
20-
export OPENAI_BASE_URL="..."
20+
export OPENAI_BASE_URL="https://..."
2121

2222
# optional, if you want to specify a model name other than gpt-4o
2323
export MIDSCENE_MODEL_NAME='claude-3-opus-20240229';
@@ -28,6 +28,8 @@ export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"ke
2828

2929
## Use in Puppeteer
3030

31+
To initialize:
32+
3133
```typescript
3234
import { PuppeteerAgent } from '@midscene/web/puppeteer';
3335

@@ -54,7 +56,7 @@ await page.waitForNavigation({
5456

5557
// init MidScene agent, perform actions
5658
const mid = new PuppeteerAgent(page);
57-
await mid.ai('type "how much is the ferry ticket in Shanghai" in search box, hit Enter');
59+
await mid.ai('type "Headphones" in search box, hit Enter');
5860
```
5961

6062
## Use in Playwright
@@ -63,7 +65,7 @@ await mid.ai('type "how much is the ferry ticket in Shanghai" in search box, hit
6365

6466
> In the following documentation, you may see functions called with the `mid.` prefix. If you use destructuring in Playwright, like `async ({ ai, aiQuery }) => { /* ... */}`, you can call the functions without this prefix. It's just a matter of syntax.
6567
66-
### `.aiAction(steps: string)` or `.ai(steps: string)` - perform your actions
68+
### `.aiAction(steps: string)` or `.ai(steps: string)` - Control the page
6769

6870
You can use `.aiAction` to perform a series of actions. It accepts a `steps: string` as a parameter, which describes the actions. In the prompt, you should clearly describe the steps. MidScene will take care of the rest.
6971

@@ -79,25 +81,23 @@ await mid.aiAction('Move your mouse over the second item in the task list and cl
7981
await mid.ai('Click the "completed" status button below the task list');
8082
```
8183

82-
Steps should always be clearly described. A very brief prompt like 'Tweet "Hello World"' will result in unstable performance and a high likelihood of failure.
84+
Steps should always be clearly and thoroughly described. A very brief prompt like 'Tweet "Hello World"' will result in unstable performance and a high likelihood of failure.
8385

8486
Under the hood, MidScene will plan the detailed steps by sending your page context and a screenshot to the AI. After that, MidScene will execute the steps one by one. If MidScene deems it impossible to execute, an error will be thrown.
8587

86-
The main capabilities of MidScene are as follows, which can be seen in the visualization tools:
87-
1. **Planning**: Determine the steps to accomplish the task
88-
2. **Find**: Identify the target element using a natural language description
89-
3. **Action**: Tap, scroll, keyboard input, hover
90-
4. **Others**: Sleep
91-
92-
Currently, MidScene can't plan steps that include conditions and loops.
88+
The main capabilities of MidScene are as follows, and your task will be split into these types. You can see them in the visualization tools:
9389

94-
:::tip Why can't MidScene smartly plan the actions according to my one-line goal?
90+
1. **Locator**: Identify the target element using a natural language description
91+
2. **Action**: Tap, scroll, keyboard input, hover
92+
3. **Others**: Sleep
9593

96-
MidScene aims to be an automation assistance SDK. Its action stability (i.e., perform the same actions on each run) is a key feature. To achieve this, we encourage you to write down detailed instructions to help the AI better understand each step of your task. If you want a 'goal-to-task' AI planning tool, you can build one on top of MidScene.
94+
Currently, MidScene can't plan steps that include conditions and loops.
9795

98-
:::
96+
Related Docs:
97+
* [FAQ: Can MidScene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'](../more/faq.html)
98+
* [Tips for Prompting](../more/prompting-tips.html)
9999

100-
### `.aiQuery(dataShape: any)` - extract any data from page
100+
### `.aiQuery(dataDemand: any)` - extract any data from page
101101

102102
You can extract customized data from the UI. Provided that the multi-modal AI can perform inference, it can return both data directly written on the page and any data based on "understanding". The return value can be any valid primitive type, like String, Number, JSON, Array, etc. Just describe it in the `dataDemand`.
103103

@@ -137,9 +137,6 @@ This method will soon be available in MidScene.
137137
LangSmith is a platform designed to debug the LLMs. To integrate LangSmith, please follow these steps:
138138

139139
```shell
140-
# install langsmith dependency
141-
npm i langsmith
142-
143140
# set env variables
144141

145142
# Flag to enable debug

0 commit comments

Comments
 (0)