Skip to content

Commit

Permalink
feat: add Chinese docs
Browse files Browse the repository at this point in the history
  • Loading branch information
yuyutaotao committed Jul 31, 2024
1 parent 7222816 commit 56fbef3
Show file tree
Hide file tree
Showing 27 changed files with 628 additions and 74 deletions.
22 changes: 0 additions & 22 deletions apps/site/docs/doc/faq.md

This file was deleted.

4 changes: 2 additions & 2 deletions apps/site/docs/_meta.json → apps/site/docs/en/_meta.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
[
{
"text": "Docs",
"link": "/doc/getting-started/introduction",
"activeMatch": "/doc/"
"link": "/docs/getting-started/introduction",
"activeMatch": "/docs"
},
{
"text": "Visualization Tool",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@
"collapsible": false,
"collapsed": false
},
"prompting-tips.md",
"faq.md"
{
"type": "dir",
"name": "more",
"label": "More",
"collapsible": false,
"collapsed": false
}
]
Original file line number Diff line number Diff line change
Expand Up @@ -8,33 +8,31 @@ UI automation can be frustrating, often involving a maze of *#ids*, *data-test-x

Introducing MidScene.js, an innovative SDK designed to bring joy back to programming by simplifying automation tasks.

MidScene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. Rather than writing and maintaining complex selectors, you can simply describe the interaction steps or expected data formats using a screenshot, and the AI will handle the execution for you.

By employing MidScene.js, you ensure a more streamlined, efficient, and enjoyable approach to UI automation.
MidScene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. You can simply describe the interaction steps or expected data formats, and the AI will handle the execution for you.

## Features

### Public LLMs are Fine
### Out-of-box LLM

It is fine to use publicly available LLMs such as GPT-4. There is no need for custom training. To experience the out-of-the-box AI-driven automation, token is all you need. 😀
It is fine to use publicly available LLMs such as GPT-4o. There is no need for custom training. To experience the brand new way of writing automation, token is all you need. 😀

### Execute Actions
### Execute Actions by AI

Use `.aiAction` to perform a series of actions by describing the steps.

For example `.aiAction('Enter "Learn JS today" in the task box, then press Enter to create')`.

### Extract Data from Page
### Extract Data from Page by AI

`.aiQuery` is the method to extract customized data from the UI.

For example, by calling `const dataB = await agent.aiQuery('string[], task names in the list');`, you will get an array with string of the task names.

### Perform Assertions
### Perform Assertions by AI

Call `.aiAssert` to perform assertions on the page.

#### Visualization Tool
### Visualization Tool

With our visualization tool, you can easily debug the prompt and AI response. All intermediate data, such as queries, plans, and actions, can be visualized.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Quick Start

In this example, we use OpenAI GPT-4o and Puppeteer.js to search headphones on ebay, and then get the result items and prices in JSON format. Remember to prepare an OpenAI key that is eligible for accessing GPT-4o before running.
In this example, we use OpenAI GPT-4o and Puppeteer.js to search headphones on ebay, and then get the result items and prices in JSON format.

Remember to prepare an OpenAI key that is eligible for accessing GPT-4o before running.

> [Puppeteer](https://pptr.dev/) is a Node.js library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi. Puppeteer runs in the headless (no visible UI) by default but can be configured to run in a visible ("headful") browser.
Expand All @@ -19,8 +21,7 @@ npm install @midscene/web --save-dev
npm install puppeteer ts-node --save-dev
```

Write a simple demo to **extract the main download button of vscode website**.
Save the following code as `./demo.ts`.
Write and save the following code as `./demo.ts`.

```typescript
import puppeteer, { Viewport } from 'puppeteer';
Expand All @@ -39,20 +40,31 @@ await page.waitForNavigation({
});
const page = await launchPage();

// init MidScene agent
// 👀 init MidScene agent
const mid = new PuppeteerAgent(page);

// perform a search
// 👀 perform a search
await mid.aiAction('type "Headphones" in search box, hit Enter');
await sleep(5000);

// find the items
// 👀 find the items
const items = await mid.aiQuery(
'{itemTitle: string, price: Number}[], find item in list and corresponding price',
);
console.log('headphones in stock', items);
```

:::tip
You may have noticed that the key lines of code for this only consist of two lines. They are all written in plain language.

```typescript
await mid.aiAction('type "Headphones" in search box, hit Enter');
await mid.aiQuery(
'{itemTitle: string, price: Number}[], find item in list and corresponding price',
);
```
:::

Using ts-node to run, you will get the data of Headphones on ebay:

```bash
Expand Down
4 changes: 4 additions & 0 deletions apps/site/docs/en/docs/more/_meta.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[
"prompting-tips",
"faq"
]
43 changes: 43 additions & 0 deletions apps/site/docs/en/docs/more/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# FAQ

### Can MidScene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'"

MidScene is an automation assistance SDK with a key feature of action stability — ensuring the same actions are performed in each run. To maintain this stability, we encourage you to provide detailed instructions to help the AI understand each step of your task.

If you require a 'goal-to-task' AI planning tool, you can develop one based on MidScene.

Related Docs:
* [Tips for Prompting](./prompting-tips.html)

### Limitations

There are some limitations with MidScene. We are still working on them.

1. The interaction types are limited to only tap, type, keyboard press, and scroll.
2. It's not 100% stable. Even GPT-4o can't return the right answer all the time. Following the [Prompting Tips](./prompting-tips) will help improve stability.
3. Since we use JavaScript to retrieve items from the page, the elements inside the iframe cannot be accessed.

### Which LLM should I choose ?

MidScene needs a multimodal Large Language Model (LLM) to understand the UI. Currently, we find that OpenAI's GPT-4o performs much better than others.

### About the token cost

Image resolution and element numbers (i.e., a UI context size created by MidScene) form the token bill.

Here are some typical data.

|Task | Resolution | Input tokens | Output tokens | GPT-4o Price |
|-----|------------|--------------|---------------|----------------|
|Find the download button on the VSCode website| 1920x1080| 2011|54| $0.011|
|Split the Github status page| 1920x1080| 3609|1020| $0.034|

> The price data was calculated in June 2024.
### The automation process is running more slowly than it did before

Since MidScene.js invokes AI for each planning and querying operation, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. This is currently inevitable but may improve with advancements in LLMs.

Despite the increased time and cost, MidScene stands out in practical applications due to its unique development experience and easy-to-maintain codebase. We are confident that incorporating automation scripts powered by MidScene will significantly enhance your project’s efficiency, cover many more situations, and boost overall productivity.

In short, it is worth the time and cost.
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
# Tips for Prompting

There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
The natural language parameter passed to MidScene will be part of the prompt sent to the LLM. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.

### The purpose of optimization is to get a stable response from AI

Since AI has the nature of heuristic, the purpose of prompt tuning should be to obtain stable responses from the AI model across runs. In most cases, to expect a consistent response from GPT-4 by using a good prompt is entirely feasible.
Since AI has the nature of heuristic, the purpose of prompt tuning should be to obtain stable responses from the AI model across runs. In most cases, to expect a consistent response from LLM by using a good prompt is entirely feasible.

### Detailed description and samples are welcome

Detailed descriptions and examples are always welcome.

For example:

Good ✅: "Find the search box, along with a region switch such as 'domestic', 'international'"
Good ✅: "Find the search box (it should be along with a region switch, such as 'domestic' or 'international'), type 'headphone', and hit Enter."

Bad ❌: "Lower Part of page"
Bad ❌: "Search 'headphone'"

### Infer from the UI, not the DOM properties

All the data sent to the LLM are the screenshots and element coordinates. The DOM is almost invisible to the LLM. So do not expect the LLM infer any information from the DOM (such as `test-id-*` properties).

Ensure everything you expect from the LLM is visible in the screenshot.

### LLMs can NOT tell the exact number like coords or hex-color, give it some choices
### LLMs can NOT tell the exact number like coords or hex-style color, give it some choices

For example:

Expand All @@ -34,11 +34,11 @@ Bad ❌: "[number, number], the [x, y] coords of the main button"

### Use visualization tool to debug

Use the visualization tool to debug and understand how the AI parse the interface. Just upload the log, and view the AI's parse results. You can find [the tool](/visualization/index.html) on the navigation bar on this site.
Use the visualization tool to debug and understand each step of MidScene. Just upload the log, and view the AI's parse results. You can find [the tool](/visualization/) on the navigation bar on this site.

### non-English prompting is acceptable

Since AI models can understand many languages, feel free to write the prompt in any language you like.
Since most AI models can understand many languages, feel free to write the prompt in any language you prefer. It usually works even if the prompt is in a language different from the page's language.

Good ✅: "点击顶部左侧导航栏中的“首页”链接"

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## config AI vendor

MidScene uses the OpenAI SDK as the default AI service. Currently OpenAI GPT-4o seems to perform best. However, you can customize the caller configuration with environment variables.
MidScene uses the OpenAI SDK as the default AI service. You can customize the configuration using environment variables.

There are the main configs, in which `OPENAI_API_KEY` is required.

Expand All @@ -17,7 +17,7 @@ Optional:

```bash
# optional, if you want to use a customized endpoint
export OPENAI_BASE_URL="..."
export OPENAI_BASE_URL="https://..."

# optional, if you want to specify a model name other than gpt-4o
export MIDSCENE_MODEL_NAME='claude-3-opus-20240229';
Expand All @@ -28,6 +28,8 @@ export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"ke

## Use in Puppeteer

To initialize:

```typescript
import { PuppeteerAgent } from '@midscene/web/puppeteer';

Expand All @@ -54,7 +56,7 @@ await page.waitForNavigation({

// init MidScene agent, perform actions
const mid = new PuppeteerAgent(page);
await mid.ai('type "how much is the ferry ticket in Shanghai" in search box, hit Enter');
await mid.ai('type "Headphones" in search box, hit Enter');
```

## Use in Playwright
Expand All @@ -63,7 +65,7 @@ await mid.ai('type "how much is the ferry ticket in Shanghai" in search box, hit

> In the following documentation, you may see functions called with the `mid.` prefix. If you use destructuring in Playwright, like `async ({ ai, aiQuery }) => { /* ... */}`, you can call the functions without this prefix. It's just a matter of syntax.
### `.aiAction(steps: string)` or `.ai(steps: string)` - perform your actions
### `.aiAction(steps: string)` or `.ai(steps: string)` - Control the page

You can use `.aiAction` to perform a series of actions. It accepts a `steps: string` as a parameter, which describes the actions. In the prompt, you should clearly describe the steps. MidScene will take care of the rest.

Expand All @@ -79,25 +81,23 @@ await mid.aiAction('Move your mouse over the second item in the task list and cl
await mid.ai('Click the "completed" status button below the task list');
```

Steps should always be clearly described. A very brief prompt like 'Tweet "Hello World"' will result in unstable performance and a high likelihood of failure.
Steps should always be clearly and thoroughly described. A very brief prompt like 'Tweet "Hello World"' will result in unstable performance and a high likelihood of failure.

Under the hood, MidScene will plan the detailed steps by sending your page context and a screenshot to the AI. After that, MidScene will execute the steps one by one. If MidScene deems it impossible to execute, an error will be thrown.

The main capabilities of MidScene are as follows, which can be seen in the visualization tools:
1. **Planning**: Determine the steps to accomplish the task
2. **Find**: Identify the target element using a natural language description
3. **Action**: Tap, scroll, keyboard input, hover
4. **Others**: Sleep

Currently, MidScene can't plan steps that include conditions and loops.
The main capabilities of MidScene are as follows, and your task will be split into these types. You can see them in the visualization tools:

:::tip Why can't MidScene smartly plan the actions according to my one-line goal?
1. **Locator**: Identify the target element using a natural language description
2. **Action**: Tap, scroll, keyboard input, hover
3. **Others**: Sleep

MidScene aims to be an automation assistance SDK. Its action stability (i.e., perform the same actions on each run) is a key feature. To achieve this, we encourage you to write down detailed instructions to help the AI better understand each step of your task. If you want a 'goal-to-task' AI planning tool, you can build one on top of MidScene.
Currently, MidScene can't plan steps that include conditions and loops.

:::
Related Docs:
* [FAQ: Can MidScene smartly plan the actions according to my one-line goal? Like executing "Tweet 'hello world'](../more/faq.html)
* [Tips for Prompting](../more/prompting-tips.html)

### `.aiQuery(dataShape: any)` - extract any data from page
### `.aiQuery(dataDemand: any)` - extract any data from page

You can extract customized data from the UI. Provided that the multi-modal AI can perform inference, it can return both data directly written on the page and any data based on "understanding". The return value can be any valid primitive type, like String, Number, JSON, Array, etc. Just describe it in the `dataDemand`.

Expand Down Expand Up @@ -137,9 +137,6 @@ This method will soon be available in MidScene.
LangSmith is a platform designed to debug the LLMs. To integrate LangSmith, please follow these steps:

```shell
# install langsmith dependency
npm i langsmith

# set env variables

# Flag to enable debug
Expand Down
File renamed without changes.
37 changes: 37 additions & 0 deletions apps/site/docs/en/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
pageType: home

hero:
name: MidScene.js
text: Joyful Automation by AI
tagline:
actions:
- theme: brand
text: Introduction
link: /docs/getting-started/introduction
- theme: alt
text: Quick Start
link: /docs/getting-started/quick-start
image:
src: /midscene.png
alt: MidScene Logo
features:
- title: Natural Language Interaction
details: Describe the steps, let MidScene plan and execute for you.
icon: 🔍
- title: Understand UI, Answer in JSON
details: Provide prompts for the desired data format, and then receive the predictable answer in JSON format.
icon: 🤔
- title: Intuitive Assertion
details: Make assertions in natural language. It’s all based on AI understanding.
icon: 🤔
- title: Out-of-box LLM
details: It is fine to use public multimodal LLMs like GPT-4o. There is no need for any custom training.
icon: 🪓
- title: Visualization
details: With our visualization tool, you can easily understand and debug the whole process.
icon: 🎞️
- title: Brand New Experience !
details: Experience a whole new world of automation development. Enjoy !
icon: 🔥
---
File renamed without changes.
12 changes: 12 additions & 0 deletions apps/site/docs/zh/_meta.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[
{
"text": "文档",
"link": "/docs/getting-started/introduction",
"activeMatch": "/docs"
},
{
"text": "可视化工具",
"link": "/visualization/",
"activeMatch": "/visualization/"
}
]
Loading

0 comments on commit 56fbef3

Please sign in to comment.