Skip to content

Commit

Permalink
feat: update docs, and some tiny bugfix
Browse files Browse the repository at this point in the history
  • Loading branch information
yuyutaotao committed Jul 29, 2024
1 parent 4ec3607 commit 5417632
Show file tree
Hide file tree
Showing 25 changed files with 253 additions and 663 deletions.
7 changes: 0 additions & 7 deletions apps/site/docs/doc/_meta.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,5 @@
"collapsed": false
},
"prompting-tips.md",
{
"type": "dir",
"name": "integration",
"label": "Integration",
"collapsible": false,
"collapsed": false
},
"faq.md"
]
19 changes: 5 additions & 14 deletions apps/site/docs/doc/faq.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Q & A
# FAQ

#### About the token cost

Expand All @@ -13,19 +13,10 @@ Here are some typical data.

The above price data was calculated in June 2024.

#### How can I do assertions with MidScene ?
#### The automation process is running more slowly than it did before

MidScene.js is an SDK for UI understanding, rather than a testing framework. You should integrate it with a familiar testing framework.
Since MidScene.js will invoke the AI each time it performs planning and querying, the running time may increase by a factor of 5 to 10. This is inevitable for now, but it may improve with advancements in LLMs.

Here are some feasible ways:
* Using Playwright, see [Integrate with Playwright](/doc/integration/playwright)
* Using [Vitest](https://vitest.dev/) + [puppeteer](https://pptr.dev/), see [Integrate with Puppeteer](/doc/integration/puppeteer)
Despite the increased time and cost, MidScene stands out in practical applications due to its unique development experience and easy-to-maintain codebase. We are confident that incorporating automation scripts powered by MidScene will significantly enhance your project’s efficiency, streamline complex tasks, and boost overall productivity. By integrating MidScene, your team can focus on more strategic and innovative activities, leading to faster development cycles and better outcomes.


#### What's the "element" in MidScene ?

An element in MidScene is an object defined by MidScene. Currently, it contains only text elements, primarily consisting of text content and coordinates. It is different from elements in the browser, so you cannot call browser methods on it.

#### Failed to interact with web page ?

The coordinates returned from MidScene only represent their positions at the time they are collected. You should check the latest UI style when interacting with the UI.
In short, it is worth the time and cost.
198 changes: 37 additions & 161 deletions apps/site/docs/doc/getting-started/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,168 +4,44 @@
<source src="/MidScene_L.mp4" type="video/mp4" />
</video>

Writing UI automation is often an annoying task. Understanding the characteristics of the DOM while writing code is not easy. Worse, writing tests for an existing web page that lacks predefined `#id` or `data-test-xxx` properties can make your selectors unmanageable and the entire test file impossible to maintain.

### Using high-level understanding of UI to reshape the automation

With MidScene.js, we harness the power of AI’s multi-modality to turn your UI into consistent and well-organized outputs. What you have to do is describe the expected data shape from screenshot, AI will do the magical reasoning for you, and TypeScript will ensure a first-class developing experience. There won't be any `.selector`s in your script any longer.

Finally, the joy of programming will come back!


### Flow Chart

Here is a flowchart illustrating the main process of MidScene.

![](/flow.png)

### Features

#### Locate - Find by natural language

Using GPT-4o, you can now locate the elements by natural language. Just like someone is viewing your page. DOM selectors should no longer be necessary.

```typescript
const downloadBtns = await insight.locate('download buttons on the page', {multi: true});
console.log(downloadBtns);
```

The result would be like
```typescript
[
{ content: 'Download', rect: { left: 1451, top: 78, width: 74, height: 22 } },
{ content: 'Download Mac Universal', rect: { left: 432, top: 328, width: 232, height: 65 } }
]
```

#### Understand - And answer in JSON

Besides basic locator and segmentation, MidScene can help you to understand the UI.
By providing the AI with the data shape you want, you will receive a predictable answer, both in terms of data structure and value. You may have never thought about UI automation in this way.

Use `query` to achieve this.

For example, if you want to understand some properties while locating elements:

```typescript
const downloadBtns = await insight.locate(query('download buttons on the page', {
textsOnButton: 'string',
backgroundColor: 'string, color of text, one of blue / red / yellow / green / white / black / others',
type: '`major` or `minor`. The Bigger one is major and the others are minor',
platform: 'string. Say `unknown` when it is not clear on the element',
}), {multi: true});
```

The result would be like
```typescript
[
{
content: 'Download Mac Universal',
rect: { left: 432, top: 328, width: 232, height: 65 },
textsOnButton: 'Download Mac Universal', // <------ The data mentions in prompt
backgroundColor: 'blue',
type: 'major',
platform: 'Mac'
},
{
content: 'Download',
type: 'minor',
platform: 'unknown'
// ...
}
]
```

You can also extract data from a section by using `query`.
For example, if you want to get the service status from the github status page:

```typescript
const result = await insightStatus.segment({
'services': query( // They are all the prompts being sent to the AI
'a list with service names and status',
{ items: '{service: "service name as string", status: "string, like normal"}[]' },
),
});
```

Here is the return value:
```typescript
[
{ service: 'Git Operations', status: 'Normal' },
{ service: 'API Requests', status: 'Normal' },
{ service: 'Webhooks', status: 'Normal' },
// ...
]
```

#### Typed - Out-of-box TypeScript definitions

The custom data shape you defined can have types assigned automatically. Simply use dot notation to access them.

Let's take the `result` above as a sample. TypeScript will give you the basic type hint:
```typescript
const result: {
services: UISection<{
items: unknown;
}>;
}
```
By providing a generic type parameter, the return value can be explicitly specified.
```typescript
const result = await insight.segment({
'services': query<{items: {service: string, status: string}[]}>(
'a list with service names and status',
{ items: '{service: "service name as string", status: "string, like normal"}[]' },
),
});

const { items } = result.services;
```

TypeScript will give you the following definition:

```typescript
const items: {
service: string;
status: string;
}[]
```
#### Segment - Customized UI splitting
Describing the sections inside a page, and let AI help you to find them out.
```typescript
// The param map is also a prompt being sent to the AI model.
const manySections = await insight.segment({
cookiePrompt: 'cookie prompt with its action buttons on the top of the page',
topRightWidgets: 'widgets on the top right corner',
});
```

The data is as follows.

```typescript
{
cookiePrompt: {
texts: [ [Object], [Object], [Object], [Object], [Object], [Object] ],
rect: { left: 144, top: 8, width: 1655, height: 49 }
},
topRightWidgets: {
texts: [ [Object], [Object] ],
rect: { left: 1241, top: 64, width: 284, height: 50 }
},
}
```

#### Online Visualization - Help to visualize your prompt

With our visualization tool, you can easily debug the prompt and AI response.

All intermediate data, such as the query, coordinates, split reason, and custom data, can be visualized.
UI automation can be quite frustrating. It is always full of *#id*, *data-test-xxx* and *.selectors* that are hard to maintain, not to mention when a refactor happens to the page.

Introducing MidScene.js, an SDK that aims to restore the joy of programming by automating tasks.

With MidScene.js, we harness the power of multimodal LLM to make your UI outputs consistent and well-organized. All you need to do is describe the interaction steps or the expected data format based on a screenshot, and the AI will execute these tasks for you. Finally, it will bring back the joy of programming!

## Features

### Public LLMs are Fine

It is fine to use publicly available LLMs such as GPT-4. There is no need for custom training. To experience the out-of-the-box AI-driven automation, token is all you need. 😀

### Execute Actions

Use `.aiAction` to perform a series of actions by describing the steps.

For example `.aiAction('Enter "Learn JS today" in the task box, then press Enter to create')`.

### Extract Data from Page

`.aiQuery` is the method to extract customized data from the UI.

For example, by calling `const dataB = await agent.aiQuery('string[], task names in the list');`, you will get an array with string of the task names.

### Perform Assertions

Call `.aiAssert` to perform assertions on the page.

#### Visualization Tool

With our visualization tool, you can easily debug the prompt and AI response. All intermediate data, such as queries, plans, and actions, can be visualized.

You may open the [Online Visualization Tool](/visualization/index.html) to see the showcase.

![](/Visualizer.gif)

## Flow Chart

Here is a flowchart illustrating the core process of MidScene.

![](/flow.png)
62 changes: 27 additions & 35 deletions apps/site/docs/doc/getting-started/quick-start.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
# Quick Start

Currently we use OpenAI GPT-4o as the default engine. So prepare an OpenAI key that is eligible for accessing GPT-4o.
In this example, we use OpenAI GPT-4o and Puppeteer.js to _________. Remember to prepare an OpenAI key that is eligible for accessing GPT-4o before running.

> [Puppeteer](https://pptr.dev/) is a Node.js library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi. Puppeteer runs in the headless (no visible UI) by default but can be configured to run in a visible ("headful") browser.
Config the API key

```bash
# replace by your own
export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"

# optional, if you use a proxy
# export OPENAI_BASE_URL="..."
```

Install

```bash
npm install midscene --save-dev
npm install @midscene/web --save-dev
# for demo use
npm install puppeteer ts-node --save-dev
```
Expand All @@ -22,35 +23,26 @@ Write a simple demo to **extract the main download button of vscode website**.
Save the following code as `./demo.ts`.

```typescript
import puppeteer from 'puppeteer';
import Insight, { query } from 'midscene';

Promise.resolve(
(async () => {
// launch vscode website
const browser = await puppeteer.launch();
const page = (await browser.pages())[0];
await page.setViewport({ width: 1920, height: 1080 })
await page.goto('https://code.visualstudio.com/');
// wait for 5s
console.log('Wait for 5 seconds. After that, the demo will begin.');
await new Promise((resolve) => setTimeout(resolve, 5 * 1000));

// ⭐ find the main download button and its backgroundColor ⭐
const insight = await Insight.fromPuppeteerBrowser(browser);
const downloadBtn = await insight.locate(
query('main download button on the page', {
textsOnButton: 'string',
backgroundColor: 'string, color of text, one of blue / red / yellow / green / white / black / others',
}),
);
console.log(`backgroundColor of main download button is: `, downloadBtn!.backgroundColor);
console.log(`text on the button is: `, downloadBtn!.textsOnButton);

// clean up
await browser.close();
})(),
);
import puppeteer, { Viewport } from 'puppeteer';
import { PuppeteerAgent } from '@midscene/web/puppeteer';

// init Puppeteer page
const browser = await puppeteer.launch({
headless: false, // here we use headed mode to help debug
});

const page = await browser.newPage();
await page.goto('https://www.bing.com');
await page.waitForNavigation({
timeout: 20 * 1000,
waitUntil: 'networkidle0',
});
const page = await launchPage();

// init MidScene agent
const agent = new PuppeteerAgent(page);
await agent.aiAction('type "how much is the ferry ticket in Shanghai" in search box, hit Enter');

```

Using ts-node to run:
Expand All @@ -62,5 +54,5 @@ npx ts-node demo.ts
# it should print '... is blue'
```

After running, MidScene will generate a log dump, which is placed in `./midscene_run/latest.insight.json` by default. Then put this file into [Visualization Tool](/visualization/), and you will have a clearer understanding of the process.
After running, MidScene will generate a log dump, which is placed in `./midscene_run/latest.web-dump.json` by default. Then put this file into [Visualization Tool](/visualization/), and you will have a clearer understanding of the process.

6 changes: 0 additions & 6 deletions apps/site/docs/doc/integration/_meta.json

This file was deleted.

33 changes: 0 additions & 33 deletions apps/site/docs/doc/integration/others.md

This file was deleted.

Loading

0 comments on commit 5417632

Please sign in to comment.