You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: apps/site/docs/en/caching.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Cache
1
+
# Caching
2
2
3
3
Midscene.js provides AI caching features to improve the stability and speed of the entire AI execution process. The cache mainly refers to caching how AI recognizes page elements. Cached AI query results are used if page elements haven't changed.
Copy file name to clipboardexpand all lines: apps/site/docs/en/choose-a-model.md
+37-4
Original file line number
Diff line number
Diff line change
@@ -10,10 +10,27 @@ Midscene.js uses general-purpose large language models (LLMs, like `gpt-4o`) as
10
10
You can also use open-source models like `UI-TARS` to improve the performance and data privacy.
11
11
:::
12
12
13
+
## Comparison between general-purpose LLMs and dedicated model
14
+
15
+
This is a table for comparison between general-purpose LLMs and dedicated model (like `UI-TARS`). We will talk about them in detail later.
16
+
17
+
|| General-purpose LLMs (default) | Dedicated model like `UI-TARS`|
18
+
| --- | --- | --- |
19
+
|**What it is**| for general-purpose tasks | dedicated for UI automation |
20
+
|**How to get started**| easy, just to get an API key | a bit complex, you need to deploy it on your own server |
21
+
|**Performance**| 3-10x slower compared to pure JavaScript automation | could be acceptable with proper deployment |
22
+
|**Who will get the page data**| the model provider | your own server |
23
+
|**Cost**| more expensive, usually pay for the token | less expensive, pay for the server |
24
+
|**Prompting**| prefer step-by-step instructions | still prefer step-by-step instructions, but performs better in uncertainty situations |
25
+
13
26
## Choose a general-purpose LLM
14
27
15
28
Midscene uses OpenAI `gpt-4o` as the default model, since this model performs the best among all general-purpose LLMs at this moment.
16
29
30
+
To use the official `gpt-4o` from OpenAI, you can simply set the `OPENAI_API_KEY` in the environment variables. Refer to [Config Model and Provider](./model-provider) for more details.
31
+
32
+
### Choose a model other than `gpt-4o`
33
+
17
34
If you want to use other models, please follow these steps:
18
35
19
36
1. A multimodal model is required, which means it must support image input.
@@ -22,7 +39,7 @@ If you want to use other models, please follow these steps:
22
39
1. If you find it not working well after changing the model, you can try using some short and clear prompt, or roll back to the previous model. See more details in [Prompting Tips](./prompting-tips).
23
40
1. Remember to follow the terms of use of each model and provider.
24
41
25
-
### Known Supported General-Purpose Models
42
+
### Known supported general-purpose models
26
43
27
44
Besides `gpt-4o`, the known supported models are:
28
45
@@ -31,17 +48,33 @@ Besides `gpt-4o`, the known supported models are:
31
48
-`qwen-vl-max-latest`
32
49
-`doubao-vision-pro-32k`
33
50
51
+
### About the token cost
52
+
53
+
Image resolution and element numbers (i.e., a UI context size created by Midscene) will affect the token bill.
54
+
55
+
Here are some typical data with gpt-4o-0806 without prompt caching.
|Plan and perform a search on eBay homepage| 1280x800 | 6005 / $0.0150125 |146 / $0.00146| $0.0164725 |
60
+
|Query the information about the item in the search results| 1280x800 | 9107 / $0.0227675 | 122 / $0.00122 | $0.0239875 |
61
+
62
+
> The price data was calculated in Nov 2024.
63
+
34
64
## Choose `UI-TARS` (a open-source model dedicated for UI automation)
35
65
36
66
UI-TARS is an end-to-end GUI agent model based on VLM architecture. It solely perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations), achieving state-of-the-art performance on 10+ GUI benchmarks.
37
67
38
68
UI-TARS is an open-source model, and provides different versions of size. You can deploy it on your own server, and it will dramatically improve the performance and data privacy.
39
69
40
-
For more details about UI-TARS, see [Github - UI-TARS](https://github.com/bytedance/ui-tars), [🤗 HuggingFace - UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT).
*[UI-TARS - Model Deployment Guide](https://juniper-switch-f10.notion.site/UI-TARS-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
41
74
42
75
### What you will have after using UI-TARS
43
76
44
-
-**Speed**: a private-deployed UI-TARS model can be 5x faster than a general-purpose LLM. Each step of `.ai` call can be processed in 1-2 seconds.
77
+
-**Speed**: a private-deployed UI-TARS model can be 5x faster than a general-purpose LLM. Each step of `.ai` call can be processed in 1-2 seconds on a high-performance GPU server.
45
78
-**Data privacy**: you can deploy it on your own server and your data will no longer be sent to the cloud.
46
79
-**More stable with short prompt**: UI-TARS is optimized for UI automation and is capable of handling more complex tasks with target-driven prompts. You can use it with a shorter prompt (although it is not recommended), and it performs even better when compared to a general-purpose LLM.
47
80
@@ -76,4 +109,4 @@ Once you feel uncomfortable with the speed, the cost, the accuracy, or the data
76
109
## More
77
110
78
111
*[Config Model and Provider](./model-provider)
79
-
*[UI-TARS on Github](https://github.com/bytedance/ui-tars)
112
+
*[UI-TARS on Github](https://github.com/bytedance/ui-tars)
Copy file name to clipboardexpand all lines: apps/site/docs/en/faq.md
+10-19
Original file line number
Diff line number
Diff line change
@@ -10,42 +10,33 @@ Related Docs: [Prompting Tips](./prompting-tips)
10
10
11
11
There are some limitations with Midscene. We are still working on them.
12
12
13
-
1. The interaction types are limited to only tap, type, keyboard press, and scroll.
13
+
1. The interaction types are limited to only tap, drag, type, keyboard press, and scroll.
14
14
2. LLM is not 100% stable. Even GPT-4o can't return the right answer all the time. Following the [Prompting Tips](./prompting-tips) will help improve stability.
15
15
3. Since we use JavaScript to retrieve elements from the page, the elements inside the cross-origin iframe cannot be accessed.
16
16
4. We cannot access the native elements of Chrome, like the right-click context menu or file upload dialog.
17
17
5. Do not use Midscene to bypass CAPTCHA. Some LLM services are set to decline requests that involve CAPTCHA-solving (e.g., OpenAI), while the DOM of some CAPTCHA pages is not accessible by regular web scraping methods. Therefore, using Midscene to bypass CAPTCHA is not a reliable method.
18
18
19
19
## Can I use a model other than `gpt-4o`?
20
20
21
-
Yes. You can [customize model and provider](./model-provider) if needed.
21
+
Of course. You can [choose a model](./choose-a-model) according to your needs.
22
22
23
-
## About the token cost
24
-
25
-
Image resolution and element numbers (i.e., a UI context size created by Midscene) will affect the token bill.
26
-
27
-
Here are some typical data with gpt-4o-0806 without prompt caching.
|Plan and perform a search on eBay homepage| 1280x800 | 6005 / $0.0150125 |146 / $0.00146| $0.0164725 |
32
-
|Query the information about the item in the search results| 1280x800 | 9107 / $0.0227675 | 122 / $0.00122 | $0.0239875 |
33
-
34
-
> The price data was calculated in Nov 2024.
35
-
36
-
## What data is sent to LLM ?
23
+
## What data is sent to AI model?
37
24
38
25
Currently, the contents are:
39
26
1. the key information extracted from the DOM, such as text content, class name, tag name, coordinates;
40
27
2. a screenshot of the page.
41
28
29
+
If you are concerned about the data privacy, please refer to [Data Privacy](./data-privacy).
30
+
42
31
## The automation process is running more slowly than the traditional one
43
32
44
-
Since Midscene.js invokes AI for each planning and querying operation, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. This is currently inevitable but may improve with advancements in LLMs.
33
+
When using general-purpose LLM in Midscene.js, the running time may increase by a factor of 3 to 10 compared to traditional Playwright scripts, for instance from 5 seconds to 20 seconds. To make the result more stable, the token and time cost is inevitable.
34
+
45
35
46
-
Despite the increased time and cost, Midscene stands out in practical applications due to its unique development experience and easy-to-maintain codebase. We are confident that incorporating automation scripts powered by Midscene will significantly enhance your project’s efficiency, cover many more situations, and boost overall productivity.
47
36
48
-
In short, it is worth the time and cost.
37
+
There are two ways to improve the running time:
38
+
1. Use a dedicated model, like UI-TARS. This is the recommended way. Read more about it in [Choose a model](./choose-a-model).
39
+
2. Use caching to reduce the token cost. Read more about it in [Caching](./caching).
49
40
50
41
## The webpage continues to flash when running in headed mode
Copy file name to clipboardexpand all lines: apps/site/docs/en/prompting-tips.md
+3-1
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Prompting Tips
2
2
3
-
The natural language parameter passed to Midscene will be part of the prompt sent to the LLM. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
3
+
The natural language parameter passed to Midscene will be part of the prompt sent to the AI model. There are certain techniques in prompt engineering that can help improve the understanding of user interfaces.
4
4
5
5
## The purpose of optimization is to get a stable response from AI
6
6
@@ -51,6 +51,8 @@ To launch the local Playground server:
51
51
npx --yes @midscene/web
52
52
```
53
53
54
+

55
+
54
56
## Infer or assert from the interface, not the DOM properties or browser status
55
57
56
58
All the data sent to the LLM is in the form of screenshots and element coordinates. The DOM and the browser instance are almost invisible to the LLM. Therefore, ensure everything you expect is visible on the screen.
0 commit comments