Skip to content

Commit e2d6510

Browse files
author
Cowork 3P
committed
docs: add lightpanda browser automation guide
1 parent 8dd2610 commit e2d6510

21 files changed

Lines changed: 648 additions & 34 deletions

File tree

astro.config.mjs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -292,6 +292,7 @@ export default defineConfig({
292292
{
293293
label: 'Output Schema',
294294
slug: 'developer-guide/worker-definition/output-schema',
295+
badge: { text: 'Required', variant: 'danger' },
295296
translations: {
296297
'zh-CN': '输出配置',
297298
},
@@ -355,6 +356,13 @@ export default defineConfig({
355356
'zh-CN': 'Playwright',
356357
},
357358
},
359+
{
360+
label: 'Lightpanda',
361+
slug: 'developer-guide/worker-definition/browser-automation/lightpanda',
362+
translations: {
363+
'zh-CN': 'Lightpanda',
364+
},
365+
},
358366
{
359367
label: 'Puppeteer',
360368
slug: 'developer-guide/worker-definition/browser-automation/puppeteer',

src/content/docs/developer-guide/builds-and-runs.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Your Code (ZIP) → Auto Dependency Install → Script Runtime → Remote Browse
2222
CoreClaw eliminates the Build step through **platform-level hosting**:
2323

2424
- **Runtime is pre-provisioned** — Python/Node.js runtimes and base dependencies are already installed in the shared environment. You don't need to build or configure a runtime image.
25-
- **Browser is remotely hosted** — No need to package a browser into your project. Connect to the remote fingerprint browser pool via the `ChromeWs` environment variable (CDP/WebSocket). See [Browser Fingerprinting](/developer-guide/worker-definition/platform-features/browser-fingerprinting/) for details.
25+
- **Browser is remotely hosted** — No need to package a browser into your project. Connect to the remote fingerprint browser pool via the `ChromeWs` environment variable, or to Lightpanda via `LightpandaDomain` (CDP/WebSocket). See [Browser Fingerprinting](/developer-guide/worker-definition/platform-features/browser-fingerprinting/) for details.
2626
- **Dependencies install automatically** — The platform reads your `requirements.txt` or `package.json` and installs dependencies before execution. No manual image building required.
2727
- **Network is sandboxed** — The runtime is an isolated network sandbox. HTTP request scripts must use the built-in SOCKS5 proxy (via `PROXY_AUTH` environment variable). See [Proxy Support](/developer-guide/worker-definition/platform-features/proxy-support/) for details.
2828

@@ -64,6 +64,7 @@ Each run executes in a lightweight, process-isolated environment with:
6464
- **Environment variables**:
6565
- `PROXY_AUTH` — SOCKS5 proxy credentials (username:password) for HTTP requests
6666
- `ChromeWs` — WebSocket address for connecting to the remote fingerprint browser
67+
- `LightpandaDomain` — CDP domain or endpoint for connecting to Lightpanda
6768
- **SDK communication** — gRPC channel to the CoreClaw platform (127.0.0.1:20086)
6869

6970
## Run Management

src/content/docs/developer-guide/deployment.md

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,36 @@ Deploy your Worker to the CoreClaw platform.
1111

1212
## Upload Requirements
1313

14-
Currently, **only ZIP archive files** are supported for uploading scripts. Please ensure the file format is correct before uploading.
14+
CoreClaw supports two ways to upload your Worker scripts:
15+
16+
### Method 1: ZIP Archive Upload
17+
18+
Upload your Worker as a ZIP archive file. This is the quickest way to get started.
19+
20+
1. Compress all project files into a ZIP archive
21+
2. Ensure the runtime entry is at the root of the ZIP: `main.py` for Python, `main.js` for Node.js, and the compiled Linux amd64 executable `main` for Go
22+
3. Upload the ZIP archive to the platform
23+
24+
### Method 2: GitHub Import
25+
26+
Import your Worker directly from a GitHub repository. This method supports **version management**, allowing you to track and manage different versions of your Worker.
27+
28+
**Supported URL formats:**
29+
30+
- **HTTPS**: `https://github.com/username/repository.git`
31+
- **SSH**: `git@github.com:username/repository.git`
32+
33+
**Version management:**
34+
35+
When importing from GitHub, you can specify which version of your code to deploy:
36+
37+
- **Branch**: Deploy the latest code from a specific branch (e.g., `main`, `develop`)
38+
- **Tag**: Deploy a specific tagged release (e.g., `v1.0.0`)
39+
- **Commit**: Deploy an exact commit by its SHA hash
40+
41+
This allows you to maintain multiple versions, roll back to previous releases, and manage your Worker's lifecycle effectively.
42+
43+
---
1544

1645
All script files **must strictly follow platform specifications**.
1746

@@ -43,7 +72,7 @@ Ensure your project includes the required files before packaging:
4372

4473
**Go:**
4574
```
46-
├── main.go # Entry file
75+
├── main.go # Source entry file
4776
├── go.mod # Dependencies
4877
├── go.sum # Dependency checksums
4978
├── input_schema.json # Input configuration
@@ -54,11 +83,21 @@ Ensure your project includes the required files before packaging:
5483
└── sdk_grpc.pb.go
5584
```
5685

86+
For Go Workers, keep the three layers distinct:
87+
88+
- **Source project**: contains `main.go`, `go.mod`, `go.sum`, `GoSdk/`, `input_schema.json`, and `output_schema.json`.
89+
- **Uploaded ZIP**: must contain a Linux amd64 executable named `main` at the ZIP root. The source entry is `main.go`; the upload/runtime entry is the compiled `main`.
90+
- **Platform runtime**: does not guarantee that source files such as `main.go`, `go.mod`, `go.sum`, or `GoSdk/` still exist in the current working directory. Only rely on files deliberately included for runtime use.
91+
5792
### Packaging
5893

5994
1. Compress all project files into a ZIP archive
60-
2. Ensure the entry file (`main.py` / `main.js` / `main.go`) is at the root of the ZIP
61-
3. Upload the ZIP archive to the platform
95+
2. Ensure the runtime entry (`main.py` / `main.js` / compiled Go executable `main`) is at the root of the ZIP
96+
3. Upload the ZIP archive to the platform, or push to GitHub and import via repository URL
97+
98+
:::caution[Windows packaging]
99+
Some ordinary Windows compression tools can drop the Linux executable bit from the Go `main` binary. If that bit is lost, the Worker may fail before user code starts, sometimes without Worker logs. For Go ZIP uploads, prefer creating the final archive in Linux or WSL after running `chmod +x main`.
100+
:::
62101

63102
---
64103

src/content/docs/developer-guide/develop-worker/quick-start.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ Edit `main.py` to implement your scraping logic.
7676
CoreClaw's runtime is an **isolated network sandbox** — your script cannot access the internet directly. You must route outbound traffic through the platform's built-in proxy:
7777

7878
- **HTTP request scripts** — Proxy configuration is **required**. Read the proxy address from the `PROXY_AUTH` environment variable and configure your HTTP client to use the SOCKS5 proxy. See [Proxy Support](/developer-guide/worker-definition/platform-features/proxy-support/) for details.
79-
- **Browser automation scripts** — Connect to the remote browser via the `ChromeWs` environment variable (WebSocket address). Proxy is handled automatically by the browser — no manual proxy configuration needed. See [Browser Fingerprinting](/developer-guide/worker-definition/platform-features/browser-fingerprinting/) for details.
79+
- **Browser automation scripts** — Connect to the remote browser via the `ChromeWs` environment variable, or to Lightpanda via `LightpandaDomain` (CDP/WebSocket). Proxy is handled automatically by the browser — no manual proxy configuration needed. See [Browser Fingerprinting](/developer-guide/worker-definition/platform-features/browser-fingerprinting/) for details.
8080
:::
8181

8282
```python

src/content/docs/developer-guide/developer-faq/how-to-deploy.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,5 +80,5 @@ Before publishing, test your Worker:
8080
**Cause:** Incorrect file names
8181

8282
**Solution:**
83-
- Entry file must be `main.py` / `main.js` / `main.go`
84-
- Check file paths in code
83+
- Runtime entry must be `main.py` for Python, `main.js` for Node.js, or the compiled Linux amd64 executable `main` for Go
84+
- Check file paths in code
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
---
2+
title: Lightpanda
3+
description: Connect to the platform-hosted Lightpanda browser through CDP
4+
sidebar:
5+
order: 2
6+
---
7+
8+
Lightpanda is a lightweight browser backend exposed by CoreClaw through the Chrome DevTools Protocol (CDP). Use it when your Worker needs browser-level navigation, JavaScript execution, or page rendering without packaging or launching a browser locally.
9+
10+
## Positioning
11+
12+
Lightpanda is **not** a browser automation framework. It is a platform-hosted browser endpoint.
13+
14+
You still write automation logic with a client library such as Playwright. The difference is that Playwright connects to the Lightpanda CDP endpoint instead of starting a local browser process.
15+
16+
## Environment Variables
17+
18+
| Variable | Description |
19+
| --- | --- |
20+
| `LightpandaDomain` | Platform-injected Lightpanda CDP domain or endpoint |
21+
| `PROXY_AUTH` | Platform-injected credentials in `username:password` format |
22+
23+
:::danger[Important]
24+
- Never hardcode `PROXY_AUTH` or Lightpanda credentials in Worker code.
25+
- Read `LightpandaDomain` from the runtime environment.
26+
- Pass `PROXY_AUTH` as a Basic `Authorization` header when connecting to Lightpanda.
27+
:::
28+
29+
## Endpoint and Authentication Rules
30+
31+
Apply these rules before calling `connect_over_cdp`:
32+
33+
- If `LightpandaDomain` is a bare domain, normalize it to `ws://<domain>/devtools/browser/new`.
34+
- If `LightpandaDomain` is already a full `ws://`, `wss://`, `http://`, or `https://` CDP endpoint, use it as provided.
35+
- Authentication uses the HTTP `Authorization` header with the Basic scheme. Build the header from `PROXY_AUTH` in `username:password` format.
36+
37+
## Recommended Playwright Connection
38+
39+
Use this version when `LightpandaDomain` may be injected as either a bare domain or a full CDP endpoint.
40+
41+
```python
42+
import asyncio
43+
import base64
44+
import os
45+
46+
from playwright.async_api import async_playwright
47+
48+
49+
def basic_auth_header(auth: str) -> str:
50+
token = base64.b64encode(auth.encode("utf-8")).decode("ascii")
51+
return f"Basic {token}"
52+
53+
54+
def lightpanda_cdp_endpoint(value: str) -> str:
55+
endpoint = value.rstrip("/")
56+
if endpoint.startswith(("ws://", "wss://", "http://", "https://")):
57+
return endpoint
58+
return f"ws://{endpoint}/devtools/browser/new"
59+
60+
61+
async def main() -> None:
62+
auth = os.environ["PROXY_AUTH"]
63+
cdp_endpoint = lightpanda_cdp_endpoint(os.environ["LightpandaDomain"])
64+
65+
async with async_playwright() as playwright:
66+
browser = await playwright.chromium.connect_over_cdp(
67+
cdp_endpoint,
68+
headers={"Authorization": basic_auth_header(auth)},
69+
timeout=60000,
70+
)
71+
try:
72+
page = await browser.new_page()
73+
await page.goto(
74+
"https://ipinfo.io/ip",
75+
wait_until="domcontentloaded",
76+
timeout=60000,
77+
)
78+
print((await page.text_content("body") or "").strip())
79+
finally:
80+
await browser.close()
81+
82+
83+
if __name__ == "__main__":
84+
asyncio.run(main())
85+
```
86+
87+
## HTTP CDP Endpoint
88+
89+
The recommended helper above already supports full HTTP CDP endpoints. If you know `LightpandaDomain` is always injected as a full HTTP endpoint, this shorter form is equivalent:
90+
91+
```python
92+
import asyncio
93+
import base64
94+
import os
95+
96+
from playwright.async_api import async_playwright
97+
98+
99+
def basic_auth_header(auth: str) -> str:
100+
token = base64.b64encode(auth.encode("utf-8")).decode("ascii")
101+
return f"Basic {token}"
102+
103+
104+
async def main() -> None:
105+
auth = os.environ["PROXY_AUTH"]
106+
cdp_endpoint = os.environ["LightpandaDomain"]
107+
108+
async with async_playwright() as playwright:
109+
browser = await playwright.chromium.connect_over_cdp(
110+
cdp_endpoint,
111+
headers={"Authorization": basic_auth_header(auth)},
112+
timeout=60000,
113+
)
114+
try:
115+
page = await browser.new_page()
116+
await page.goto("https://ipinfo.io/ip", wait_until="domcontentloaded")
117+
print((await page.text_content("body") or "").strip())
118+
finally:
119+
await browser.close()
120+
121+
122+
if __name__ == "__main__":
123+
asyncio.run(main())
124+
```
125+
126+
## Complete Worker Example with CoreSDK
127+
128+
Use this structure in `main.py` when packaging Lightpanda automation as a CoreClaw Worker. The SDK handles input parameters, logs, table headers, and result delivery back to the platform.
129+
130+
```python
131+
import asyncio
132+
import base64
133+
import os
134+
135+
from playwright.async_api import async_playwright
136+
from sdk import CoreSDK
137+
138+
139+
def basic_auth_header(auth: str) -> str:
140+
token = base64.b64encode(auth.encode("utf-8")).decode("ascii")
141+
return f"Basic {token}"
142+
143+
144+
def lightpanda_cdp_endpoint(value: str) -> str:
145+
endpoint = value.rstrip("/")
146+
if endpoint.startswith(("ws://", "wss://", "http://", "https://")):
147+
return endpoint
148+
return f"ws://{endpoint}/devtools/browser/new"
149+
150+
151+
async def run() -> None:
152+
CoreSDK.Log.info("Starting Lightpanda Worker...")
153+
154+
headers = [
155+
{"label": "url", "key": "url", "format": "text"},
156+
{"label": "ip", "key": "ip", "format": "text"},
157+
{"label": "html", "key": "html", "format": "text"},
158+
{"label": "resp_status", "key": "resp_status", "format": "text"},
159+
]
160+
CoreSDK.Result.set_table_header(headers)
161+
162+
input_json = CoreSDK.Parameter.get_input_json_dict()
163+
url = input_json.get("url") or "https://ipinfo.io/ip"
164+
165+
auth = os.environ["PROXY_AUTH"]
166+
cdp_endpoint = lightpanda_cdp_endpoint(os.environ["LightpandaDomain"])
167+
168+
result = {
169+
"url": url,
170+
"ip": "",
171+
"html": "",
172+
"resp_status": "200",
173+
}
174+
175+
browser = None
176+
try:
177+
async with async_playwright() as playwright:
178+
CoreSDK.Log.info("Connecting to Lightpanda...")
179+
browser = await playwright.chromium.connect_over_cdp(
180+
cdp_endpoint,
181+
headers={"Authorization": basic_auth_header(auth)},
182+
timeout=60000,
183+
)
184+
185+
page = await browser.new_page()
186+
await page.goto(url, wait_until="domcontentloaded", timeout=60000)
187+
188+
result["html"] = await page.content()
189+
result["ip"] = (await page.text_content("body") or "").strip()
190+
CoreSDK.Log.info("Lightpanda page loaded successfully")
191+
except Exception as exc:
192+
result["resp_status"] = "500"
193+
result["html"] = str(exc)
194+
CoreSDK.Log.error(f"Lightpanda run failed: {exc}")
195+
finally:
196+
if browser:
197+
await browser.close()
198+
CoreSDK.Result.push_data(result)
199+
200+
201+
if __name__ == "__main__":
202+
asyncio.run(run())
203+
```
204+
205+
## Best Practices
206+
207+
- Use Lightpanda for CDP-based browser automation where a lightweight hosted browser is sufficient.
208+
- Keep page-level logic in Playwright selectors and navigation APIs.
209+
- Set explicit timeouts for browser connection and page navigation.
210+
- In Worker code, use `CoreSDK.Parameter` for inputs, `CoreSDK.Log` for progress, and `CoreSDK.Result` for output.
211+
- Close the remote browser in a `finally` block.
212+
- Do not configure SOCKS5 proxy manually for browser pages; the remote browser handles outbound access.

src/content/docs/developer-guide/worker-definition/browser-automation/overview.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,19 @@ Business Logic & Data Processing
215215
└── Local storage or real-time delivery
216216
```
217217

218-
## 6. Conclusion
218+
## 6. Browser Backends and Connection Endpoints
219+
220+
Browser automation frameworks and browser backends are separate layers:
221+
222+
| Layer | Examples | Responsibility |
223+
| --- | --- | --- |
224+
| Automation framework | Playwright, Puppeteer, Selenium, DrissionPage | Provides the API used by Worker code |
225+
| Browser backend | Remote fingerprint browser, Lightpanda | Runs the actual browser process and network environment |
226+
| Platform runtime | `ChromeWs`, `ChromeHttp`, `LightpandaDomain`, `PROXY_AUTH` | Injects connection endpoints and credentials |
227+
228+
For example, Playwright can connect to either the remote fingerprint browser through `ChromeWs` or the Lightpanda CDP endpoint through `LightpandaDomain`. The scraping logic still uses Playwright APIs; only the remote browser endpoint changes.
229+
230+
## 7. Conclusion
219231

220232
When the target website is a **modern Web application** rather than a traditional static page, **using a real browser environment is not an optimization—it is a prerequisite**.
221233

0 commit comments

Comments
 (0)