You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Provisioning disks to agents offers several benefits:
4
+
5
+
***Stateful Computations**: Allows agents to retain data between computations, enabling the execution of complex, multi-step tasks.
6
+
***Improved Efficiency**: Reduces the need to repeatedly fetch data from external sources, resulting in faster processing times.
7
+
***Data Persistence**: Ensures that important data is not lost in case of agent restarts or failures.
8
+
9
+
10
+
11
+
12
+
The agent must declare the `resources.disk` section to automatically ask for a persistent disk.
13
+
Disks are automatically provided to the agents at runtime by LangStream: the provided disks are isolated from other agents and each agent can request different disk sizes and types.
14
+
15
+
```yaml
16
+
- name: "Stateful processing using Python"
17
+
resources:
18
+
disk:
19
+
enabled: true
20
+
size: 50M
21
+
id: "my-python-processor"
22
+
type: "python-processor"
23
+
```
24
+
25
+
26
+
The `disk` section provides these parameters:
27
+
- `enabled` (boolean): whether to provide the disk or not
28
+
- `size` (string): size of the disk to provision. e.g. 100K, 100M, 100G
29
+
- `type` (string): type of the disk
30
+
31
+
32
+
At runtime LangStream converts the disk specification to the actual storage provisioner disk request, as configured in the LangStream cluster.
33
+
The `type` option is statically mapped to a Kubernetes Storage class. The value `default` means to use the default Storage Class configured in Kubernetes.
34
+
35
+
36
+
Once the agent requests the disk, the disk is mounted in the local file system of the agent.
37
+
In Python, you can access the directory by calling `AgentContext.get_persistent_state_directory()`.
38
+
39
+
```python
40
+
from langstream import SimpleRecord, Processor, AgentContext
Copy file name to clipboardExpand all lines: pipeline-agents/input-and-output/webcrawler-source.md
+9-37
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,10 @@
2
2
3
3
The webcrawler-source agent crawls a website and outputs the site's URL and an HTML document. Crawling a website is an ideal first step in a [text embeddings pipeline](https://github.com/LangStream/langstream/tree/main/examples/applications/webcrawler-source).
4
4
5
-
The S3 bucket only stores metadata about the website and the status of the crawler - it won’t contain a copy of the crawl data, but a single JSON file with a name computed from the name of the agent and the id of the LangStream application.
5
+
This agent keeps the status of the crawling in a persistent storage. Storage won’t contain a copy of the crawl data, but a single JSON file with a name computed from the name of the agent and the id of the LangStream application.
6
+
7
+
By default, it requires an S3-compatible bucket that must be defined using `bucketName`, `endpoint`, `access-key`, `secret-key` and `region` properties.
8
+
Another solution is to store the status in a [persistent disk provided by LangStream](../../building-applications/stateful-agents.md). This can be achieved by setting `state-storage: disk`.
6
9
7
10
### Example
8
11
@@ -27,11 +30,7 @@ pipeline:
27
30
http-timeout: 10000
28
31
handle-cookies: true
29
32
max-unflushed-pages: 100
30
-
bucketName: "${secrets.s3.bucket-name}"
31
-
endpoint: "${secrets.s3.endpoint}"
32
-
access-key: "${secrets.s3.access-key}"
33
-
secret-key: "${secrets.s3.secret}"
34
-
region: "${secrets.s3.region}"
33
+
state-storage: disk
35
34
```
36
35
37
36
#### Multiple URLs in pipeline
@@ -60,30 +59,12 @@ allowed-domains:
60
59
* Structured text (JSON) [?](../agent-messaging.md)
| handle-cookies | Boolean | Whether to handle cookies. |
80
-
| max-unflushed-pages | Integer | Maximum number of unflushed pages before the agent persists the crawl data. |
62
+
### Configuration
81
63
82
-
### S3 credentials
64
+
Checkout the full configuration properties in the [API Reference page](../../building-applications/api-reference/agents.md#webcrawler-source).
83
65
84
-
<table><thead><tr><th width="147.33333333333331">Label</th><th width="165">Type</th><th>Description</th></tr></thead><tbody><tr><td>bucketName</td><td>string (required)</td><td>The name of the bucket. Defaults to "langstream-source".</td></tr><tr><td>endpoint</td><td>string (required)</td><td>The URL of the S3 service. Defaults to "<a href="http://minio-endpoint.-not-set:9090">http://minio-endpoint.-not-set:9090</a>".</td></tr><tr><td>access-key</td><td>string (optional)</td><td>Optional user name credential. Defaults to "minioadmin".</td></tr><tr><td>secret-key</td><td>string (optional)</td><td>Optional password credential. Defaults to "minioadmin".</td></tr><tr><td>region</td><td>string </td><td>Region of S3 bucket.</td></tr></tbody></table>
The webcrawler itself uses the [Jsoup](https://jsoup.org/) library to parse HTML with the [WHATWG HTML spec](https://html.spec.whatwg.org/multipage/). The webcrawler explores the web starting from a list of seed URLs and follows links within pages to discover more content. 
@@ -242,8 +219,3 @@ The webcrawler then passes the document on to the next agent.
0 commit comments