Skip to content

Commit f0dffb9

Browse files
Document stateful agents (#130)
Co-authored-by: Mendon Kissling <[email protected]>
1 parent c186a47 commit f0dffb9

File tree

3 files changed

+61
-37
lines changed

3 files changed

+61
-37
lines changed

SUMMARY.md

+1
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
* [Secrets](building-applications/secrets.md)
3535
* [YAML templating](building-applications/yaml-templates.md)
3636
* [Error Handling](building-applications/error-handling.md)
37+
* [Stateful agents](building-applications/stateful-agents.md)
3738
* [.langstreamignore](building-applications/langstreamignore.md)
3839
* [Sample App](building-applications/build-a-sample-app.md)
3940
* [Develop, test and deploy](building-applications/development-workflow.md)
+51
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Stateful agents
2+
3+
Provisioning disks to agents offers several benefits:
4+
5+
* **Stateful Computations**: Allows agents to retain data between computations, enabling the execution of complex, multi-step tasks.
6+
* **Improved Efficiency**: Reduces the need to repeatedly fetch data from external sources, resulting in faster processing times.
7+
* **Data Persistence**: Ensures that important data is not lost in case of agent restarts or failures.
8+
9+
10+
11+
12+
The agent must declare the `resources.disk` section to automatically ask for a persistent disk.
13+
Disks are automatically provided to the agents at runtime by LangStream: the provided disks are isolated from other agents and each agent can request different disk sizes and types.
14+
15+
```yaml
16+
- name: "Stateful processing using Python"
17+
resources:
18+
disk:
19+
enabled: true
20+
size: 50M
21+
id: "my-python-processor"
22+
type: "python-processor"
23+
```
24+
25+
26+
The `disk` section provides these parameters:
27+
- `enabled` (boolean): whether to provide the disk or not
28+
- `size` (string): size of the disk to provision. e.g. 100K, 100M, 100G
29+
- `type` (string): type of the disk
30+
31+
32+
At runtime LangStream converts the disk specification to the actual storage provisioner disk request, as configured in the LangStream cluster.
33+
The `type` option is statically mapped to a Kubernetes Storage class. The value `default` means to use the default Storage Class configured in Kubernetes.
34+
35+
36+
Once the agent requests the disk, the disk is mounted in the local file system of the agent.
37+
In Python, you can access the directory by calling `AgentContext.get_persistent_state_directory()`.
38+
39+
```python
40+
from langstream import SimpleRecord, Processor, AgentContext
41+
import os
42+
43+
class Exclamation(Processor):
44+
def init(self, config, context: AgentContext):
45+
self.context = context
46+
47+
def process(self, record):
48+
directory = self.context.get_persistent_state_directory()
49+
counter_file = os.path.join(directory, "counter.txt")
50+
...
51+
```

pipeline-agents/input-and-output/webcrawler-source.md

+9-37
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,10 @@
22

33
The webcrawler-source agent crawls a website and outputs the site's URL and an HTML document. Crawling a website is an ideal first step in a [text embeddings pipeline](https://github.com/LangStream/langstream/tree/main/examples/applications/webcrawler-source).
44

5-
The S3 bucket only stores metadata about the website and the status of the crawler - it won’t contain a copy of the crawl data, but a single JSON file with a name computed from the name of the agent and the id of the LangStream application.
5+
This agent keeps the status of the crawling in a persistent storage. Storage won’t contain a copy of the crawl data, but a single JSON file with a name computed from the name of the agent and the id of the LangStream application.
6+
7+
By default, it requires an S3-compatible bucket that must be defined using `bucketName`, `endpoint`, `access-key`, `secret-key` and `region` properties.
8+
Another solution is to store the status in a [persistent disk provided by LangStream](../../building-applications/stateful-agents.md). This can be achieved by setting `state-storage: disk`.
69

710
### Example
811

@@ -27,11 +30,7 @@ pipeline:
2730
http-timeout: 10000
2831
handle-cookies: true
2932
max-unflushed-pages: 100
30-
bucketName: "${secrets.s3.bucket-name}"
31-
endpoint: "${secrets.s3.endpoint}"
32-
access-key: "${secrets.s3.access-key}"
33-
secret-key: "${secrets.s3.secret}"
34-
region: "${secrets.s3.region}"
33+
state-storage: disk
3534
```
3635
3736
#### Multiple URLs in pipeline
@@ -60,30 +59,12 @@ allowed-domains:
6059
* Structured text (JSON) [?](../agent-messaging.md)
6160
* Implicit topic [?](../agent-messaging.md#implicit-input-and-output-topics)
6261
63-
### **Configuration**
64-
65-
| Label | Type | Description |
66-
| ------------------------- | ---------------------- | -------------------------------------------------------------------------------------------------------- |
67-
| seed-urls | List of Strings | The starting URLs for the crawl. |
68-
| allowed-domains | List of Strings | Domains that the crawler is allowed to access. |
69-
| forbidden-paths | List of Strings | Paths that the crawler is not allowed to access. |
70-
| min-time-between-requests | Integer (milliseconds) | Minimum time between two requests to the same domain. |
71-
| reindex-interval-seconds | Integer (seconds) | Time interval between reindexing of the pages. |
72-
| max-error-count | Integer | Maximum number of errors allowed before stopping. |
73-
| max-urls | Integer | Maximum number of URLs that can be crawled. Defaults to 1000. |
74-
| max-depth | Integer | Maximum depth of the crawl. |
75-
| handle-robots-file | Boolean | Whether to scan the HTML documents to find links to other pages (defaults to true) |
76-
| user-agent | String | User-agent string, computed automatically unless overridden. Defaults to "langstream.ai-webcrawler/1.0". |
77-
| scan-html-documents | Boolean | Whether to scan HTML documents for links to other sites. Defaults to true. |
78-
| http-timeout | Integer (milliseconds) | Timeout for HTTP requests. |
79-
| handle-cookies | Boolean | Whether to handle cookies. |
80-
| max-unflushed-pages | Integer | Maximum number of unflushed pages before the agent persists the crawl data. |
62+
### Configuration
8163
82-
### S3 credentials
64+
Checkout the full configuration properties in the [API Reference page](../../building-applications/api-reference/agents.md#webcrawler-source).
8365
84-
<table><thead><tr><th width="147.33333333333331">Label</th><th width="165">Type</th><th>Description</th></tr></thead><tbody><tr><td>bucketName</td><td>string (required)</td><td>The name of the bucket. Defaults to "langstream-source".</td></tr><tr><td>endpoint</td><td>string (required)</td><td>The URL of the S3 service. Defaults to "<a href="http://minio-endpoint.-not-set:9090">http://minio-endpoint.-not-set:9090</a>".</td></tr><tr><td>access-key</td><td>string (optional)</td><td>Optional user name credential. Defaults to "minioadmin".</td></tr><tr><td>secret-key</td><td>string (optional)</td><td>Optional password credential. Defaults to "minioadmin".</td></tr><tr><td>region</td><td>string </td><td>Region of S3 bucket.</td></tr></tbody></table>
8566
86-
### Webcrawler-status
67+
### Webcrawler Status
8768
8869
| Label | Type | Description |
8970
| ------------- | ------ | --------------------------------------------------------------------- |
@@ -123,11 +104,7 @@ pipeline:
123104
http-timeout: 10000
124105
handle-cookies: true
125106
max-unflushed-pages: 100
126-
bucketName: "${secrets.s3.bucket-name}"
127-
endpoint: "${secrets.s3.endpoint}"
128-
access-key: "${secrets.s3.access-key}"
129-
secret-key: "${secrets.s3.secret}"
130-
region: "${secrets.s3.region}"
107+
state-storage: disk
131108
```
132109
133110
The webcrawler itself uses the [Jsoup](https://jsoup.org/) library to parse HTML with the [WHATWG HTML spec](https://html.spec.whatwg.org/multipage/). The webcrawler explores the web starting from a list of seed URLs and follows links within pages to discover more content.&#x20;
@@ -242,8 +219,3 @@ The webcrawler then passes the document on to the next agent.
242219
keyspace: "documents"
243220
mapping: "filename=value.filename, chunk_id=value.chunk_id, language=value.language, text=value.text, embeddings_vector=value.embeddings_vector, num_tokens=value.chunk_num_tokens"
244221
```
245-
246-
247-
### Configuration
248-
249-
Checkout the full configuration properties in the [API Reference page](../../building-applications/api-reference/agents.md#webcrawler-source).

0 commit comments

Comments
 (0)