Document stateful agents (#130)

nicoloboschi · mendonk · web-flow · commit f0dffb9f0b66 · 2023-11-02T16:05:19.000-04:00
Co-authored-by: Mendon Kissling &lt;59585235+mendonk@users.noreply.github.com&gt;
diff --git a/SUMMARY.md b/SUMMARY.md
@@ -34,6 +34,7 @@
   * [Secrets](building-applications/secrets.md)
   * [YAML templating](building-applications/yaml-templates.md)
   * [Error Handling](building-applications/error-handling.md)
+  * [Stateful agents](building-applications/stateful-agents.md)
   * [.langstreamignore](building-applications/langstreamignore.md)
 * [Sample App](building-applications/build-a-sample-app.md)
 * [Develop, test and deploy](building-applications/development-workflow.md)
diff --git a/building-applications/stateful-agents.md b/building-applications/stateful-agents.md
@@ -0,0 +1,51 @@
+# Stateful agents
+
+Provisioning disks to agents offers several benefits:
+
+* **Stateful Computations**: Allows agents to retain data between computations, enabling the execution of complex, multi-step tasks.
+* **Improved Efficiency**: Reduces the need to repeatedly fetch data from external sources, resulting in faster processing times.
+* **Data Persistence**: Ensures that important data is not lost in case of agent restarts or failures.
+
+
+
+
+The agent must declare the `resources.disk` section to automatically ask for a persistent disk.
+Disks are automatically provided to the agents at runtime by LangStream: the provided disks are isolated from other agents and each agent can request different disk sizes and types.
+
+```yaml
+- name: "Stateful processing using Python"
+  resources:
+    disk:
+      enabled: true
+      size: 50M
+    id: "my-python-processor"
+    type: "python-processor"
+```
+
+
+The `disk` section provides these parameters:
+- `enabled` (boolean): whether to provide the disk or not
+- `size` (string): size of the disk to provision. e.g. 100K, 100M, 100G
+- `type` (string): type of the disk
+
+
+At runtime LangStream converts the disk specification to the actual storage provisioner disk request, as configured in the LangStream cluster.
+The `type` option is statically mapped to a Kubernetes Storage class. The value `default` means to use the default Storage Class configured in Kubernetes.
+
+
+Once the agent requests the disk, the disk is mounted in the local file system of the agent.
+In Python, you can access the directory by calling `AgentContext.get_persistent_state_directory()`.
+
+```python
+from langstream import SimpleRecord, Processor, AgentContext
+import os
+
+class Exclamation(Processor):
+    def init(self, config, context: AgentContext):
+        self.context = context
+
+    def process(self, record):
+        directory = self.context.get_persistent_state_directory()
+        counter_file = os.path.join(directory, "counter.txt")
+        ...
+```
diff --git a/pipeline-agents/input-and-output/webcrawler-source.md b/pipeline-agents/input-and-output/webcrawler-source.md
@@ -2,7 +2,10 @@
 
 The webcrawler-source agent crawls a website and outputs the site's URL and an HTML document. Crawling a website is an ideal first step in a [text embeddings pipeline](https://github.com/LangStream/langstream/tree/main/examples/applications/webcrawler-source).
 
-The S3 bucket only stores metadata about the website and the status of the crawler - it won’t contain a copy of the crawl data, but a single JSON file with a name computed from the name of the agent and the id of the LangStream application.
+This agent keeps the status of the crawling in a persistent storage. Storage won’t contain a copy of the crawl data, but a single JSON file with a name computed from the name of the agent and the id of the LangStream application.
+
+By default, it requires an S3-compatible bucket that must be defined using `bucketName`, `endpoint`, `access-key`, `secret-key` and `region` properties.
+Another solution is to store the status in a [persistent disk provided by LangStream](../../building-applications/stateful-agents.md). This can be achieved by setting `state-storage: disk`.
 
 ### Example
 
@@ -27,11 +30,7 @@ pipeline:
       http-timeout: 10000
       handle-cookies: true
       max-unflushed-pages: 100
-      bucketName: "${secrets.s3.bucket-name}"
-      endpoint: "${secrets.s3.endpoint}"
-      access-key: "${secrets.s3.access-key}"
-      secret-key: "${secrets.s3.secret}"
-      region: "${secrets.s3.region}"
+      state-storage: disk
 ```
 
 #### Multiple URLs in pipeline
@@ -60,30 +59,12 @@ allowed-domains:
 * Structured text (JSON) [?](../agent-messaging.md)
 * Implicit topic [?](../agent-messaging.md#implicit-input-and-output-topics)
 
-### **Configuration**
-
-| Label                     | Type                   | Description                                                                                              |
-| ------------------------- | ---------------------- | -------------------------------------------------------------------------------------------------------- |
-| seed-urls                 | List of Strings        | The starting URLs for the crawl.                                                                         |
-| allowed-domains           | List of Strings        | Domains that the crawler is allowed to access.                                                           |
-| forbidden-paths           | List of Strings        | Paths that the crawler is not allowed to access.                                                         |
-| min-time-between-requests | Integer (milliseconds) | Minimum time between two requests to the same domain.                                                    |
-| reindex-interval-seconds  | Integer (seconds)      | Time interval between reindexing of the pages.                                                           |
-| max-error-count           | Integer                | Maximum number of errors allowed before stopping.                                                        |
-| max-urls                  | Integer                | Maximum number of URLs that can be crawled. Defaults to 1000.                                            |
-| max-depth                 | Integer                | Maximum depth of the crawl.                                                                              |
-| handle-robots-file        | Boolean                | Whether to scan the HTML documents to find links to other pages (defaults to true)                       |
-| user-agent                | String                 | User-agent string, computed automatically unless overridden. Defaults to "langstream.ai-webcrawler/1.0". |
-| scan-html-documents       | Boolean                | Whether to scan HTML documents for links to other sites. Defaults to true.                               |
-| http-timeout              | Integer (milliseconds) | Timeout for HTTP requests.                                                                               |
-| handle-cookies            | Boolean                | Whether to handle cookies.                                                                               |
-| max-unflushed-pages       | Integer                | Maximum number of unflushed pages before the agent persists the crawl data.                              |
+### Configuration
 
-### S3 credentials
+Checkout the full configuration properties in the [API Reference page](../../building-applications/api-reference/agents.md#webcrawler-source).
 
-<table><thead><tr><th width="147.33333333333331">Label</th><th width="165">Type</th><th>Description</th></tr></thead><tbody><tr><td>bucketName</td><td>string (required)</td><td>The name of the bucket. Defaults to "langstream-source".</td></tr><tr><td>endpoint</td><td>string (required)</td><td>The URL of the S3 service.  Defaults to "<a href="http://minio-endpoint.-not-set:9090">http://minio-endpoint.-not-set:9090</a>".</td></tr><tr><td>access-key</td><td>string (optional)</td><td>Optional user name credential. Defaults to "minioadmin".</td></tr><tr><td>secret-key</td><td>string (optional)</td><td>Optional password credential. Defaults to "minioadmin".</td></tr><tr><td>region</td><td>string </td><td>Region of S3 bucket.</td></tr></tbody></table>
 
-### Webcrawler-status
+### Webcrawler Status
 
 | Label         | Type   | Description                                                           |
 | ------------- | ------ | --------------------------------------------------------------------- |
@@ -123,11 +104,7 @@ pipeline:
       http-timeout: 10000
       handle-cookies: true
       max-unflushed-pages: 100
-      bucketName: "${secrets.s3.bucket-name}"
-      endpoint: "${secrets.s3.endpoint}"
-      access-key: "${secrets.s3.access-key}"
-      secret-key: "${secrets.s3.secret}"
-      region: "${secrets.s3.region}"
+      state-storage: disk
 ```
 
 The webcrawler itself uses the [Jsoup](https://jsoup.org/) library to parse HTML with the [WHATWG HTML spec](https://html.spec.whatwg.org/multipage/). The webcrawler explores the web starting from a list of seed URLs and follows links within pages to discover more content.&#x20;
@@ -242,8 +219,3 @@ The webcrawler then passes the document on to the next agent.
       keyspace: "documents"
       mapping: "filename=value.filename, chunk_id=value.chunk_id, language=value.language, text=value.text, embeddings_vector=value.embeddings_vector, num_tokens=value.chunk_num_tokens"
 ```
-
-
-### Configuration
-
-Checkout the full configuration properties in the [API Reference page](../../building-applications/api-reference/agents.md#webcrawler-source).