v0.2.5 - Structured Outputs

raznem · raznem · commit 4c316e07ddbb · 2025-04-09T18:08:01.000+02:00
diff --git a/docs/api/getting-started.md b/docs/api/getting-started.md
@@ -31,7 +31,7 @@ curl https://agents.parsera.org/v1/generate \
 |----------------|----------|--------------|--------------------------------------------|
 | `name`         | `string` | -            | Name of the agent                           |
 | `url`          | `string` | -            | Website URL                                |
-| `attributes`   | `array`  | -            | A map of `name` - `description` pairs of data fields to extract from the webpage |
+| `attributes`   | `object`  | -            | A map of `name` - `description` pairs of data fields to extract from the webpage |
 | `proxy_country` | `string` | `UnitedStates` | Proxy country, see [Proxy Countries](proxy.md)    |
 | `cookies`      | `array`  | Empty        | Cookies to use during extraction, see [Cookies](cookies.md) |
 
@@ -109,7 +109,7 @@ It's recommended to set the `proxy_country` parameter to a specific country sinc
 | Parameter      | Type     | Default      | Description                                      |
 |----------------|----------|--------------|--------------------------------------------------|
 | `url`          | `string` | -            | URL of the webpage to extract data from          |
-| `attributes`   | `array`  | -            | A map of `name` - `description` pairs of data fields to extract from the webpage |
+| `attributes`   | `object`  | -            | A map of `name` - `description` pairs of data fields to extract from the webpage |
 | `mode`         | `string` | `standard`    | Mode of the extractor, `standard` or `precision`. For details, see [Precision mode](precision-mode.md) |
 | `proxy_country`| `string` | `UnitedStates`| Proxy country, see [Proxy Countries](proxy.md)    |
 | `cookies`      | `array`  | Empty        | Cookies to use during extraction, see [Cookies](cookies.md) |
@@ -136,7 +136,7 @@ curl https://api.parsera.org/v1/parse \
 | Parameter      | Type     | Default      | Description                                      |
 |----------------|----------|--------------|--------------------------------------------------|
 | `content`          | `string` | -        | Raw HTML or text content to extract data from |
-| `attributes`   | `array`  | -            | A map of `name` - `description` pairs of data fields to extract from the webpage |
+| `attributes`   | `object`  | -            | A map of `name` - `description` pairs of data fields to extract from the webpage |
 | `mode`         | `string` | `standard`    | Mode of the extractor, `standard` or `precision`. For details, see [Precision mode](precision-mode.md) |
 
 
diff --git a/docs/features/extractors.md b/docs/features/extractors.md
@@ -54,6 +54,37 @@ def count_tokens(text):
 scraper = Parsera(extractor=ExtractorType.CHUNKS_TABULAR, chunk_size=12000, token_counter=count_tokens)
 ```
 
+## Structured Extractor
+Extension of `ChunksTabularExtractor`, which uses structured output to get the output of specified type.  
+Used by default inside `Parsera` when extended elements schema is provided:
+```python
+from parsera import Parsera
+from parsera.engine.chunks_extractor import ChunksTabularExtractor
+
+url = "https://news.ycombinator.com/"
+elements = {
+    "Title": {
+        "description": "News title",
+        "type": "string",
+    },
+    "Points": {
+        "description": "Number of points",
+        "type": "integer",
+    }
+    "Comments": {
+        "description": "Number of comments",
+        "type": "integer",
+    }
+}
+
+extractor = StructuredExtractor()
+scraper = Parsera(extractor=extractor)  # With elements input structured as above, will be used by default.
+
+result = scraper.run(url=url, elements=elements)
+```
+
+Note: use this extractor only with models supporting Structured Outputs.
+
 
 ## List Extractor
 ```python
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -54,6 +54,34 @@ There is also `arun` async method available:
 result = await scrapper.arun(url=url, elements=elements)
 ```
 
+## Specify output types
+
+You can specify the output types using the following schema:
+```python
+from parsera import Parsera
+
+url = "https://news.ycombinator.com/"
+elements = {
+    "Title": {
+        "description": "News title",
+        "type": "string",
+    },
+    "Points": {
+        "description": "Number of points",
+        "type": "integer",
+    }
+    "Comments": {
+        "description": "Number of comments",
+        "type": "integer",
+    }
+}
+
+scraper = Parsera()
+result = scraper.run(url=url, elements=elements)
+```
+
+When schema with types is used, `Parsera` switches to [Structured Extractor](/features/extractors/#structured-extractor).
+
 ## Running with CLI
 
 Before you run `Parsera` as command line tool don't forget to put your `OPENAI_API_KEY` to env variables or `.env` file
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -27,13 +27,13 @@ nav:
   - Home: index.md
   - Getting started: getting-started.md
   - Features:
+    - Custom models: features/custom-models.md
+    - Extractors: features/extractors.md
+    - Proxy: features/proxy.md
     - Custom browser: features/custom-browser.md
     - Custom cookies: features/custom-cookies.md
     - Custom playwright: features/custom-playwright.md
-    - Custom models: features/custom-models.md
-    - Proxy: features/proxy.md
     - Scrolling: features/scrolling.md
-    - Extractors: features/extractors.md
     - Docker: features/docker.md
   - API:
     - Getting started: api/getting-started.md
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "parsera"
-version = "0.2.4"
+version = "0.2.5"
 description = "Lightweight library for scraping web-sites with LLMs"
 authors = ["Mikhail Zanka <raznem@gmail.com>", "Danila Paddubny <danilapoddubny26@gmail.com>"]
 license = "GPL-2.0-or-later"