Skip to content

Commit 4c316e0

Browse files
committed
v0.2.5 - Structured Outputs
1 parent 33fec21 commit 4c316e0

File tree

5 files changed

+66
-7
lines changed

5 files changed

+66
-7
lines changed

docs/api/getting-started.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ curl https://agents.parsera.org/v1/generate \
3131
|----------------|----------|--------------|--------------------------------------------|
3232
| `name` | `string` | - | Name of the agent |
3333
| `url` | `string` | - | Website URL |
34-
| `attributes` | `array` | - | A map of `name` - `description` pairs of data fields to extract from the webpage |
34+
| `attributes` | `object` | - | A map of `name` - `description` pairs of data fields to extract from the webpage |
3535
| `proxy_country` | `string` | `UnitedStates` | Proxy country, see [Proxy Countries](proxy.md) |
3636
| `cookies` | `array` | Empty | Cookies to use during extraction, see [Cookies](cookies.md) |
3737

@@ -109,7 +109,7 @@ It's recommended to set the `proxy_country` parameter to a specific country sinc
109109
| Parameter | Type | Default | Description |
110110
|----------------|----------|--------------|--------------------------------------------------|
111111
| `url` | `string` | - | URL of the webpage to extract data from |
112-
| `attributes` | `array` | - | A map of `name` - `description` pairs of data fields to extract from the webpage |
112+
| `attributes` | `object` | - | A map of `name` - `description` pairs of data fields to extract from the webpage |
113113
| `mode` | `string` | `standard` | Mode of the extractor, `standard` or `precision`. For details, see [Precision mode](precision-mode.md) |
114114
| `proxy_country`| `string` | `UnitedStates`| Proxy country, see [Proxy Countries](proxy.md) |
115115
| `cookies` | `array` | Empty | Cookies to use during extraction, see [Cookies](cookies.md) |
@@ -136,7 +136,7 @@ curl https://api.parsera.org/v1/parse \
136136
| Parameter | Type | Default | Description |
137137
|----------------|----------|--------------|--------------------------------------------------|
138138
| `content` | `string` | - | Raw HTML or text content to extract data from |
139-
| `attributes` | `array` | - | A map of `name` - `description` pairs of data fields to extract from the webpage |
139+
| `attributes` | `object` | - | A map of `name` - `description` pairs of data fields to extract from the webpage |
140140
| `mode` | `string` | `standard` | Mode of the extractor, `standard` or `precision`. For details, see [Precision mode](precision-mode.md) |
141141

142142

docs/features/extractors.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,37 @@ def count_tokens(text):
5454
scraper = Parsera(extractor=ExtractorType.CHUNKS_TABULAR, chunk_size=12000, token_counter=count_tokens)
5555
```
5656

57+
## Structured Extractor
58+
Extension of `ChunksTabularExtractor`, which uses structured output to get the output of specified type.
59+
Used by default inside `Parsera` when extended elements schema is provided:
60+
```python
61+
from parsera import Parsera
62+
from parsera.engine.chunks_extractor import ChunksTabularExtractor
63+
64+
url = "https://news.ycombinator.com/"
65+
elements = {
66+
"Title": {
67+
"description": "News title",
68+
"type": "string",
69+
},
70+
"Points": {
71+
"description": "Number of points",
72+
"type": "integer",
73+
}
74+
"Comments": {
75+
"description": "Number of comments",
76+
"type": "integer",
77+
}
78+
}
79+
80+
extractor = StructuredExtractor()
81+
scraper = Parsera(extractor=extractor) # With elements input structured as above, will be used by default.
82+
83+
result = scraper.run(url=url, elements=elements)
84+
```
85+
86+
Note: use this extractor only with models supporting Structured Outputs.
87+
5788

5889
## List Extractor
5990
```python

docs/getting-started.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,34 @@ There is also `arun` async method available:
5454
result = await scrapper.arun(url=url, elements=elements)
5555
```
5656

57+
## Specify output types
58+
59+
You can specify the output types using the following schema:
60+
```python
61+
from parsera import Parsera
62+
63+
url = "https://news.ycombinator.com/"
64+
elements = {
65+
"Title": {
66+
"description": "News title",
67+
"type": "string",
68+
},
69+
"Points": {
70+
"description": "Number of points",
71+
"type": "integer",
72+
}
73+
"Comments": {
74+
"description": "Number of comments",
75+
"type": "integer",
76+
}
77+
}
78+
79+
scraper = Parsera()
80+
result = scraper.run(url=url, elements=elements)
81+
```
82+
83+
When schema with types is used, `Parsera` switches to [Structured Extractor](/features/extractors/#structured-extractor).
84+
5785
## Running with CLI
5886

5987
Before you run `Parsera` as command line tool don't forget to put your `OPENAI_API_KEY` to env variables or `.env` file

mkdocs.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,13 @@ nav:
2727
- Home: index.md
2828
- Getting started: getting-started.md
2929
- Features:
30+
- Custom models: features/custom-models.md
31+
- Extractors: features/extractors.md
32+
- Proxy: features/proxy.md
3033
- Custom browser: features/custom-browser.md
3134
- Custom cookies: features/custom-cookies.md
3235
- Custom playwright: features/custom-playwright.md
33-
- Custom models: features/custom-models.md
34-
- Proxy: features/proxy.md
3536
- Scrolling: features/scrolling.md
36-
- Extractors: features/extractors.md
3737
- Docker: features/docker.md
3838
- API:
3939
- Getting started: api/getting-started.md

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "parsera"
3-
version = "0.2.4"
3+
version = "0.2.5"
44
description = "Lightweight library for scraping web-sites with LLMs"
55
authors = ["Mikhail Zanka <raznem@gmail.com>", "Danila Paddubny <danilapoddubny26@gmail.com>"]
66
license = "GPL-2.0-or-later"

0 commit comments

Comments
 (0)