You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The LandingAI [Agentic Document Extraction tool](https://va.landing.ai/demo/doc-extraction) extracts structured information from visually complex documents with text, tables, pictures, charts, and other information. The API returns the extracted data in a hierarchical format and pinpoints the exact location of each element.
7
8
8
-
The LandingAI [Agentic Document Extraction](https://va.landing.ai/demo/doc-extraction) tool extracts structured information from visually complex documents with text, tables, pictures, charts, and other information. The API returns the extracted data in a hierarchical format and pinpoints the exact location of each element.
9
+
This `agentic-doc` Python library wraps around the Agentic Document Extraction API to add more features and support to the document extraction process. For example, using this library allows you to process much longer documents.
9
10
10
-
This `agentic-doc` Python library wraps around the [Agentic Document Extraction](https://va.landing.ai/demo/doc-extraction) API to add more features and support to the document extraction process. For example, using this library allows you to process much longer documents.
11
-
12
-
Learn more about the Agentic Document Extraction API [here](https://support.landing.ai/docs/document-extraction).
11
+
For advanced users or for troubleshooting purposes, you can refer to the Agentic Document Extraction API [here](https://support.landing.ai/docs/document-extraction).
-**Simplified Setup:** No need to manage API keys or handle low-level REST calls.
67
+
-**Automatic Large File Processing:** Splits large PDFs into manageable parts and processes them in parallel.
68
+
-**Built-In Error Handling:** Automatically retries requests with exponential backoff and jitter for common HTTP errors.
69
+
-**Parallel Processing:** Efficiently parse multiple documents at once with configurable parallelism.
65
70
66
71
## Main Features
67
72
@@ -74,20 +79,18 @@ This section describes some of the key features this library offers.
74
79
75
80
We've used this library to successfully parse PDFs that are 1000+ pages long.
76
81
77
-
78
82
### Parse Multiple Files in a Batch
79
83
80
84
You can parse multiple files in a single function call with this library. The library processes files in parallel.
81
85
82
-
NOTE: You can change the parallelism by setting the `batch_size` setting.
86
+
> **NOTE:** You can change the parallelism by setting the `batch_size` setting.
83
87
84
88
### Automatically Handle API Errors and Rate Limits with Retries
85
89
86
90
The REST API endpoint imposes rate limits per API key. This library automatically handles the rate limit error or other intermittent HTTP errors with retries.
87
91
88
92
For more information, see [Error Handling](#error-handling) and [Configuration Options](#configuration-options).
89
93
90
-
91
94
### Error Handling
92
95
93
96
This library implements a retry mechanism for handling API failures:
@@ -105,7 +108,6 @@ If the REST API encounters an unrecoverable error during parsing, the library in
105
108
Each error chunk contains the error message and corresponding page index.
106
109
Error chunks can be identified in the `ParsedDocument` by checking for `chunk_type=ChunkType.error`.
107
110
108
-
109
111
## Configuration Options
110
112
111
113
The library uses a [`Settings`](./agentic_doc/config.py) object to manage configuration. You can customize these settings either through environment variables or a `.env` file:
@@ -139,16 +141,14 @@ The optimal values for `MAX_WORKERS` and `BATCH_SIZE` depend on your API rate li
139
141
140
142
You can find your REST API latency in the logs. If you want to increase your rate limit, schedule a time to meet with us [here](https://scheduler.zoom.us/d/56i81uc2/landingai-document-extraction).
141
143
142
-
143
144
### Set `RETRY_LOGGING_STYLE`
144
145
145
146
The `RETRY_LOGGING_STYLE` setting controls how the library logs the retry attempts.
146
147
147
148
-`log_msg`: Log the retry attempts as a log messages. Each attempt is logged as a separate message. This is the default setting.
148
-
-`inline_block`: Print a yellow progress block ('█') on the same line. Each block represents one retry attempt. Choose this if you don't want to see the verbose retry logging message and still want to track the number of retries has been made.
149
+
-`inline_block`: Print a yellow progress block ('█') on the same line. Each block represents one retry attempt. Choose this if you don't want to see the verbose retry logging message and still want to track the number of retries that have been made.
149
150
-`none`: Do not log the retry attempts.
150
151
151
-
152
152
## API Reference
153
153
154
154
### Main Functions
@@ -184,13 +184,13 @@ Parse a single document and optionally save results.
184
184
-`file_path`: Path to document
185
185
-`result_save_dir`: Optional directory to save results
186
186
-**Returns:**
187
-
- If `result_save_dir` provided: Path to saved result file
187
+
- If `result_save_dir` provided: Path to saved result file
188
188
- If no `result_save_dir`: ParsedDocument object
189
189
-**Raises:**
190
-
-`FileNotFoundError`: If input file doesn't exist
190
+
-`FileNotFoundError`: If input file doesn't exist
191
191
-`ValueError`: If file type is not supported
192
192
193
-
###Result Schema
193
+
## Result Schema
194
194
195
195
#### ParsedDocument
196
196
@@ -210,3 +210,15 @@ Represents a parsed content chunk with the following attributes:
210
210
-`grounding`: list[Grounding] - List of content locations in document
211
211
-`chunk_type`: Literal["text", "error"] - Type of chunk
212
212
-`chunk_id`: Optional[str] - ID of the chunk
213
+
214
+
## Troubleshooting & FAQ
215
+
216
+
### Common Issues
217
+
-**API Key Errors:**
218
+
Ensure your API key is correctly set as an environment variable.
219
+
-**Rate Limits:**
220
+
The library automatically retries requests if you hit the API rate limit. Adjust `BATCH_SIZE` or `MAX_WORKERS` if you encounter frequent rate limit errors.
221
+
-**Parsing Failures:**
222
+
If a document fails to parse, an error chunk will be included in the result, detailing the error message and page index.
0 commit comments