Skip to content

Commit 99d766b

Browse files
authored
Update README.md (#13)
SDK over REST API highlight
1 parent fa0fe14 commit 99d766b

File tree

1 file changed

+27
-15
lines changed

1 file changed

+27
-15
lines changed

README.md

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,13 @@
22
![ci_status](https://github.com/landing-ai/agentic-doc/actions/workflows/ci_cd.yml/badge.svg)
33
[![PyPI version](https://badge.fury.io/py/agentic-doc.svg)](https://badge.fury.io/py/agentic-doc)
44

5+
# Agentic Document Extraction Python Library
56

6-
# agentic-doc
7+
The LandingAI [Agentic Document Extraction tool](https://va.landing.ai/demo/doc-extraction) extracts structured information from visually complex documents with text, tables, pictures, charts, and other information. The API returns the extracted data in a hierarchical format and pinpoints the exact location of each element.
78

8-
The LandingAI [Agentic Document Extraction](https://va.landing.ai/demo/doc-extraction) tool extracts structured information from visually complex documents with text, tables, pictures, charts, and other information. The API returns the extracted data in a hierarchical format and pinpoints the exact location of each element.
9+
This `agentic-doc` Python library wraps around the Agentic Document Extraction API to add more features and support to the document extraction process. For example, using this library allows you to process much longer documents.
910

10-
This `agentic-doc` Python library wraps around the [Agentic Document Extraction](https://va.landing.ai/demo/doc-extraction) API to add more features and support to the document extraction process. For example, using this library allows you to process much longer documents.
11-
12-
Learn more about the Agentic Document Extraction API [here](https://support.landing.ai/docs/document-extraction).
11+
For advanced users or for troubleshooting purposes, you can refer to the Agentic Document Extraction API [here](https://support.landing.ai/docs/document-extraction).
1312

1413
## Quick Start
1514

@@ -62,6 +61,12 @@ result_paths = parse_and_save_documents(file_paths, result_save_dir=result_save_
6261
# result_paths: ["path/to/save/results/document1_20250313_070305.json", "path/to/save/results/document2_20250313_070408.json"]
6362
```
6463

64+
## Why Use It?
65+
66+
- **Simplified Setup:** No need to manage API keys or handle low-level REST calls.
67+
- **Automatic Large File Processing:** Splits large PDFs into manageable parts and processes them in parallel.
68+
- **Built-In Error Handling:** Automatically retries requests with exponential backoff and jitter for common HTTP errors.
69+
- **Parallel Processing:** Efficiently parse multiple documents at once with configurable parallelism.
6570

6671
## Main Features
6772

@@ -74,20 +79,18 @@ This section describes some of the key features this library offers.
7479

7580
We've used this library to successfully parse PDFs that are 1000+ pages long.
7681

77-
7882
### Parse Multiple Files in a Batch
7983

8084
You can parse multiple files in a single function call with this library. The library processes files in parallel.
8185

82-
NOTE: You can change the parallelism by setting the `batch_size` setting.
86+
> **NOTE:** You can change the parallelism by setting the `batch_size` setting.
8387
8488
### Automatically Handle API Errors and Rate Limits with Retries
8589

8690
The REST API endpoint imposes rate limits per API key. This library automatically handles the rate limit error or other intermittent HTTP errors with retries.
8791

8892
For more information, see [Error Handling](#error-handling) and [Configuration Options](#configuration-options).
8993

90-
9194
### Error Handling
9295

9396
This library implements a retry mechanism for handling API failures:
@@ -105,7 +108,6 @@ If the REST API encounters an unrecoverable error during parsing, the library in
105108
Each error chunk contains the error message and corresponding page index.
106109
Error chunks can be identified in the `ParsedDocument` by checking for `chunk_type=ChunkType.error`.
107110

108-
109111
## Configuration Options
110112

111113
The library uses a [`Settings`](./agentic_doc/config.py) object to manage configuration. You can customize these settings either through environment variables or a `.env` file:
@@ -139,16 +141,14 @@ The optimal values for `MAX_WORKERS` and `BATCH_SIZE` depend on your API rate li
139141

140142
You can find your REST API latency in the logs. If you want to increase your rate limit, schedule a time to meet with us [here](https://scheduler.zoom.us/d/56i81uc2/landingai-document-extraction).
141143

142-
143144
### Set `RETRY_LOGGING_STYLE`
144145

145146
The `RETRY_LOGGING_STYLE` setting controls how the library logs the retry attempts.
146147

147148
- `log_msg`: Log the retry attempts as a log messages. Each attempt is logged as a separate message. This is the default setting.
148-
- `inline_block`: Print a yellow progress block ('█') on the same line. Each block represents one retry attempt. Choose this if you don't want to see the verbose retry logging message and still want to track the number of retries has been made.
149+
- `inline_block`: Print a yellow progress block ('█') on the same line. Each block represents one retry attempt. Choose this if you don't want to see the verbose retry logging message and still want to track the number of retries that have been made.
149150
- `none`: Do not log the retry attempts.
150151

151-
152152
## API Reference
153153

154154
### Main Functions
@@ -184,13 +184,13 @@ Parse a single document and optionally save results.
184184
- `file_path`: Path to document
185185
- `result_save_dir`: Optional directory to save results
186186
- **Returns:**
187-
- If `result_save_dir` provided: Path to saved result file
187+
- If `result_save_dir` provided: Path to saved result file
188188
- If no `result_save_dir`: ParsedDocument object
189189
- **Raises:**
190-
- `FileNotFoundError`: If input file doesn't exist
190+
- `FileNotFoundError`: If input file doesn't exist
191191
- `ValueError`: If file type is not supported
192192

193-
### Result Schema
193+
## Result Schema
194194

195195
#### ParsedDocument
196196

@@ -210,3 +210,15 @@ Represents a parsed content chunk with the following attributes:
210210
- `grounding`: list[Grounding] - List of content locations in document
211211
- `chunk_type`: Literal["text", "error"] - Type of chunk
212212
- `chunk_id`: Optional[str] - ID of the chunk
213+
214+
## Troubleshooting & FAQ
215+
216+
### Common Issues
217+
- **API Key Errors:**
218+
Ensure your API key is correctly set as an environment variable.
219+
- **Rate Limits:**
220+
The library automatically retries requests if you hit the API rate limit. Adjust `BATCH_SIZE` or `MAX_WORKERS` if you encounter frequent rate limit errors.
221+
- **Parsing Failures:**
222+
If a document fails to parse, an error chunk will be included in the result, detailing the error message and page index.
223+
224+
---

0 commit comments

Comments
 (0)