Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 27 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@
![ci_status](https://github.com/landing-ai/agentic-doc/actions/workflows/ci_cd.yml/badge.svg)
[![PyPI version](https://badge.fury.io/py/agentic-doc.svg)](https://badge.fury.io/py/agentic-doc)

# Agentic Document Extraction Python Library

# agentic-doc
The LandingAI [Agentic Document Extraction tool](https://va.landing.ai/demo/doc-extraction) extracts structured information from visually complex documents with text, tables, pictures, charts, and other information. The API returns the extracted data in a hierarchical format and pinpoints the exact location of each element.

The LandingAI [Agentic Document Extraction](https://va.landing.ai/demo/doc-extraction) tool extracts structured information from visually complex documents with text, tables, pictures, charts, and other information. The API returns the extracted data in a hierarchical format and pinpoints the exact location of each element.
This `agentic-doc` Python library wraps around the Agentic Document Extraction API to add more features and support to the document extraction process. For example, using this library allows you to process much longer documents.

This `agentic-doc` Python library wraps around the [Agentic Document Extraction](https://va.landing.ai/demo/doc-extraction) API to add more features and support to the document extraction process. For example, using this library allows you to process much longer documents.

Learn more about the Agentic Document Extraction API [here](https://support.landing.ai/docs/document-extraction).
For advanced users or for troubleshooting purposes, you can refer to the Agentic Document Extraction API [here](https://support.landing.ai/docs/document-extraction).

## Quick Start

Expand Down Expand Up @@ -62,6 +61,12 @@ result_paths = parse_and_save_documents(file_paths, result_save_dir=result_save_
# result_paths: ["path/to/save/results/document1_20250313_070305.json", "path/to/save/results/document2_20250313_070408.json"]
```

## Why Use It?

- **Simplified Setup:** No need to manage API keys or handle low-level REST calls.
- **Automatic Large File Processing:** Splits large PDFs into manageable parts and processes them in parallel.
- **Built-In Error Handling:** Automatically retries requests with exponential backoff and jitter for common HTTP errors.
- **Parallel Processing:** Efficiently parse multiple documents at once with configurable parallelism.

## Main Features

Expand All @@ -74,20 +79,18 @@ This section describes some of the key features this library offers.

We've used this library to successfully parse PDFs that are 1000+ pages long.


### Parse Multiple Files in a Batch

You can parse multiple files in a single function call with this library. The library processes files in parallel.

NOTE: You can change the parallelism by setting the `batch_size` setting.
> **NOTE:** You can change the parallelism by setting the `batch_size` setting.

### Automatically Handle API Errors and Rate Limits with Retries

The REST API endpoint imposes rate limits per API key. This library automatically handles the rate limit error or other intermittent HTTP errors with retries.

For more information, see [Error Handling](#error-handling) and [Configuration Options](#configuration-options).


### Error Handling

This library implements a retry mechanism for handling API failures:
Expand All @@ -105,7 +108,6 @@ If the REST API encounters an unrecoverable error during parsing, the library in
Each error chunk contains the error message and corresponding page index.
Error chunks can be identified in the `ParsedDocument` by checking for `chunk_type=ChunkType.error`.


## Configuration Options

The library uses a [`Settings`](./agentic_doc/config.py) object to manage configuration. You can customize these settings either through environment variables or a `.env` file:
Expand Down Expand Up @@ -139,16 +141,14 @@ The optimal values for `MAX_WORKERS` and `BATCH_SIZE` depend on your API rate li

You can find your REST API latency in the logs. If you want to increase your rate limit, schedule a time to meet with us [here](https://scheduler.zoom.us/d/56i81uc2/landingai-document-extraction).


### Set `RETRY_LOGGING_STYLE`

The `RETRY_LOGGING_STYLE` setting controls how the library logs the retry attempts.

- `log_msg`: Log the retry attempts as a log messages. Each attempt is logged as a separate message. This is the default setting.
- `inline_block`: Print a yellow progress block ('█') on the same line. Each block represents one retry attempt. Choose this if you don't want to see the verbose retry logging message and still want to track the number of retries has been made.
- `inline_block`: Print a yellow progress block ('█') on the same line. Each block represents one retry attempt. Choose this if you don't want to see the verbose retry logging message and still want to track the number of retries that have been made.
- `none`: Do not log the retry attempts.


## API Reference

### Main Functions
Expand Down Expand Up @@ -184,13 +184,13 @@ Parse a single document and optionally save results.
- `file_path`: Path to document
- `result_save_dir`: Optional directory to save results
- **Returns:**
- If `result_save_dir` provided: Path to saved result file
- If `result_save_dir` provided: Path to saved result file
- If no `result_save_dir`: ParsedDocument object
- **Raises:**
- `FileNotFoundError`: If input file doesn't exist
- `FileNotFoundError`: If input file doesn't exist
- `ValueError`: If file type is not supported

### Result Schema
## Result Schema

#### ParsedDocument

Expand All @@ -210,3 +210,15 @@ Represents a parsed content chunk with the following attributes:
- `grounding`: list[Grounding] - List of content locations in document
- `chunk_type`: Literal["text", "error"] - Type of chunk
- `chunk_id`: Optional[str] - ID of the chunk

## Troubleshooting & FAQ

### Common Issues
- **API Key Errors:**
Ensure your API key is correctly set as an environment variable.
- **Rate Limits:**
The library automatically retries requests if you hit the API rate limit. Adjust `BATCH_SIZE` or `MAX_WORKERS` if you encounter frequent rate limit errors.
- **Parsing Failures:**
If a document fails to parse, an error chunk will be included in the result, detailing the error message and page index.

---