Skip to content

Commit 921defd

Browse files
authored
Merge pull request #191 from enoch3712/49-documentloaderconfig
49 documentloaderconfig
2 parents c89ca62 + e51887c commit 921defd

37 files changed

+3375
-599
lines changed
Lines changed: 67 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,74 +1,92 @@
11
# AWS Textract Document Loader
22

3-
> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents.
4-
5-
## Installation
6-
7-
Install the required dependencies:
8-
9-
```bash
10-
pip install boto3
11-
```
12-
13-
## Prerequisites
14-
15-
1. An AWS account
16-
2. AWS credentials with access to Textract service
17-
3. AWS region where Textract is available
3+
The AWS Textract loader uses Amazon's Textract service to extract text, forms, and tables from documents. It supports both image files and PDFs.
184

195
## Supported Formats
206

21-
- Images: jpeg/jpg, png, tiff
22-
- Documents: pdf
7+
- pdf
8+
- jpeg
9+
- png
10+
- tiff
2311

2412
## Usage
2513

14+
### Basic Usage
15+
2616
```python
2717
from extract_thinker import DocumentLoaderAWSTextract
2818

29-
# Initialize the loader with AWS credentials
19+
# Initialize with AWS credentials
3020
loader = DocumentLoaderAWSTextract(
31-
aws_access_key_id="your-access-key",
32-
aws_secret_access_key="your-secret-key",
33-
region_name="your-region"
21+
aws_access_key_id="your_access_key",
22+
aws_secret_access_key="your_secret_key",
23+
region_name="your_region"
3424
)
3525

36-
# Load document content
37-
result = loader.load_content_from_file("document.pdf")
38-
```
26+
# Load document
27+
pages = loader.load("path/to/your/document.pdf")
3928

40-
## Response Structure
29+
# Process extracted content
30+
for page in pages:
31+
# Access text content
32+
text = page["content"]
33+
# Access tables if extracted
34+
tables = page.get("tables", [])
35+
```
4136

42-
The loader returns a dictionary with the following structure:
37+
### Configuration-based Usage
4338

4439
```python
45-
{
46-
"pages": [
47-
{
48-
"paragraphs": ["text content..."],
49-
"lines": ["line1", "line2"],
50-
"words": ["word1", "word2"]
51-
}
52-
],
53-
"tables": [
54-
[["cell1", "cell2"], ["cell3", "cell4"]]
55-
],
56-
"forms": [
57-
{"key": "value"}
58-
],
59-
"layout": {
60-
# Document layout information
61-
}
62-
}
40+
from extract_thinker import DocumentLoaderAWSTextract, TextractConfig
41+
42+
# Create configuration
43+
config = TextractConfig(
44+
aws_access_key_id="your_access_key",
45+
aws_secret_access_key="your_secret_key",
46+
region_name="your_region",
47+
feature_types=["TABLES", "FORMS", "SIGNATURES"], # Specify features to extract
48+
cache_ttl=600, # Cache results for 10 minutes
49+
max_retries=3 # Number of retry attempts
50+
)
51+
52+
# Initialize loader with configuration
53+
loader = DocumentLoaderAWSTextract(config)
54+
55+
# Load and process document
56+
pages = loader.load("path/to/your/document.pdf")
6357
```
6458

65-
## Supported Formats
59+
## Configuration Options
60+
61+
The `TextractConfig` class supports the following options:
6662

67-
`PDF`, `JPEG`, `PNG`
63+
| Option | Type | Default | Description |
64+
|--------|------|---------|-------------|
65+
| `content` | Any | None | Initial content to process |
66+
| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
67+
| `aws_access_key_id` | str | None | AWS access key ID |
68+
| `aws_secret_access_key` | str | None | AWS secret access key |
69+
| `region_name` | str | None | AWS region name |
70+
| `textract_client` | boto3.client | None | Pre-configured Textract client |
71+
| `feature_types` | List[str] | [] | Features to extract (TABLES, FORMS, LAYOUT, SIGNATURES) |
72+
| `max_retries` | int | 3 | Maximum number of retry attempts |
6873

6974
## Features
7075

71-
- Text extraction with layout preservation
76+
- Text extraction from images and PDFs
7277
- Table detection and extraction
73-
- Support for multiple document formats
74-
- Automatic retries on API failures
78+
- Form field detection
79+
- Layout analysis
80+
- Signature detection
81+
- Configurable feature selection
82+
- Automatic retry on failure
83+
- Caching support
84+
- Support for pre-configured clients
85+
86+
## Notes
87+
88+
- Raw text extraction is the default when no feature types are specified
89+
- "QUERIES" feature type is not supported
90+
- Vision mode is supported for image formats
91+
- AWS credentials are required unless using a pre-configured client
92+
- Rate limits and quotas apply based on your AWS account
Lines changed: 61 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,23 @@
1-
# Azure Document Intelligence Document Loader
1+
# Azure Document Intelligence Loader
22

3-
The Azure Document Intelligence loader (formerly known as Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and layout information from documents.
4-
5-
## Installation
6-
7-
Install the required dependencies:
8-
9-
```bash
10-
pip install azure-ai-formrecognizer
11-
```
12-
13-
## Prerequisites
14-
15-
1. An Azure subscription
16-
2. A Document Intelligence resource created in your Azure portal
17-
3. The endpoint URL and subscription key from your Azure resource
3+
The Azure Document Intelligence loader (formerly Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and structured information from documents.
184

195
## Supported Formats
206

217
Supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.
228

239
## Usage
2410

11+
### Basic Usage
12+
2513
```python
2614
from extract_thinker import DocumentLoaderAzureForm
2715

28-
# Initialize the loader
16+
# Initialize with Azure credentials
2917
loader = DocumentLoaderAzureForm(
30-
subscription_key="your-subscription-key",
31-
endpoint="your-endpoint-url"
18+
endpoint="your_endpoint",
19+
key="your_api_key",
20+
model="prebuilt-document" # Use prebuilt document model
3221
)
3322

3423
# Load document
@@ -38,14 +27,62 @@ pages = loader.load("path/to/your/document.pdf")
3827
for page in pages:
3928
# Access text content
4029
text = page["content"]
41-
42-
# Access tables (if any)
43-
tables = page["tables"]
30+
# Access tables if available
31+
tables = page.get("tables", [])
4432
```
4533

34+
### Configuration-based Usage
35+
36+
```python
37+
from extract_thinker import DocumentLoaderAzureForm, AzureConfig
38+
39+
# Create configuration
40+
config = AzureConfig(
41+
endpoint="your_endpoint",
42+
key="your_api_key",
43+
model="prebuilt-read", # Use layout model for enhanced layout analysis
44+
language="en", # Specify document language
45+
pages=[1, 2, 3], # Process specific pages
46+
cache_ttl=600 # Cache results for 10 minutes
47+
)
48+
49+
# Initialize loader with configuration
50+
loader = DocumentLoaderAzureForm(config)
51+
52+
# Load and process document
53+
pages = loader.load("path/to/your/document.pdf")
54+
```
55+
56+
## Configuration Options
57+
58+
The `AzureConfig` class supports the following options:
59+
60+
| Option | Type | Default | Description |
61+
|--------|------|---------|-------------|
62+
| `content` | Any | None | Initial content to process |
63+
| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
64+
| `endpoint` | str | None | Azure endpoint URL |
65+
| `key` | str | None | Azure API key |
66+
| `model` | str | "prebuilt-document" | Model ID to use |
67+
| `language` | str | None | Document language code |
68+
| `pages` | List[int] | None | Specific pages to process |
69+
| `reading_order` | str | "natural" | Text reading order |
70+
4671
## Features
4772

4873
- Text extraction with layout preservation
4974
- Table detection and extraction
50-
- Support for multiple document formats
51-
- Automatic table content deduplication from text
75+
- Form field recognition
76+
- Multiple model support (document, layout, read)
77+
- Language specification
78+
- Page selection
79+
- Reading order control
80+
- Caching support
81+
- Support for pre-configured clients
82+
83+
## Notes
84+
85+
- Available models: "prebuilt-document", "prebuilt-layout", "prebuilt-read"
86+
- Vision mode is supported for image formats
87+
- Azure credentials are required
88+
- Rate limits and quotas apply based on your Azure subscription
Lines changed: 51 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,6 @@
1-
# Microsoft Word Document Loader (Doc2txt)
1+
# Doc2txt Document Loader
22

3-
The Doc2txt loader is designed to handle Microsoft Word documents (`.doc` and `.docx` files). It uses the `docx2txt` library to extract text content from Word documents.
4-
5-
## Installation
6-
7-
Install the required dependencies:
8-
9-
```bash
10-
pip install docx2txt
11-
```
3+
The Doc2txt loader extracts text from Microsoft Word documents. It supports both legacy (.doc) and modern (.docx) file formats.
124

135
## Supported Formats
146

@@ -17,10 +9,12 @@ pip install docx2txt
179

1810
## Usage
1911

12+
### Basic Usage
13+
2014
```python
2115
from extract_thinker import DocumentLoaderDoc2txt
2216

23-
# Initialize the loader
17+
# Initialize with default settings
2418
loader = DocumentLoaderDoc2txt()
2519

2620
# Load document
@@ -32,9 +26,52 @@ for page in pages:
3226
text = page["content"]
3327
```
3428

29+
### Configuration-based Usage
30+
31+
```python
32+
from extract_thinker import DocumentLoaderDoc2txt, Doc2txtConfig
33+
34+
# Create configuration
35+
config = Doc2txtConfig(
36+
page_separator="\n\n---\n\n", # Custom page separator
37+
preserve_whitespace=True, # Preserve original whitespace
38+
extract_images=True, # Extract embedded images
39+
cache_ttl=600 # Cache results for 10 minutes
40+
)
41+
42+
# Initialize loader with configuration
43+
loader = DocumentLoaderDoc2txt(config)
44+
45+
# Load and process document
46+
pages = loader.load("path/to/your/document.docx")
47+
```
48+
49+
## Configuration Options
50+
51+
The `Doc2txtConfig` class supports the following options:
52+
53+
| Option | Type | Default | Description |
54+
|--------|------|---------|-------------|
55+
| `content` | Any | None | Initial content to process |
56+
| `cache_ttl` | int | 300 | Cache time-to-live in seconds |
57+
| `page_separator` | str | "\n\n" | Text to use as page separator |
58+
| `preserve_whitespace` | bool | False | Whether to preserve whitespace |
59+
| `extract_images` | bool | False | Whether to extract embedded images |
60+
3561
## Features
3662

3763
- Text extraction from Word documents
38-
- Support for both .doc and .docx formats
39-
- Automatic page detection
40-
- Preserves basic text formatting
64+
- Support for both .doc and .docx
65+
- Custom page separation
66+
- Whitespace preservation
67+
- Image extraction (optional)
68+
- Caching support
69+
- No cloud service required
70+
71+
## Notes
72+
73+
- Vision mode is not supported
74+
- Image extraction requires additional memory
75+
- Local processing with no external dependencies
76+
- May not preserve complex formatting
77+
- Handles both legacy and modern Word formats

0 commit comments

Comments
 (0)