Skip to content

Commit 242a5f3

Browse files
authored
Merge pull request #86 from enoch3712/85-documentloader-aws-google-documentation
done
2 parents b3ca396 + f78cb2d commit 242a5f3

File tree

7 files changed

+324
-1
lines changed

7 files changed

+324
-1
lines changed
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# AWS Textract Document Loader
2+
3+
> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents.
4+
5+
## Prerequisite
6+
7+
You need AWS credentials with access to Textract service. You will need:
8+
- `AWS_ACCESS_KEY_ID`
9+
- `AWS_SECRET_ACCESS_KEY`
10+
- `AWS_DEFAULT_REGION`
11+
12+
```python
13+
%pip install --upgrade --quiet extract_thinker boto3
14+
```
15+
16+
## Basic Usage
17+
18+
Here's a simple example of using the AWS Textract loader:
19+
20+
```python
21+
from extract_thinker import DocumentLoaderTextract
22+
23+
# Initialize the loader
24+
loader = DocumentLoaderTextract(
25+
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
26+
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
27+
region_name=os.getenv('AWS_DEFAULT_REGION')
28+
)
29+
30+
# Load document content
31+
result = loader.load_content_from_file("document.pdf")
32+
```
33+
34+
## Response Structure
35+
36+
The loader returns a dictionary with the following structure:
37+
38+
```python
39+
{
40+
"pages": [
41+
{
42+
"paragraphs": ["text content..."],
43+
"lines": ["line1", "line2"],
44+
"words": ["word1", "word2"]
45+
}
46+
],
47+
"tables": [
48+
[["cell1", "cell2"], ["cell3", "cell4"]]
49+
],
50+
"forms": [
51+
{"key": "value"}
52+
],
53+
"layout": {
54+
# Document layout information
55+
}
56+
}
57+
```
58+
59+
## Best Practices
60+
61+
1. **Document Preparation**
62+
- Use high-quality scans
63+
- Support formats: PDF, JPEG, PNG
64+
- Consider file size limits
65+
66+
2. **Performance**
67+
- Cache results when possible
68+
- Process pages individually for large documents
69+
- Monitor API quotas and costs
70+
71+
For more examples and implementation details, check out the [examples directory](examples/) in the repository.
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Google Document AI Document Loader
2+
3+
> Google Document AI transforms unstructured document data into structured, actionable insights using machine learning.
4+
5+
## Prerequisite
6+
7+
You need Google Cloud credentials and a Document AI processor. You will need:
8+
- `DOCUMENTAI_GOOGLE_CREDENTIALS`
9+
- `DOCUMENTAI_LOCATION`
10+
- `DOCUMENTAI_PROCESSOR_NAME`
11+
12+
```python
13+
%pip install --upgrade --quiet extract_thinker google-cloud-documentai
14+
```
15+
16+
## Basic Usage
17+
18+
Here's a simple example of using the Google Document AI loader:
19+
20+
```python
21+
from extract_thinker import DocumentLoaderDocumentAI
22+
23+
# Initialize the loader
24+
loader = DocumentLoaderDocumentAI(
25+
credentials=os.getenv("DOCUMENTAI_GOOGLE_CREDENTIALS"),
26+
location=os.getenv("DOCUMENTAI_LOCATION"),
27+
processor_name=os.getenv("DOCUMENTAI_PROCESSOR_NAME")
28+
)
29+
30+
# Load CV/Resume content
31+
content = loader.load_content_from_file("CV_Candidate.pdf")
32+
```
33+
34+
## Response Structure
35+
36+
The loader returns a dictionary containing:
37+
```python
38+
{
39+
"pages": [
40+
{
41+
"content": "Full text content of the page",
42+
"paragraphs": ["Paragraph 1", "Paragraph 2"],
43+
"tables": [
44+
[
45+
["Header 1", "Header 2"],
46+
["Value 1", "Value 2"]
47+
]
48+
]
49+
}
50+
]
51+
}
52+
```
53+
54+
## Processing Different Document Types
55+
56+
```python
57+
# Process forms with tables
58+
content = loader.load_content_from_file("form_with_tables.pdf")
59+
60+
# Process from stream
61+
with open("document.pdf", "rb") as f:
62+
content = loader.load_content_from_stream(
63+
stream=f,
64+
mime_type="application/pdf"
65+
)
66+
```
67+
68+
## Best Practices
69+
70+
1. **Document Types**
71+
- Use appropriate processor for document type
72+
- Ensure correct MIME type for streams
73+
- Validate content structure
74+
75+
2. **Performance**
76+
- Process in batches when possible
77+
- Cache results for repeated access
78+
- Monitor API quotas
79+
80+
Document AI supports PDF, TIFF, GIF, JPEG, PNG with a maximum file size of 20MB or 2000 pages.
81+
82+
For more examples and implementation details, check out the [examples directory](examples/) in the repository.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# PDF Plumber Document Loader
2+
3+
PDF Plumber is a Python library for extracting text and tables from PDFs. ExtractThinker's PDF Plumber loader provides a simple interface for working with this library.
4+
5+
## Basic Usage
6+
7+
Here's how to use the PDF Plumber loader:
8+
9+
```python
10+
from extract_thinker import Extractor
11+
from extract_thinker.document_loader import DocumentLoaderPdfPlumber
12+
13+
# Initialize the loader
14+
loader = DocumentLoaderPdfPlumber()
15+
16+
# Load document content
17+
result = loader.load_content_from_file("document.pdf")
18+
19+
# Access extracted content
20+
text = result["text"] # List of text content by page
21+
tables = result["tables"] # List of tables found in document
22+
```
23+
24+
## Features
25+
26+
- Text extraction with positioning
27+
- Table detection and extraction
28+
- Image location detection
29+
- Character-level text properties
30+
31+
## Best Practices
32+
33+
1. **Document Preparation**
34+
- Ensure PDFs are not scanned images
35+
- Use well-structured PDFs
36+
- Check for text encoding issues
37+
38+
2. **Performance**
39+
- Process pages individually for large documents
40+
- Cache results for repeated access
41+
- Consider memory usage for large files
42+
43+
For more examples and implementation details, check out the [examples directory](examples/) in the repository.
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# PyPDF Document Loader
2+
3+
PyPDF is a pure-Python library for reading and writing PDFs. ExtractThinker's PyPDF loader provides a simple interface for text extraction.
4+
5+
## Basic Usage
6+
7+
Here's how to use the PyPDF loader:
8+
9+
```python
10+
from extract_thinker import Extractor
11+
from extract_thinker.document_loader import DocumentLoaderPyPdf
12+
13+
# Initialize the loader
14+
loader = DocumentLoaderPyPdf()
15+
16+
# Load document content
17+
content = loader.load_content_from_file("document.pdf")
18+
19+
# Access text content
20+
text = content["text"] # List of text content by page
21+
```
22+
23+
## Features
24+
25+
- Basic text extraction
26+
- Page-by-page processing
27+
- Metadata extraction
28+
- Low memory footprint
29+
30+
## Best Practices
31+
32+
1. **Document Handling**
33+
- Use for text-based PDFs
34+
- Consider alternatives for scanned documents
35+
- Check PDF version compatibility
36+
37+
2. **Performance**
38+
- Process large documents in chunks
39+
- Cache results when appropriate
40+
- Monitor memory usage
41+
42+
For more examples and implementation details, check out the [examples directory](examples/) in the repository.
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Spreadsheet Document Loader
2+
3+
The Spreadsheet loader in ExtractThinker handles Excel, CSV, and other tabular data formats.
4+
5+
## Basic Usage
6+
7+
Here's how to use the Spreadsheet loader:
8+
9+
```python
10+
from extract_thinker import Extractor
11+
from extract_thinker.document_loader import DocumentLoaderSpreadsheet
12+
13+
# Initialize the loader
14+
loader = DocumentLoaderSpreadsheet()
15+
16+
# Load Excel file
17+
excel_content = loader.load_content_from_file("data.xlsx")
18+
19+
# Load CSV file
20+
csv_content = loader.load_content_from_file("data.csv")
21+
```
22+
23+
## Features
24+
25+
- Excel file support (.xlsx, .xls)
26+
- CSV file support
27+
- Multiple sheet handling
28+
- Data type preservation
29+
30+
## Best Practices
31+
32+
1. **Data Preparation**
33+
- Use consistent data formats
34+
- Clean data before processing
35+
- Handle missing values appropriately
36+
37+
2. **Performance**
38+
- Process large files in chunks
39+
- Use appropriate data types
40+
- Consider memory limitations
41+
42+
For more examples and implementation details, check out the [examples directory](examples/) in the repository.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Web Document Loader
2+
3+
The Web loader in ExtractThinker uses BeautifulSoup to extract content from web pages and HTML documents.
4+
5+
## Basic Usage
6+
7+
Here's how to use the Web loader:
8+
9+
```python
10+
from extract_thinker import Extractor
11+
from extract_thinker.document_loader import DocumentLoaderBeautifulSoup
12+
13+
# Initialize the loader
14+
loader = DocumentLoaderBeautifulSoup(
15+
header_handling="summarize" # Options: summarize, extract, ignore
16+
)
17+
18+
# Load content from URL
19+
content = loader.load_content_from_file("https://example.com")
20+
21+
# Access extracted content
22+
text = content["content"]
23+
```
24+
25+
## Features
26+
27+
- HTML content extraction
28+
- Header/footer handling
29+
- Link extraction
30+
- Image reference extraction
31+
32+
## Best Practices
33+
34+
1. **URL Handling**
35+
- Validate URLs before processing
36+
- Handle redirects appropriately
37+
- Respect robots.txt
38+
39+
2. **Content Processing**
40+
- Clean HTML before extraction
41+
- Handle different character encodings
42+
- Consider rate limiting for multiple URLs
43+
44+
For more examples and implementation details, check out the [examples directory](examples/) in the repository.

mkdocs.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,6 @@ nav:
3838
- PyPDF: core-concepts/document-loaders/pypdf.md
3939
- Spreadsheet: core-concepts/document-loaders/spreadsheet.md
4040
- Web Loader: core-concepts/document-loaders/web-loader.md
41-
- Docling: core-concepts/document-loaders/docling.md
4241
- Adobe PDF Services: '#'
4342
- ABBYY FineReader: '#'
4443
- PaddleOCR: '#'

0 commit comments

Comments
 (0)