Merge pull request #86 from enoch3712/85-documentloader-aws-google-documentation

enoch3712 · web-flow · commit 242a5f397b07 · 2024-11-25T22:36:34.000+01:00
done
diff --git a/docs/core-concepts/document-loaders/aws-textract.md b/docs/core-concepts/document-loaders/aws-textract.md
@@ -0,0 +1,71 @@
+# AWS Textract Document Loader
+
+> AWS Textract provides advanced OCR and document analysis capabilities, extracting text, forms, and tables from documents.
+
+## Prerequisite
+
+You need AWS credentials with access to Textract service. You will need:
+- `AWS_ACCESS_KEY_ID`
+- `AWS_SECRET_ACCESS_KEY`
+- `AWS_DEFAULT_REGION`
+
+```python
+%pip install --upgrade --quiet extract_thinker boto3
+```
+
+## Basic Usage
+
+Here's a simple example of using the AWS Textract loader:
+
+```python
+from extract_thinker import DocumentLoaderTextract
+
+# Initialize the loader
+loader = DocumentLoaderTextract(
+    aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
+    aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
+    region_name=os.getenv('AWS_DEFAULT_REGION')
+)
+
+# Load document content
+result = loader.load_content_from_file("document.pdf")
+```
+
+## Response Structure
+
+The loader returns a dictionary with the following structure:
+
+```python
+{
+    "pages": [
+        {
+            "paragraphs": ["text content..."],
+            "lines": ["line1", "line2"],
+            "words": ["word1", "word2"]
+        }
+    ],
+    "tables": [
+        [["cell1", "cell2"], ["cell3", "cell4"]]
+    ],
+    "forms": [
+        {"key": "value"}
+    ],
+    "layout": {
+        # Document layout information
+    }
+}
+```
+
+## Best Practices
+
+1. **Document Preparation**
+   - Use high-quality scans
+   - Support formats: PDF, JPEG, PNG
+   - Consider file size limits
+
+2. **Performance**
+   - Cache results when possible
+   - Process pages individually for large documents
+   - Monitor API quotas and costs
+
+For more examples and implementation details, check out the [examples directory](examples/) in the repository. 
diff --git a/docs/core-concepts/document-loaders/google-document-ai.md b/docs/core-concepts/document-loaders/google-document-ai.md
@@ -0,0 +1,82 @@
+# Google Document AI Document Loader
+
+> Google Document AI transforms unstructured document data into structured, actionable insights using machine learning.
+
+## Prerequisite
+
+You need Google Cloud credentials and a Document AI processor. You will need:
+- `DOCUMENTAI_GOOGLE_CREDENTIALS`
+- `DOCUMENTAI_LOCATION`
+- `DOCUMENTAI_PROCESSOR_NAME`
+
+```python
+%pip install --upgrade --quiet extract_thinker google-cloud-documentai
+```
+
+## Basic Usage
+
+Here's a simple example of using the Google Document AI loader:
+
+```python
+from extract_thinker import DocumentLoaderDocumentAI
+
+# Initialize the loader
+loader = DocumentLoaderDocumentAI(
+    credentials=os.getenv("DOCUMENTAI_GOOGLE_CREDENTIALS"),
+    location=os.getenv("DOCUMENTAI_LOCATION"),
+    processor_name=os.getenv("DOCUMENTAI_PROCESSOR_NAME")
+)
+
+# Load CV/Resume content
+content = loader.load_content_from_file("CV_Candidate.pdf")
+```
+
+## Response Structure
+
+The loader returns a dictionary containing:
+```python
+{
+    "pages": [
+        {
+            "content": "Full text content of the page",
+            "paragraphs": ["Paragraph 1", "Paragraph 2"],
+            "tables": [
+                [
+                    ["Header 1", "Header 2"],
+                    ["Value 1", "Value 2"]
+                ]
+            ]
+        }
+    ]
+}
+```
+
+## Processing Different Document Types
+
+```python
+# Process forms with tables
+content = loader.load_content_from_file("form_with_tables.pdf")
+
+# Process from stream
+with open("document.pdf", "rb") as f:
+    content = loader.load_content_from_stream(
+        stream=f,
+        mime_type="application/pdf"
+    )
+```
+
+## Best Practices
+
+1. **Document Types**
+   - Use appropriate processor for document type
+   - Ensure correct MIME type for streams
+   - Validate content structure
+
+2. **Performance**
+   - Process in batches when possible
+   - Cache results for repeated access
+   - Monitor API quotas
+
+Document AI supports PDF, TIFF, GIF, JPEG, PNG with a maximum file size of 20MB or 2000 pages.
+
+For more examples and implementation details, check out the [examples directory](examples/) in the repository. 
diff --git a/docs/core-concepts/document-loaders/pdf-plumber.md b/docs/core-concepts/document-loaders/pdf-plumber.md
@@ -0,0 +1,43 @@
+# PDF Plumber Document Loader
+
+PDF Plumber is a Python library for extracting text and tables from PDFs. ExtractThinker's PDF Plumber loader provides a simple interface for working with this library.
+
+## Basic Usage
+
+Here's how to use the PDF Plumber loader:
+
+```python
+from extract_thinker import Extractor
+from extract_thinker.document_loader import DocumentLoaderPdfPlumber
+
+# Initialize the loader
+loader = DocumentLoaderPdfPlumber()
+
+# Load document content
+result = loader.load_content_from_file("document.pdf")
+
+# Access extracted content
+text = result["text"]      # List of text content by page
+tables = result["tables"]  # List of tables found in document
+```
+
+## Features
+
+- Text extraction with positioning
+- Table detection and extraction
+- Image location detection
+- Character-level text properties
+
+## Best Practices
+
+1. **Document Preparation**
+   - Ensure PDFs are not scanned images
+   - Use well-structured PDFs
+   - Check for text encoding issues
+
+2. **Performance**
+   - Process pages individually for large documents
+   - Cache results for repeated access
+   - Consider memory usage for large files
+
+For more examples and implementation details, check out the [examples directory](examples/) in the repository. 
diff --git a/docs/core-concepts/document-loaders/pypdf.md b/docs/core-concepts/document-loaders/pypdf.md
@@ -0,0 +1,42 @@
+# PyPDF Document Loader
+
+PyPDF is a pure-Python library for reading and writing PDFs. ExtractThinker's PyPDF loader provides a simple interface for text extraction.
+
+## Basic Usage
+
+Here's how to use the PyPDF loader:
+
+```python
+from extract_thinker import Extractor
+from extract_thinker.document_loader import DocumentLoaderPyPdf
+
+# Initialize the loader
+loader = DocumentLoaderPyPdf()
+
+# Load document content
+content = loader.load_content_from_file("document.pdf")
+
+# Access text content
+text = content["text"]  # List of text content by page
+```
+
+## Features
+
+- Basic text extraction
+- Page-by-page processing
+- Metadata extraction
+- Low memory footprint
+
+## Best Practices
+
+1. **Document Handling**
+   - Use for text-based PDFs
+   - Consider alternatives for scanned documents
+   - Check PDF version compatibility
+
+2. **Performance**
+   - Process large documents in chunks
+   - Cache results when appropriate
+   - Monitor memory usage
+
+For more examples and implementation details, check out the [examples directory](examples/) in the repository. 
diff --git a/docs/core-concepts/document-loaders/spreadsheet.md b/docs/core-concepts/document-loaders/spreadsheet.md
@@ -0,0 +1,42 @@
+# Spreadsheet Document Loader
+
+The Spreadsheet loader in ExtractThinker handles Excel, CSV, and other tabular data formats.
+
+## Basic Usage
+
+Here's how to use the Spreadsheet loader:
+
+```python
+from extract_thinker import Extractor
+from extract_thinker.document_loader import DocumentLoaderSpreadsheet
+
+# Initialize the loader
+loader = DocumentLoaderSpreadsheet()
+
+# Load Excel file
+excel_content = loader.load_content_from_file("data.xlsx")
+
+# Load CSV file
+csv_content = loader.load_content_from_file("data.csv")
+```
+
+## Features
+
+- Excel file support (.xlsx, .xls)
+- CSV file support
+- Multiple sheet handling
+- Data type preservation
+
+## Best Practices
+
+1. **Data Preparation**
+   - Use consistent data formats
+   - Clean data before processing
+   - Handle missing values appropriately
+
+2. **Performance**
+   - Process large files in chunks
+   - Use appropriate data types
+   - Consider memory limitations
+
+For more examples and implementation details, check out the [examples directory](examples/) in the repository. 
diff --git a/docs/core-concepts/document-loaders/web-loader.md b/docs/core-concepts/document-loaders/web-loader.md
@@ -0,0 +1,44 @@
+# Web Document Loader
+
+The Web loader in ExtractThinker uses BeautifulSoup to extract content from web pages and HTML documents.
+
+## Basic Usage
+
+Here's how to use the Web loader:
+
+```python
+from extract_thinker import Extractor
+from extract_thinker.document_loader import DocumentLoaderBeautifulSoup
+
+# Initialize the loader
+loader = DocumentLoaderBeautifulSoup(
+    header_handling="summarize"  # Options: summarize, extract, ignore
+)
+
+# Load content from URL
+content = loader.load_content_from_file("https://example.com")
+
+# Access extracted content
+text = content["content"]
+```
+
+## Features
+
+- HTML content extraction
+- Header/footer handling
+- Link extraction
+- Image reference extraction
+
+## Best Practices
+
+1. **URL Handling**
+   - Validate URLs before processing
+   - Handle redirects appropriately
+   - Respect robots.txt
+
+2. **Content Processing**
+   - Clean HTML before extraction
+   - Handle different character encodings
+   - Consider rate limiting for multiple URLs
+
+For more examples and implementation details, check out the [examples directory](examples/) in the repository. 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -38,7 +38,6 @@ nav:
           - PyPDF: core-concepts/document-loaders/pypdf.md
           - Spreadsheet: core-concepts/document-loaders/spreadsheet.md
           - Web Loader: core-concepts/document-loaders/web-loader.md
-          - Docling: core-concepts/document-loaders/docling.md
           - Adobe PDF Services: '#'
           - ABBYY FineReader: '#'
           - PaddleOCR: '#'