Skip to content

Commit 57c0e8c

Browse files
authored
Merge pull request #2 from enoch3712/extractThinker
Extract thinker 0.0.1
2 parents cb5d01d + 741b9aa commit 57c0e8c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+3707
-44
lines changed

.flake8

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[flake8]
2+
ignore = E501

.github/workflows/workflow.yml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
name: Python package workflow
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
branches:
9+
- main
10+
11+
jobs:
12+
build-and-test:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- uses: actions/checkout@v3
16+
- name: Set up Python
17+
uses: actions/setup-python@v3
18+
with:
19+
python-version: '3.8'
20+
21+
- name: Install dependencies
22+
run: |
23+
pip install poetry
24+
poetry install
25+
26+
- name: Run tests
27+
run: poetry run pytest
28+
29+
- name: Build package
30+
run: poetry build
31+
32+
- name: Publish to PyPI
33+
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
34+
uses: pypa/[email protected]
35+
with:
36+
user: __token__
37+
password: ${{ secrets.PYPI_API_TOKEN }}

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,3 +162,6 @@ Scripts/unstructured-ingest-script.py
162162
Scripts/unstructured-ingest.exe
163163
Scripts/uvicorn.exe
164164
Scripts/vba_extract.py
165+
166+
# VSCode settings
167+
.vscode/

.pre-commit-config.yaml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
repos:
2+
- repo: https://github.com/astral-sh/ruff-pre-commit
3+
rev: v0.1.7 # Ruff version
4+
hooks:
5+
- id: ruff # Run the linter.
6+
name: Run Linter Check (Ruff)
7+
args: [ --fix ]
8+
files: ^(extractthinker|tests|examples)/
9+
- id: ruff-format # Run the formatter.
10+
name: Run Formatter (Ruff)
11+
- repo: local
12+
hooks:
13+
- id: ci_type_mypy
14+
name: Run Type Check (Mypy)
15+
entry: >
16+
bash -c 'set -o pipefail;
17+
export CUSTOM_PACKAGES="extractthinker/_types/_alias.py extractthinker/cli/cli.py extractthinker/cli/files.py extractthinker/cli/usage.py extractthinker/exceptions.py" &&
18+
export CUSTOM_FLAGS="--python-version=3.9 --color-output --no-pretty --follow-imports=skip" &&
19+
curl -sSL https://raw.githubusercontent.com/gao-hongnan/omniverse/2fd5de1b8103e955cd5f022ab016b72fa901fa8f/scripts/devops/continuous-integration/type_mypy.sh |
20+
bash'
21+
language: system
22+
types: [python]
23+
pass_filenames: false

.ruff.toml

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Exclude a variety of commonly ignored directories.
2+
exclude = [
3+
".bzr",
4+
".direnv",
5+
".eggs",
6+
".git",
7+
".git-rewrite",
8+
".hg",
9+
".mypy_cache",
10+
".nox",
11+
".pants.d",
12+
".pytype",
13+
".ruff_cache",
14+
".svn",
15+
".tox",
16+
".venv",
17+
"__pypackages__",
18+
"_build",
19+
"buck-out",
20+
"build",
21+
"dist",
22+
"node_modules",
23+
"venv",
24+
]
25+
26+
# Same as Black.
27+
line-length = 88
28+
output-format = "grouped"
29+
30+
target-version = "py39"
31+
32+
[lint]
33+
select = [
34+
# bugbear rules
35+
"B",
36+
# remove unused imports
37+
"F401",
38+
# bare except statements
39+
"E722",
40+
# unused arguments
41+
"ARG",
42+
# redefined variables
43+
"ARG005",
44+
]
45+
ignore = [
46+
# mutable defaults
47+
"B006",
48+
"B018",
49+
]
50+
51+
unfixable = [
52+
# disable auto fix for print statements
53+
"T201",
54+
"T203",
55+
]
56+
ignore-init-module-imports = true
57+
58+
[extend-per-file-ignores]
59+
"instructor/distil.py" = ["ARG002"]
60+
"tests/test_distil.py" = ["ARG001"]
61+
"tests/test_patch.py" = ["ARG001"]
62+
"examples/task_planner/task_planner_topological_sort.py" = ["ARG002"]
63+
"examples/citation_with_extraction/main.py" = ["ARG001"]
64+

README.md

Lines changed: 111 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,133 @@
1-
# Open-DocLLM
1+
<p align="center">
2+
<img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/41d9d151-acb5-44da-9c10-0058f76c2512" alt="Extract Thinker Logo" width="200"/>
3+
</p>
4+
<p align="center">
5+
<a href="https://medium.com/@enoch3712">
6+
<img alt="Medium" src="https://img.shields.io/badge/Medium-12100E?style=flat&logo=medium&logoColor=white" />
7+
</a>
8+
<img alt="GitHub Last Commit" src="https://img.shields.io/github/last-commit/enoch3712/Open-DocLLM" />
9+
<img alt="Github License" src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" />
10+
</p>
211

3-
## Introduction
4-
This project aims to tackle the challenges of data extraction and processing using OCR and LLM. It is inspired by JP Morgan's DocLLM but is fully open-source and offers a larger context window size. The project is divided into two parts: the OCR and LLM layer.
12+
# ExtractThinker
513

6-
![image](https://github.com/enoch3712/Open-DocLLM/assets/9283394/2612cc9e-fc66-401e-912d-3acaef42d9cc)
14+
Library to extract data from files and documents agnostically using LLMs. `extract_thinker` provides ORM-style interaction between files and LLMs, allowing for flexible and powerful document extraction workflows.
715

8-
## OCR Layer
9-
The OCR layer is responsible for reading all the content from a document. It involves the following steps:
16+
## Features
1017

11-
1. **Convert pages to images**: Any type of file is converted into an image so that all the content in the document can be read.
18+
- Supports multiple document loaders including Tesseract OCR, Azure Form Recognizer, AWS TextExtract, Google Document AI.
19+
- Customizable extraction using contract definitions.
20+
- Asynchronous processing for efficient document handling.
21+
- Built-in support for various document formats.
22+
- ORM-style interaction between files and LLMs.
1223

13-
2. **Preprocess image for OCR**: The image is adjusted to improve its quality and readability.
24+
<p align="center">
25+
<img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/b1b8800c-3c55-4ee5-92fe-b8b663c7a81f" alt="Extract Thinker Features Diagram" width="300"/>
26+
</p>
1427

15-
3. **Tesseract OCR**: The Tesseract OCR, the most popular open-source OCR in the world, is used to read the content from the images.
28+
## Installation
1629

17-
## LLM Layer
18-
The LLM layer is responsible for extracting specific content from the document in a structured way. It involves defining an extraction contract and extracting the JSON data.
30+
To install `extract_thinker`, you can use `pip`:
1931

20-
## Running Locally
21-
You can run the models on-premises using LLM studio or Ollama. This project uses LlamaIndex and Ollama.
32+
```bash
33+
pip install extract_thinker
34+
```
2235

23-
## Running the Code
24-
The repo includes a FastAPI app with one endpoint for testing. Make sure to point to the proper Tesseract executable and change the key in the config.py file.
36+
## Usage
37+
Here's a quick example to get you started with extract_thinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract.
2538

26-
1. Install Tessaract
27-
https://github.com/tesseract-ocr/tesseract
39+
```python
40+
import os
41+
from dotenv import load_dotenv
42+
from extract_thinker import DocumentLoaderTesseract, Extractor, Contract
2843

29-
2. Install the required Python packages.
30-
```sh
31-
pip install -r requirements.txt
32-
```
44+
load_dotenv()
45+
cwd = os.getcwd()
3346

34-
3. Run fast api
35-
```sh
36-
uvicorn main:app --reload
37-
```
47+
class InvoiceContract(Contract):
48+
invoice_number: str
49+
invoice_date: str
50+
51+
tesseract_path = os.getenv("TESSERACT_PATH")
52+
test_file_path = os.path.join(cwd, "test_images", "invoice.png")
53+
54+
extractor = Extractor()
55+
extractor.load_document_loader(
56+
DocumentLoaderTesseract(tesseract_path)
57+
)
58+
extractor.load_llm("claude-3-haiku-20240307")
3859

39-
4. go to the Swgger page:
40-
http://localhost:8000/docs
60+
result = extractor.extract(test_file_path, InvoiceContract)
4161

42-
## Running with Docker
43-
1. Build the Docker image.
44-
```sh
45-
docker build -t your-image-name .
62+
print("Invoice Number: ", result.invoice_number)
63+
print("Invoice Date: ", result.invoice_date)
4664
```
4765

48-
2. Run the Docker container.
49-
```sh
50-
docker run -p 8000:8000 your-image-name
66+
## Splitting Files Example
67+
You can also split and process documents using extract_thinker. Here's how you can do it:
68+
69+
```python
70+
import os
71+
from dotenv import load_dotenv
72+
from extract_thinker import DocumentLoaderTesseract, Extractor, Process, Classification, ImageSplitter
73+
74+
load_dotenv()
75+
76+
class DriverLicense(Contract):
77+
# Define your DriverLicense contract fields here
78+
pass
79+
80+
class InvoiceContract(Contract):
81+
invoice_number: str
82+
invoice_date: str
83+
84+
extractor = Extractor()
85+
extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
86+
extractor.load_llm("gpt-3.5-turbo")
87+
88+
classifications = [
89+
Classification(name="Driver License", description="This is a driver license", contract=DriverLicense, extractor=extractor),
90+
Classification(name="Invoice", description="This is an invoice", contract=InvoiceContract, extractor=extractor)
91+
]
92+
93+
process = Process()
94+
process.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
95+
process.load_splitter(ImageSplitter())
96+
97+
path = "..."
98+
99+
split_content = process.load_file(path)\
100+
.split(classifications)\
101+
.extract()
102+
103+
# Process the split_content as needed
51104
```
52105

53-
3. go to the Swgger page:
54-
http://localhost:8000/docs
106+
## Infrastructure
107+
108+
The `extract_thinker` project is inspired by the LangChain ecosystem, featuring a modular infrastructure with templates, components, and core functions to facilitate robust document extraction and processing.
109+
110+
<p align="center">
111+
<img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/996fb2de-0558-4f13-ab3d-7ea56a593951" alt="Extract Thinker Logo" width="400"/>
112+
</p>
113+
114+
## Why Just Not LangChain?
115+
While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). Although achieving 100% accuracy in IDP remains a challenge, leveraging LLMs brings us significantly closer to this goal.
116+
117+
## Additional Examples
118+
You can find more examples in the repository. These examples cover various use cases and demonstrate the flexibility of extract_thinker. Also check my the medium of the author that contains several examples about the library
55119

120+
## Contributing
121+
We welcome contributions from the community! If you would like to contribute, please follow these steps:
56122

57-
## Advanced Cases: 1 Million token context
58-
The project also explores advanced cases like a 1 million token context using LLM Lingua and Mistral Yarn 128k context window.
123+
Fork the repository.
124+
Create a new branch for your feature or bugfix.
125+
Write tests for your changes.
126+
Run tests to ensure everything is working correctly.
127+
Submit a pull request with a description of your changes.
59128

60-
## Conclusion
61-
The integration of OCR and LLM technologies in this project marks a pivotal advancement in analyzing unstructured data. The combination of open-source projects like Tesseract and Mistral makes a perfect implementation that could be used in an on-premise use case.
129+
## License
130+
This project is licensed under the Apache License 2.0. See the LICENSE file for more details.
62131

63-
## References & Documents 
64-
1. [DOCLLM: A LAYOUT-AWARE GENERATIVE LANGUAGE MODEL FOR MULTIMODAL DOCUMENT UNDERSTANDING](https://arxiv.org/pdf/2401.00908.pdf)
65-
2. [YaRN: Efficient Context Window Extension of Large Language Models](https://arxiv.org/pdf/2309.00071.pdf)
132+
## Contact
133+
For any questions or issues, please open an issue on the GitHub repository.

examples/extractor_basic.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
import os
2+
3+
from dotenv import load_dotenv
4+
5+
from extract_thinker import DocumentLoaderTesseract, Extractor, Contract
6+
7+
load_dotenv()
8+
cwd = os.getcwd()
9+
10+
11+
class InvoiceContract(Contract):
12+
invoice_number: str
13+
invoice_date: str
14+
15+
16+
tesseract_path = os.getenv("TESSERACT_PATH")
17+
test_file_path = os.path.join(cwd, "tests", "test_images", "invoice.png")
18+
19+
extractor = Extractor()
20+
extractor.load_document_loader(
21+
DocumentLoaderTesseract(tesseract_path)
22+
)
23+
extractor.load_llm("claude-3-haiku-20240307")
24+
25+
result = extractor.extract(test_file_path, InvoiceContract)
26+
27+
if result is not None:
28+
print("Extraction successful.")
29+
else:
30+
print("Extraction failed.")
31+
32+
print("Invoice Number: ", result.invoice_number)
33+
print("Invoice Date: ", result.invoice_date)

extract_thinker/__init__.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
from .extractor import Extractor
2+
from .document_loader.document_loader import DocumentLoader
3+
from .document_loader.cached_document_loader import CachedDocumentLoader
4+
from .document_loader.document_loader_tesseract import DocumentLoaderTesseract
5+
from .models import classification, classification_response
6+
from .process import Process
7+
from .splitter import Splitter
8+
from .image_splitter import ImageSplitter
9+
from .models.classification import Classification
10+
from .models.contract import Contract
11+
12+
13+
__all__ = ['Extractor', 'DocumentLoader', 'CachedDocumentLoader', 'DocumentLoaderTesseract', 'classification', 'classification_response', 'Process', 'Splitter', 'ImageSplitter', 'Classification', 'Contract']

0 commit comments

Comments
 (0)