|
1 | | -# Open-DocLLM |
| 1 | +<p align="center"> |
| 2 | + <img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/41d9d151-acb5-44da-9c10-0058f76c2512" alt="Extract Thinker Logo" width="200"/> |
| 3 | +</p> |
| 4 | +<p align="center"> |
| 5 | +<a href="https://medium.com/@enoch3712"> |
| 6 | + <img alt="Medium" src="https://img.shields.io/badge/Medium-12100E?style=flat&logo=medium&logoColor=white" /> |
| 7 | +</a> |
| 8 | +<img alt="GitHub Last Commit" src="https://img.shields.io/github/last-commit/enoch3712/Open-DocLLM" /> |
| 9 | +<img alt="Github License" src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" /> |
| 10 | +</p> |
2 | 11 |
|
3 | | -## Introduction |
4 | | -This project aims to tackle the challenges of data extraction and processing using OCR and LLM. It is inspired by JP Morgan's DocLLM but is fully open-source and offers a larger context window size. The project is divided into two parts: the OCR and LLM layer. |
| 12 | +# ExtractThinker |
5 | 13 |
|
6 | | - |
| 14 | +Library to extract data from files and documents agnostically using LLMs. `extract_thinker` provides ORM-style interaction between files and LLMs, allowing for flexible and powerful document extraction workflows. |
7 | 15 |
|
8 | | -## OCR Layer |
9 | | -The OCR layer is responsible for reading all the content from a document. It involves the following steps: |
| 16 | +## Features |
10 | 17 |
|
11 | | -1. **Convert pages to images**: Any type of file is converted into an image so that all the content in the document can be read. |
| 18 | +- Supports multiple document loaders including Tesseract OCR, Azure Form Recognizer, AWS TextExtract, Google Document AI. |
| 19 | +- Customizable extraction using contract definitions. |
| 20 | +- Asynchronous processing for efficient document handling. |
| 21 | +- Built-in support for various document formats. |
| 22 | +- ORM-style interaction between files and LLMs. |
12 | 23 |
|
13 | | -2. **Preprocess image for OCR**: The image is adjusted to improve its quality and readability. |
| 24 | +<p align="center"> |
| 25 | + <img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/b1b8800c-3c55-4ee5-92fe-b8b663c7a81f" alt="Extract Thinker Features Diagram" width="300"/> |
| 26 | +</p> |
14 | 27 |
|
15 | | -3. **Tesseract OCR**: The Tesseract OCR, the most popular open-source OCR in the world, is used to read the content from the images. |
| 28 | +## Installation |
16 | 29 |
|
17 | | -## LLM Layer |
18 | | -The LLM layer is responsible for extracting specific content from the document in a structured way. It involves defining an extraction contract and extracting the JSON data. |
| 30 | +To install `extract_thinker`, you can use `pip`: |
19 | 31 |
|
20 | | -## Running Locally |
21 | | -You can run the models on-premises using LLM studio or Ollama. This project uses LlamaIndex and Ollama. |
| 32 | +```bash |
| 33 | +pip install extract_thinker |
| 34 | +``` |
22 | 35 |
|
23 | | -## Running the Code |
24 | | -The repo includes a FastAPI app with one endpoint for testing. Make sure to point to the proper Tesseract executable and change the key in the config.py file. |
| 36 | +## Usage |
| 37 | +Here's a quick example to get you started with extract_thinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract. |
25 | 38 |
|
26 | | -1. Install Tessaract |
27 | | -https://github.com/tesseract-ocr/tesseract |
| 39 | +```python |
| 40 | +import os |
| 41 | +from dotenv import load_dotenv |
| 42 | +from extract_thinker import DocumentLoaderTesseract, Extractor, Contract |
28 | 43 |
|
29 | | -2. Install the required Python packages. |
30 | | -```sh |
31 | | -pip install -r requirements.txt |
32 | | -``` |
| 44 | +load_dotenv() |
| 45 | +cwd = os.getcwd() |
33 | 46 |
|
34 | | -3. Run fast api |
35 | | -```sh |
36 | | -uvicorn main:app --reload |
37 | | -``` |
| 47 | +class InvoiceContract(Contract): |
| 48 | + invoice_number: str |
| 49 | + invoice_date: str |
| 50 | + |
| 51 | +tesseract_path = os.getenv("TESSERACT_PATH") |
| 52 | +test_file_path = os.path.join(cwd, "test_images", "invoice.png") |
| 53 | + |
| 54 | +extractor = Extractor() |
| 55 | +extractor.load_document_loader( |
| 56 | + DocumentLoaderTesseract(tesseract_path) |
| 57 | +) |
| 58 | +extractor.load_llm("claude-3-haiku-20240307") |
38 | 59 |
|
39 | | -4. go to the Swgger page: |
40 | | -http://localhost:8000/docs |
| 60 | +result = extractor.extract(test_file_path, InvoiceContract) |
41 | 61 |
|
42 | | -## Running with Docker |
43 | | -1. Build the Docker image. |
44 | | -```sh |
45 | | -docker build -t your-image-name . |
| 62 | +print("Invoice Number: ", result.invoice_number) |
| 63 | +print("Invoice Date: ", result.invoice_date) |
46 | 64 | ``` |
47 | 65 |
|
48 | | -2. Run the Docker container. |
49 | | -```sh |
50 | | -docker run -p 8000:8000 your-image-name |
| 66 | +## Splitting Files Example |
| 67 | +You can also split and process documents using extract_thinker. Here's how you can do it: |
| 68 | + |
| 69 | +```python |
| 70 | +import os |
| 71 | +from dotenv import load_dotenv |
| 72 | +from extract_thinker import DocumentLoaderTesseract, Extractor, Process, Classification, ImageSplitter |
| 73 | + |
| 74 | +load_dotenv() |
| 75 | + |
| 76 | +class DriverLicense(Contract): |
| 77 | + # Define your DriverLicense contract fields here |
| 78 | + pass |
| 79 | + |
| 80 | +class InvoiceContract(Contract): |
| 81 | + invoice_number: str |
| 82 | + invoice_date: str |
| 83 | + |
| 84 | +extractor = Extractor() |
| 85 | +extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH"))) |
| 86 | +extractor.load_llm("gpt-3.5-turbo") |
| 87 | + |
| 88 | +classifications = [ |
| 89 | + Classification(name="Driver License", description="This is a driver license", contract=DriverLicense, extractor=extractor), |
| 90 | + Classification(name="Invoice", description="This is an invoice", contract=InvoiceContract, extractor=extractor) |
| 91 | +] |
| 92 | + |
| 93 | +process = Process() |
| 94 | +process.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH"))) |
| 95 | +process.load_splitter(ImageSplitter()) |
| 96 | + |
| 97 | +path = "..." |
| 98 | + |
| 99 | +split_content = process.load_file(path)\ |
| 100 | + .split(classifications)\ |
| 101 | + .extract() |
| 102 | + |
| 103 | +# Process the split_content as needed |
51 | 104 | ``` |
52 | 105 |
|
53 | | -3. go to the Swgger page: |
54 | | -http://localhost:8000/docs |
| 106 | +## Infrastructure |
| 107 | + |
| 108 | +The `extract_thinker` project is inspired by the LangChain ecosystem, featuring a modular infrastructure with templates, components, and core functions to facilitate robust document extraction and processing. |
| 109 | + |
| 110 | +<p align="center"> |
| 111 | + <img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/996fb2de-0558-4f13-ab3d-7ea56a593951" alt="Extract Thinker Logo" width="400"/> |
| 112 | +</p> |
| 113 | + |
| 114 | +## Why Just Not LangChain? |
| 115 | +While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). Although achieving 100% accuracy in IDP remains a challenge, leveraging LLMs brings us significantly closer to this goal. |
| 116 | + |
| 117 | +## Additional Examples |
| 118 | +You can find more examples in the repository. These examples cover various use cases and demonstrate the flexibility of extract_thinker. Also check my the medium of the author that contains several examples about the library |
55 | 119 |
|
| 120 | +## Contributing |
| 121 | +We welcome contributions from the community! If you would like to contribute, please follow these steps: |
56 | 122 |
|
57 | | -## Advanced Cases: 1 Million token context |
58 | | -The project also explores advanced cases like a 1 million token context using LLM Lingua and Mistral Yarn 128k context window. |
| 123 | +Fork the repository. |
| 124 | +Create a new branch for your feature or bugfix. |
| 125 | +Write tests for your changes. |
| 126 | +Run tests to ensure everything is working correctly. |
| 127 | +Submit a pull request with a description of your changes. |
59 | 128 |
|
60 | | -## Conclusion |
61 | | -The integration of OCR and LLM technologies in this project marks a pivotal advancement in analyzing unstructured data. The combination of open-source projects like Tesseract and Mistral makes a perfect implementation that could be used in an on-premise use case. |
| 129 | +## License |
| 130 | +This project is licensed under the Apache License 2.0. See the LICENSE file for more details. |
62 | 131 |
|
63 | | -## References & Documents |
64 | | -1. [DOCLLM: A LAYOUT-AWARE GENERATIVE LANGUAGE MODEL FOR MULTIMODAL DOCUMENT UNDERSTANDING](https://arxiv.org/pdf/2401.00908.pdf) |
65 | | -2. [YaRN: Efficient Context Window Extension of Large Language Models](https://arxiv.org/pdf/2309.00071.pdf) |
| 132 | +## Contact |
| 133 | +For any questions or issues, please open an issue on the GitHub repository. |
0 commit comments