Skip to content

an AI-driven solution for extracting structured data from PDFs and HTML files, specifically designed for electronic datasheets. It integrates OCR for scanned documents and utilizes LLMs for intelligent parsing.

License

Notifications You must be signed in to change notification settings

emreyesilyurt/deepsheet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepSheet

Project Overview

This repository provides an AI-driven solution for extracting structured data from PDFs and HTML files, specifically designed for electronic datasheets. It integrates OCR for scanned documents and utilizes LLMs for intelligent parsing.

Installation

git clone <repo-url>
cd DeepSheet
pip install -r requirements.txt

Running the API

uvicorn src.api.server:app --host 0.0.0.0 --port 8000 --reload

Directory Structure

DeepSheet/
│── src/
│   ├── extraction/
│   │   ├── pdf_extractor.py   # PDF extraction logic
│   │   ├── html_extractor.py  # HTML extraction logic
│   │   ├── ocr_handler.py     # OCR processing for scanned PDFs
│   │   ├── llm_handler.py     # LLM-based extraction functions
│   │   └── __init__.py
│   ├── utils/
│   │   ├── file_utils.py      # File handling
│   │   ├── text_processing.py # Text processing utilities
│   │   ├── validation.py      # Data validation
│   │   ├── config.py          # Configuration settings
│   │   └── __init__.py
│   ├── api/
│   │   ├── server.py          # FastAPI server
│   │   ├── routes.py          # API endpoints
│   │   ├── models.py          # Data models
│   │   └── __init__.py
│   ├── tests/
│   │   ├── test_pdf_extractor.py
│   │   ├── test_html_extractor.py
│   │   ├── test_ocr_handler.py
│   │   ├── test_llm_handler.py
│   │   └── __init__.py
│── notebooks/                  # Jupyter notebooks for testing
│── examples/                    # Sample PDFs and HTML files
│── data/                        # Extracted structured data
│── docs/                        # Documentation
│── .env.example                 # Example environment variables file
│── .gitignore                    # Ignore unnecessary files
│── Dockerfile                    # Containerization setup
│── requirements.txt               # Dependencies
│── README.md                      # Project overview
│── setup.py                        # Package setup (optional)
│── LICENSE                        # MIT License file

Dependencies

fastapi
uvicorn
pdfplumber
beautifulsoup4
tesseract
openai
langchain

API Endpoints

  • POST /extract/pdf/ - Uploads a PDF and extracts structured data.
  • POST /extract/html/ - Uploads an HTML file and extracts structured data.
  • GET /health - Checks API health status.

Contribution Guidelines

  1. Fork the repository.
  2. Create a feature branch.
  3. Submit a pull request.

License

MIT License

About

an AI-driven solution for extracting structured data from PDFs and HTML files, specifically designed for electronic datasheets. It integrates OCR for scanned documents and utilizes LLMs for intelligent parsing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published