DeepSheet

Project Overview

This repository provides an AI-driven solution for extracting structured data from PDFs and HTML files, specifically designed for electronic datasheets. It integrates OCR for scanned documents and utilizes LLMs for intelligent parsing.

Installation

git clone <repo-url>
cd DeepSheet
pip install -r requirements.txt

Running the API

uvicorn src.api.server:app --host 0.0.0.0 --port 8000 --reload

Directory Structure

DeepSheet/
│── src/
│   ├── extraction/
│   │   ├── pdf_extractor.py   # PDF extraction logic
│   │   ├── html_extractor.py  # HTML extraction logic
│   │   ├── ocr_handler.py     # OCR processing for scanned PDFs
│   │   ├── llm_handler.py     # LLM-based extraction functions
│   │   └── __init__.py
│   ├── utils/
│   │   ├── file_utils.py      # File handling
│   │   ├── text_processing.py # Text processing utilities
│   │   ├── validation.py      # Data validation
│   │   ├── config.py          # Configuration settings
│   │   └── __init__.py
│   ├── api/
│   │   ├── server.py          # FastAPI server
│   │   ├── routes.py          # API endpoints
│   │   ├── models.py          # Data models
│   │   └── __init__.py
│   ├── tests/
│   │   ├── test_pdf_extractor.py
│   │   ├── test_html_extractor.py
│   │   ├── test_ocr_handler.py
│   │   ├── test_llm_handler.py
│   │   └── __init__.py
│── notebooks/                  # Jupyter notebooks for testing
│── examples/                    # Sample PDFs and HTML files
│── data/                        # Extracted structured data
│── docs/                        # Documentation
│── .env.example                 # Example environment variables file
│── .gitignore                    # Ignore unnecessary files
│── Dockerfile                    # Containerization setup
│── requirements.txt               # Dependencies
│── README.md                      # Project overview
│── setup.py                        # Package setup (optional)
│── LICENSE                        # MIT License file

Dependencies

fastapi
uvicorn
pdfplumber
beautifulsoup4
tesseract
openai
langchain

API Endpoints

POST /extract/pdf/ - Uploads a PDF and extracts structured data.
POST /extract/html/ - Uploads an HTML file and extracts structured data.
GET /health - Checks API health status.

Contribution Guidelines

Fork the repository.
Create a feature branch.
Submit a pull request.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepSheet

Project Overview

Installation

Running the API

Directory Structure

Dependencies

API Endpoints

Contribution Guidelines

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
env.example		env.example
requirements.txt		requirements.txt
setup.py		setup.py

License

emreyesilyurt/deepsheet

Folders and files

Latest commit

History

Repository files navigation

DeepSheet

Project Overview

Installation

Running the API

Directory Structure

Dependencies

API Endpoints

Contribution Guidelines

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages