This repository provides an AI-driven solution for extracting structured data from PDFs and HTML files, specifically designed for electronic datasheets. It integrates OCR for scanned documents and utilizes LLMs for intelligent parsing.
git clone <repo-url>
cd DeepSheet
pip install -r requirements.txtuvicorn src.api.server:app --host 0.0.0.0 --port 8000 --reloadDeepSheet/
│── src/
│ ├── extraction/
│ │ ├── pdf_extractor.py # PDF extraction logic
│ │ ├── html_extractor.py # HTML extraction logic
│ │ ├── ocr_handler.py # OCR processing for scanned PDFs
│ │ ├── llm_handler.py # LLM-based extraction functions
│ │ └── __init__.py
│ ├── utils/
│ │ ├── file_utils.py # File handling
│ │ ├── text_processing.py # Text processing utilities
│ │ ├── validation.py # Data validation
│ │ ├── config.py # Configuration settings
│ │ └── __init__.py
│ ├── api/
│ │ ├── server.py # FastAPI server
│ │ ├── routes.py # API endpoints
│ │ ├── models.py # Data models
│ │ └── __init__.py
│ ├── tests/
│ │ ├── test_pdf_extractor.py
│ │ ├── test_html_extractor.py
│ │ ├── test_ocr_handler.py
│ │ ├── test_llm_handler.py
│ │ └── __init__.py
│── notebooks/ # Jupyter notebooks for testing
│── examples/ # Sample PDFs and HTML files
│── data/ # Extracted structured data
│── docs/ # Documentation
│── .env.example # Example environment variables file
│── .gitignore # Ignore unnecessary files
│── Dockerfile # Containerization setup
│── requirements.txt # Dependencies
│── README.md # Project overview
│── setup.py # Package setup (optional)
│── LICENSE # MIT License file
fastapi
uvicorn
pdfplumber
beautifulsoup4
tesseract
openai
langchainPOST /extract/pdf/- Uploads a PDF and extracts structured data.POST /extract/html/- Uploads an HTML file and extracts structured data.GET /health- Checks API health status.
- Fork the repository.
- Create a feature branch.
- Submit a pull request.
MIT License