ParseForge: Automating Text Extraction from PDFs and URLs

Introduction

This project demonstrates the functionality of a context extraction tool that extracts structured information from unstructured data sources like PDFs and web pages. The tool allows users to test and compare the performance of open-source and enterprise-grade parsers, providing insights into their efficiency, accuracy, and feasibility.

Key Features

Dual Processing Modes:
- 🐍 Open-Source Stack (PyMuPDF, BeautifulSoup, Docling).
- 🚀 Enterprise Solutions (LlamaParser, Firecrawl).
Multi-Format Support: Extracts text, images, tables, and metadata from PDFs and web pages.
Smart Output Options: Choose between Markdown-only or bundled ZIP files containing multiple components.
Cloud Integration: Uses AWS S3 for secure storage of raw inputs and processed outputs.

Initial Setup

Prerequisites

Install Python (>= 3.8) on your system.
Install Docker (for containerized deployment).
Set up an AWS account for S3 storage (if running locally).

Installation

Clone the repository:

git clone https://github.com/your-repo/parse-forge.git
cd parse-forge

Install dependencies:

pip install -r requirements.txt

Set up environment variables by creating a .env file in the root directory (see sample below).
Run the application:

Frontend (Streamlit):
```
streamlit run app.py
```
Backend (FastAPI):
```
uvicorn api:app --reload
```

Access the application:

Streamlit Frontend: http://localhost:8501
FastAPI Backend: http://localhost:8000/docs

Sample `.env` File

Create a .env file in the root directory with the following structure: AWS Configuration

AWS_ACCESS_KEY_ID=<your_aws_access_key>
AWS_SECRET_ACCESS_KEY=<your_aws_secret_key>
AWS_REGION=<your_aws_region>
S3_BUCKET_NAME=<your_s3_bucket_name>

FastAPI Configuration

FASTAPI_URL=<link_to_FASTAPI>

Enterprise API Keys

LLAMAPARSER_API_KEY=<your_llamaparser_api_key>
FIRECRAWL_API_KEY=<your_firecrawl_api_key>

Replace <your_aws_access_key> and other placeholders with your actual credentials.

Links

Assignment 1 Links:

FastAPI Backend: https://nehadevarapalli-parseforge.hf.space/
Streamlit Frontend: https://parse-forge.streamlit.app
GitHub Project: https://github.com/users/nehadevarapalli/projects/2
Codelabs Documentation: https://codelabs-preview.appspot.com/?file_id=1SZHxAEETpt6-INannVHcy-WhCiZ-rmFsuChKF19sKO8#0
Demo Video: Youtube

How It Works

Upload a PDF or input a webpage URL.
Choose between open-source or enterprise parsers.
Select specific components to extract (e.g., text, images, tables).
Process the content and download the output as Markdown or a ZIP bundle.

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
.github/workflows		.github/workflows
assets		assets
backend		backend
frontend		frontend
prototyping		prototyping
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ParseForge: Automating Text Extraction from PDFs and URLs

Introduction

Key Features

Initial Setup

Prerequisites

Installation

Sample `.env` File

Links

Assignment 1 Links:

How It Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

DAMG7245-Team-2/ParseForge

Folders and files

Latest commit

History

Repository files navigation

ParseForge: Automating Text Extraction from PDFs and URLs

Introduction

Key Features

Initial Setup

Prerequisites

Installation

Sample .env File

Links

Assignment 1 Links:

How It Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Sample `.env` File

Packages