SFI Data Information Extractor

An LLM-powered document intelligence pipeline built during an internship at SFI Data. The system processes corporate financial PDFs — Second Party Opinion (SPO) reports, sustainability frameworks, and term sheets — and outputs structured, analysis-ready Excel files.

Problem Context

Financial analysts at SFI Data spent significant manual effort reading dense, inconsistently formatted corporate PDFs to extract a standard set of data fields. The documents were long (50–200+ pages), varied in layout, and contained both free-form narrative and embedded tables — neither of which existing PDF parsers handled well together.

This pipeline automates both extraction pathways and reduces document processing to a single command.

System Architecture

The pipeline is split into two parallel tracks depending on data type:

Track 1 — Textual Data Pipeline

PDF(s)
  │
  ▼
extractor.py       ← PyPDF2-based text extraction per page
  │
  ▼
parser.py          ← Chunks text → builds vector DB → retrieves relevant chunks
  │                   per field → queries LLM with structured JSON prompt
  ▼
writer.py          ← Parses LLM JSON output → writes to Excel via openpyxl
  │
  ▼
output.xlsx

Track 2 — Tabular Data Pipeline

PDF(s)
  │
  ▼
table_extractor.py  ← LLMWhisperer API extracts tables with layout preservation
  │
  ▼
table_parser.py     ← Full table passed as context (no chunking — avoids
  │                    row/column boundary loss) → LLM extracts structured data
  ▼
table_writer.py     ← Writes structured output to Excel
  │
  ▼
output.xlsx

Why different chunking strategies for text vs. tables? Text fields are long and diffuse — chunking + vector retrieval focuses the LLM on the relevant passage without exceeding context limits. Tables are short but spatially structured: chunking destroys row-column relationships. For tables, the entire extracted content is passed as a single context block.

LLM Integration

Prompts are stored as structured JSON files in Prompts/, not hardcoded. This decouples field definitions from pipeline logic, making it straightforward to add new extraction targets without modifying code.

The system is provider-agnostic by design — it supports OpenAI, Gemini, and Groq interchangeably. This was intentional: different document types performed differently across providers during testing, and the internship scope required flexibility to swap without refactoring.

Prompts instruct the model to return only valid JSON with no preamble, which is then parsed and validated before writing to Excel. Malformed responses are caught and logged rather than silently written.

Document Types Supported

Type	Input	Prompt File
SPO + Framework	Two PDFs per company subfolder	`Prompts/prompts_spo_framework.json`
Term Sheets	One PDF per entry	`Prompts/prompts_term_sheet/`

Setup

git clone https://github.com/CandyButcher27/SFI-Data-Project.git
cd SFI-Data-Project
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt

Set API keys:

export LLMWHISPERER_API_KEY='...'
export OPENAI_API_KEY='...'       # or GEMINI_API_KEY / GROQ_API_KEY

Run:

# SPO + Framework pipeline
cd Python_spo_framework && python main.py

# Term Sheet pipeline
cd Python_term_sheet && python main.py

Output: structured .xlsx files with one row per company/document.

Configuration Notes

LLMWhisperer endpoint is region-configurable (US: us-central, EU: eu-west)
Each company's documents go in a subfolder under Main_spo_framework/ — framework PDF first, SPO PDF second
Prompts are the primary configuration surface — no code changes needed to add new fields

Documentation

Full API documentation generated with Doxygen. View locally:

docs/html - Python_spo_framework/index.html
docs/html - Python_term_sheet/index.html

Regenerate:

cd Python_spo_framework && doxygen Doxyfile

Project Structure

SFI-Data-Project/
├── Python_spo_framework/    # Textual + tabular pipeline for SPO/framework docs
├── Python_term_sheet/       # Pipeline variant for term sheets
├── Main_spo_framework/      # Input PDFs (per-company subfolders)
├── Main_term_sheet/         # Input PDFs for term sheets
├── Prompts/                 # JSON prompt definitions (field extraction targets)
└── requirements.txt

Built During

Internship at SFI Data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFI Data Information Extractor

Problem Context

System Architecture

Track 1 — Textual Data Pipeline

Track 2 — Tabular Data Pipeline

LLM Integration

Document Types Supported

Setup

Configuration Notes

Documentation

Project Structure

Built During

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Main_spo_framework		Main_spo_framework
Main_term_sheet		Main_term_sheet
Prompts		Prompts
Python_spo_framework		Python_spo_framework
Python_term_sheet		Python_term_sheet
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SFI Data Information Extractor

Problem Context

System Architecture

Track 1 — Textual Data Pipeline

Track 2 — Tabular Data Pipeline

LLM Integration

Document Types Supported

Setup

Configuration Notes

Documentation

Project Structure

Built During

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages