CLI tool to extract tables from PDFs using Camelot (default, free), Mistral OCR (Pixtral vision model), AWS Textract, or pdfplumber and convert them to machine-readable CSV files.
Dedicated to Alice Corona e Marco Corona, and the entire onData community.
- Four extraction engines: Camelot (free, local, native PDFs), Mistral (schema-driven, scanned PDFs), AWS Textract (managed service), or pdfplumber (robust, works on both native and scanned PDFs)
- Extract tables from multi-page PDFs
- Support page selection (ranges or lists)
- Optional YAML schema for improved extraction accuracy (Mistral only)
- CSV output per page or merged into single file
- Configurable DPI and engine-specific options
Prerequisites: Python 3.8+.
Quick install (pip): pip install -U alice-pdf
Install globally from PyPI (choose one):
pip install alice-pdfuv tool install alice-pdf(requiresuv)
Upgrade to the latest release at any time:
pip install -U alice-pdf
# or
uv tool upgrade alice-pdfFor Camelot engine:
- Python 3.8+
- camelot-py library (included in install)
- Works with native PDFs (not scanned images)
For Mistral engine:
- Python 3.8+
- Mistral API key (https://console.mistral.ai/)
- Best for scanned PDFs and complex tables
For pdfplumber engine:
- Python 3.8+
- pdfplumber library (included in install)
- Works on both native and scanned PDFs
- Handles complex table structures better than Camelot
- Free and local extraction
For Textract engine:
- Python 3.8+
- AWS credentials with Textract permissions
- boto3 library (included in install)
Camelot (default, no setup needed):
No API key required! Just install and use.
Mistral:
Option 1 - Environment variables (recommended for uv run):
export MISTRAL_API_KEY="your-api-key"Option 2 - CLI parameters (recommended for uv tool install):
alice-pdf input.pdf output/ --engine mistral --api-key "your-api-key"
# alias: --mistral-api-keyOption 3 - .env file (only works with uv run, not with uv tool install):
# Create .env file in project directory
echo 'MISTRAL_API_KEY="your-api-key"' > .env
uv run alice-pdf input.pdf output/ --engine mistralTextract:
Option 1 - Environment variables (recommended):
export AWS_ACCESS_KEY_ID="your-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="eu-west-1"Option 2 - CLI parameters:
alice-pdf input.pdf output/ --engine textract \
--aws-region eu-west-1 \
--aws-access-key-id "your-key-id" \
--aws-secret-access-key "your-secret-key"Costi: il motore Textract qui usa solo FeatureTypes=["TABLES"] per tenere il costo a ~0,015 USD/pagina. Il feature FORMS (~0,050 USD/pagina) non è abilitato.
Note: .env file support is only available for Mistral and only when running with uv run.
For Textract, always use environment variables or CLI parameters.
# Extract with Camelot (default, free, no API)
alice-pdf input.pdf output/
# Extract with Mistral (for scanned PDFs)
alice-pdf input.pdf output/ --engine mistral
# Extract with Textract
alice-pdf input.pdf output/ --engine textract --aws-region eu-west-1
# Extract with pdfplumber (robust, works on both native and scanned PDFs)
alice-pdf input.pdf output/ --engine pdfplumber
# Extract with pdfplumber with minimum table size constraints
alice-pdf input.pdf output/ --engine pdfplumber --pdfplumber-min-rows 2 --pdfplumber-min-cols 3
# Extract with Camelot (local, fast for native PDFs)
alice-pdf input.pdf output/ --engine camelot --camelot-flavor stream
# Camelot: fix for tables with merged cells
alice-pdf input.pdf output/ --engine camelot --camelot-split-text --merge
# Specific pages
alice-pdf input.pdf output/ --pages "1-3,5"
# Merge all tables into one CSV
alice-pdf input.pdf output/ --merge
# With table schema for better accuracy (Mistral only)
alice-pdf input.pdf output/ --schema table_schema.yaml
# Debug mode
alice-pdf input.pdf output/ --debugCommon:
--engine {mistral,textract,camelot,pdfplumber}: Extraction engine (default: camelot)--pages: Pages to process (default: all). Examples: "1", "1-3", "1,3,5"--dpi: Image resolution (default: 150)-m, --merge: Merge all tables into single CSV--no-resume: Clear output and reprocess all pages-d, --debug: Enable debug logging
Mistral-specific:
--model: Mistral model (default: pixtral-12b-2409)--schema: Path to YAML/JSON schema file for custom prompt generation--prompt: Custom prompt (overrides --schema)--api-key: Mistral API key (alternative to env var)--timeout-ms: HTTP timeout in milliseconds (default: 60000)
Textract-specific:
--aws-region: AWS region (or set AWS_DEFAULT_REGION)--aws-access-key-id: AWS access key (or set AWS_ACCESS_KEY_ID)--aws-secret-access-key: AWS secret key (or set AWS_SECRET_ACCESS_KEY)
Camelot-specific:
--camelot-flavor {lattice,stream}: Extraction mode (default: lattice)lattice: For tables with visible bordersstream: For tables without borders (whitespace-based)
--camelot-split-text: Split text spanning multiple cells (useful for complex tables with merged cells)
pdfplumber-specific:
--pdfplumber-min-rows: Minimum number of rows for table detection (default: 1)--pdfplumber-min-cols: Minimum number of columns for table detection (default: 1)--pdfplumber-strip-text/--no-pdfplumber-strip-text: Enable/disable whitespace stripping in extracted text (default: strip)
To improve extraction accuracy, create a YAML file describing the table structure:
name: "housing_properties"
description: "Housing properties table"
columns:
- name: "PROPERTY"
description: "Property owner name"
examples:
- "ATER DI VENEZIA"
- "COMUNE DI VENEZIA"
- name: "UNIT"
description: "Housing unit number"
examples:
- "2950010"
- "170"
notes:
- "Keep columns separate"
- "Do NOT merge adjacent cells"
- "All rows should have exactly N columns"- Converts PDF pages to raster images (150 DPI default)
- Sends images to Mistral API with structured prompt
- Mistral API (Pixtral) analyzes image and extracts tables as JSON
- Converts JSON to pandas DataFrame
- Saves CSV per page + optional merge
- Adds 'page' column for traceability
Progressive Timeout Retry:
When a page times out, the tool automatically retries with doubled timeouts:
- Attempt 1: 60 seconds (default timeout)
- Attempt 2: 120 seconds (2x timeout, if first attempt times out)
- Attempt 3: 240 seconds (4x timeout, if second attempt times out)
After 3 failed attempts, the page is skipped and processing continues with the next page. Non-timeout errors (authentication, rate limits, etc.) skip retry and move to the next page immediately.
- Converts PDF pages to raster images (150 DPI default)
- Sends images to AWS Textract API
- Textract analyzes document structure and extracts tables
- Converts Textract response to pandas DataFrame
- Saves CSV per page + optional merge
- Adds 'page' column for traceability
Note: Textract does not support schema/prompt customization. Use Mistral if you need custom prompts.
- Reads native PDF structure (no image conversion needed)
- Detects tables using borders (
lattice) or whitespace (stream) - Converts to pandas DataFrame
- Saves CSV per page + optional merge
- Adds 'page' column for traceability
Best for: Native PDFs (not scanned) with clear table structure. Fast and free (local processing).
Each extracted table is saved as:
{pdf_name}_page{N}_table{i}.csv: CSV per table{pdf_name}_merged.csv: All tables merged (if --merge)
alice-pdf document.pdf output/alice-pdf document.pdf output/ \
--engine mistral \
--mergealice-pdf document.pdf output/ \
--engine pdfplumber \
--pdfplumber-min-rows 2 \
--pdfplumber-min-cols 3 \
--mergealice-pdf document.pdf output/ \
--engine textract \
--aws-region eu-west-1 \
--mergealice-pdf document.pdf output/ \
--engine mistral \
--schema table_schema.yaml \
--pages "2-10" \
--mergealice-pdf document.pdf output/ \
--dpi 300 \
--debugUse Mistral when:
- You need custom prompts or schema-driven extraction
- Tables have complex structure requiring specific instructions
- You want fine control over extraction behavior
Use Textract when:
- You need fast, reliable extraction on standard tables
- You prefer managed AWS infrastructure
- Schema customization is not required
Use Camelot when:
- PDF is native (not scanned)
- Tables have clear structure (borders or consistent spacing)
- You want local, free extraction (no API costs)
- Speed is critical for simple PDFs
Use pdfplumber when:
- PDF can be native or scanned
- Tables have complex structures or inconsistent borders
- You want robust local extraction (no API costs)
- Camelot fails to detect tables properly
alice-pdf/
├── alice_pdf/ # Main package source code
│ ├── cli.py # CLI entry point and argument parsing
│ ├── extractor.py # Mistral engine implementation
│ ├── textract_extractor.py # AWS Textract engine
│ ├── camelot_extractor.py # Camelot engine
│ ├── pdfplumber_extractor.py # pdfplumber engine
│ └── prompt_generator.py # YAML schema to prompt converter
├── docs/ # Documentation
│ └── best-practices.md # Comprehensive usage guide
├── sample/ # Example PDFs and schemas
│ ├── *.pdf # Sample PDF files for testing
│ └── *.yaml # Example table schemas
├── openspec/ # OpenSpec specifications
│ ├── AGENTS.md # Agent instructions
│ └── specs/ # Change proposals and documentation
├── tests/ # Unit tests
└── tmp/ # Temporary test outputs (gitignored)
Key directories:
alice_pdf/: Core library codedocs/: User guides and best practicessample/: Example files and schemas for testingopenspec/: Project specifications using OpenSpec formattmp/: Temporary directory for test outputs (not tracked in git)
MIT License - Copyright (c) 2025 Andrea Borruso [email protected]
