Universal File Processor with AI-Powered Capabilities
A Python package that provides unified access to PDF text extraction, image processing, audio transcription, and text summarization using AI models. Install as a package with optional dependencies for modular usage.
Extract text from PDF documents with streaming support for large files.
- Documentation:
anyfile_to_ai/pdf_extractor/README.md - Usage: CLI and Python API for text extraction
Process images with Vision Language Models to generate descriptive text.
- Documentation:
anyfile_to_ai/image_processor/README.md - Usage: CLI and Python API for AI-powered image description
Transcribe audio files using MLX-optimized Whisper models for Apple Silicon.
- Documentation:
anyfile_to_ai/audio_processor/README.md - Usage: CLI and Python API for audio-to-text transcription with multilingual support
Summarize text using LLM models with automatic language detection and intelligent chunking.
- Documentation:
anyfile_to_ai/text_summarizer/README.md - Usage: CLI and Python API for AI-powered text summarization with pipeline support
Convert a local file path or HTTP/HTTPS URL with deterministic backend routing across PDF, image, audio, and MarkItDown-backed document formats.
- Usage: CLI and Python API with stable
source/route/contentoutput contract
Persistent task state storage for long-running operations with checkpoint-based resume capability.
- Documentation:
anyfile_to_ai/task_manager/README.md - Usage: Python API for task creation, checkpointing, and resume
pip install anyfile_to_ai# PDF processing only
pip install anyfile_to_ai[pdf]
# Image processing only
pip install anyfile_to_ai[image]
# Audio transcription only
pip install anyfile_to_ai[audio]
# Text summarization only
pip install anyfile_to_ai[text]
# All modules
pip install anyfile_to_ai[all]pip install anyfile_to_ai[dev]# Extract text from PDF
pdf-extractor extract document.pdf --format json
# Extract with streaming for large files
pdf-extractor extract large-document.pdf --stream --progress# Process images with AI description
image-processor photo.jpg --style detailed
# Batch process multiple images
image-processor *.jpg --style brief --format json# Transcribe audio file
audio-processor podcast.mp3 --format json --verbose
# Transcribe with specific model
audio-processor interview.wav --model base --language en# Summarize text file
text-summarizer article.txt --format markdown
# Summarize from stdin
cat document.txt | text-summarizer --stdin --format json# Convert a local Office file via MarkItDown route
document-converter /tmp/report.docx
# Convert with metadata enabled for specialized routes
document-converter /tmp/file.pdf --include-metadata# Audio to Summary Pipeline
audio-processor podcast.mp3 --format plain | \
text-summarizer --stdin --format markdown > summary.md
# PDF to Summary Pipeline
pdf-extractor extract document.pdf --format plain | \
text-summarizer --stdin --format json > summary.json
# PDF with image descriptions (provider-aware vision backend)
pdf-extractor extract document.pdf --include-images \
--provider lmstudio \
--base-url http://127.0.0.1:1234/v1 \
--vision-model qwen/qwen3-vl-8bfrom anyfile_to_ai.pdf_extractor import extract_text
from anyfile_to_ai.image_processor import process_image
from anyfile_to_ai.audio_processor import transcribe_audio
from anyfile_to_ai.text_summarizer import summarize_text
from anyfile_to_ai.document_converter import convert_document
from anyfile_to_ai.task_manager import TaskManager, TaskStateresult = extract_text("document.pdf", format="json")
print(result.text)result = process_image("image.jpg", style="detailed")
print(result.description)result = transcribe_audio("audio.mp3", format="json")
print(result.text)result = summarize_text("long_text.txt", format="markdown")
print(result.summary)result = convert_document("/tmp/report.docx")
print(result.route, result.content)# Create task with checkpoint-based resume
manager = TaskManager()
task = manager.create_task("job-001", "/data/file.pdf", total_pages=100)
# Checkpoint after each page
for page in range(1, 101):
process_page(page)
manager.checkpoint("job-001", page)
# Resume from checkpoint on restart
task = manager.load_task("job-001")
remaining = [p for p in range(1, task.total_pages + 1) if p not in task.processed_pages]Since ML models are not included in the package, install them separately:
# For image processing (VLM models)
pip install mlx-vlm
# For audio transcription (Whisper models)
pip install lightning-whisper-mlx
# For text summarization (LLM client)
pip install httpx# Unified provider configuration
export PROVIDER=ollama
export BASE_URL=http://127.0.0.1:11434
# Text and vision model selection
export TEXT_MODEL=qwen/qwen3-14b
export VISION_MODEL=qwen/qwen3-vl-8b# Per-command overrides (highest priority)
text-summarizer article.txt --provider lmstudio --base-url http://127.0.0.1:1234/v1 --text-model qwen/qwen3-14b
image-processor photo.jpg --provider lmstudio --base-url http://127.0.0.1:1234/v1 --vision-model qwen/qwen3-vl-8b
pdf-extractor extract paper.pdf --include-images --provider lmstudio --base-url http://127.0.0.1:1234/v1 --vision-model qwen/qwen3-vl-8b- Python 3.11+
- UV package manager (recommended)
- Apple Silicon Mac (for MLX-optimized features)
# Clone and enter directory
git clone <repo-url>
cd anyfile-to-ai
# Install development dependencies
uv sync
# Install pre-commit hooks
uv run pre-commit install# Run tests
uv run pytest
# Run comprehensive human review test suite (quick integration test)
./tests/human_review_quick_test
# Code formatting and linting
uv run ruff check .
uv run ruff format .
# Pre-commit hooks (auto-run on git commit)
uv run pre-commit install # Install hooks (one-time setup)
uv run pre-commit run --all-files # Run manually on all files
# Check file length compliance
uv run python check_file_lengths.pyPre-commit hooks automatically run linting and formatting checks when you commit. These hooks:
- Fix simple issues automatically (imports, whitespace, formatting)
- Report complex issues that require manual fixes (complexity, undefined names)
When to bypass hooks (use git commit --no-verify):
- Emergency hotfixes that need immediate deployment
- Pre-commit tool malfunction or configuration issues
- Work-in-progress commits during local experimentation
- Dependency updates that may temporarily break checks
When NOT to bypass hooks:
- To avoid fixing legitimate linting errors
- To skip required code quality checks
- To save time during normal development
Note: CI will enforce all checks regardless of local bypass, making this a safe escape hatch for edge cases.
- Streaming support for large files
- Progress tracking
- Multiple output formats (plain, JSON, CSV)
- Error handling for corrupted/protected PDFs
- Vision Language Model integration
- Multiple description styles (brief, detailed, technical)
- Batch processing with progress
- MLX optimization for Apple Silicon
- MLX-optimized Whisper models
- Multilingual support with auto-detection
- Multiple model sizes (tiny to large-v3)
- Batch processing with progress tracking
- Support for mp3, wav, and m4a formats
- LLM-powered intelligent summarization
- Automatic language detection (outputs in English)
- Hierarchical chunking for large documents (>10k words)
- Minimum 3 categorization tags per summary
- Pipeline integration with other modules
- JSON and plain text output formats
All processing modules support cooperative cancellation for long-running operations:
from anyfile_to_ai.progress_tracker import CancellationToken, OperationCancelledError
# Create token
token = CancellationToken()
# Request cancellation
token.cancel()
# Check status
if token.is_cancelled:
print("Operation cancelled")
# Reset for reuse
token.reset()from anyfile_to_ai.pdf_extractor import extract_text_streaming
from anyfile_to_ai.progress_tracker import CancellationToken, OperationCancelledError
token = CancellationToken()
try:
for page in extract_text_streaming("large.pdf", cancel_token=token):
print(f"Page {page.page_number}")
# Cancel after 10 pages
if page.page_number >= 10:
token.cancel()
except OperationCancelledError:
print("Processing cancelled")from anyfile_to_ai.image_processor import process_images
from anyfile_to_ai.progress_tracker import CancellationToken, OperationCancelledError
token = CancellationToken()
try:
results = process_images(
["img1.jpg", "img2.jpg", "img3.jpg"],
cancel_token=token
)
except OperationCancelledError:
print("Batch processing cancelled")- Cooperative cancellation: Check at iteration boundaries
- Partial results: Yield completed results before raising
- Resource cleanup: Clean up resources before raising
- Backward compatible: Optional parameter, existing code works unchanged
See module READMEs for detailed cancellation examples:
🚧 Work in Progress - This is an evolving experiment. Modules are functional but the overall vision continues to develop.
Each module is documented independently. Check their individual READMEs for detailed usage instructions.
This is an experimental project exploring modular design patterns. Feel free to explore the code and documentation in the specs/ directory to understand the development process.
The repository now includes anyfile_to_ai/output_formatter/ as the canonical formatter package for plain, markdown, and json output assembly.
- Use profile values:
pdf,image,audio,text,document_converter. - JSON serialization is deterministic and can include normalized metadata when requested.
- Module-local formatter paths remain available with rollback toggles (
ANYFILE_OUTPUT_FORMATTER_*_SHARED=0) during migration.