A curated collection of useful tools organized as git submodules for easy management and deployment.
This repository serves as a centralized hub for various development and productivity tools, each maintained as a git submodule for easy version control and updates.
Location: tools/mineru | Version: 1.3.10 (magic_pdf-1.3.11-released tag)
A high-quality tool for converting PDF to Markdown and JSON format. MinerU is a comprehensive solution for precise document content extraction with support for:
- ✅ Multiple output formats (Markdown, JSON)
- ✅ OCR support for 84 languages
- ✅ Layout and span visualization
- ✅ CPU and GPU acceleration support
- ✅ Cross-platform compatibility (Windows, Linux, macOS)
Quick Start:
# Create and activate conda environment
conda create --name pd python=3.13 -y
source /opt/conda/etc/profile.d/conda.sh && conda activate pd
# Navigate to MinerU directory and install dependencies
cd tools/mineru
pip install -e .[full]
cd ../..
# Install required dependencies and download models
pip install requests huggingface_hub
python scripts/download_models_hf.py
# Use MinerU (note: command is magic-pdf, not mineru in v1.3.10)
magic-pdf -p <input_path> -o <output_path>For optimal performance with CUDA GPU acceleration:
1. Verify GPU Support:
nvidia-smi # Check GPU availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"2. Configure GPU Acceleration:
The model download script automatically creates a configuration file at ~/magic-pdf.json. To enable GPU acceleration, ensure the device mode is set to cuda:
{
"device-mode": "cuda",
"models-dir": "/path/to/downloaded/models"
}📋 Configuration Template: See
configs/magic-pdf-gpu.template.jsonfor a complete configuration template with all available options.
3. Performance Comparison:
- CPU Mode: ~16-17 it/s processing speed, language switched to
ch_lite - GPU Mode: ~134+ it/s processing speed (8x faster), full language support
Example Usage:
magic-pdf -p demo/pdfs/small_ocr.pdf -o output/Here's the complete process to reproduce the magic-pdf setup:
1. Clone Repository with Submodules:
git clone --recursive https://github.com/protagolabs/ProtagoDoc.git
cd ProtagoDoc2. Set up Conda Environment:
# Ensure conda is available and activate environment
source /opt/conda/etc/profile.d/conda.sh && conda activate pd
# or create new environment: conda create -n pd python=3.13 && conda activate pd3. Install MinerU:
cd tools/mineru
pip install -e .[full]
cd ../..4. Download Models and Configure GPU:
# Install required dependencies first
pip install requests huggingface_hub
python scripts/download_models_hf.py5. Verify Setup:
python scripts/test_fresh_setup.py6. Test Magic-PDF:
mkdir -p output
magic-pdf -p tools/mineru/demo/pdfs/small_ocr.pdf -o output/Expected result: ~134+ it/s GPU processing speed 🔥
To clone this repository with all submodules:
git clone --recursive https://github.com/protagolabs/ProtagoDoc.gitIf you've already cloned the repository, initialize and update submodules:
git submodule init
git submodule updateTo add a new tool as a submodule:
git submodule add <repository-url> tools/<tool-name>
git commit -m "Add <tool-name> submodule"To update all submodules to their latest versions:
git submodule update --remoteTo update a specific submodule:
# For MinerU (uses master branch)
cd tools/mineru
git pull origin master
cd ../..
git add tools/mineru
git commit -m "Update MinerU submodule"
# For other tools that might use main branch
cd tools/<tool-name>
git pull origin main # or master, depending on the repository
cd ../..
git add tools/<tool-name>
git commit -m "Update <tool-name> submodule"ProtagoDoc/
├── tools/ # All tool submodules
│ └── mineru/ # MinerU - PDF to Markdown/JSON converter
├── scripts/ # Utility scripts
│ ├── download_models_hf.py # Model download script (local)
│ └── test_fresh_setup.py # Setup validation script
├── configs/ # Configuration templates
│ ├── magic-pdf-gpu.template.json # GPU configuration template
│ └── README.md # Configuration documentation
├── .gitmodules # Submodule configuration
└── README.md # This file
Always run in the correct conda environment:
# ALWAYS activate the environment first
source /opt/conda/etc/profile.d/conda.sh && conda activate pd
# Verify you're in the right environment (should show "pd")
echo $CONDA_DEFAULT_ENV
# If not in pd environment, installations will fail or go to wrong locationError: fatal: couldn't find remote ref main
- Some repositories use
masteras the default branch instead ofmain - For MinerU: use
git pull origin master - Check the default branch with:
git branch -r
Updating from a specific version:
# To update MinerU to a newer version tag
cd tools/mineru
git fetch origin
git checkout magic_pdf-1.3.11-released # or desired version
cd ../..
git add tools/mineru
git commit -m "Update MinerU to version 1.3.10 (magic_pdf-1.3.11-released tag)"Reset submodule to specific commit:
cd tools/mineru
git checkout ea619281ef43577da91247a9df60f53b12d47cbc # current pinned commit (magic_pdf-1.3.11-released)
cd ../..
git add tools/mineru
git commit -m "Reset MinerU to pinned version 1.3.10 (magic_pdf-1.3.11-released tag)"Error: magic-pdf: command not found
- CRITICAL: Ensure you're in the correct conda environment:
conda activate pd - Ensure you've run the model download script:
python scripts/download_models_hf.py - Check if MinerU is properly installed:
pip show magic-pdf - Verify environment activation:
echo $CONDA_DEFAULT_ENVshould showpd
Error: Still using CPU despite CUDA configuration
- Verify the configuration file exists:
ls -la ~/magic-pdf.json - Check device mode setting:
python -c "from magic_pdf.libs.config_reader import get_device; print('Device:', get_device())" - Ensure device-mode is set to "cuda" in
~/magic-pdf.json:{ "device-mode": "cuda" }
Error: Missing model weights
# Re-download models if they're missing
python scripts/download_models_hf.pyGPU Memory Issues
- Reduce batch size by modifying the configuration
- Check available GPU memory:
nvidia-smi - For GPUs with <6GB VRAM, consider using CPU mode
Performance Optimization
- Expected GPU Performance: 130+ it/s for OCR processing
- Expected CPU Performance: 16-17 it/s for OCR processing
- If GPU performance is poor, check CUDA installation and drivers
- Fork the repository
- Add your tool as a submodule in the
tools/directory - Update this README with tool documentation
- Submit a pull request
This repository serves as a collection hub. Each tool maintains its own license:
- MinerU: AGPL-3.0 License
Last updated: 2025-07-25 - Fresh setup validated with 4x RTX 4090 GPUs achieving 134+ it/s performance