Job management for Yale's HPC cluster - like HuggingFace Jobs for Yale
Yale Jobs provides a simple, HuggingFace-style API for running jobs on Yale's HPC cluster. It handles SSH connections (with 2FA), data preparation from multiple sources (PDFs, IIIF, directories, web, HuggingFace datasets), job submission, and monitoring.
β¨ Simple API - HuggingFace-style job submission
π 2FA Support - Seamless authentication with Yale's cluster
π Multiple Data Sources - PDFs, IIIF manifests, directories, web URLs, HF datasets
π OCR Ready - Built-in support for DoTS.ocr and other models
π Job Monitoring - Track status and download results
π Python SDK & CLI - Use programmatically or from command line
# Clone the repository
git clone https://github.com/your-username/yale-jobs.git
cd yale-jobs
# Install with pip
pip install -e .
# Or install with OCR dependencies
pip install -e ".[ocr]"Create a config.yaml file:
alias: yale-cluster # Your cluster hostname
login: true
env: qwen # Conda environment name
job_dir: proj*/shared/test
result_dir: proj*/shared/test/results
2fa: true# OCR on PDFs
yale jobs ocr path/to/pdfs output-dataset --source-type pdf --gpus v100:2
# OCR on IIIF manifest
yale jobs ocr https://example.com/manifest.json output --source-type iiif
# OCR on image directory
yale jobs ocr path/to/images output --source-type directory --batch-size 32
# Check status
yale jobs status 12345
# Download results
yale jobs download --job-name my-job --output-dir ./results
# View logs
yale jobs logs --job-name my-jobfrom yale import run_ocr_job
# Run OCR on a directory of PDFs
job = run_ocr_job(
data_source="manuscripts/",
output_dataset="manuscripts-ocr",
source_type="pdf",
gpus="v100:2",
batch_size=32
)
print(f"Job ID: {job.job_id}")
# Check status
status = job.get_status()
print(f"State: {status['state']}")
# Download results
job.download_results("./results")Yale Jobs supports multiple data sources:
from yale import run_ocr_job
job = run_ocr_job(
data_source="path/to/pdfs/", # Single PDF or directory
output_dataset="pdf-ocr",
source_type="pdf",
gpus="v100:2"
)Supports both IIIF Presentation API v2 and v3:
job = run_ocr_job(
data_source="https://example.com/iiif/manifest.json",
output_dataset="iiif-ocr",
source_type="iiif",
gpus="p100:2"
)# Single URL
job = run_ocr_job(
data_source="https://example.com/image.jpg",
output_dataset="web-ocr",
source_type="web"
)
# Multiple URLs from file (one per line)
job = run_ocr_job(
data_source="urls.txt",
output_dataset="web-ocr",
source_type="web"
)job = run_ocr_job(
data_source="path/to/images/",
output_dataset="dir-ocr",
source_type="directory",
gpus="v100:2"
)job = run_ocr_job(
data_source="davanstrien/ufo-ColPali",
output_dataset="ufo-ocr",
source_type="hf",
gpus="a100:2"
)Run any Python script on the cluster:
from yale import run_job
script = """
import torch
from datasets import load_from_disk
# Load prepared dataset
dataset = load_from_disk("dataset")
print(f"Processing {len(dataset)} samples...")
# Your custom processing here
"""
job = run_job(
script=script,
data_source="path/to/data",
source_type="auto",
job_name="custom-job",
gpus="v100:2",
cpus_per_task=4,
time_limit="02:00:00",
memory="32G"
)from yale import YaleJobs
# Create SDK instance
yale = YaleJobs(config_path="config.yaml")
yale.connect()
# Submit job
job = yale.submit_job(
script=my_script,
data_source="path/to/data",
job_name="my-job",
gpus="v100:2"
)
# Monitor status
status = yale.get_job_status(job.job_id)
print(status)
# Close connection
yale.close()
# Or use context manager
with YaleJobs() as yale:
yale.connect()
job = yale.submit_job(...)Work with data sources directly:
from yale.data import PDFDataSource, IIIFDataSource
# Convert PDFs to images
pdf_ds = PDFDataSource("document.pdf")
images = pdf_ds.to_images(dpi=300)
# Create HuggingFace dataset
dataset = pdf_ds.to_dataset()
dataset.save_to_disk("pdf_dataset")
# IIIF manifest
iiif_ds = IIIFDataSource("https://example.com/manifest.json")
print(f"IIIF version: {iiif_ds.version}")
image_urls = iiif_ds.get_image_urls()
dataset = iiif_ds.to_dataset(max_size=2000)# Run custom script
yale jobs run script.py --data-source data/ --gpus v100:2
# Run OCR
yale jobs ocr <source> <output> [options]
--source-type {auto,pdf,iiif,web,directory,hf}
--model MODEL # Default: rednote-hilab/dots.ocr
--batch-size N # Default: 16
--max-samples N # Limit samples
--gpus GPU_SPEC # Default: p100:2
--partition PARTITION # SLURM partition (default: gpu)
--time HH:MM:SS # Time limit (default: 02:00:00)
--env ENV_NAME # Conda environment (overrides config.yaml)
--prompt-mode {ocr,layout-all,layout-only} # DoTS.ocr mode (default: layout-all)
--dataset-path PATH # Use existing dataset on cluster (skips upload)
--max-model-len N # Maximum model context length (default: 32768)
--max-tokens N # Maximum output tokens (default: 16384)
--hpc-process # Process data on HPC (copy raw data first)
--wait # Wait for completion
# Check status
yale jobs status <job-id>
# Cancel job
yale jobs cancel <job-id>
# Download results
yale jobs download --job-name NAME [--output-dir DIR] [--pattern PATTERN]
# View logs
yale jobs logs --job-name NAMECommon GPU specifications:
p100:1- Single P100 GPUp100:2- Two P100 GPUsv100:1- Single V100 GPUv100:2- Two V100 GPUsa100:1- Single A100 GPU
Yale Jobs handles the complete workflow:
- SSH Connection - Connects to cluster with 2FA support
- Data Preparation - Converts data from various sources to HuggingFace datasets
- Upload - Transfers data to cluster via SFTP
- Job Creation - Generates SLURM batch script
- Submission - Submits job to SLURM queue
- Monitoring - Tracks job status with
sacct/squeue - Results - Downloads results when complete
# Simple text extraction (ocr mode)
yale jobs ocr manuscript.pdf text-output \
--source-type pdf \
--prompt-mode ocr
# Full layout analysis with bounding boxes (layout-all - default)
# Note: layout-all uses a longer prompt, increase context if needed
yale jobs ocr documents/ layout-output \
--source-type pdf \
--prompt-mode layout-all \
--gpus h200:1 \
--partition gpu_h200 \
--max-model-len 32768
# Layout structure only (no text content)
yale jobs ocr scans.pdf layout-only-output \
--source-type pdf \
--prompt-mode layout-onlyNote on Context Length:
- Simple OCR mode: Default 32768 is usually enough
- Layout-all mode: May need 32768+ for complex/large images (default)
- Error "decoder prompt too long": Increase
--max-model-len(e.g., 49152 or 65536) - DoTS.ocr supports up to ~128K tokens depending on available GPU memory
Process multiple IIIF manifests from a text file:
# Create a text file with manifest URLs (one per line)
cat > manifests.txt <<EOF
https://collections.library.yale.edu/manifests/11781249
https://collections.library.yale.edu/manifests/11781250
https://collections.library.yale.edu/manifests/11781251
EOF
# Default: Downloads images locally, then uploads dataset to cluster
yale jobs ocr manifests.txt output \
--batch-size 16 \
--gpus h200:1 \
--partition gpu_h200
# With --hpc-process: Downloads images ON THE CLUSTER (recommended!)
yale jobs ocr manifests.txt output \
--source-type iiif-list \
--hpc-process \
--batch-size 16 \
--gpus h200:1 \
--partition gpu_h200Without --hpc-process (default):
- Load IIIF manifests on your local machine
- Download images from IIIF servers to your local machine
- Convert to HuggingFace dataset locally
- Upload entire dataset to cluster
- Run OCR
With --hpc-process (recommended for IIIF):
- Upload manifest list to cluster
- Cluster loads IIIF manifests
- Cluster downloads images directly from IIIF servers (faster, doesn't use your bandwidth!)
- Cluster converts to HuggingFace dataset
- Run OCR
π‘ Always use --hpc-process with IIIF manifests - the cluster has better bandwidth to IIIF servers!
For large datasets or when bandwidth is limited, process data on the cluster instead of locally:
# Process PDF on HPC (copies raw PDF, processes there)
yale jobs ocr large-document.pdf output \
--source-type pdf \
--hpc-process \
--gpus h200:1 \
--partition gpu_h200
# Process directory of images on HPC
yale jobs ocr /local/images/ output \
--source-type directory \
--hpc-processHow --hpc-process works:
- Without flag (default): Data is processed locally, converted to HuggingFace Dataset, then uploaded
- With flag: Raw data is copied to cluster, preprocessing script runs there, then OCR runs
When to use:
- β IIIF manifests (cluster has better bandwidth to IIIF servers!)
- β Large PDFs or image directories (faster than local processing + upload)
- β Slow local connection to cluster
- β Want to leverage cluster's faster processing
When NOT to use:
- β Small datasets (local processing is fine)
- β HuggingFace datasets (already remote)
Skip data upload when rerunning OCR on an existing dataset:
# First run - uploads data
yale jobs ocr manuscript.pdf first-output \
--source-type pdf \
--prompt-mode ocr \
--job-name ocr-run-1
# Second run - reuse the uploaded dataset with different prompt
yale jobs ocr dummy.pdf second-output \
--dataset-path /path/to/cluster/first-output_data \
--prompt-mode layout-all \
--job-name ocr-run-2See the examples/ directory for more:
simple_ocr.py- Basic OCR usageiiif_ocr.py- IIIF manifest processingcustom_job.py- Custom script executiondata_sources.py- Working with different data sources
The config.yaml file supports:
alias: cluster-hostname # Required: SSH hostname
login: true # Optional: Login node
env: conda-env-name # Optional: Conda environment
job_dir: path/to/jobs # Optional: Job directory (supports wildcards)
result_dir: path/to/results # Optional: Results directory
2fa: true # Optional: Enable 2FA (default: true)If you have trouble connecting:
# Test SSH manually
ssh your-netid@yale-cluster
# Check config
cat config.yamlThe system prompts for 2FA code after initial password. If this doesn't work, you may need to configure SSH keys.
Check job status:
yale jobs status <job-id>Check logs:
yale jobs logs --job-name <job-name>| Feature | HuggingFace Jobs | Yale Jobs |
|---|---|---|
| Remote execution | β HF infrastructure | β Yale HPC |
| GPU support | β | β |
| Data sources | HF datasets | PDFs, IIIF, directories, web, HF |
| Authentication | HF token | SSH + 2FA |
| Job monitoring | β | β |
| Python SDK | β | β |
| CLI | β | β |
MIT License - see LICENSE file for details.
Contributions welcome! Please open an issue or PR.
For issues or questions:
- Open a GitHub issue
- Contact: [your-email]