Skip to content

Latest commit

 

History

History
116 lines (81 loc) · 6.7 KB

File metadata and controls

116 lines (81 loc) · 6.7 KB

IndicPhotoOCR Optimization & Stabilization Report

This report summarizes the comprehensive series of performance optimizations, bug fixes, and architectural improvements applied to the IndicPhotoOCR repository up to the current commit string.

🚀 Performance Metrics & Improvements

Significant architectural changes (specifically Model Caching and Batch Inference) have dramatically accelerated the pipeline's execution times.

Metric Older Version Current Version Improvement
Sequential Time 89.61 seconds 17.80 seconds ~5.0x Faster (80% reduction)
Batched Time 92.93 seconds 14.10 seconds ~6.6x Faster (85% reduction)

Key Takeaways

  • The 5x Sequential Speedup is primarily driven by the Model Caching implementation. Previously, models were re-loaded from disk into memory for every single cropped word loop, which heavily bottlenecked inference. By preserving models in memory, the baseline sequential execution time drastically collapsed from 89.61s to 17.80s.
  • The additional 21% Batched Speedup (17.80s → 14.10s) is achieved via the newly implemented Batch Inference method, which groups multiple bounding box crops into a single combined tensor/pipeline array, maximizing GPU efficiency and lowering loop overheads.

🛠️ Summary of All Codebase Improvements

1. Architectural Optimizations

  • Model In-Memory Caching: Added caching functionality to both VIT_identifier and PARseqrecogniser. Models are now loaded once upon request and persisted in memory for consecutive word detections, completely eliminating redundant I/O disk bottlenecks.
  • Batch Inference Pipeline (feat/batch-inference):
    • ViT Identifier: Integrated HuggingFace pipeline batched processing natively into identify_batch().
    • PARseq Recogniser: Created custom get_model_output_batch() and recognise_batch() methods utilizing torch.stack() and torch.no_grad() to compute parallel logit extraction across sequence predictions.
    • Orchestration: Updated OCR.ocr(image_path, batch_size=X) to natively support batched detection grouping, cleanly reorganizing bounding boxes into language-specific batch tensors before restoring their original associative structure.

2. Stabilization & Safety Hooks

  • Atomic Model Checkpoint Downloads: Prevented corrupted network downloads! Modified the ensure_model() helper utilities (textbpn, parseq, vit, CLIP). Model shards are strictly written into progressive .tmp suffixes and only explicitly moved via os.rename() to the final endpoint upon fully completing a 200 OK fetch.
  • Relative to Absolute Paths: Resolved significant routing fragility by detaching hard-coded path invocations (e.g., IndicPhotoOCR/recognition/). Dynamic resolution relies on os.path.dirname(os.path.abspath(__file__)), meaning end-users can invoke IndicPhotoOCR gracefully from anywhere within their root OS directory.
  • Robust Temporary Crop Handling: Converted hardcoded transient folders (IndicPhotoOCR/script_identification/images/) over to globally compliant temp directories (import tempfile.mkstemp()). Reconciled garbage collection bugs to ensure .jpg crop trails are accurately deleted off the OS without bleeding.

3. Feature Extensions

  • Confidence Scoring (feat/confidence-scores): Implemented logit pooling to calculate mean token probability arrays during extraction. OCR.recognise now has the capability of emitting the word along with its scalar certainty.

4. Backwards Compatibility & Automated Testing

  • 100% Core Functionality Preserved:
    • OCR.recognise() enforces backward API parity using return_confidence=False by default so existing dependent scripts do not suddenly encounter tuple unpacking TypeErrors.
    • OCR.ocr() preserves default batch_size=0, guaranteeing existing loops will continue running identically in sequential mode without disruption unless explicitly elevated.
  • Test Suite Introduction (origin/test):
    • Added ~106 extensive pytest conditions checking mocking validations, dataset layout parsing logic (detect_para), and network retry policies to protect against future regressions.

All implementations verified safe & green via the fix/quick-wins branch.


User Guide: New Features & Performance

Welcome to the latest version of IndicPhotoOCR! We have upgraded the pipeline under the hood so it runs exponentially faster, while adding a few requested features to the public API.

Here is a quick breakdown of what is new for users.

🚀 1. Massive Performance Upgrade

You do not need to change any of your code to experience the new performance upgrades. The underlying models are now intelligently cached in memory, which means sequential scanning will naturally run 5x faster on average.

Mode Previous Version Current Version
Out-of-the-box (Sequential) ~90 seconds ~18 seconds

⚡ 2. Batch Inference (New Feature)

If you are running OCR on an image containing a large paragraph of text (many words), you can explicitly turn on Batch Inference to process them concurrently for even faster extraction.

How to use it: Pass the batch_size parameter to the main .ocr() method:

from IndicPhotoOCR.ocr import OCR

ocr = OCR()

# Old Way (Still works natively!)
results = ocr.ocr("image.jpg")

# New Way (Enable Batching for faster large-document processing)
results = ocr.ocr("image.jpg", batch_size=32) 
Mode Previous Version Current Batched
Batched Processing ~93 seconds ~14 seconds

🎯 3. Confidence Scores (New Feature)

You can now extract the neural net's confidence score representing the certainty of the extracted text. This is highly useful for downstream NLP evaluation, filtering bad predictions, or generating Word Error Rate (WER) metrics.

Through the main ocr() pipeline:

The confidence score is automatically injected into the default dictionary output for each bounding box:

results = ocr.ocr("image.jpg")

# Example Output
# {
#   "img_0": {
#     "txt": "नमस्ते", 
#     "bbox": [531, 237, 725, 290], 
#     "confidence": 0.985  <--- NEW!
#   }
# }

Through the direct recognise() function:

If you are passing your own cropped images directly to the recogniser, you can optionally request the confidence score by passing return_confidence=True:

# Old behavior (returns string text)
text = ocr.recognise("crop.jpg", "hindi")

# New behavior (returns tuple with confidence float)
text, conf_score = ocr.recognise("crop.jpg", "hindi", return_confidence=True)

print(f"I am {conf_score * 100}% confident the text is: {text}")