Pdf/optimize#67
Conversation
- Add PyMuPDFConverter as default fast PDF converter (~40x faster) - Keep DoclingPdfConverter as optional high-quality alternative - Create dedicated DocxConverter (separated from PDF processing) - Add pdfConverter field to ProcessingProfile for converter selection - Conditional OCR settings (only shown when Docling is selected) - Restore conversionTableRows/Cols for XLSX/CSV table rendering - Update frontend ProfileCard and ProfileFormDialog with new fields - Update backend routes, upload-route, sync-service Closes: PDF conversion optimization task
Converter Refactoring: - Split text_converter.py into md_converter.py, txt_converter.py, json_converter.py - Each format now has dedicated converter file for easier optimization - Updated __init__.py and router.py exports Test Fixes: - test_text_processor.py: Use new converter imports - test_profile_config.py: Test pdfConverter instead of deprecated fields - test_existing_formats.py: Fix broken converter imports - test_main.py: Mock get_pdf_converter instead of get_converter for PDF - database.ts: Add conversionTableRows/Cols to type - profile-model.test.ts: Add table field assertions All 222 tests passing
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
|||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||||||||||||
PR Type
Enhancement
Description
Optimize PDF conversion with PyMuPDF4LLM as default (~40x faster)
Keep Docling as optional high-quality alternative with conditional OCR
Split text converters into dedicated files (md, txt, json, docx)
Restore table conversion settings for XLSX/CSV files
Update UI to show PDF converter selection with conditional OCR fields
Diagram Walkthrough
File Walkthrough
12 files
New fast PDF converter using PyMuPDF4LLMNew dedicated DOCX converter using DoclingAdded dynamic PDF converter selection logicUpdated to use get_pdf_converter for dynamic selectionUpdated ProfileConfig with pdfConverter fieldUpdated schema validation for new pdfConverter fieldUpdated profileConfig to include pdfConverter selectionUpdated profile config building with pdfConverterUpdated ProcessingProfile interface with pdfConverterUpdated field labels and tooltips for new converter selectionAdded conditional OCR display based on PDF converter choiceAdded PDF converter selector with conditional OCR fields5 files
Renamed to DoclingPdfConverter for clarityNew dedicated Markdown converter moduleNew dedicated plain text converter moduleExtracted JSON converter with tabular detectionUpdated exports for new converter modules1 files
Added pdfConverter field, reorganized conversion settings6 files
Updated test helper with new pdfConverter fieldUpdated profile model tests for pdfConverter defaultsUpdated ProfileConfig tests with new defaultsRefactored tests for split converter modulesUpdated imports for new converter module structureUpdated mocks to use get_pdf_converter instead of get_converter