Skip to content

Pdf/optimize#67

Merged
namtroi merged 2 commits into
mainfrom
pdf/optimize
Dec 28, 2025
Merged

Pdf/optimize#67
namtroi merged 2 commits into
mainfrom
pdf/optimize

Conversation

@namtroi

@namtroi namtroi commented Dec 28, 2025

Copy link
Copy Markdown
Owner

PR Type

Enhancement


Description

  • Optimize PDF conversion with PyMuPDF4LLM as default (~40x faster)

  • Keep Docling as optional high-quality alternative with conditional OCR

  • Split text converters into dedicated files (md, txt, json, docx)

  • Restore table conversion settings for XLSX/CSV files

  • Update UI to show PDF converter selection with conditional OCR fields


Diagram Walkthrough

flowchart LR
  A["PDF Processing"] -->|"pdfConverter: pymupdf"| B["PyMuPDFConverter<br/>Fast ~40x"]
  A -->|"pdfConverter: docling"| C["DoclingPdfConverter<br/>High Quality + OCR"]
  D["Text Converters"] -->|"Split into"| E["MarkdownConverter"]
  D -->|"Split into"| F["TxtConverter"]
  D -->|"Split into"| G["JsonConverter"]
  H["DOCX Processing"] -->|"Dedicated"| I["DocxConverter<br/>Docling-based"]
  J["Profile Config"] -->|"New Field"| K["pdfConverter Selection"]
  K -->|"Conditional"| L["OCR Settings<br/>Docling only"]
  K -->|"Restored"| M["Table Settings<br/>XLSX/CSV"]
Loading

File Walkthrough

Relevant files
Enhancement
12 files
pymupdf_converter.py
New fast PDF converter using PyMuPDF4LLM                                 
+97/-0   
docx_converter.py
New dedicated DOCX converter using Docling                             
+103/-0 
router.py
Added dynamic PDF converter selection logic                           
+25/-6   
main.py
Updated to use get_pdf_converter for dynamic selection     
+13/-7   
models.py
Updated ProfileConfig with pdfConverter field                       
+2/-5     
profile-routes.ts
Updated schema validation for new pdfConverter field         
+6/-8     
upload-route.ts
Updated profileConfig to include pdfConverter selection   
+3/-4     
sync-service.ts
Updated profile config building with pdfConverter               
+10/-11 
endpoints.ts
Updated ProcessingProfile interface with pdfConverter       
+6/-8     
profile-field-config.ts
Updated field labels and tooltips for new converter selection
+11/-15 
ProfileCard.tsx
Added conditional OCR display based on PDF converter choice
+19/-20 
ProfileFormDialog.tsx
Added PDF converter selector with conditional OCR fields 
+60/-17 
Refactoring
5 files
pdf_converter.py
Renamed to DoclingPdfConverter for clarity                             
+4/-3     
md_converter.py
New dedicated Markdown converter module                                   
+64/-0   
txt_converter.py
New dedicated plain text converter module                               
+67/-0   
json_converter.py
Extracted JSON converter with tabular detection                   
+24/-64 
__init__.py
Updated exports for new converter modules                               
+12/-5   
Configuration changes
1 files
schema.prisma
Added pdfConverter field, reorganized conversion settings
+4/-5     
Tests
6 files
database.ts
Updated test helper with new pdfConverter field                   
+3/-4     
profile-model.test.ts
Updated profile model tests for pdfConverter defaults       
+3/-4     
test_profile_config.py
Updated ProfileConfig tests with new defaults                       
+1/-4     
test_text_processor.py
Refactored tests for split converter modules                         
+76/-60 
test_existing_formats.py
Updated imports for new converter module structure             
+10/-5   
test_main.py
Updated mocks to use get_pdf_converter instead of get_converter
+4/-2     

- Add PyMuPDFConverter as default fast PDF converter (~40x faster)
- Keep DoclingPdfConverter as optional high-quality alternative
- Create dedicated DocxConverter (separated from PDF processing)
- Add pdfConverter field to ProcessingProfile for converter selection
- Conditional OCR settings (only shown when Docling is selected)
- Restore conversionTableRows/Cols for XLSX/CSV table rendering
- Update frontend ProfileCard and ProfileFormDialog with new fields
- Update backend routes, upload-route, sync-service

Closes: PDF conversion optimization task
Converter Refactoring:
- Split text_converter.py into md_converter.py, txt_converter.py, json_converter.py
- Each format now has dedicated converter file for easier optimization
- Updated __init__.py and router.py exports

Test Fixes:
- test_text_processor.py: Use new converter imports
- test_profile_config.py: Test pdfConverter instead of deprecated fields
- test_existing_formats.py: Fix broken converter imports
- test_main.py: Mock get_pdf_converter instead of get_converter for PDF
- database.ts: Add conversionTableRows/Cols to type
- profile-model.test.ts: Add table field assertions

All 222 tests passing
@namtroi namtroi merged commit c268d08 into main Dec 28, 2025
7 checks passed
@qodo-code-review

Copy link
Copy Markdown

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Exception leaked to user: The converter returns metadata={"error": str(e)} which can expose internal
exception details to user-facing responses instead of restricting details to secure logs.

Referred Code
except Exception as e:
    logger.exception("pymupdf_conversion_error", path=file_path, error=str(e))
    return ProcessorOutput(markdown="", metadata={"error": str(e)})

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Audit coverage unclear: The PR adds new processing behavior selection via profileConfig.pdfConverter but the diff
does not show whether changes to this setting (or its usage during processing) are
recorded in audit logs with user context.

Referred Code
// Build profileConfig for AI worker
const profileConfig = {
  pdfConverter: activeProfile.pdfConverter,
  pdfOcrMode: activeProfile.pdfOcrMode,
  pdfOcrLanguages: activeProfile.pdfOcrLanguages,
  conversionTableRows: activeProfile.conversionTableRows,
  conversionTableCols: activeProfile.conversionTableCols,
  documentChunkSize: activeProfile.documentChunkSize,

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Error log data risk: The PR logs exceptions with logger.exception(..., path=file_path, error=str(e)), and
without knowing what file_path/exception messages may contain in production, it is unclear
whether sensitive information could be logged.

Referred Code
    logger.info(
        "docx_conversion_complete",
        path=file_path,
        pages=page_count,
    )

    return ProcessorOutput(
        markdown=markdown,
        metadata={"converter": "docling"},
        page_count=page_count,
    )

except Exception as e:
    logger.exception("docx_conversion_error", path=file_path, error=str(e))
    return ProcessorOutput(markdown="", metadata={"error": str(e)})

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Re-evaluate removal of Docling configuration

Retain the pdfNumThreads and pdfTableStructure configuration options for the
Docling PDF converter. These were removed, resulting in a functional regression
where performance and table extraction quality are no longer tunable by the
user.

Examples:

apps/backend/prisma/schema.prisma [209-213]
  pdfConverter           String  @default("pymupdf") @map("pdf_converter") // "pymupdf" | "docling"
  pdfOcrMode             String  @default("auto") @map("pdf_ocr_mode") // Only for docling
  pdfOcrLanguages        String  @default("en")   @map("pdf_ocr_languages")
  conversionTableRows    Int     @default(35)   @map("conversion_table_rows") // For XLSX/CSV
  conversionTableCols    Int     @default(20)   @map("conversion_table_cols") // For XLSX/CSV
apps/ai-worker/src/converters/docx_converter.py [43-52]
        accelerator_options = AcceleratorOptions(
            num_threads=4,
            device=AcceleratorDevice.CPU,
        )

        pipeline_options = PdfPipelineOptions(
            accelerator_options=accelerator_options,
            do_table_structure=False,
            do_ocr=False,
        )

Solution Walkthrough:

Before:

# apps/ai-worker/src/converters/docx_converter.py
class DocxConverter(FormatConverter):
    def _get_docling_converter(self):
        # ...
        accelerator_options = AcceleratorOptions(
            num_threads=4, # Hardcoded
            device=AcceleratorDevice.CPU,
        )
        pipeline_options = PdfPipelineOptions(
            # ...
            do_table_structure=False, # Hardcoded
            do_ocr=False,
        )
        # ...

# apps/backend/prisma/schema.prisma
// pdfNumThreads and pdfTableStructure are removed from the model

After:

# apps/ai-worker/src/converters/pdf_converter.py
class DoclingPdfConverter(FormatConverter):
    async def to_markdown(self, file_path: str, ocr_mode: str, num_threads: int, table_structure: bool):
        # ...
        converter = self._get_docling_converter(ocr_mode, num_threads, table_structure)
        # ...

    def _get_docling_converter(self, ocr_mode, num_threads, table_structure):
        # ...
        accelerator_options = AcceleratorOptions(num_threads=num_threads, ...)
        pipeline_options = PdfPipelineOptions(do_table_structure=table_structure, ...)
        # ...

# apps/backend/prisma/schema.prisma
model ProcessingProfile {
  // ...
  pdfConverter      String  @default("pymupdf")
  pdfOcrMode        String  @default("auto")
  pdfNumThreads     Int     @default(4)    // Re-added for Docling
  pdfTableStructure Boolean @default(false) // Re-added for Docling
  // ...
}
Suggestion importance[1-10]: 8

__

Why: This is a significant functional regression, as it removes user control over performance (pdfNumThreads) and table extraction quality (pdfTableStructure) for the high-quality docling converter, undermining its purpose.

Medium
General
Validate PDF converter input

In get_pdf_converter, add an else block to raise a ValueError if an unsupported
converter_type is provided, preventing silent fallback to the default.

apps/ai-worker/src/router.py [73-85]

 def get_pdf_converter(converter_type: str = "pymupdf") -> FormatConverter:
     """
     Get PDF converter based on profile setting.
 
     Args:
         converter_type: "pymupdf" (fast) or "docling" (high quality with OCR)
 
     Returns:
         FormatConverter instance for PDF
     """
     if converter_type == "docling":
         return DoclingPdfConverter()
-    return PyMuPDFConverter()
+    elif converter_type == "pymupdf":
+        return PyMuPDFConverter()
+    else:
+        raise ValueError(f"Unsupported pdfConverter value: {converter_type!r}")
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: This is a good defensive programming practice that makes the system more robust by explicitly handling invalid configuration values instead of silently falling back to a default.

Medium
Consolidate encoding detection

Refactor the code by moving the duplicated _detect_encoding method from
md_converter.py, txt_converter.py, and json_converter.py into the
FormatConverter base class.

apps/ai-worker/src/converters/md_converter.py [58-64]

+# In FormatConverter (base.py), add:
 def _detect_encoding(self, raw_bytes: bytes) -> str:
     """Detect file encoding with BOM check."""
     if raw_bytes.startswith(b"\xef\xbb\xbf"):
         return "utf-8-sig"
     result = chardet.detect(raw_bytes)
     encoding = result.get("encoding", "utf-8")
     return (encoding or "utf-8").lower()
 
+# Then remove the local `_detect_encoding` from md_converter.py
+
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies duplicated code and proposes moving the _detect_encoding method to the base class, which improves code maintainability by adhering to the DRY principle.

Low
Simplify redundant converter call logic

Remove the redundant elif file_format in ("txt", "md", "json"): block, as the
else block now correctly handles the method call for these and other converters.

apps/ai-worker/src/main.py [149-152]

-elif file_format in ("txt", "md", "json"):
-    output = await converter.to_markdown(request.filePath, file_format)
 else:
     output = await converter.to_markdown(request.filePath)
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the elif block is redundant after refactoring the converters, and its removal simplifies the code by treating all remaining converters uniformly.

Low
Possible issue
Return zero pages on PDF error

In _get_page_count, change the return value in the except block from 1 to 0 to
accurately reflect that no pages were read if an error occurs.

apps/ai-worker/src/converters/pymupdf_converter.py [89-97]

 def _get_page_count(self, path: Path) -> int:
 """Get the number of pages in the PDF."""
 try:
     import fitz
 
     with fitz.open(str(path)) as doc:
         return len(doc)
 except Exception:
-    return 1
+    return 0
  • Apply / Chat
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly points out that returning 1 for page count on an exception is misleading; returning 0 more accurately reflects the state of a failed document read.

Low
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant