Pdf/optimize by namtroi · Pull Request #67 · namtroi/RAGBase

namtroi · 2025-12-28T23:55:58Z

PR Type

Enhancement

Description

Optimize PDF conversion with PyMuPDF4LLM as default (~40x faster)
Keep Docling as optional high-quality alternative with conditional OCR
Split text converters into dedicated files (md, txt, json, docx)
Restore table conversion settings for XLSX/CSV files
Update UI to show PDF converter selection with conditional OCR fields

Diagram Walkthrough

flowchart LR
  A["PDF Processing"] -->|"pdfConverter: pymupdf"| B["PyMuPDFConverter<br/>Fast ~40x"]
  A -->|"pdfConverter: docling"| C["DoclingPdfConverter<br/>High Quality + OCR"]
  D["Text Converters"] -->|"Split into"| E["MarkdownConverter"]
  D -->|"Split into"| F["TxtConverter"]
  D -->|"Split into"| G["JsonConverter"]
  H["DOCX Processing"] -->|"Dedicated"| I["DocxConverter<br/>Docling-based"]
  J["Profile Config"] -->|"New Field"| K["pdfConverter Selection"]
  K -->|"Conditional"| L["OCR Settings<br/>Docling only"]
  K -->|"Restored"| M["Table Settings<br/>XLSX/CSV"]

File Walkthrough

Relevant files

Enhancement

12 files

pymupdf_converter.py `New fast PDF converter using PyMuPDF4LLM`	+97/-0
docx_converter.py `New dedicated DOCX converter using Docling`	+103/-0
router.py `Added dynamic PDF converter selection logic`	+25/-6
main.py `Updated to use get_pdf_converter for dynamic selection`	+13/-7
models.py `Updated ProfileConfig with pdfConverter field`	+2/-5
profile-routes.ts `Updated schema validation for new pdfConverter field`	+6/-8
upload-route.ts `Updated profileConfig to include pdfConverter selection`	+3/-4
sync-service.ts `Updated profile config building with pdfConverter`	+10/-11
endpoints.ts `Updated ProcessingProfile interface with pdfConverter`	+6/-8
profile-field-config.ts `Updated field labels and tooltips for new converter selection`	+11/-15
ProfileCard.tsx `Added conditional OCR display based on PDF converter choice`	+19/-20
ProfileFormDialog.tsx `Added PDF converter selector with conditional OCR fields`	+60/-17

Refactoring

5 files

pdf_converter.py `Renamed to DoclingPdfConverter for clarity`	+4/-3
md_converter.py `New dedicated Markdown converter module`	+64/-0
txt_converter.py `New dedicated plain text converter module`	+67/-0
json_converter.py `Extracted JSON converter with tabular detection`	+24/-64
__init__.py `Updated exports for new converter modules`	+12/-5

Configuration changes

1 files

schema.prisma `Added pdfConverter field, reorganized conversion settings`	+4/-5

Tests

6 files

database.ts `Updated test helper with new pdfConverter field`	+3/-4
profile-model.test.ts `Updated profile model tests for pdfConverter defaults`	+3/-4
test_profile_config.py `Updated ProfileConfig tests with new defaults`	+1/-4
test_text_processor.py `Refactored tests for split converter modules`	+76/-60
test_existing_formats.py `Updated imports for new converter module structure`	+10/-5
test_main.py `Updated mocks to use get_pdf_converter instead of get_converter`	+4/-2

- Add PyMuPDFConverter as default fast PDF converter (~40x faster) - Keep DoclingPdfConverter as optional high-quality alternative - Create dedicated DocxConverter (separated from PDF processing) - Add pdfConverter field to ProcessingProfile for converter selection - Conditional OCR settings (only shown when Docling is selected) - Restore conversionTableRows/Cols for XLSX/CSV table rendering - Update frontend ProfileCard and ProfileFormDialog with new fields - Update backend routes, upload-route, sync-service Closes: PDF conversion optimization task

Converter Refactoring: - Split text_converter.py into md_converter.py, txt_converter.py, json_converter.py - Each format now has dedicated converter file for easier optimization - Updated __init__.py and router.py exports Test Fixes: - test_text_processor.py: Use new converter imports - test_profile_config.py: Test pdfConverter instead of deprecated fields - test_existing_formats.py: Fix broken converter imports - test_main.py: Mock get_pdf_converter instead of get_converter for PDF - database.ts: Add conversionTableRows/Cols to type - profile-model.test.ts: Add table field assertions All 222 tests passing

qodo-code-review · 2025-12-28T23:56:47Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Exception leaked to user: The converter returns `metadata={"error": str(e)}` which can expose internal exception details to user-facing responses instead of restricting details to secure logs. Referred Code except Exception as e: logger.exception("pymupdf_conversion_error", path=file_path, error=str(e)) return ProcessorOutput(markdown="", metadata={"error": str(e)}) Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Audit coverage unclear: The PR adds new processing behavior selection via `profileConfig.pdfConverter` but the diff does not show whether changes to this setting (or its usage during processing) are recorded in audit logs with user context. Referred Code // Build profileConfig for AI worker const profileConfig = { pdfConverter: activeProfile.pdfConverter, pdfOcrMode: activeProfile.pdfOcrMode, pdfOcrLanguages: activeProfile.pdfOcrLanguages, conversionTableRows: activeProfile.conversionTableRows, conversionTableCols: activeProfile.conversionTableCols, documentChunkSize: activeProfile.documentChunkSize, Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Error log data risk: The PR logs exceptions with `logger.exception(..., path=file_path, error=str(e))`, and without knowing what `file_path`/exception messages may contain in production, it is unclear whether sensitive information could be logged. Referred Code logger.info( "docx_conversion_complete", path=file_path, pages=page_count, ) return ProcessorOutput( markdown=markdown, metadata={"converter": "docling"}, page_count=page_count, ) except Exception as e: logger.exception("docx_conversion_error", path=file_path, error=str(e)) return ProcessorOutput(markdown="", metadata={"error": str(e)}) Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2025-12-28T23:58:04Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Re-evaluate removal of Docling configuration Retain the `pdfNumThreads` and `pdfTableStructure` configuration options for the Docling PDF converter. These were removed, resulting in a functional regression where performance and table extraction quality are no longer tunable by the user. Examples: apps/backend/prisma/schema.prisma [209-213] pdfConverter String @default("pymupdf") @map("pdf_converter") // "pymupdf" \| "docling" pdfOcrMode String @default("auto") @map("pdf_ocr_mode") // Only for docling pdfOcrLanguages String @default("en") @map("pdf_ocr_languages") conversionTableRows Int @default(35) @map("conversion_table_rows") // For XLSX/CSV conversionTableCols Int @default(20) @map("conversion_table_cols") // For XLSX/CSV apps/ai-worker/src/converters/docx_converter.py [43-52] accelerator_options = AcceleratorOptions( num_threads=4, device=AcceleratorDevice.CPU, ) pipeline_options = PdfPipelineOptions( accelerator_options=accelerator_options, do_table_structure=False, do_ocr=False, ) Solution Walkthrough: Before: # apps/ai-worker/src/converters/docx_converter.py class DocxConverter(FormatConverter): def _get_docling_converter(self): # ... accelerator_options = AcceleratorOptions( num_threads=4, # Hardcoded device=AcceleratorDevice.CPU, ) pipeline_options = PdfPipelineOptions( # ... do_table_structure=False, # Hardcoded do_ocr=False, ) # ... # apps/backend/prisma/schema.prisma // pdfNumThreads and pdfTableStructure are removed from the model After: # apps/ai-worker/src/converters/pdf_converter.py class DoclingPdfConverter(FormatConverter): async def to_markdown(self, file_path: str, ocr_mode: str, num_threads: int, table_structure: bool): # ... converter = self._get_docling_converter(ocr_mode, num_threads, table_structure) # ... def _get_docling_converter(self, ocr_mode, num_threads, table_structure): # ... accelerator_options = AcceleratorOptions(num_threads=num_threads, ...) pipeline_options = PdfPipelineOptions(do_table_structure=table_structure, ...) # ... # apps/backend/prisma/schema.prisma model ProcessingProfile { // ... pdfConverter String @default("pymupdf") pdfOcrMode String @default("auto") pdfNumThreads Int @default(4) // Re-added for Docling pdfTableStructure Boolean @default(false) // Re-added for Docling // ... } Suggestion importance[1-10]: 8 __ Why: This is a significant functional regression, as it removes user control over performance (`pdfNumThreads`) and table extraction quality (`pdfTableStructure`) for the high-quality `docling` converter, undermining its purpose.	Medium
General	Validate PDF converter input In `get_pdf_converter`, add an `else` block to raise a `ValueError` if an unsupported `converter_type` is provided, preventing silent fallback to the default. apps/ai-worker/src/router.py [73-85] def get_pdf_converter(converter_type: str = "pymupdf") -> FormatConverter: """ Get PDF converter based on profile setting. Args: converter_type: "pymupdf" (fast) or "docling" (high quality with OCR) Returns: FormatConverter instance for PDF """ if converter_type == "docling": return DoclingPdfConverter() - return PyMuPDFConverter() + elif converter_type == "pymupdf": + return PyMuPDFConverter() + else: + raise ValueError(f"Unsupported pdfConverter value: {converter_type!r}") Apply / Chat Suggestion importance[1-10]: 7 __ Why: This is a good defensive programming practice that makes the system more robust by explicitly handling invalid configuration values instead of silently falling back to a default.	Medium
	Consolidate encoding detection Refactor the code by moving the duplicated `_detect_encoding` method from `md_converter.py`, `txt_converter.py`, and `json_converter.py` into the `FormatConverter` base class. apps/ai-worker/src/converters/md_converter.py [58-64] +# In FormatConverter (base.py), add: def _detect_encoding(self, raw_bytes: bytes) -> str: """Detect file encoding with BOM check.""" if raw_bytes.startswith(b"\xef\xbb\xbf"): return "utf-8-sig" result = chardet.detect(raw_bytes) encoding = result.get("encoding", "utf-8") return (encoding or "utf-8").lower() +# Then remove the local `_detect_encoding` from md_converter.py + Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies duplicated code and proposes moving the `_detect_encoding` method to the base class, which improves code maintainability by adhering to the DRY principle.	Low
	Simplify redundant converter call logic Remove the redundant `elif file_format in ("txt", "md", "json"):` block, as the `else` block now correctly handles the method call for these and other converters. apps/ai-worker/src/main.py [149-152] -elif file_format in ("txt", "md", "json"): - output = await converter.to_markdown(request.filePath, file_format) else: output = await converter.to_markdown(request.filePath) Apply / Chat Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies that the `elif` block is redundant after refactoring the converters, and its removal simplifies the code by treating all remaining converters uniformly.	Low
Possible issue	Return zero pages on PDF error In `_get_page_count`, change the return value in the `except` block from `1` to `0` to accurately reflect that no pages were read if an error occurs. apps/ai-worker/src/converters/pymupdf_converter.py [89-97] def _get_page_count(self, path: Path) -> int: """Get the number of pages in the PDF.""" try: import fitz with fitz.open(str(path)) as doc: return len(doc) except Exception: - return 1 + return 0 Apply / Chat Suggestion importance[1-10]: 4 __ Why: The suggestion correctly points out that returning `1` for page count on an exception is misleading; returning `0` more accurately reflects the state of a failed document read.	Low
More

namtroi added 2 commits December 28, 2025 15:33

namtroi merged commit c268d08 into main Dec 28, 2025
7 checks passed

qodo-code-review Bot added the Review effort 3/5 label Dec 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pdf/optimize#67

Pdf/optimize#67
namtroi merged 2 commits into
mainfrom
pdf/optimize

namtroi commented Dec 28, 2025 •

edited by qodo-code-review Bot

Loading

Uh oh!

Uh oh!

qodo-code-review Bot commented Dec 28, 2025

Uh oh!

qodo-code-review Bot commented Dec 28, 2025

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

namtroi commented Dec 28, 2025 • edited by qodo-code-review Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

Uh oh!

qodo-code-review Bot commented Dec 28, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-code-review Bot commented Dec 28, 2025

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

namtroi commented Dec 28, 2025 •

edited by qodo-code-review Bot

Loading