Phase4/refactor#57
Conversation
…into FormatConverter base class - Add _sanitize_raw() and _post_process() methods to FormatConverter - Update all 8 converters to use inherited methods - Remove InputSanitizer from ProcessingPipeline (avoid duplication) - Add test_base_converter.py with 13 unit tests Processing flow: Converter (Sanitize → Format → Normalize) → Pipeline (Chunk → Quality → Embed)
- pdf_converter: Sanitize markdown output after Docling extraction - xlsx_converter: Sanitize markdown output before post-processing - Consistent with other converters: sanitize earliest available text once
- backend: Add stale job protection and network error retry logic in JobProcessor - frontend: Add driveWebViewLink to Document model and link icon in DocumentCard
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
|||||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||||
PR Type
Enhancement, Bug fix
Description
Refactor converters to integrate sanitization and normalization into base class
Enhance job processing with stale job protection and network error retry logic
Add Google Drive web view links to frontend Document model and UI
Improve chunkers and converters for Phase 4 optimization
Diagram Walkthrough
File Walkthrough
11 files
Add stale job protection and network error retry logicAdd driveWebViewLink field to Document modelAdd Google Drive external link icon to document cardRemove InputSanitizer, add smart number formattingImplement custom HTML-to-Markdown conversion, remove InputSanitizerRemove markdownify dependency, add metadata extractionImprove slide marker injection and metadata extractionAdd encoding detection and JSON tabular supportAdd smart formatting and context-aware sentence serializationExtract slide titles and add title metadata to chunksImprove code block extraction and empty section removal2 files
Add _sanitize_raw and _post_process methods to base classRemove InputSanitizer from pipeline, delegate to converters3 files
Add sanitization and post-processing to PDF outputFix delimiter handling and improve sheet name injectionAdd DEL character to control character removal pattern1 files
Add 13 unit tests for base converter sanitization and normalization