Phase4/refactor by namtroi · Pull Request #57 · namtroi/RAGBase

namtroi · 2025-12-26T03:23:09Z

PR Type

Enhancement, Bug fix

Description

Refactor converters to integrate sanitization and normalization into base class
- Add _sanitize_raw() and _post_process() methods to FormatConverter
- Remove duplicate InputSanitizer instances from 8 converters
- Add test_base_converter.py with 13 unit tests
Enhance job processing with stale job protection and network error retry logic
- Skip already-completed documents to prevent reprocessing
- Detect and retry network errors separately from permanent failures
- Improve error classification and logging
Add Google Drive web view links to frontend Document model and UI
- Add driveWebViewLink field to Document interface
- Display ExternalLink icon in DocumentCard for Drive documents
Improve chunkers and converters for Phase 4 optimization
- PresentationChunker: Extract slide titles and add metadata
- TabularChunker: Fix delimiter handling and improve sheet name injection
- CsvConverter: Add smart number formatting (currency, percentages, thousands separators)
- EpubConverter: Implement custom HTML-to-Markdown conversion preserving structure
- HtmlConverter: Remove markdownify dependency, add metadata extraction
- TextConverter: Add encoding detection and JSON tabular support
- XlsxConverter: Add smart formatting and context-aware sentence serialization

Diagram Walkthrough

flowchart LR
  A["Raw File"] -->|Converter| B["Sanitize + Format + Normalize"]
  B -->|Pipeline| C["Chunk"]
  C -->|Pipeline| D["Quality Analysis"]
  D -->|Pipeline| E["Embed"]
  F["Job Processor"] -->|Check Status| G["Skip if Complete"]
  G -->|Network Error| H["Retry Logic"]
  H -->|Permanent Error| I["Mark Failed"]
  J["Drive Document"] -->|driveWebViewLink| K["Frontend Display"]

File Walkthrough

Relevant files

Enhancement

11 files

job-processor.ts `Add stale job protection and network error retry logic`	+29/-4
endpoints.ts `Add driveWebViewLink field to Document model`	+2/-0
document-card.tsx `Add Google Drive external link icon to document card`	+13/-1
csv_converter.py `Remove InputSanitizer, add smart number formatting`	+116/-15
epub_converter.py `Implement custom HTML-to-Markdown conversion, remove InputSanitizer`	+134/-50
html_converter.py `Remove markdownify dependency, add metadata extraction`	+162/-47
pptx_converter.py `Improve slide marker injection and metadata extraction`	+61/-11
text_converter.py `Add encoding detection and JSON tabular support`	+91/-26
xlsx_converter.py `Add smart formatting and context-aware sentence serialization`	+106/-20
presentation_chunker.py `Extract slide titles and add title metadata to chunks`	+30/-6
normalizer.py `Improve code block extraction and empty section removal`	+51/-75

Refactoring

2 files

base.py `Add _sanitize_raw and _post_process methods to base class`	+32/-0
pipeline.py `Remove InputSanitizer from pipeline, delegate to converters`	+3/-7

Bug fix

3 files

pdf_converter.py `Add sanitization and post-processing to PDF output`	+2/-0
tabular_chunker.py `Fix delimiter handling and improve sheet name injection`	+32/-45
sanitizer.py `Add DEL character to control character removal pattern`	+2/-4

Tests

1 files

test_base_converter.py `Add 13 unit tests for base converter sanitization and normalization`	+121/-0

…into FormatConverter base class - Add _sanitize_raw() and _post_process() methods to FormatConverter - Update all 8 converters to use inherited methods - Remove InputSanitizer from ProcessingPipeline (avoid duplication) - Add test_base_converter.py with 13 unit tests Processing flow: Converter (Sanitize → Format → Normalize) → Pipeline (Chunk → Quality → Embed)

- pdf_converter: Sanitize markdown output after Docling extraction - xlsx_converter: Sanitize markdown output before post-processing - Consistent with other converters: sanitize earliest available text once

- backend: Add stale job protection and network error retry logic in JobProcessor - frontend: Add driveWebViewLink to Document model and link icon in DocumentCard

qodo-code-review · 2025-12-26T03:24:01Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Unsafe URL injection Description: Untrusted `document.driveWebViewLink` is rendered directly into an without scheme/host allowlisting, enabling `javascript:`/`data:` URL injection (client-side XSS) or phishing via attacker-controlled links if this field can be influenced by untrusted input. document-card.tsx [82-97] Referred Code {document.sourceType === 'DRIVE' && ( <span className="flex items-center gap-1 text-blue-600 bg-blue-50 px-1.5 py-0.5 rounded"> <FolderSync className="w-3 h-3" /> Drive {document.driveWebViewLink && ( <a href={document.driveWebViewLink} target="_blank" rel="noopener noreferrer" onClick={(e) => e.stopPropagation()} className="ml-1 hover:text-blue-800 transition-colors" title="Open in Google Drive" > <ExternalLink className="w-3 h-3" /> </a> )}
	Unsafe Markdown links Description: The converter emits Markdown links using raw HTML `href` values (`{content}`) without validating schemes, so malicious `javascript:`/`data:` URLs from input HTML could persist into produced Markdown and become an XSS/phishing vector if the Markdown is later rendered to users. html_converter.py [190-193] Referred Code elif element.name == "a": href = element.get("href", "") return f"[{content}]({href})" if href else content
	Unvalidated href in conversion Description: The EPUB HTML-to-Markdown conversion preserves into Markdown links without scheme validation, allowing malicious `javascript:`/`data:` URLs from EPUB content to propagate into stored Markdown and potentially trigger XSS/phishing if rendered downstream. epub_converter.py [176-179] Referred Code elif element.name == "a": href = element.get("href", "") return f"[{content}]({href})" if href else content
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Unsafe error inspection: The new network/permanent error classification assumes `err.message` is always present and will throw if `err` is not an `Error`, potentially masking the original failure and breaking retry/failed-state handling. Referred Code const isNetworkError = err.message.includes('fetch failed') \|\| err.message.includes('ECONNREFUSED') \|\| err.message.includes('ETIMEDOUT') \|\| err.message.includes('AI worker failed: 5'); // 5xx server errors const isPermanent = PERMANENT_ERRORS.some(code => err.message.includes(code) ); // Network errors should keep retrying until exhausted, not fail immediately const shouldMarkFailed = (isPermanent && !isNetworkError) \|\| err instanceof UnrecoverableError \|\| job.attemptsMade >= (job.opts.attempts ?? 3); Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Unstructured error log: The new CSV error logging uses an interpolated string (`logger.error(f"...")`) rather than structured logging, reducing auditability and increasing the risk of logging sensitive exception content without fields-based controls. Referred Code logger.error(f"Error converting CSV {file_path}: {e}") return ProcessorOutput(markdown="", metadata={"error": str(e)}) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Unvalidated external URL: The new `driveWebViewLink` is rendered directly into an `<a href=...>` without validating scheme/host (e.g., ensuring `https://drive.google.com/...`), enabling potential malicious URLs (including `javascript:`) if the backend value is compromised. Referred Code {document.driveWebViewLink && ( <a href={document.driveWebViewLink} target="_blank" rel="noopener noreferrer" onClick={(e) => e.stopPropagation()} className="ml-1 hover:text-blue-800 transition-colors" title="Open in Google Drive" > <ExternalLink className="w-3 h-3" /> </a> Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Missing user context: New job lifecycle logs (e.g., `job_skipped_already_complete`) include `documentId` but no user identifier, so it is unclear whether this processing is attributable to a user or is a system-only action. Referred Code // Part 3: Skip already-completed documents (stale job protection) const existing = await prisma.document.findUnique({ where: { id: job.data.documentId }, select: { status: true, processedContent: true } }); if (existing?.status === 'COMPLETED' && existing?.processedContent) { logger.info({ documentId: job.data.documentId }, 'job_skipped_already_complete'); return { skipped: true, documentId: job.data.documentId }; } Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Sensitive error content: The new `failReason` persisted to the database is derived directly from `err.message` (which may include upstream response bodies), and it is unclear from the diff whether `failReason` is user-visible and therefore could leak internal details. Referred Code // Classify error for better debugging let failReason = err.message.slice(0, 500); if (isNetworkError && job.attemptsMade >= (job.opts.attempts ?? 3)) { failReason = `AI_WORKER_UNREACHABLE: ${failReason}`; } await prisma.document.update({ where: { id: job.data.documentId }, data: { status: 'FAILED', failReason, retryCount: job.attemptsMade, }, }); Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2025-12-26T03:25:24Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Restore default slide marker Restore the slide marker in `PresentationChunker` to its original value of `""`. The current empty string value will cause a runtime error during chunking. apps/ai-worker/src/chunkers/presentation_chunker.py [19] -self.marker = "" +self.marker = "<!-- slide -->" Apply / Chat Suggestion importance[1-10]: 9 __ Why: This suggestion correctly identifies a critical bug where setting the slide marker to `""` would cause the `split` operation to raise a `ValueError`, breaking the presentation chunking logic. Restoring the original marker is the correct fix.	High
Possible issue	Prevent applying multiple formatting rules Modify `_format_smart_value` to return immediately after applying a formatting rule. This prevents a value from being formatted multiple times, such as being formatted as currency and then incorrectly re-formatted as a large number. apps/ai-worker/src/converters/csv_converter.py [119-174] def _format_smart_value(self, header: str, value: str) -> str: """ Phase 4: Smart Number Formatting. Applies semantic formatting based on header keywords and value type. """ header_lower = header.lower() # 1. Clean strings value = value.strip() if not value: return "" # 2. Try to parse number try: # Remove existing commas if present to check numeric validity clean_val = value.replace(",", "") float_val = float(clean_val) is_number = True except ValueError: is_number = False if is_number: # Currency Detection if any( k in header_lower for k in ["revenue", "price", "cost", "salary", "usd", "amount"] ): # If not already formatted with currency symbol if not value.startswith("$") and not value.startswith("€"): return f"${float_val:,.2f}" # Percentage Detection if "rate" in header_lower or "percent" in header_lower: - if float_val < 1.0: # Assumption: 0.45 -> 45% + if float_val >= 0 and float_val <= 1: # Assumption: 0.45 -> 45% return f"{float_val:.1%}" return f"{float_val}%" # Large Numbers (Thousands separator) if float_val > 999: # Check if it's an ID (usually doesn't need commas) if ( "id" not in header_lower and "code" not in header_lower and "year" not in header_lower ): - return ( - f"{float_val:,.0f}" - if float_val.is_integer() - else f"{float_val:,.2f}" - ) + if float_val.is_integer(): + return f"{float_val:,.0f}" + else: + return f"{float_val:,.2f}" # Date Detection (basic keywords) # Note: Extensive date parsing is complex; assuming standard ISO or simple formats. # If specific Excel serial date conversion is needed, it would go here. return value Apply / Chat Suggestion importance[1-10]: 8 __ Why: This suggestion correctly identifies a significant bug where multiple, mutually exclusive formatting rules could be applied to the same value, leading to incorrect output. The fix is accurate and critical for the feature's correctness.	Medium
High-level	Extract duplicated HTML parsing logic The custom HTML-to-Markdown parsing logic is duplicated across `epub_converter.py` and `html_converter.py`. This logic should be extracted into a single shared utility to improve maintainability. Examples: apps/ai-worker/src/converters/epub_converter.py [103-192] def _html_to_markdown(self, html_content: str) -> str: """ Parses HTML and converts semantic tags to Markdown. We do this manually to avoid extra dependencies (like markdownify) and to strictly control the output for RAG optimization. """ soup = BeautifulSoup(html_content, "lxml") # Remove script and style elements for script in soup(["script", "style", "meta", "link", "noscript"]): ... (clipped 80 lines) apps/ai-worker/src/converters/html_converter.py [112-204] def _html_to_markdown(self, root_element: Tag) -> str: """ Recursively converts HTML to Markdown. Ensures consistent handling of headers and semantics for RAG. """ def process_element(element): if isinstance(element, NavigableString): text = str(element).strip() if text: ... (clipped 83 lines) Solution Walkthrough: Before: # In epub_converter.py class EpubConverter(FormatConverter): def _html_to_markdown(self, html_content: str) -> str: def process_element(element): # ... complex recursive logic ... if element.name == "h1": return f"\n\n# {content}\n\n" # ... return process_element(root) # In html_converter.py class HtmlConverter(FormatConverter): def _html_to_markdown(self, root_element: Tag) -> str: def process_element(element): # ... nearly identical complex recursive logic ... if element.name == "h1": return f"\n\n# {content}\n\n" # ... return process_element(root_element) After: # In a new file, e.g., converters/utils/html_parser.py def html_to_markdown_util(root_element: Tag) -> str: def process_element(element): # ... complex recursive logic in one place ... if element.name == "h1": return f"\n\n# {content}\n\n" # ... return process_element(root_element) # In epub_converter.py and html_converter.py from .utils.html_parser import html_to_markdown_util class EpubConverter(FormatConverter): def _html_to_markdown(self, html_content: str) -> str: soup = BeautifulSoup(html_content, "lxml") return html_to_markdown_util(soup.body or soup) class HtmlConverter(FormatConverter): def _html_to_markdown(self, root_element: Tag) -> str: return html_to_markdown_util(root_element) Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies significant duplication of complex, recursive HTML parsing logic introduced in `epub_converter.py` and `html_converter.py`, and centralizing it would greatly improve code maintainability.	Medium
More

namtroi added 3 commits December 25, 2025 15:13

fix: Add InputSanitizer to PDF and XLSX converters

30793a4

- pdf_converter: Sanitize markdown output after Docling extraction - xlsx_converter: Sanitize markdown output before post-processing - Consistent with other converters: sanitize earliest available text once

feat: Enhance job processing and add Drive links to UI

6c347c6

- backend: Add stale job protection and network error retry logic in JobProcessor - frontend: Add driveWebViewLink to Document model and link icon in DocumentCard

namtroi merged commit ffc34aa into main Dec 26, 2025
6 of 7 checks passed

qodo-code-review Bot added the Review effort 4/5 label Dec 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase4/refactor#57

Phase4/refactor#57
namtroi merged 3 commits into
mainfrom
phase4/refactor

namtroi commented Dec 26, 2025 •

edited by qodo-code-review Bot

Loading

Uh oh!

Uh oh!

qodo-code-review Bot commented Dec 26, 2025

Uh oh!

qodo-code-review Bot commented Dec 26, 2025

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

namtroi commented Dec 26, 2025 • edited by qodo-code-review Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

Uh oh!

qodo-code-review Bot commented Dec 26, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-code-review Bot commented Dec 26, 2025

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

namtroi commented Dec 26, 2025 •

edited by qodo-code-review Bot

Loading