Skip to content

Phase4/refactor#57

Merged
namtroi merged 3 commits into
mainfrom
phase4/refactor
Dec 26, 2025
Merged

Phase4/refactor#57
namtroi merged 3 commits into
mainfrom
phase4/refactor

Conversation

@namtroi

@namtroi namtroi commented Dec 26, 2025

Copy link
Copy Markdown
Owner

PR Type

Enhancement, Bug fix


Description

  • Refactor converters to integrate sanitization and normalization into base class

    • Add _sanitize_raw() and _post_process() methods to FormatConverter
    • Remove duplicate InputSanitizer instances from 8 converters
    • Add test_base_converter.py with 13 unit tests
  • Enhance job processing with stale job protection and network error retry logic

    • Skip already-completed documents to prevent reprocessing
    • Detect and retry network errors separately from permanent failures
    • Improve error classification and logging
  • Add Google Drive web view links to frontend Document model and UI

    • Add driveWebViewLink field to Document interface
    • Display ExternalLink icon in DocumentCard for Drive documents
  • Improve chunkers and converters for Phase 4 optimization

    • PresentationChunker: Extract slide titles and add metadata
    • TabularChunker: Fix delimiter handling and improve sheet name injection
    • CsvConverter: Add smart number formatting (currency, percentages, thousands separators)
    • EpubConverter: Implement custom HTML-to-Markdown conversion preserving structure
    • HtmlConverter: Remove markdownify dependency, add metadata extraction
    • TextConverter: Add encoding detection and JSON tabular support
    • XlsxConverter: Add smart formatting and context-aware sentence serialization

Diagram Walkthrough

flowchart LR
  A["Raw File"] -->|Converter| B["Sanitize + Format + Normalize"]
  B -->|Pipeline| C["Chunk"]
  C -->|Pipeline| D["Quality Analysis"]
  D -->|Pipeline| E["Embed"]
  F["Job Processor"] -->|Check Status| G["Skip if Complete"]
  G -->|Network Error| H["Retry Logic"]
  H -->|Permanent Error| I["Mark Failed"]
  J["Drive Document"] -->|driveWebViewLink| K["Frontend Display"]
Loading

File Walkthrough

Relevant files
Enhancement
11 files
job-processor.ts
Add stale job protection and network error retry logic     
+29/-4   
endpoints.ts
Add driveWebViewLink field to Document model                         
+2/-0     
document-card.tsx
Add Google Drive external link icon to document card         
+13/-1   
csv_converter.py
Remove InputSanitizer, add smart number formatting             
+116/-15
epub_converter.py
Implement custom HTML-to-Markdown conversion, remove InputSanitizer
+134/-50
html_converter.py
Remove markdownify dependency, add metadata extraction     
+162/-47
pptx_converter.py
Improve slide marker injection and metadata extraction     
+61/-11 
text_converter.py
Add encoding detection and JSON tabular support                   
+91/-26 
xlsx_converter.py
Add smart formatting and context-aware sentence serialization
+106/-20
presentation_chunker.py
Extract slide titles and add title metadata to chunks       
+30/-6   
normalizer.py
Improve code block extraction and empty section removal   
+51/-75 
Refactoring
2 files
base.py
Add _sanitize_raw and _post_process methods to base class
+32/-0   
pipeline.py
Remove InputSanitizer from pipeline, delegate to converters
+3/-7     
Bug fix
3 files
pdf_converter.py
Add sanitization and post-processing to PDF output             
+2/-0     
tabular_chunker.py
Fix delimiter handling and improve sheet name injection   
+32/-45 
sanitizer.py
Add DEL character to control character removal pattern     
+2/-4     
Tests
1 files
test_base_converter.py
Add 13 unit tests for base converter sanitization and normalization
+121/-0 

…into FormatConverter base class

- Add _sanitize_raw() and _post_process() methods to FormatConverter
- Update all 8 converters to use inherited methods
- Remove InputSanitizer from ProcessingPipeline (avoid duplication)
- Add test_base_converter.py with 13 unit tests

Processing flow: Converter (Sanitize → Format → Normalize) → Pipeline (Chunk → Quality → Embed)
- pdf_converter: Sanitize markdown output after Docling extraction
- xlsx_converter: Sanitize markdown output before post-processing
- Consistent with other converters: sanitize earliest available text once
- backend: Add stale job protection and network error retry logic in JobProcessor
- frontend: Add driveWebViewLink to Document model and link icon in DocumentCard
@namtroi namtroi merged commit ffc34aa into main Dec 26, 2025
6 of 7 checks passed
@qodo-code-review

Copy link
Copy Markdown

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Unsafe URL injection

Description: Untrusted document.driveWebViewLink is rendered directly into an without
scheme/host allowlisting, enabling javascript:/data: URL injection (client-side XSS) or
phishing via attacker-controlled links if this field can be influenced by untrusted input.

document-card.tsx [82-97]

Referred Code
{document.sourceType === 'DRIVE' && (
  <span className="flex items-center gap-1 text-blue-600 bg-blue-50 px-1.5 py-0.5 rounded">
    <FolderSync className="w-3 h-3" />
    Drive
    {document.driveWebViewLink && (
      <a
        href={document.driveWebViewLink}
        target="_blank"
        rel="noopener noreferrer"
        onClick={(e) => e.stopPropagation()}
        className="ml-1 hover:text-blue-800 transition-colors"
        title="Open in Google Drive"
      >
        <ExternalLink className="w-3 h-3" />
      </a>
    )}
Unsafe Markdown links

Description: The converter emits Markdown links using raw HTML href values ({content})
without validating schemes, so malicious javascript:/data: URLs from input HTML could
persist into produced Markdown and become an XSS/phishing vector if the Markdown is later
rendered to users.
html_converter.py [190-193]

Referred Code
elif element.name == "a":
    href = element.get("href", "")
    return f"[{content}]({href})" if href else content
Unvalidated href in conversion

Description: The EPUB HTML-to-Markdown conversion preserves into Markdown links without scheme
validation, allowing malicious javascript:/data: URLs from EPUB content to propagate into
stored Markdown and potentially trigger XSS/phishing if rendered downstream.
epub_converter.py [176-179]

Referred Code
elif element.name == "a":
    href = element.get("href", "")
    return f"[{content}]({href})" if href else content
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Unsafe error inspection: The new network/permanent error classification assumes err.message is always present and
will throw if err is not an Error, potentially masking the original failure and breaking
retry/failed-state handling.

Referred Code
const isNetworkError =
  err.message.includes('fetch failed') ||
  err.message.includes('ECONNREFUSED') ||
  err.message.includes('ETIMEDOUT') ||
  err.message.includes('AI worker failed: 5'); // 5xx server errors

const isPermanent = PERMANENT_ERRORS.some(code =>
  err.message.includes(code)
);

// Network errors should keep retrying until exhausted, not fail immediately
const shouldMarkFailed =
  (isPermanent && !isNetworkError) ||
  err instanceof UnrecoverableError ||
  job.attemptsMade >= (job.opts.attempts ?? 3);

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Unstructured error log: The new CSV error logging uses an interpolated string (logger.error(f"..."))
rather than structured logging, reducing auditability and increasing the risk of logging
sensitive exception content without fields-based controls.

Referred Code
logger.error(f"Error converting CSV {file_path}: {e}")
return ProcessorOutput(markdown="", metadata={"error": str(e)})

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unvalidated external URL: The new driveWebViewLink is rendered directly into an <a href=...> without
validating scheme/host (e.g., ensuring https://drive.google.com/...), enabling potential
malicious URLs (including javascript:) if the backend value is compromised.

Referred Code
{document.driveWebViewLink && (
  <a
    href={document.driveWebViewLink}
    target="_blank"
    rel="noopener noreferrer"
    onClick={(e) => e.stopPropagation()}
    className="ml-1 hover:text-blue-800 transition-colors"
    title="Open in Google Drive"
  >
    <ExternalLink className="w-3 h-3" />
  </a>

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing user context: New job lifecycle logs (e.g., job_skipped_already_complete) include documentId but no user
identifier, so it is unclear whether this processing is attributable to a user or is a
system-only action.

Referred Code
// Part 3: Skip already-completed documents (stale job protection)
const existing = await prisma.document.findUnique({
  where: { id: job.data.documentId },
  select: { status: true, processedContent: true }
});

if (existing?.status === 'COMPLETED' && existing?.processedContent) {
  logger.info({ documentId: job.data.documentId }, 'job_skipped_already_complete');
  return { skipped: true, documentId: job.data.documentId };
}

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Sensitive error content: The new failReason persisted to the database is derived directly from err.message (which
may include upstream response bodies), and it is unclear from the diff whether failReason
is user-visible and therefore could leak internal details.

Referred Code
// Classify error for better debugging
let failReason = err.message.slice(0, 500);
if (isNetworkError && job.attemptsMade >= (job.opts.attempts ?? 3)) {
  failReason = `AI_WORKER_UNREACHABLE: ${failReason}`;
}

await prisma.document.update({
  where: { id: job.data.documentId },
  data: {
    status: 'FAILED',
    failReason,
    retryCount: job.attemptsMade,
  },
});

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Restore default slide marker

Restore the slide marker in PresentationChunker to its original value of "". The current empty string value will cause a runtime error during
chunking.

apps/ai-worker/src/chunkers/presentation_chunker.py [19]

-self.marker = ""
+self.marker = "<!-- slide -->"
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a critical bug where setting the slide marker to "" would cause the split operation to raise a ValueError, breaking the presentation chunking logic. Restoring the original marker is the correct fix.

High
Prevent applying multiple formatting rules

Modify _format_smart_value to return immediately after applying a formatting
rule. This prevents a value from being formatted multiple times, such as being
formatted as currency and then incorrectly re-formatted as a large number.

apps/ai-worker/src/converters/csv_converter.py [119-174]

 def _format_smart_value(self, header: str, value: str) -> str:
     """
     Phase 4: Smart Number Formatting.
     Applies semantic formatting based on header keywords and value type.
     """
     header_lower = header.lower()
 
     # 1. Clean strings
     value = value.strip()
     if not value:
         return ""
 
     # 2. Try to parse number
     try:
         # Remove existing commas if present to check numeric validity
         clean_val = value.replace(",", "")
         float_val = float(clean_val)
         is_number = True
     except ValueError:
         is_number = False
 
     if is_number:
         # Currency Detection
         if any(
             k in header_lower
             for k in ["revenue", "price", "cost", "salary", "usd", "amount"]
         ):
             # If not already formatted with currency symbol
             if not value.startswith("$") and not value.startswith("€"):
                 return f"${float_val:,.2f}"
 
         # Percentage Detection
         if "rate" in header_lower or "percent" in header_lower:
-            if float_val < 1.0:  # Assumption: 0.45 -> 45%
+            if float_val >= 0 and float_val <= 1:  # Assumption: 0.45 -> 45%
                 return f"{float_val:.1%}"
             return f"{float_val}%"
 
         # Large Numbers (Thousands separator)
         if float_val > 999:
             # Check if it's an ID (usually doesn't need commas)
             if (
                 "id" not in header_lower
                 and "code" not in header_lower
                 and "year" not in header_lower
             ):
-                return (
-                    f"{float_val:,.0f}"
-                    if float_val.is_integer()
-                    else f"{float_val:,.2f}"
-                )
+                if float_val.is_integer():
+                    return f"{float_val:,.0f}"
+                else:
+                    return f"{float_val:,.2f}"
 
     # Date Detection (basic keywords)
     # Note: Extensive date parsing is complex; assuming standard ISO or simple formats.
     # If specific Excel serial date conversion is needed, it would go here.
 
     return value
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: This suggestion correctly identifies a significant bug where multiple, mutually exclusive formatting rules could be applied to the same value, leading to incorrect output. The fix is accurate and critical for the feature's correctness.

Medium
High-level
Extract duplicated HTML parsing logic

The custom HTML-to-Markdown parsing logic is duplicated across epub_converter.py
and html_converter.py. This logic should be extracted into a single shared
utility to improve maintainability.

Examples:

apps/ai-worker/src/converters/epub_converter.py [103-192]
    def _html_to_markdown(self, html_content: str) -> str:
        """
        Parses HTML and converts semantic tags to Markdown.
        We do this manually to avoid extra dependencies (like markdownify)
        and to strictly control the output for RAG optimization.
        """
        soup = BeautifulSoup(html_content, "lxml")

        # Remove script and style elements
        for script in soup(["script", "style", "meta", "link", "noscript"]):

 ... (clipped 80 lines)
apps/ai-worker/src/converters/html_converter.py [112-204]
    def _html_to_markdown(self, root_element: Tag) -> str:
        """
        Recursively converts HTML to Markdown.
        Ensures consistent handling of headers and semantics for RAG.
        """

        def process_element(element):
            if isinstance(element, NavigableString):
                text = str(element).strip()
                if text:

 ... (clipped 83 lines)

Solution Walkthrough:

Before:

# In epub_converter.py
class EpubConverter(FormatConverter):
    def _html_to_markdown(self, html_content: str) -> str:
        def process_element(element):
            # ... complex recursive logic ...
            if element.name == "h1":
                return f"\n\n# {content}\n\n"
            # ...
        return process_element(root)

# In html_converter.py
class HtmlConverter(FormatConverter):
    def _html_to_markdown(self, root_element: Tag) -> str:
        def process_element(element):
            # ... nearly identical complex recursive logic ...
            if element.name == "h1":
                return f"\n\n# {content}\n\n"
            # ...
        return process_element(root_element)

After:

# In a new file, e.g., converters/utils/html_parser.py
def html_to_markdown_util(root_element: Tag) -> str:
    def process_element(element):
        # ... complex recursive logic in one place ...
        if element.name == "h1":
            return f"\n\n# {content}\n\n"
        # ...
    return process_element(root_element)

# In epub_converter.py and html_converter.py
from .utils.html_parser import html_to_markdown_util

class EpubConverter(FormatConverter):
    def _html_to_markdown(self, html_content: str) -> str:
        soup = BeautifulSoup(html_content, "lxml")
        return html_to_markdown_util(soup.body or soup)

class HtmlConverter(FormatConverter):
    def _html_to_markdown(self, root_element: Tag) -> str:
        return html_to_markdown_util(root_element)
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies significant duplication of complex, recursive HTML parsing logic introduced in epub_converter.py and html_converter.py, and centralizing it would greatly improve code maintainability.

Medium
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant