Skip to content

Pdf/optimize#68

Merged
namtroi merged 3 commits into
mainfrom
pdf/optimize
Dec 29, 2025
Merged

Pdf/optimize#68
namtroi merged 3 commits into
mainfrom
pdf/optimize

Conversation

@namtroi

@namtroi namtroi commented Dec 29, 2025

Copy link
Copy Markdown
Owner

PR Type

Enhancement, Bug fix


Description

  • Remove charStart/charEnd fields from schema and all API endpoints

  • Implement PDF artifact cleaning: page numbers and junk code blocks

  • Add soft linebreak merging for better PDF text extraction

  • Enhance chunk quality scoring with multi-factor weighted calculation

  • Add chunk sorting and merging capabilities for improved chunking

  • Update default chunking parameters and add Prisma P2025 error handling


Diagram Walkthrough

flowchart LR
  A["Schema Changes"] -->|Remove charStart/charEnd| B["Database & APIs"]
  C["PDF Processing"] -->|Page artifacts + junk code| D["Normalizer"]
  E["Soft Linebreaks"] -->|Merge PDF breaks| D
  F["Quality Scoring"] -->|Multi-factor weights| G["Analyzer"]
  H["Chunk Merging"] -->|Optimize small chunks| I["Pipeline"]
  B --> J["Frontend & Backend"]
  D --> K["Better Extraction"]
  G --> L["Improved Metrics"]
  I --> L
Loading

File Walkthrough

Relevant files
Schema changes
1 files
schema.prisma
Remove charStart and charEnd fields from Chunk model         
+4/-5     
Error handling
1 files
job-processor.ts
Add P2025 error handling for deleted documents                     
+5/-0     
Bug fix
9 files
documents-route.ts
Remove charStart/charEnd from chunk selection and response
+0/-4     
content-route.ts
Remove charStart/charEnd from chunk metadata                         
+0/-4     
callback-route.ts
Remove charStart/charEnd from chunk insertion and error message
+2/-4     
search-route.ts
Remove charStart/charEnd from search results                         
+0/-6     
hybrid-search.ts
Remove charStart/charEnd from search result interfaces     
+0/-10   
callback-validator.ts
Remove charStart/charEnd from chunk metadata schema           
+1/-2     
endpoints.ts
Remove charStart/charEnd from QueryResult and ChunkDetail interfaces
+4/-6     
presentation_chunker.py
Remove charStart/charEnd from chunk metadata                         
+4/-7     
tabular_chunker.py
Remove charStart/charEnd position tracking from chunks     
+1/-12   
Enhancement
11 files
chunks-route.ts
Add sorting by index/tokenCount/qualityScore and remove char fields
+18/-13 
ChunkCard.tsx
Remove character position display from chunk footer           
+0/-3     
ChunkFilters.tsx
Add sorting options for tokenCount and qualityScore           
+18/-0   
ChunksExplorerPage.tsx
Implement sort field and order parsing for chunk queries 
+6/-0     
document_chunker.py
Remove charStart/charEnd tracking and add configurable header levels
+35/-36 
base.py
Add PDF-specific post-processing methods for artifact removal
+20/-0   
pdf_converter.py
Use PDF-specific post-processing with artifact cleaning   
+1/-1     
pymupdf_converter.py
Add hidden link stripping and PyMuPDF-specific post-processing
+15/-1   
normalizer.py
Add page artifact removal, junk code block removal, and soft linebreak
merging
+163/-0 
pipeline.py
Add chunk merging algorithm and multi-factor quality scoring
integration
+201/-0 
analyzer.py
Implement multi-factor weighted quality scoring system     
+38/-7   
Tests
17 files
error-handling.test.ts
Update test payloads to remove charStart/charEnd metadata
+2/-2     
query-flow.test.ts
Remove charStart/charEnd from chunk insertion in tests     
+6/-8     
database.ts
Remove charStart/charEnd from seedChunk helper                     
+2/-8     
analytics-api.test.ts
Remove charStart/charEnd from test chunk inserts and fix whitespace
+44/-44 
analytics-e2e.test.ts
Remove charStart/charEnd from callback test payload           
+1/-1     
callback-route.test.ts
Remove charStart/charEnd from test callback metadata         
+1/-1     
chunks-api.test.ts
Remove charStart/charEnd from chunk seeding and fix formatting
+6/-6     
content-route.test.ts
Remove charStart/charEnd from chunk insertion test             
+2/-2     
search-route.test.ts
Remove charStart/charEnd from all chunk insertions and fix whitespace
+37/-37 
status-route.test.ts
Remove charStart/charEnd from seedChunk calls                       
+2/-2     
python-worker-mock.ts
Remove charStart/charEnd from default callback metadata   
+1/-1     
hybrid-search.test.ts
Remove charStart/charEnd from mock search results               
+0/-2     
callback-validator.test.ts
Remove charStart/charEnd from test callback payload           
+1/-1     
test_callback.py
Remove charStart/charEnd from test callback metadata         
+1/-1     
test_normalizer.py
Add comprehensive tests for page artifacts, code blocks, and linebreak
merging
+231/-0 
test_profile_config.py
Update quality analyzer test for multi-factor scoring calculation
+8/-3     
test_quality_analyzer.py
Update quality score tests for multi-factor weighted system
+24/-6   

…ters

- Implement junk code block removal in PyMuPDF converter
- Remove charStart and charEnd fields from AI worker, backend, and frontend
- Migrate analytics format filters and update UI components
- Update tests to reflect schema and filter changes
- Added merge_soft_linebreaks to normalizer for better PDF extraction
- Updated default chunking and quality parameters in prisma schema
- Added Prisma P2025 error handling in job processor
@namtroi namtroi merged commit c61922d into main Dec 29, 2025
6 of 7 checks passed
@qodo-code-review

Copy link
Copy Markdown

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Regex DoS risk

Description: The newly added PDF-cleanup routines (remove_page_artifacts, remove_junk_code_blocks, and
merge_soft_linebreaks) apply multiple regex-based transformations and paragraph/line
splitting over potentially untrusted, attacker-controlled PDF-derived text, which could
enable resource-exhaustion (CPU/memory) DoS if very large or adversarially structured
inputs are processed without size/time limits.
normalizer.py [122-275]

Referred Code
def remove_page_artifacts(self, markdown: str) -> str:
    """
    Remove standalone page numbers from markdown (PDF-specific).
    Uses paragraph-level analysis with strict patterns.
    """
    if not markdown:
        return ""

    paragraphs = re.split(r"\n\s*\n", markdown)
    result = []

    for para in paragraphs:
        stripped = para.strip()

        # Keep empty paragraphs (preserve spacing)
        if not stripped:
            result.append(para)
            continue

        # Keep multi-line paragraphs (page nums are single-line)
        if "\n" in stripped:


 ... (clipped 133 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Error detail exposure: The PR stores raw err.message into failReason, which may later be shown externally and
could leak internal processing/system details depending on how failReason is surfaced.

Referred Code
where: { id: documentId },
data: {
  status: 'FAILED',
  failReason: `PROCESSING_ERROR: ${err.message} `,
},

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Simplify the complex chunk merging logic

The merge_small_chunks function in pipeline.py is overly complex and stateful,
making it difficult to maintain. It should be refactored into a simpler, more
robust algorithm, like a two-pass approach, to improve stability.

Examples:

apps/ai-worker/src/pipeline.py [64-233]
    def merge_small_chunks(
        self, chunks: List[Dict[str, Any]], min_chars: int, max_chars: int
    ) -> List[Dict[str, Any]]:
        """
        Optimally merge small chunks using greedy accumulator algorithm.

        Goals:
        - Maximize number of normal-sized chunks (>= min_chars)
        - Never exceed max_chars
        - Preserve document order

 ... (clipped 160 lines)

Solution Walkthrough:

Before:

def merge_small_chunks(chunks, min_chars, max_chars):
    result = []
    accumulator = ""
    
    for chunk in chunks:
        if len(chunk.content) >= min_chars:
            # Big chunk arrives
            if accumulator:
                # Complex logic to flush accumulator, which might...
                # 1. Prepend to current chunk
                # 2. Append to PREVIOUS chunk in result (result[-1])
                # 3. Become its own chunk
                flush_accumulator_and_maybe_modify_previous_results(...)
            result.append(chunk)
            accumulator = ""
        else:
            # Small chunk, add to accumulator with more complex logic
            ...
    
    # Final complex flush of any remaining accumulator content
    ...
    return result

After:

def merge_small_chunks(chunks, min_chars, max_chars):
    # A simpler, more predictable approach
    if not chunks:
        return []

    merged_chunks = []
    current_content = ""
    
    for chunk in chunks:
        # Always try to add the next chunk
        next_content = strip_prefix(chunk.content)
        if not current_content or len(current_content) + len(next_content) <= max_chars:
            current_content += "\n\n" + next_content
        else:
            # Current accumulator is full, finalize it and start a new one
            merged_chunks.append({"content": current_content, ...})
            current_content = next_content
    
    # Add the last accumulated chunk
    if current_content:
        merged_chunks.append({"content": current_content, ...})
        
    # Optional second pass to handle any remaining undersized chunks
    return final_pass(merged_chunks, min_chars)
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a significant design issue in the new merge_small_chunks function, which is overly complex, stateful, and modifies previously processed items, making it hard to maintain and debug.

Medium
Possible issue
Ensure job fails if status update fails

In the job processor's error handler, re-throw the error after logging a failed
job status update. This ensures the job is correctly marked as failed in the
queue, preventing data inconsistencies.

apps/backend/src/queue/job-processor.ts [122-129]

 } catch (updateError) {
   // P2025: Document already deleted, ignore
   if ((updateError as any)?.code === 'P2025') {
     logger.info({ jobId: job.id }, 'job_status_update_skipped_doc_deleted');
     return;
   }
   logger.error({ err: updateError, jobId: job.id }, 'job_status_update_failed');
+  // Re-throw to ensure the job is marked as failed in the queue
+  throw updateError;
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: This suggestion correctly identifies a critical issue where a failed database update for a job's status is only logged, not propagated, causing the job to be considered successful by the queue. Re-throwing the error ensures the job fails correctly, which is crucial for system consistency and reliability.

Medium
General
Support index sorting

In the buildOrderBy function, add a condition to handle sortBy: 'index' to allow
sorting chunks by their index.

apps/backend/src/routes/chunks/chunks-route.ts [20-29]

 function buildOrderBy(sortBy?: SortByField, sortOrder?: SortOrder) {
+  if (sortBy === 'index') {
+    return [{ chunkIndex: sortOrder || 'asc' }];
+  }
   if (sortBy === 'tokenCount') {
     return [{ tokenCount: sortOrder || 'desc' }];
   }
   if (sortBy === 'qualityScore') {
     return [{ qualityScore: sortOrder || 'desc' }];
   }
   // Default: by document then index
   return [{ documentId: 'asc' as const }, { chunkIndex: 'asc' as const }];
 }
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: This suggestion correctly points out that sorting by index was not explicitly handled, causing it to fall back to the default sort. Adding a specific case for index sorting implements the feature as intended by the schema and improves the API's functionality.

Medium
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant