Pdf/optimize by namtroi · Pull Request #68 · namtroi/RAGBase

namtroi · 2025-12-29T18:43:02Z

PR Type

Enhancement, Bug fix

Description

Remove charStart/charEnd fields from schema and all API endpoints
Implement PDF artifact cleaning: page numbers and junk code blocks
Add soft linebreak merging for better PDF text extraction
Enhance chunk quality scoring with multi-factor weighted calculation
Add chunk sorting and merging capabilities for improved chunking
Update default chunking parameters and add Prisma P2025 error handling

Diagram Walkthrough

flowchart LR
  A["Schema Changes"] -->|Remove charStart/charEnd| B["Database & APIs"]
  C["PDF Processing"] -->|Page artifacts + junk code| D["Normalizer"]
  E["Soft Linebreaks"] -->|Merge PDF breaks| D
  F["Quality Scoring"] -->|Multi-factor weights| G["Analyzer"]
  H["Chunk Merging"] -->|Optimize small chunks| I["Pipeline"]
  B --> J["Frontend & Backend"]
  D --> K["Better Extraction"]
  G --> L["Improved Metrics"]
  I --> L

File Walkthrough

Relevant files

Schema changes

1 files

schema.prisma `Remove charStart and charEnd fields from Chunk model`	+4/-5

Error handling

1 files

job-processor.ts `Add P2025 error handling for deleted documents`	+5/-0

Bug fix

9 files

documents-route.ts `Remove charStart/charEnd from chunk selection and response`	+0/-4
content-route.ts `Remove charStart/charEnd from chunk metadata`	+0/-4
callback-route.ts `Remove charStart/charEnd from chunk insertion and error message`	+2/-4
search-route.ts `Remove charStart/charEnd from search results`	+0/-6
hybrid-search.ts `Remove charStart/charEnd from search result interfaces`	+0/-10
callback-validator.ts `Remove charStart/charEnd from chunk metadata schema`	+1/-2
endpoints.ts `Remove charStart/charEnd from QueryResult and ChunkDetail interfaces`	+4/-6
presentation_chunker.py `Remove charStart/charEnd from chunk metadata`	+4/-7
tabular_chunker.py `Remove charStart/charEnd position tracking from chunks`	+1/-12

Enhancement

11 files

chunks-route.ts `Add sorting by index/tokenCount/qualityScore and remove char fields`	+18/-13
ChunkCard.tsx `Remove character position display from chunk footer`	+0/-3
ChunkFilters.tsx `Add sorting options for tokenCount and qualityScore`	+18/-0
ChunksExplorerPage.tsx `Implement sort field and order parsing for chunk queries`	+6/-0
document_chunker.py `Remove charStart/charEnd tracking and add configurable header levels`	+35/-36
base.py `Add PDF-specific post-processing methods for artifact removal`	+20/-0
pdf_converter.py `Use PDF-specific post-processing with artifact cleaning`	+1/-1
pymupdf_converter.py `Add hidden link stripping and PyMuPDF-specific post-processing`	+15/-1
normalizer.py `Add page artifact removal, junk code block removal, and soft linebreak` `merging`	+163/-0
pipeline.py `Add chunk merging algorithm and multi-factor quality scoring` `integration`	+201/-0
analyzer.py `Implement multi-factor weighted quality scoring system`	+38/-7

Tests

17 files

error-handling.test.ts `Update test payloads to remove charStart/charEnd metadata`	+2/-2
query-flow.test.ts `Remove charStart/charEnd from chunk insertion in tests`	+6/-8
database.ts `Remove charStart/charEnd from seedChunk helper`	+2/-8
analytics-api.test.ts `Remove charStart/charEnd from test chunk inserts and fix whitespace`	+44/-44
analytics-e2e.test.ts `Remove charStart/charEnd from callback test payload`	+1/-1
callback-route.test.ts `Remove charStart/charEnd from test callback metadata`	+1/-1
chunks-api.test.ts `Remove charStart/charEnd from chunk seeding and fix formatting`	+6/-6
content-route.test.ts `Remove charStart/charEnd from chunk insertion test`	+2/-2
search-route.test.ts `Remove charStart/charEnd from all chunk insertions and fix whitespace`	+37/-37
status-route.test.ts `Remove charStart/charEnd from seedChunk calls`	+2/-2
python-worker-mock.ts `Remove charStart/charEnd from default callback metadata`	+1/-1
hybrid-search.test.ts `Remove charStart/charEnd from mock search results`	+0/-2
callback-validator.test.ts `Remove charStart/charEnd from test callback payload`	+1/-1
test_callback.py `Remove charStart/charEnd from test callback metadata`	+1/-1
test_normalizer.py `Add comprehensive tests for page artifacts, code blocks, and linebreak` `merging`	+231/-0
test_profile_config.py `Update quality analyzer test for multi-factor scoring calculation`	+8/-3
test_quality_analyzer.py `Update quality score tests for multi-factor weighted system`	+24/-6

…nk code blocks)

…ters - Implement junk code block removal in PyMuPDF converter - Remove charStart and charEnd fields from AI worker, backend, and frontend - Migrate analytics format filters and update UI components - Update tests to reflect schema and filter changes

- Added merge_soft_linebreaks to normalizer for better PDF extraction - Updated default chunking and quality parameters in prisma schema - Added Prisma P2025 error handling in job processor

qodo-code-review · 2025-12-29T18:43:53Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Regex DoS risk Description: The newly added PDF-cleanup routines (`remove_page_artifacts`, `remove_junk_code_blocks`, and `merge_soft_linebreaks`) apply multiple regex-based transformations and paragraph/line splitting over potentially untrusted, attacker-controlled PDF-derived text, which could enable resource-exhaustion (CPU/memory) DoS if very large or adversarially structured inputs are processed without size/time limits. normalizer.py [122-275] Referred Code def remove_page_artifacts(self, markdown: str) -> str: """ Remove standalone page numbers from markdown (PDF-specific). Uses paragraph-level analysis with strict patterns. """ if not markdown: return "" paragraphs = re.split(r"\n\s*\n", markdown) result = [] for para in paragraphs: stripped = para.strip() # Keep empty paragraphs (preserve spacing) if not stripped: result.append(para) continue # Keep multi-line paragraphs (page nums are single-line) if "\n" in stripped: ... (clipped 133 lines)
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Error detail exposure: The PR stores raw `err.message` into `failReason`, which may later be shown externally and could leak internal processing/system details depending on how `failReason` is surfaced. Referred Code where: { id: documentId }, data: { status: 'FAILED', failReason: `PROCESSING_ERROR: ${err.message} `, }, Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2025-12-29T18:45:15Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Simplify the complex chunk merging logic The `merge_small_chunks` function in `pipeline.py` is overly complex and stateful, making it difficult to maintain. It should be refactored into a simpler, more robust algorithm, like a two-pass approach, to improve stability. Examples: apps/ai-worker/src/pipeline.py [64-233] def merge_small_chunks( self, chunks: List[Dict[str, Any]], min_chars: int, max_chars: int ) -> List[Dict[str, Any]]: """ Optimally merge small chunks using greedy accumulator algorithm. Goals: - Maximize number of normal-sized chunks (>= min_chars) - Never exceed max_chars - Preserve document order ... (clipped 160 lines) Solution Walkthrough: Before: def merge_small_chunks(chunks, min_chars, max_chars): result = [] accumulator = "" for chunk in chunks: if len(chunk.content) >= min_chars: # Big chunk arrives if accumulator: # Complex logic to flush accumulator, which might... # 1. Prepend to current chunk # 2. Append to PREVIOUS chunk in result (result[-1]) # 3. Become its own chunk flush_accumulator_and_maybe_modify_previous_results(...) result.append(chunk) accumulator = "" else: # Small chunk, add to accumulator with more complex logic ... # Final complex flush of any remaining accumulator content ... return result After: def merge_small_chunks(chunks, min_chars, max_chars): # A simpler, more predictable approach if not chunks: return [] merged_chunks = [] current_content = "" for chunk in chunks: # Always try to add the next chunk next_content = strip_prefix(chunk.content) if not current_content or len(current_content) + len(next_content) <= max_chars: current_content += "\n\n" + next_content else: # Current accumulator is full, finalize it and start a new one merged_chunks.append({"content": current_content, ...}) current_content = next_content # Add the last accumulated chunk if current_content: merged_chunks.append({"content": current_content, ...}) # Optional second pass to handle any remaining undersized chunks return final_pass(merged_chunks, min_chars) Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a significant design issue in the new `merge_small_chunks` function, which is overly complex, stateful, and modifies previously processed items, making it hard to maintain and debug.	Medium
Possible issue	Ensure job fails if status update fails In the job processor's error handler, re-throw the error after logging a failed job status update. This ensures the job is correctly marked as failed in the queue, preventing data inconsistencies. apps/backend/src/queue/job-processor.ts [122-129] } catch (updateError) { // P2025: Document already deleted, ignore if ((updateError as any)?.code === 'P2025') { logger.info({ jobId: job.id }, 'job_status_update_skipped_doc_deleted'); return; } logger.error({ err: updateError, jobId: job.id }, 'job_status_update_failed'); + // Re-throw to ensure the job is marked as failed in the queue + throw updateError; } Apply / Chat Suggestion importance[1-10]: 8 __ Why: This suggestion correctly identifies a critical issue where a failed database update for a job's status is only logged, not propagated, causing the job to be considered successful by the queue. Re-throwing the error ensures the job fails correctly, which is crucial for system consistency and reliability.	Medium
General	Support index sorting In the `buildOrderBy` function, add a condition to handle `sortBy: 'index'` to allow sorting chunks by their index. apps/backend/src/routes/chunks/chunks-route.ts [20-29] function buildOrderBy(sortBy?: SortByField, sortOrder?: SortOrder) { + if (sortBy === 'index') { + return [{ chunkIndex: sortOrder \|\| 'asc' }]; + } if (sortBy === 'tokenCount') { return [{ tokenCount: sortOrder \|\| 'desc' }]; } if (sortBy === 'qualityScore') { return [{ qualityScore: sortOrder \|\| 'desc' }]; } // Default: by document then index return [{ documentId: 'asc' as const }, { chunkIndex: 'asc' as const }]; } Apply / Chat Suggestion importance[1-10]: 7 __ Why: This suggestion correctly points out that sorting by `index` was not explicitly handled, causing it to fall back to the default sort. Adding a specific case for `index` sorting implements the feature as intended by the schema and improves the API's functionality.	Medium
More

namtroi added 3 commits December 28, 2025 18:57

feat(ai-worker): implement PDF artifact cleaning (page numbers and ju…

bceb5ad

…nk code blocks)

feat: refine PDF line break merging and update processing parameters

c470d76

- Added merge_soft_linebreaks to normalizer for better PDF extraction - Updated default chunking and quality parameters in prisma schema - Added Prisma P2025 error handling in job processor

namtroi merged commit c61922d into main Dec 29, 2025
6 of 7 checks passed

qodo-code-review Bot added the Review effort 4/5 label Dec 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pdf/optimize#68

Pdf/optimize#68
namtroi merged 3 commits into
mainfrom
pdf/optimize

namtroi commented Dec 29, 2025 •

edited by qodo-code-review Bot

Loading

Uh oh!

Uh oh!

qodo-code-review Bot commented Dec 29, 2025

Uh oh!

qodo-code-review Bot commented Dec 29, 2025

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

namtroi commented Dec 29, 2025 • edited by qodo-code-review Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

Uh oh!

qodo-code-review Bot commented Dec 29, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-code-review Bot commented Dec 29, 2025

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

namtroi commented Dec 29, 2025 •

edited by qodo-code-review Bot

Loading