Pdf/optimize#68
Conversation
…ters - Implement junk code block removal in PyMuPDF converter - Remove charStart and charEnd fields from AI worker, backend, and frontend - Migrate analytics format filters and update UI components - Update tests to reflect schema and filter changes
- Added merge_soft_linebreaks to normalizer for better PDF extraction - Updated default chunking and quality parameters in prisma schema - Added Prisma P2025 error handling in job processor
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
|||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||||||||
PR Type
Enhancement, Bug fix
Description
Remove
charStart/charEndfields from schema and all API endpointsImplement PDF artifact cleaning: page numbers and junk code blocks
Add soft linebreak merging for better PDF text extraction
Enhance chunk quality scoring with multi-factor weighted calculation
Add chunk sorting and merging capabilities for improved chunking
Update default chunking parameters and add Prisma P2025 error handling
Diagram Walkthrough
File Walkthrough
1 files
Remove charStart and charEnd fields from Chunk model1 files
Add P2025 error handling for deleted documents9 files
Remove charStart/charEnd from chunk selection and responseRemove charStart/charEnd from chunk metadataRemove charStart/charEnd from chunk insertion and error messageRemove charStart/charEnd from search resultsRemove charStart/charEnd from search result interfacesRemove charStart/charEnd from chunk metadata schemaRemove charStart/charEnd from QueryResult and ChunkDetail interfacesRemove charStart/charEnd from chunk metadataRemove charStart/charEnd position tracking from chunks11 files
Add sorting by index/tokenCount/qualityScore and remove char fieldsRemove character position display from chunk footerAdd sorting options for tokenCount and qualityScoreImplement sort field and order parsing for chunk queriesRemove charStart/charEnd tracking and add configurable header levelsAdd PDF-specific post-processing methods for artifact removalUse PDF-specific post-processing with artifact cleaningAdd hidden link stripping and PyMuPDF-specific post-processingAdd page artifact removal, junk code block removal, and soft linebreakmergingAdd chunk merging algorithm and multi-factor quality scoringintegrationImplement multi-factor weighted quality scoring system17 files
Update test payloads to remove charStart/charEnd metadataRemove charStart/charEnd from chunk insertion in testsRemove charStart/charEnd from seedChunk helperRemove charStart/charEnd from test chunk inserts and fix whitespaceRemove charStart/charEnd from callback test payloadRemove charStart/charEnd from test callback metadataRemove charStart/charEnd from chunk seeding and fix formattingRemove charStart/charEnd from chunk insertion testRemove charStart/charEnd from all chunk insertions and fix whitespaceRemove charStart/charEnd from seedChunk callsRemove charStart/charEnd from default callback metadataRemove charStart/charEnd from mock search resultsRemove charStart/charEnd from test callback payloadRemove charStart/charEnd from test callback metadataAdd comprehensive tests for page artifacts, code blocks, and linebreakmergingUpdate quality analyzer test for multi-factor scoring calculationUpdate quality score tests for multi-factor weighted system