fix: Improve PDF parser robustness and efficiency #138

Dhruv-Sharma01 · 2025-10-10T13:16:36Z

Closes #137

Changes Implemented
This PR introduces a more robust, two-stage parsing architecture. Below are the specific changes made to each file.

In pymupdf_rag.py (PDF-to-Markdown Conversion):

Replaced IdentifyHeaders class with AdvancedHeaderIdentifier: The new class uses a multi-heuristic scoring system (boldness, all-caps, relative font size) to detect section headers, making it more resilient to different resume styles.

Enforced Linear Reading Order: Added a sort for the text_rects list (text_rects.sort(key=lambda r: (r.y0, r.x0))) immediately after text blocks are identified. This ensures a strict top-to-bottom, left-to-right processing flow, fixing parsing errors on multi-column layouts.

In pdf_handler.py (Section Extraction):

Added _split_markdown_by_headers method: This new helper function provides a deterministic way to pre-process the Markdown text, splitting it into a dictionary of sections based on ## headers before any LLM calls are made.

Refactored _extract_all_sections_separately method: The original logic, which made multiple, full-document LLM calls, has been replaced. The new implementation first uses _split_markdown_by_headers to get structured data and then sends only the small, relevant text chunk for each section to the LLM for analysis. This improves efficiency and reliability.

Mohd-Mursaleen · 2025-10-10T13:41:50Z

You messed up the formatting use Black.

Dhruv-Sharma01 · 2025-10-10T14:22:16Z

Check it.

Mohd-Mursaleen · 2025-10-10T14:56:46Z

Check it.

👍🏻

fix: Improve PDF parser robustness and efficiency

bfe2fcb

Dhruv-Sharma01 force-pushed the main branch from 34b501c to bfe2fcb Compare October 10, 2025 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Improve PDF parser robustness and efficiency #138

fix: Improve PDF parser robustness and efficiency #138

Uh oh!

Dhruv-Sharma01 commented Oct 10, 2025

Uh oh!

Mohd-Mursaleen commented Oct 10, 2025

Uh oh!

Dhruv-Sharma01 commented Oct 10, 2025 •

edited

Loading

Uh oh!

Mohd-Mursaleen commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

fix: Improve PDF parser robustness and efficiency #138

Are you sure you want to change the base?

fix: Improve PDF parser robustness and efficiency #138

Uh oh!

Conversation

Dhruv-Sharma01 commented Oct 10, 2025

Uh oh!

Mohd-Mursaleen commented Oct 10, 2025

Uh oh!

Dhruv-Sharma01 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mohd-Mursaleen commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Dhruv-Sharma01 commented Oct 10, 2025 •

edited

Loading