Skip to content

Conversation

@Dhruv-Sharma01
Copy link

Closes #137

Changes Implemented
This PR introduces a more robust, two-stage parsing architecture. Below are the specific changes made to each file.

  1. In pymupdf_rag.py (PDF-to-Markdown Conversion):

Replaced IdentifyHeaders class with AdvancedHeaderIdentifier: The new class uses a multi-heuristic scoring system (boldness, all-caps, relative font size) to detect section headers, making it more resilient to different resume styles.

Enforced Linear Reading Order: Added a sort for the text_rects list (text_rects.sort(key=lambda r: (r.y0, r.x0))) immediately after text blocks are identified. This ensures a strict top-to-bottom, left-to-right processing flow, fixing parsing errors on multi-column layouts.

  1. In pdf_handler.py (Section Extraction):

Added _split_markdown_by_headers method: This new helper function provides a deterministic way to pre-process the Markdown text, splitting it into a dictionary of sections based on ## headers before any LLM calls are made.

Refactored _extract_all_sections_separately method: The original logic, which made multiple, full-document LLM calls, has been replaced. The new implementation first uses _split_markdown_by_headers to get structured data and then sends only the small, relevant text chunk for each section to the LLM for analysis. This improves efficiency and reliability.

@Mohd-Mursaleen
Copy link

You messed up the formatting use Black.

@Dhruv-Sharma01
Copy link
Author

Dhruv-Sharma01 commented Oct 10, 2025

Check it.

@Mohd-Mursaleen
Copy link

Check it.

👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Improve PDF Parser Robustness for Diverse Resume Layouts and Styles

2 participants