Skip to content

Feature Request: Extract clickable URLs from PDF text #152

@nagarajp2004

Description

@nagarajp2004

Problem

Currently, the PDFHandler only extracts visible text from resumes.
If a link is present in the PDF as clickable text (e.g., "GitHub"), the underlying URL is not captured.
As a result, the JSON resume does not include these URLs in the "profiles" section.

Proposed Solution

Enhance to_markdown (or PDFHandler) to:

  1. Extract link annotations (e.g., link['uri'] from PyMuPDF).
  2. Append URLs to the text passed to the LLM prompt.
  3. Ensure the LLM prompt can include these URLs for accurate JSON extraction.

Benefits

  • Improves accuracy of profile extraction (GitHub, LinkedIn, portfolio links).
  • Ensures that clickable links in resumes are not lost.
  • Makes the system more robust for real-world resumes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions