Skip to content

Prototype option 1 for PDF to HTML -pdf2htmlEX + Nokogiri #4170

Open
@meganhicks

Description

@meganhicks

Description As a developer, I want to prototype a solution using pdf2htmlEX to convert PDFs to HTML and process the output with Nokogiri, so that I can determine if this approach produces structured and accessible HTML consistently.

Details on decision narratives here (including mock letter, and examination of 961 letter content).

Hypothetical implementation here.

Acceptance Criteria

  1. The service accepts a PDF file as input.
  2. pdf2htmlEX successfully converts the PDF into an HTML format while maintaining layout structure.
  3. Nokogiri processes the HTML to:
    Convert headings (h1, h2) based on font size.
    Convert lists (ul, ol, li).
    Structure tables with , , .
    Ensure images have alt text.
    Remove absolute positioning styles for accessibility.
  4. The service returns a well-structured HTML output.
  5. The output is evaluated for consistency across multiple decision narrative PDFs.
  6. This protype is compared to the others and a recommendation is made

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions