Skip to content

prototype for PDF to HTML option 2- Ruby Prototype #4172

Open
@meganhicks

Description

@meganhicks

Description As a developer, I want to prototype a solution using Ruby libraries (pdf-reader, pdf-extract, hexapdf, nokogiri) to extract and structure PDF content, so that I can determine if a native Ruby approach provides more control and consistency.

Details on decision narratives here (including mock letter, and examination of 961 letter content).

Hypothetical implementation here.

Acceptance Criteria

  1. The service accepts a PDF file as input.
  2. pdf-reader extracts text while maintaining paragraph structure.
  3. pdf-extract identifies and extracts structured elements:
    Headings (h1, h2) based on font size.
    Lists (ul, ol) and detects ordered vs. unordered lists.
    Tables with correct , , and .
  4. hexapdf extracts images and assigns alt text.
  5. Extracted content is processed with Nokogiri to generate structured, accessible HTML.
  6. The output is evaluated for consistency across multiple decision narrative PDFs
  7. Compare this prototype to the others and make a recommendation

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions