Closed
Description
Description As a developer, I want to prototype a solution using Ruby libraries (pdf-reader, pdf-extract, hexapdf, nokogiri) to extract and structure PDF content, so that I can determine if a native Ruby approach provides more control and consistency.
Details on decision narratives here (including mock letter, and examination of 961 letter content).
Hypothetical implementation here.
Acceptance Criteria
- The service accepts a PDF file as input.
- pdf-reader extracts text while maintaining paragraph structure.
- pdf-extract identifies and extracts structured elements:
Headings (h1, h2) based on font size.
Lists (ul, ol) and detects ordered vs. unordered lists.
Tables with correct , , and . - hexapdf extracts images and assigns alt text.
- Extracted content is processed with Nokogiri to generate structured, accessible HTML.
- The output is evaluated for consistency across multiple decision narrative PDFs
- Compare this prototype to the others and make a recommendation
Notes:
Gabe proposed making this modular and working together on this. Gabe spoke in depth about this idea on slack.
New Note
The scope has adjusted for the project to VBMS only generated decision letters. This is what you should use to test the prototype.
Metadata
Metadata
Assignees
Labels
No labels