Open
Description
Bug description
Parsing a publically available PDF file (https://www.novo-pi.com/ozempic.pdf) results in an exception:
Failed to ingest PDF file ozempic-pi.pdf
java.lang.RuntimeException: Failed to ingest PDF file ozempic-pi.pdf
at com.vodori.platform.ai.advisor.service.DocumentIngestionService.ingestSupportingDocuments(DocumentIngestionService.java:103)
at com.vodori.platform.ai.advisor.DocumentIngestionTest.setUp(DocumentIngestionTest.java:56)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
Caused by: java.lang.StringIndexOutOfBoundsException: Index 0 out of bounds for length 0
at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:55)
at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:52)
at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
at java.base/java.lang.String.checkIndex(String.java:4832)
at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:46)
at java.base/java.lang.String.charAt(String.java:1555)
at org.springframework.ai.reader.pdf.layout.CharacterFactory.getCharacterFromTextPosition(CharacterFactory.java:97)
at org.springframework.ai.reader.pdf.layout.CharacterFactory.createCharacterFromTextPosition(CharacterFactory.java:46)
at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.writeLine(ForkPDFLayoutTextStripper.java:114)
at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.writeTextPositionList(ForkPDFLayoutTextStripper.java:148)
at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.iterateThroughTextList(ForkPDFLayoutTextStripper.java:136)
at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.writePage(ForkPDFLayoutTextStripper.java:85)
at org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea.writePage(PDFLayoutTextStripperByArea.java:150)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:380)
at org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper.processPage(ForkPDFLayoutTextStripper.java:68)
at org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea.extractRegions(PDFLayoutTextStripperByArea.java:123)
at org.springframework.ai.reader.pdf.PagePdfDocumentReader.get(PagePdfDocumentReader.java:141)
at org.springframework.ai.reader.pdf.PagePdfDocumentReader.get(PagePdfDocumentReader.java:48)
at org.springframework.ai.document.DocumentReader.read(DocumentReader.java:25)
at com.vodori.platform.ai.advisor.service.DocumentIngestionService.ingestSupportingDocuments(DocumentIngestionService.java:79)
... 3 more
Environment
- Spring AI 1.0.M8
- Spring Boot 3.4.5
- MacOS
- Java 21
Steps to reproduce
Pass in the PDF linked above and call (which I pulled from the Spring AI docs)
List<Document> pages = new PagePdfDocumentReader(resource,
PdfDocumentReaderConfig.builder().withPageTopMargin(0).withPageExtractedTextFormatter(ExtractedTextFormatter.builder().withNumberOfTopTextLinesToDelete(0).build())
.withPagesPerDocument(1).build()).read();
Expected behavior
List of pages.
Metadata
Metadata
Assignees
Labels
No labels