This repository was archived by the owner on Mar 24, 2026. It is now read-only.

fix: pdf form parsing#117

Open

hugohonda wants to merge 6 commits intomainfrom

fix/pdf-form-parsing

Contributor

hugohonda commented Sep 17, 2025 •

edited

Loading

Form elements were disappearing from SDK output despite being present in API responses. The SDK was unnecessarily splitting PDFs using pypdf library, which strips interactive form elements during PDF manipulation.

Used pymupdf instead of pypdf + fix minimum pages split logic

https://landingai.slack.com/archives/C07KNEGHWKA/p1757949372178239
https://app.asana.com/1/504311096896991/project/1206677697418483/task/1211376367674705?focus=true

hugohonda self-assigned this

github-actions Bot commented Sep 17, 2025

❌ Integration tests failed. Please check the logs.

3 similar comments

github-actions Bot commented Sep 17, 2025

❌ Integration tests failed. Please check the logs.

github-actions Bot commented Sep 17, 2025

❌ Integration tests failed. Please check the logs.

github-actions Bot commented Sep 17, 2025

❌ Integration tests failed. Please check the logs.


          fix: pdf form parsing

dad5441

fix mypy

update test to pymupdf + split size limit scenario

update test to pymupdf

update to use get file path function

update mypy

update mypy

update tests

hugohonda force-pushed the fix/pdf-form-parsing branch from 3a9789c to dad5441 Compare

September 17, 2025 22:06

hugohonda added 2 commits

September 17, 2025 19:33


          remove pypdf

15dc626


          fix mypy

c4cf3b0

hugohonda force-pushed the fix/pdf-form-parsing branch from d3a3f69 to c4cf3b0 Compare

September 18, 2025 15:39

github-actions Bot commented Sep 18, 2025

❌ Integration tests failed. Please check the logs.


          revert poetry lock

f31b719

hugohonda force-pushed the fix/pdf-form-parsing branch from 94e100f to f31b719 Compare

September 18, 2025 15:59

github-actions Bot commented Sep 18, 2025

❌ Integration tests failed. Please check the logs.

camiloaz approved these changes

View reviewed changes

camiloaz suggested changes

View reviewed changes

Member

camiloaz left a comment

left some comments after thinking more about it.

agentic_doc/parse.py Outdated

Comment on lines +582 to +612

+                      if total_pages <= split_size:
+                          # Process the PDF directly without splitting
+                          file_path = Path(file_path)
+                          return _parse_doc_parts(
+                              Document(
+                                  file_path=file_path,
+                                  start_page_idx=0,
+                                  end_page_idx=total_pages - 1,
+                              ),
+                              include_marginalia=include_marginalia,
+                              include_metadata_in_markdown=include_metadata_in_markdown,
+                              extraction_model=extraction_model,
+                              extraction_schema=extraction_schema,
+                              config=config,
+                          )
+                      # Split PDF using the already opened document
+                      with tempfile.TemporaryDirectory() as temp_dir:
+                          file_path = Path(file_path)
+                          parts = split_pdf(pdf_doc, temp_dir, split_size, file_stem=file_path.stem)
+                          part_results = _parse_doc_in_parallel(
+                              parts,
+                              doc_name=file_path.name,
+                              include_marginalia=include_marginalia,
+                              include_metadata_in_markdown=include_metadata_in_markdown,
+                              extraction_model=extraction_model,
+                              extraction_schema=extraction_schema,
+                              config=config,
+                          )
+                          split_type = (
+                              config.split if config and config.split is not None else SplitType.full

Member

camiloaz Oct 7, 2025

instead of changing this function, i would change only the split_pdf function to check for length and return a list with a single part with the unmodified document. it seems less risky and easier to maintain to me because you are repeating some logic here.

Contributor Author

hugohonda Oct 8, 2025

Agreed! Thanks

hugohonda added 2 commits

October 8, 2025 08:19


          skip split if the doc total pages is smaller than split requested

6bd611f


          update tests

0b665e6

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet