This repository was archived by the owner on Mar 24, 2026. It is now read-only.
Conversation
|
❌ Integration tests failed. Please check the logs. |
3 similar comments
|
❌ Integration tests failed. Please check the logs. |
|
❌ Integration tests failed. Please check the logs. |
|
❌ Integration tests failed. Please check the logs. |
fix mypy update test to pymupdf + split size limit scenario update test to pymupdf update to use get file path function update mypy update mypy update tests
3a9789c to
dad5441
Compare
d3a3f69 to
c4cf3b0
Compare
|
❌ Integration tests failed. Please check the logs. |
94e100f to
f31b719
Compare
|
❌ Integration tests failed. Please check the logs. |
camiloaz
approved these changes
Oct 7, 2025
camiloaz
suggested changes
Oct 7, 2025
Member
camiloaz
left a comment
There was a problem hiding this comment.
left some comments after thinking more about it.
Comment on lines
+582
to
+612
| if total_pages <= split_size: | ||
| # Process the PDF directly without splitting | ||
| file_path = Path(file_path) | ||
| return _parse_doc_parts( | ||
| Document( | ||
| file_path=file_path, | ||
| start_page_idx=0, | ||
| end_page_idx=total_pages - 1, | ||
| ), | ||
| include_marginalia=include_marginalia, | ||
| include_metadata_in_markdown=include_metadata_in_markdown, | ||
| extraction_model=extraction_model, | ||
| extraction_schema=extraction_schema, | ||
| config=config, | ||
| ) | ||
|
|
||
| # Split PDF using the already opened document | ||
| with tempfile.TemporaryDirectory() as temp_dir: | ||
| file_path = Path(file_path) | ||
| parts = split_pdf(pdf_doc, temp_dir, split_size, file_stem=file_path.stem) | ||
| part_results = _parse_doc_in_parallel( | ||
| parts, | ||
| doc_name=file_path.name, | ||
| include_marginalia=include_marginalia, | ||
| include_metadata_in_markdown=include_metadata_in_markdown, | ||
| extraction_model=extraction_model, | ||
| extraction_schema=extraction_schema, | ||
| config=config, | ||
| ) | ||
| split_type = ( | ||
| config.split if config and config.split is not None else SplitType.full |
Member
There was a problem hiding this comment.
instead of changing this function, i would change only the split_pdf function to check for length and return a list with a single part with the unmodified document. it seems less risky and easier to maintain to me because you are repeating some logic here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Form elements were disappearing from SDK output despite being present in API responses. The SDK was unnecessarily splitting PDFs using pypdf library, which strips interactive form elements during PDF manipulation.
Used pymupdf instead of pypdf + fix minimum pages split logic
https://landingai.slack.com/archives/C07KNEGHWKA/p1757949372178239
https://app.asana.com/1/504311096896991/project/1206677697418483/task/1211376367674705?focus=true