feat: add ConversionStatus.TIMEOUT to differentiate from page failures#3211
feat: add ConversionStatus.TIMEOUT to differentiate from page failures#3211joaquinhuigomez wants to merge 1 commit intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @joaquinhuigomez, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -1,6 +1,7 @@
### PDF
- **Pipeline/Backend**: `StandardPdfPipeline` + `DoclingParseDocumentBackend` (default: `docling_parse`)
- **Key Options**:
+ - `document_timeout` (default: None): Maximum processing time in seconds before aborting document conversion. When exceeded, the pipeline stops processing and returns partial results with `TIMEOUT` status. If None, no timeout is enforced. The `TIMEOUT` status is a dedicated status value that allows downstream consumers to distinguish between partial results caused by `document_timeout` being exceeded versus individual page conversion failures (which remain `PARTIAL_SUCCESS`). Both `StandardPdfPipeline` and `PaginatedPipeline` emit `TIMEOUT` when the document timeout is exceeded, and `DocumentConverter` treats `TIMEOUT` as non-fatal (similar to `PARTIAL_SUCCESS`).
- `from_formats`: Supported input formats include `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md` (including `txt`, `text`, `qmd`, `rmd`), `csv`, `xlsx`, `xml_uspto`, `xml_jats`, `xml_xbrl`, `mets_gbs`, `json_docling`, `audio`, `vtt`, `latex`
- `to_formats`: Supported output formats include `md`, `json`, `yaml`, `html`, `html_split_page`, `text`, `doctags`, `vtt`
- `pdf_backend`: Allowed values: `pypdfium2`, `docling_parse`, `dlparse_v1`, `dlparse_v2`, `dlparse_v4` (default: `docling_parse`)Note: You must be authenticated to accept/decline updates. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Thanks for the review! Anything else needed to merge? |
|
@joaquinhuigomez Thanks for this contribution, I think it makes sense. Before we merge, we must however do a thorough scan on several client codes which have hard-coded expectations about "terminal states" in Docling. So far these all just watch for |
Add a dedicated TIMEOUT status so downstream consumers can distinguish between partial results caused by document_timeout being reached versus individual page conversion failures (which remain PARTIAL_SUCCESS). Fixes docling-project#3205 Signed-off-by: Joaquin Hui <joaquinhui1995@gmail.com> Signed-off-by: Joaquin Hui Gomez <132194176+joaquinhuigomez@users.noreply.github.com>
6a54bd5 to
caa8799
Compare
|
Makes sense — take your time on the scan. I've also fixed the DCO signoff. |
Add a dedicated
ConversionStatus.TIMEOUTstatus so downstream consumers can distinguish between partial results caused bydocument_timeoutbeing reached versus individual page conversion failures (which remainPARTIAL_SUCCESS). Both pipelines (threadedStandardPdfPipelineand legacyPaginatedPipeline) now emitTIMEOUTwhen the document timeout is exceeded. TheDocumentConverterandDocumentExtractortreatTIMEOUTas non-fatal, consistent with existingPARTIAL_SUCCESShandling.Fixes #3205