feat(latex): add optional Tectonic TikZ rendering#3369
Conversation
This commit introduces a high-performance, asynchronous pipeline for rendering TikZ diagrams into images during LaTeX document conversion. Key Changes: - Tectonic Integration (`TectonicEngine`): Compiles `tikzpicture` environments into PDFs using Tectonic, auto-downloading the binary if missing. Rasterizes the PDF to 300 DPI images. - Asynchronous Processing: Utilizes a dynamic `ThreadPoolExecutor` (scaled to `os.cpu_count() - 1`) to render multiple diagrams concurrently without blocking the main document conversion pipeline. - Preamble Extraction: Dynamically parses the main document's preamble and injects it into standalone diagrams to ensure compatibility with complex libraries (e.g., `pgfgantt`, `tikz-cd`, `tkz-euclide`). - Graceful Fallbacks: If Tectonic compilation fails due to LaTeX syntax errors or incompatible packages, the engine gracefully falls back to preserving the raw TikZ source code as a `CodeMetaField` to prevent data loss. - CLI Support: Added `--tikz-engine tectonic` option to enable the backend configuration. Resolves pre-commit hooks (MyPy, Ruff linter/formatter). Signed-off-by: Aditya Sasidhar <arctic@arctic> Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
…ndency staging
Add opt-in TikZ image rendering for the LaTeX backend using Tectonic,
while preserving stable fallback behavior when rendering fails.
What this changes:
- add optional `tikz_engine="tectonic"` backend support for TikZ diagrams
- render `tikzpicture` environments asynchronously during LaTeX parsing
- preserve raw TikZ code as `PictureMeta.code` whenever rendering fails,
times out, or rasterization cannot complete
- add Tectonic engine options for:
- automatic binary download
- per-diagram timeout
- shell escape control
- make shell escape explicit opt-in via CLI/backend config
- sanitize known pdfTeX-only assignment lines in preambles for better
Tectonic/XeTeX compatibility
- restore file-backed relative TikZ compatibility by staging only explicit
local dependencies (`\input`, `\include`, `\includegraphics`) into the
temp render directory
- block dependency path traversal and avoid ambient source-directory search
- rasterize generated PDFs with locking and crop whitespace from output
CLI / config updates:
- add `--tikz-engine` / `-T`
- add `--no-tikz-engine-download`
- add `--tikz-engine-timeout`
- add `--tikz-shell-escape`
Tests:
- add focused Tectonic engine tests for download behavior, timeout,
preamble sanitization, shell escape toggling, dependency staging,
and path traversal blocking
- add backend tests for TikZ fallback behavior and file-backed source-root
handling
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
|
✅ DCO Check Passed Thanks @adityasasidhar, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Documentation Updates 1 document(s) were updated by changes in this PR: What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Changes@@ -247,11 +247,15 @@
- **Pipeline/Backend**: `SimplePipeline` + `LatexDocumentBackend`
- **Key Options** (`LatexBackendOptions`):
- `parse_timeout` (default: 30.0 seconds): Maximum time allowed for parsing a LaTeX document. Set to `None` to disable the timeout. This prevents `pylatexenc` from spinning indefinitely when parsing legacy arXiv documents with complex or malformed macroscopic environments. If parsing exceeds this timeout, the conversion will fall back to raw text extraction rather than structured parsing. A warning will be logged when a timeout occurs.
+ - `tikz_engine` (Optional[Literal["tectonic"]]): The engine to use for rendering TikZ diagrams into images. Set to `'tectonic'` to enable asynchronous image generation. Defaults to `None`.
+ - `tikz_engine_timeout` (float, default: 60.0): The timeout in seconds for rendering a single TikZ diagram.
+ - `tikz_engine_allow_shell_escape` (bool, default: False): Allow Tectonic TikZ rendering to enable shell escape during compilation. Disabled by default for safer rendering of untrusted LaTeX.
- **Processing**:
- Parses LaTeX source using `pylatexenc` to extract structured content (sections, equations, tables, etc.)
- Pre-processes custom macros (e.g., `\be`/`\ee` shortcuts for equations)
- Timeout enforcement runs parsing in a daemon thread to allow graceful fallback on timeout
-- **Notes**: The `parse_timeout` option is particularly useful for processing legacy arXiv documents that may contain complex or malformed macro environments. To configure the timeout:
+ - **TikZ Rendering**: When `tikz_engine` is set to `'tectonic'`, the backend detects `tikzpicture` environments and renders them asynchronously into images. When Tectonic compilation succeeds, the TikZ diagram is rasterized and stored as an image. When compilation fails, times out, produces no PDF, or rasterization fails, Docling preserves the original TikZ source as fallback code metadata.
+- **Notes**: The `parse_timeout` option is particularly useful for processing legacy arXiv documents that may contain complex or malformed macro environments. CLI flags are available for TikZ rendering: `--tikz-engine` / `-T`, `--tikz-engine-timeout`, and `--tikz-shell-escape`. To configure the timeout:
```python
from docling.datamodel.backend_options import LatexBackendOptions |
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
|
Merged the main branch please re run the CI tests |
|
@adityasasidhar Seems like fail here, I think to add these very specific tikz commands to the main docling cli might be a bit overkill. I wonder if it is not better to have a dedicated @cau-git @dolfim-ibm feel free to pitch in |
If I get it right, we are talking about 3 arguments. We indeed don't need all options in the CLI, we could have clean examples for the latex options. |
hey @PeterStaar-IBM @dolfim-ibm I agree with you, certainly exceeding the 1000 line limit on the cli/main.py file adds overhead.... also true on adding very specific latex commands like, most people using docling won't know what tikz is... specifically those are: I think skipping them in the cli would probably be a better choice:
I'm however good to go in any direction |
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
|
Just pushed the latest changes Apologies for the delay. The latest changes include:
|
all good, let's let the CI do its thing now! |
yessss lets go |
|
@dolfim-ibm I think this looks good to merge now |
Resolves #3302
Description
Added an optional TikZ rendering path for the LaTeX backend using Tectonic, with configurable flags and safe fallbacks.
Here's how it works:
tikzpictureenvironment.--tikz-engine. ( tectonic automatically installs the required packages only )tikzpictureblocks and captures them atomically.\input\include\includegraphicsPictureItem.PictureMeta.code.Configuration added
--tikz-engine/-T: Enables optional TikZ rendering with Tectonic.--no-tikz-engine-download: Disables automatic Tectonic download if no binary is present.--tikz-engine-timeout: Sets the timeout for rendering a single TikZ diagram.--tikz-shell-escape: Explicitly enables shell escape during Tectonic compilation. This is optional and remains disabled by default.I ran testing on 10's of files, but testing on a corpus would be great to capture all kinds of edge cases that could creep in.
Further instead of installing tectonic from the default curl script could pose a threat, so we can choose an appropriate version and store it in the docling hugging face repo where it can simply send a curl request from a safe and known source.
currently we use this
It could be a hf curl request. Currently I have not added installation support for windows, this should be taken into account for the next commit.
Checklist: