fix(utils): prevent path traversal in encode_images via embedded image filenames by sebastiondev · Pull Request #115 · adithya-s-k/omniparse

sebastiondev · 2026-05-29T11:28:23Z

Vulnerability Summary

CWE-22 (Path Traversal) — Arbitrary file write via malicious embedded image filenames in omniparse/utils.py

The encode_images() function in omniparse/utils.py receives a dictionary of {filename: image} pairs extracted from uploaded documents (PDFs, PPTs, DOCs). The filename key comes directly from embedded image names within the parsed document. The original code passes this filename directly to image.save(filename, "PNG") and later to os.remove(filename), allowing an attacker to write (and then delete) files at arbitrary paths on the server.

Severity: High — this enables arbitrary file write with no authentication required.

Affected function: encode_images() in omniparse/utils.py, called from multiple document parsing endpoints (/pdf, /ppt, /doc) via omniparse/documents/router.py and omniparse/documents/__init__.py.

Data Flow

User uploads a crafted PDF to /pdf (or PPT to /ppt, DOC to /doc) — no authentication required
The document parser (e.g., marker) extracts embedded images, preserving their embedded filenames
encode_images() iterates over the extracted {filename: image} dict
image.save(filename, "PNG") writes the image to disk using the attacker-controlled filename directly
A filename like ../../../etc/cron.d/malicious writes outside the working directory

Fix Description

This PR makes two changes:

Temporary file for image saving: Instead of writing to the attacker-controlled path, images are saved to a tempfile.NamedTemporaryFile with a .png suffix. This ensures writes always go to the system temp directory.
Basename sanitization for stored name: The filename stored in the response document uses os.path.basename(filename), stripping any directory traversal components. This ensures downstream consumers see only the leaf filename.

A try/finally block ensures the temporary file is always cleaned up, even if an error occurs during processing.

Proof of Concept

An attacker can craft a PDF with an embedded image whose name contains path traversal sequences. When this PDF is uploaded to the OmniParse server, the image is written to an arbitrary path:

import requests
from unittest.mock import MagicMock
from PIL import Image
import os

# This is what encode_images receives after PDF parsing — the filename
# comes from the embedded image name in the document:
malicious_filename = "../../../tmp/pwned.png"
img = Image.new("RGB", (10, 10), color="red")

# Before fix: image.save("../../../tmp/pwned.png", "PNG")
# writes to /tmp/pwned.png (or any attacker-chosen path)
# After fix: image.save("/tmp/tmpXXXXXX.png", "PNG")
# writes to a safe temporary file

# To exploit via the HTTP API:
# 1. Craft a PDF with an image named "../../../etc/cron.d/evil"
# 2. Upload: curl -X POST http://target:8000/pdf -F "file=@malicious.pdf"
# 3. The server writes the image content to /etc/cron.d/evil

The endpoints are unauthenticated — document_router = APIRouter() at line 34 of omniparse/documents/router.py has no dependency injection for auth, and no middleware-level auth exists in the application.

Testing

The fix was tested by verifying:

Normal image encoding still works correctly (images are base64-encoded and stored in the response document)
Path traversal filenames are sanitized — a filename like ../../etc/passwd.png results in only passwd.png being stored as the image name
Temporary files are created in the system temp directory, not at attacker-controlled paths
Temporary files are cleaned up after processing (verified via the finally block)

Adversarial Review

Before submitting, we attempted to disprove this finding: we checked whether any authentication middleware, request validation, or filename sanitization exists upstream of encode_images(). There is none — the FastAPI routers accept file uploads without auth (APIRouter() with no dependencies), and the document parsers (marker, python-pptx, python-docx) pass embedded image names through without sanitization. The image.save() call uses the raw filename from the parsed document, making this directly exploitable by anyone with network access to the server.

_{Submitted by Sebastion — autonomous open-source security research from Foundation Machines. Free for public repos via the Sebastion AI GitHub App.}

Summary by CodeRabbit

Release Notes

Bug Fixes
- Enhanced image validation to prevent unauthorized file system access during image processing.
- Improved image encoding with more robust temporary file management and cleanup procedures.

…ages Sanitize filenames from parsed documents using os.path.basename() and write to temporary files instead of using the raw filename directly. This prevents a crafted PDF/DOCX with traversal sequences in embedded image names (e.g. "../../etc/cron.d/malicious") from writing to or deleting arbitrary files on the server. CWE-22: Path Traversal

coderabbitai · 2026-05-29T11:28:35Z

📝 Walkthrough

Walkthrough

The encode_images function in omniparse/utils.py is hardened against path traversal attacks by extracting only the basename from image keys and managing temporary files instead of writing directly to provided paths. A tempfile import enables safe temporary file creation and cleanup.

Changes

Image Encoding Security Hardening

Layer / File(s)	Summary
Image encoding with path sanitization and temp file cleanup `omniparse/utils.py`	`encode_images` now imports `tempfile` and sanitizes image filenames via `os.path.basename` to prevent path traversal. Images are written to temporary PNG files, base64-encoded from those files, and reliably deleted in a `finally` block with existence checks, replacing the previous direct-path write-and-delete approach.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A rabbit hops through paths with care,
No traversal tricks can snare!
Temp files cleaned up, sure and neat,
Security patch, complete! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main security fix: preventing path traversal vulnerabilities in the encode_images function by sanitizing embedded filenames.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

omniparse/utils.py (1)
9-28: ⚡ Quick win

Refactor encode_images to pass the PIL image directly to add_image

The path-traversal hardening is fine (safe_filename = os.path.basename(filename) is only used for image_name, not filesystem paths).

Remove the tempfile + PNG-to-base64 round-trip: responseDocument.encode_image_to_base64 always re-encodes to JPEG (image.save(..., format="JPEG", quality=85)), so you can call inputDocument.add_image(image_name=safe_filename, image_data=image) and drop the disk I/O/cleanup.

Also remove the unused enumerate index (for filename, image in images.items():), and delete tempfile/base64 imports here if they become unused.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@omniparse/utils.py` around lines 9 - 28, In encode_images, stop writing
images to disk and instead pass the PIL Image object directly to
inputDocument.add_image (use safe_filename = os.path.basename(filename) for
image_name); remove the tempfile+/base64 round-trip and the enumerate index
(change loop to for filename, image in images.items()), and delete unused
tempfile and base64 imports; note that responseDocument.encode_image_to_base64
will re-encode to JPEG so passing the PIL Image is sufficient.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@omniparse/utils.py`:
- Around line 9-28: In encode_images, stop writing images to disk and instead
pass the PIL Image object directly to inputDocument.add_image (use safe_filename
= os.path.basename(filename) for image_name); remove the tempfile+/base64
round-trip and the enumerate index (change loop to for filename, image in
images.items()), and delete unused tempfile and base64 imports; note that
responseDocument.encode_image_to_base64 will re-encode to JPEG so passing the
PIL Image is sufficient.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6fbb0731-5739-4b30-b36e-70e118cbdb1c

📥 Commits

Reviewing files that changed from the base of the PR and between 9d1ae83 and 06ae84e.

📒 Files selected for processing (1)

omniparse/utils.py

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(utils): prevent path traversal in encode_images via embedded image filenames#115

fix(utils): prevent path traversal in encode_images via embedded image filenames#115
sebastiondev wants to merge 1 commit into
adithya-s-k:mainfrom
sebastiondev:fix/cwe22-utils-uploaded-7293

sebastiondev commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sebastiondev commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vulnerability Summary

Data Flow

Fix Description

Proof of Concept

Testing

Adversarial Review

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sebastiondev commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading