Skip to content

fix(utils): prevent path traversal in encode_images via embedded image filenames#115

Open
sebastiondev wants to merge 1 commit into
adithya-s-k:mainfrom
sebastiondev:fix/cwe22-utils-uploaded-7293
Open

fix(utils): prevent path traversal in encode_images via embedded image filenames#115
sebastiondev wants to merge 1 commit into
adithya-s-k:mainfrom
sebastiondev:fix/cwe22-utils-uploaded-7293

Conversation

@sebastiondev

@sebastiondev sebastiondev commented May 29, 2026

Copy link
Copy Markdown

Vulnerability Summary

CWE-22 (Path Traversal) — Arbitrary file write via malicious embedded image filenames in omniparse/utils.py

The encode_images() function in omniparse/utils.py receives a dictionary of {filename: image} pairs extracted from uploaded documents (PDFs, PPTs, DOCs). The filename key comes directly from embedded image names within the parsed document. The original code passes this filename directly to image.save(filename, "PNG") and later to os.remove(filename), allowing an attacker to write (and then delete) files at arbitrary paths on the server.

Severity: High — this enables arbitrary file write with no authentication required.

Affected function: encode_images() in omniparse/utils.py, called from multiple document parsing endpoints (/pdf, /ppt, /doc) via omniparse/documents/router.py and omniparse/documents/__init__.py.

Data Flow

  1. User uploads a crafted PDF to /pdf (or PPT to /ppt, DOC to /doc) — no authentication required
  2. The document parser (e.g., marker) extracts embedded images, preserving their embedded filenames
  3. encode_images() iterates over the extracted {filename: image} dict
  4. image.save(filename, "PNG") writes the image to disk using the attacker-controlled filename directly
  5. A filename like ../../../etc/cron.d/malicious writes outside the working directory

Fix Description

This PR makes two changes:

  1. Temporary file for image saving: Instead of writing to the attacker-controlled path, images are saved to a tempfile.NamedTemporaryFile with a .png suffix. This ensures writes always go to the system temp directory.

  2. Basename sanitization for stored name: The filename stored in the response document uses os.path.basename(filename), stripping any directory traversal components. This ensures downstream consumers see only the leaf filename.

A try/finally block ensures the temporary file is always cleaned up, even if an error occurs during processing.

Proof of Concept

An attacker can craft a PDF with an embedded image whose name contains path traversal sequences. When this PDF is uploaded to the OmniParse server, the image is written to an arbitrary path:

import requests
from unittest.mock import MagicMock
from PIL import Image
import os

# This is what encode_images receives after PDF parsing — the filename
# comes from the embedded image name in the document:
malicious_filename = "../../../tmp/pwned.png"
img = Image.new("RGB", (10, 10), color="red")

# Before fix: image.save("../../../tmp/pwned.png", "PNG")
# writes to /tmp/pwned.png (or any attacker-chosen path)
# After fix: image.save("/tmp/tmpXXXXXX.png", "PNG")
# writes to a safe temporary file

# To exploit via the HTTP API:
# 1. Craft a PDF with an image named "../../../etc/cron.d/evil"
# 2. Upload: curl -X POST http://target:8000/pdf -F "file=@malicious.pdf"
# 3. The server writes the image content to /etc/cron.d/evil

The endpoints are unauthenticated — document_router = APIRouter() at line 34 of omniparse/documents/router.py has no dependency injection for auth, and no middleware-level auth exists in the application.

Testing

The fix was tested by verifying:

  • Normal image encoding still works correctly (images are base64-encoded and stored in the response document)
  • Path traversal filenames are sanitized — a filename like ../../etc/passwd.png results in only passwd.png being stored as the image name
  • Temporary files are created in the system temp directory, not at attacker-controlled paths
  • Temporary files are cleaned up after processing (verified via the finally block)

Adversarial Review

Before submitting, we attempted to disprove this finding: we checked whether any authentication middleware, request validation, or filename sanitization exists upstream of encode_images(). There is none — the FastAPI routers accept file uploads without auth (APIRouter() with no dependencies), and the document parsers (marker, python-pptx, python-docx) pass embedded image names through without sanitization. The image.save() call uses the raw filename from the parsed document, making this directly exploitable by anyone with network access to the server.


Submitted by Sebastion — autonomous open-source security research from Foundation Machines. Free for public repos via the Sebastion AI GitHub App.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Enhanced image validation to prevent unauthorized file system access during image processing.
    • Improved image encoding with more robust temporary file management and cleanup procedures.

Review Change Stack

…ages

Sanitize filenames from parsed documents using os.path.basename() and
write to temporary files instead of using the raw filename directly.
This prevents a crafted PDF/DOCX with traversal sequences in embedded
image names (e.g. "../../etc/cron.d/malicious") from writing to or
deleting arbitrary files on the server.

CWE-22: Path Traversal
@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

The encode_images function in omniparse/utils.py is hardened against path traversal attacks by extracting only the basename from image keys and managing temporary files instead of writing directly to provided paths. A tempfile import enables safe temporary file creation and cleanup.

Changes

Image Encoding Security Hardening

Layer / File(s) Summary
Image encoding with path sanitization and temp file cleanup
omniparse/utils.py
encode_images now imports tempfile and sanitizes image filenames via os.path.basename to prevent path traversal. Images are written to temporary PNG files, base64-encoded from those files, and reliably deleted in a finally block with existence checks, replacing the previous direct-path write-and-delete approach.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A rabbit hops through paths with care,
No traversal tricks can snare!
Temp files cleaned up, sure and neat,
Security patch, complete! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main security fix: preventing path traversal vulnerabilities in the encode_images function by sanitizing embedded filenames.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
omniparse/utils.py (1)

9-28: ⚡ Quick win

Refactor encode_images to pass the PIL image directly to add_image

  • The path-traversal hardening is fine (safe_filename = os.path.basename(filename) is only used for image_name, not filesystem paths).
  • Remove the tempfile + PNG-to-base64 round-trip: responseDocument.encode_image_to_base64 always re-encodes to JPEG (image.save(..., format="JPEG", quality=85)), so you can call inputDocument.add_image(image_name=safe_filename, image_data=image) and drop the disk I/O/cleanup.
  • Also remove the unused enumerate index (for filename, image in images.items():), and delete tempfile/base64 imports here if they become unused.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@omniparse/utils.py` around lines 9 - 28, In encode_images, stop writing
images to disk and instead pass the PIL Image object directly to
inputDocument.add_image (use safe_filename = os.path.basename(filename) for
image_name); remove the tempfile+/base64 round-trip and the enumerate index
(change loop to for filename, image in images.items()), and delete unused
tempfile and base64 imports; note that responseDocument.encode_image_to_base64
will re-encode to JPEG so passing the PIL Image is sufficient.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@omniparse/utils.py`:
- Around line 9-28: In encode_images, stop writing images to disk and instead
pass the PIL Image object directly to inputDocument.add_image (use safe_filename
= os.path.basename(filename) for image_name); remove the tempfile+/base64
round-trip and the enumerate index (change loop to for filename, image in
images.items()), and delete unused tempfile and base64 imports; note that
responseDocument.encode_image_to_base64 will re-encode to JPEG so passing the
PIL Image is sufficient.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6fbb0731-5739-4b30-b36e-70e118cbdb1c

📥 Commits

Reviewing files that changed from the base of the PR and between 9d1ae83 and 06ae84e.

📒 Files selected for processing (1)
  • omniparse/utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant