fix(utils): prevent path traversal in encode_images via embedded image filenames#115
fix(utils): prevent path traversal in encode_images via embedded image filenames#115sebastiondev wants to merge 1 commit into
Conversation
…ages Sanitize filenames from parsed documents using os.path.basename() and write to temporary files instead of using the raw filename directly. This prevents a crafted PDF/DOCX with traversal sequences in embedded image names (e.g. "../../etc/cron.d/malicious") from writing to or deleting arbitrary files on the server. CWE-22: Path Traversal
📝 WalkthroughWalkthroughThe ChangesImage Encoding Security Hardening
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
omniparse/utils.py (1)
9-28: ⚡ Quick winRefactor
encode_imagesto pass the PIL image directly toadd_image
- The path-traversal hardening is fine (
safe_filename = os.path.basename(filename)is only used forimage_name, not filesystem paths).- Remove the
tempfile+ PNG-to-base64 round-trip:responseDocument.encode_image_to_base64always re-encodes to JPEG (image.save(..., format="JPEG", quality=85)), so you can callinputDocument.add_image(image_name=safe_filename, image_data=image)and drop the disk I/O/cleanup.- Also remove the unused
enumerateindex (for filename, image in images.items():), and deletetempfile/base64imports here if they become unused.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@omniparse/utils.py` around lines 9 - 28, In encode_images, stop writing images to disk and instead pass the PIL Image object directly to inputDocument.add_image (use safe_filename = os.path.basename(filename) for image_name); remove the tempfile+/base64 round-trip and the enumerate index (change loop to for filename, image in images.items()), and delete unused tempfile and base64 imports; note that responseDocument.encode_image_to_base64 will re-encode to JPEG so passing the PIL Image is sufficient.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@omniparse/utils.py`:
- Around line 9-28: In encode_images, stop writing images to disk and instead
pass the PIL Image object directly to inputDocument.add_image (use safe_filename
= os.path.basename(filename) for image_name); remove the tempfile+/base64
round-trip and the enumerate index (change loop to for filename, image in
images.items()), and delete unused tempfile and base64 imports; note that
responseDocument.encode_image_to_base64 will re-encode to JPEG so passing the
PIL Image is sufficient.
Vulnerability Summary
CWE-22 (Path Traversal) — Arbitrary file write via malicious embedded image filenames in
omniparse/utils.pyThe
encode_images()function inomniparse/utils.pyreceives a dictionary of{filename: image}pairs extracted from uploaded documents (PDFs, PPTs, DOCs). Thefilenamekey comes directly from embedded image names within the parsed document. The original code passes this filename directly toimage.save(filename, "PNG")and later toos.remove(filename), allowing an attacker to write (and then delete) files at arbitrary paths on the server.Severity: High — this enables arbitrary file write with no authentication required.
Affected function:
encode_images()inomniparse/utils.py, called from multiple document parsing endpoints (/pdf,/ppt,/doc) viaomniparse/documents/router.pyandomniparse/documents/__init__.py.Data Flow
/pdf(or PPT to/ppt, DOC to/doc) — no authentication requiredmarker) extracts embedded images, preserving their embedded filenamesencode_images()iterates over the extracted{filename: image}dictimage.save(filename, "PNG")writes the image to disk using the attacker-controlled filename directly../../../etc/cron.d/maliciouswrites outside the working directoryFix Description
This PR makes two changes:
Temporary file for image saving: Instead of writing to the attacker-controlled path, images are saved to a
tempfile.NamedTemporaryFilewith a.pngsuffix. This ensures writes always go to the system temp directory.Basename sanitization for stored name: The filename stored in the response document uses
os.path.basename(filename), stripping any directory traversal components. This ensures downstream consumers see only the leaf filename.A
try/finallyblock ensures the temporary file is always cleaned up, even if an error occurs during processing.Proof of Concept
An attacker can craft a PDF with an embedded image whose name contains path traversal sequences. When this PDF is uploaded to the OmniParse server, the image is written to an arbitrary path:
The endpoints are unauthenticated —
document_router = APIRouter()at line 34 ofomniparse/documents/router.pyhas no dependency injection for auth, and no middleware-level auth exists in the application.Testing
The fix was tested by verifying:
../../etc/passwd.pngresults in onlypasswd.pngbeing stored as the image namefinallyblock)Adversarial Review
Before submitting, we attempted to disprove this finding: we checked whether any authentication middleware, request validation, or filename sanitization exists upstream of
encode_images(). There is none — the FastAPI routers accept file uploads without auth (APIRouter()with nodependencies), and the document parsers (marker, python-pptx, python-docx) pass embedded image names through without sanitization. Theimage.save()call uses the raw filename from the parsed document, making this directly exploitable by anyone with network access to the server.Submitted by Sebastion — autonomous open-source security research from Foundation Machines. Free for public repos via the Sebastion AI GitHub App.
Summary by CodeRabbit
Release Notes