Skip to content

Conversation

@qued
Copy link
Contributor

@qued qued commented Feb 5, 2025

Summary

A recent security review showed that it was possible to partition arbitrary local files in cases where the filetype supports an "include" functionality that brings in the content of files external to the partitioned file. This affects rst and org files.

Fix

This PR fixes the above issue by passing the parameter sandbox=True in all cases where pypandoc.convert_file is called.

Note I also added the parameter to a call to this method in the ODT code. I haven't investigated whether there was a security issue with ODT files, but it seems better to use pandoc in sandbox mode given the security issues we know about.

Testing

To verify that the tests that are added with this PR find the relevant issue:

  • Remove the sandbox=True text from unstructured/file_utils/file_conversion.py line 17.
  • Run the tests test_unstructured.partition.test_rst.test_rst_wont_include_external_files and test_unstructured.partition.test_org.test_org_wont_include_external_files. Both should fail due to the partitioning containing the word "wombat", which only appears in a file external to the partitioned file.
  • Add the parameter back in, and the tests pass.

@qued qued requested review from amanda103 and scanny February 5, 2025 21:33
Copy link
Contributor

@amanda103 amanda103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! 🥳

@qued qued enabled auto-merge February 6, 2025 03:00
@qued qued added this pull request to the merge queue Feb 6, 2025
Merged via the queue into main with commit b10379c Feb 6, 2025
41 checks passed
@qued qued deleted the fix/partition-system-files-via-include branch February 6, 2025 03:55
temp-adelyn pushed a commit to temp-adelyn/unstructured that referenced this pull request Mar 3, 2025
…ured-IO#3908)

#### Summary

A recent security review showed that it was possible to partition
arbitrary local files in cases where the filetype supports an "include"
functionality that brings in the content of files external to the
partitioned file. This affects `rst` and `org` files.

#### Fix

This PR fixes the above issue by passing the parameter `sandbox=True` in
all cases where `pypandoc.convert_file` is called.

Note I also added the parameter to a call to this method in the ODT
code. I haven't investigated whether there was a security issue with ODT
files, but it seems better to use pandoc in sandbox mode given the
security issues we know about.

#### Testing

To verify that the tests that are added with this PR find the relevant
issue:
- Remove the `sandbox=True` text from
`unstructured/file_utils/file_conversion.py` line 17.
- Run the tests
`test_unstructured.partition.test_rst.test_rst_wont_include_external_files`
and
`test_unstructured.partition.test_org.test_org_wont_include_external_files`.
Both should fail due to the partitioning containing the word "wombat",
which only appears in a file external to the partitioned file.
- Add the parameter back in, and the tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants