Skip to content

[BUG] Security Proposal: Add integrity & malware checks for local embeddings #2201

@arsbr

Description

@arsbr

Pre-check

  • I have searched the existing issues and none cover this bug.

Description

I identified a supply chain security risk in private_gpt/components/embedding/embedding_component.py.

When embedding_mode is set to huggingface (default), the application initializes HuggingFaceEmbedding directly using the model name provided in settings.yaml.

The underlying library (sentence-transformers) relies on torch.load(), which uses pickle. This creates a Remote Code Execution (RCE) vector. If a user (or an attacker with access to the config) points embedding_hf_model_name to a compromised or malicious repository on Hugging Face, PrivateGPT will download and execute the payload immediately upon startup.

Furthermore, there are no checks for:

  1. Integrity: Verifying that the downloaded file matches the official hash (protection against MITM or corrupted downloads).
  2. License Compliance: Checking if the model has a restrictive license (e.g., CC-BY-NC) that might violate the user's usage policy.

Proposed Solution:

I maintain an open-source tool called Veritensor (Apache 2.0) designed to mitigate these exact risks. It performs static analysis on Pickle/PyTorch files and verifies hashes against the Hugging Face API.

You could integrate a pre-flight check in EmbeddingComponent before initializing the model.

Suggested Fix (Pseudo-code):

# private_gpt/components/embedding/embedding_component.py
from veritensor.engines.static.pickle_engine import scan_pickle_stream

# Inside __init__, before HuggingFaceEmbedding(...)
if settings.embedding.mode == "huggingface":
    # 1. Download file to cache
    # 2. Scan file
    # if threats: raise SecurityError(...)

I've attached an example of the full code.

[privategpt_scan.py](https://github.com/user-attachments/files/25047068/privategpt_scan.py)

### Steps to Reproduce

1. Create a malicious PyTorch model (pickle bomb) that executes a command (e.g., `os.system('echo HACKED')`) upon loading.
2. Upload it to a Hugging Face repository.
3. Modify `settings.yaml`: set `huggingface.embedding_hf_model_name` to your malicious repository.
4. Run `python -m private_gpt`.

### Expected Behavior

The application should verify the model before loading it into memory. 1. It should check the file hash against the upstream registry to ensure integrity. 2. It should scan the file structure for malicious bytecode (Pickle RCE). 3. It should ideally warn about restrictive licenses.  If the model is malicious or tampered with, the application should refuse to start and log a security error.

### Actual Behavior

The application downloads the model and passes it directly to `HuggingFaceEmbedding`. The malicious pickle payload is deserialized, and the arbitrary code is executed on the host machine.

### Environment

OS: Linux/Windows/MacOS (Issue is platform-independent) Installation: Source / Docker

### Additional Information

_No response_

### Version

_No response_

### Setup Checklist

- [x] Confirm that you have followed the installation instructions in the projects documentation.
- [x] Check that you are using the latest version of the project.
- [x] Verify disk space availability for model storage and data processing.
- [x] Ensure that you have the necessary permissions to run the project.

### NVIDIA GPU Setup Checklist

- [x] Check that the all CUDA dependencies are installed and are compatible with your GPU (refer to [CUDA's documentation](https://docs.nvidia.com/deploy/cuda-compatibility/#frequently-asked-questions))
- [x] Ensure an NVIDIA GPU is installed and recognized by the system (run `nvidia-smi` to verify).
- [x] Ensure proper permissions are set for accessing GPU resources.
- [x] Docker users - Verify that the NVIDIA Container Toolkit is configured correctly (e.g. run `sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi`)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions