[BUG] Security Proposal: Add integrity & malware checks for local embeddings

### Pre-check

- [x] I have searched the existing issues and none cover this bug.

### Description

I identified a supply chain security risk in `private_gpt/components/embedding/embedding_component.py`.

When `embedding_mode` is set to `huggingface` (default), the application initializes `HuggingFaceEmbedding` directly using the model name provided in `settings.yaml`.

The underlying library (`sentence-transformers`) relies on `torch.load()`, which uses `pickle`. This creates a Remote Code Execution (RCE) vector. If a user (or an attacker with access to the config) points `embedding_hf_model_name` to a compromised or malicious repository on Hugging Face, PrivateGPT will download and execute the payload immediately upon startup.

Furthermore, there are no checks for:
1. **Integrity:** Verifying that the downloaded file matches the official hash (protection against MITM or corrupted downloads).
2. **License Compliance:** Checking if the model has a restrictive license (e.g., CC-BY-NC) that might violate the user's usage policy.

**Proposed Solution:**

I maintain an open-source tool called **[Veritensor](https://github.com/ArseniiBrazhnyk/Veritensor)** (Apache 2.0) designed to mitigate these exact risks. It performs static analysis on Pickle/PyTorch files and verifies hashes against the Hugging Face API.

You could integrate a pre-flight check in `EmbeddingComponent` before initializing the model.

**Suggested Fix (Pseudo-code):**

```python
# private_gpt/components/embedding/embedding_component.py
from veritensor.engines.static.pickle_engine import scan_pickle_stream

# Inside __init__, before HuggingFaceEmbedding(...)
if settings.embedding.mode == "huggingface":
    # 1. Download file to cache
    # 2. Scan file
    # if threats: raise SecurityError(...)

I've attached an example of the full code.

[privategpt_scan.py](https://github.com/user-attachments/files/25047068/privategpt_scan.py)

### Steps to Reproduce

1. Create a malicious PyTorch model (pickle bomb) that executes a command (e.g., `os.system('echo HACKED')`) upon loading.
2. Upload it to a Hugging Face repository.
3. Modify `settings.yaml`: set `huggingface.embedding_hf_model_name` to your malicious repository.
4. Run `python -m private_gpt`.

### Expected Behavior

The application should verify the model before loading it into memory. 1. It should check the file hash against the upstream registry to ensure integrity. 2. It should scan the file structure for malicious bytecode (Pickle RCE). 3. It should ideally warn about restrictive licenses.  If the model is malicious or tampered with, the application should refuse to start and log a security error.

### Actual Behavior

The application downloads the model and passes it directly to `HuggingFaceEmbedding`. The malicious pickle payload is deserialized, and the arbitrary code is executed on the host machine.

### Environment

OS: Linux/Windows/MacOS (Issue is platform-independent) Installation: Source / Docker

### Additional Information

_No response_

### Version

_No response_

### Setup Checklist

- [x] Confirm that you have followed the installation instructions in the project’s documentation.
- [x] Check that you are using the latest version of the project.
- [x] Verify disk space availability for model storage and data processing.
- [x] Ensure that you have the necessary permissions to run the project.

### NVIDIA GPU Setup Checklist

- [x] Check that the all CUDA dependencies are installed and are compatible with your GPU (refer to [CUDA's documentation](https://docs.nvidia.com/deploy/cuda-compatibility/#frequently-asked-questions))
- [x] Ensure an NVIDIA GPU is installed and recognized by the system (run `nvidia-smi` to verify).
- [x] Ensure proper permissions are set for accessing GPU resources.
- [x] Docker users - Verify that the NVIDIA Container Toolkit is configured correctly (e.g. run `sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Security Proposal: Add integrity & malware checks for local embeddings #2201

Pre-check

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Security Proposal: Add integrity & malware checks for local embeddings #2201

Description

Pre-check

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions