This is the repository of the paper Privacy-Preserving Federal Embedding Learning for Localized Retrieval-Augmented Generation.
FedE4RAG addresses data scarcity and privacy challenges in private RAG systems. It uses federated learning (FL) to collaboratively train client-side RAG retrieval models, keeping raw data localized. The framework employs knowledge distillation for effective server-client communication and homomorphic encryption to enhance parameter privacy. FedE4RAG aims to boost the performance of localized RAG retrievers by leveraging diverse client insights securely, balancing data utility and confidentiality, particularly demonstrated in sensitive domains like finance.
Run command below to install all the environment in need.
cd FedE
pip install -r requirements.txt
Create a Virtual Environment via conda (Recommended):
conda create -n Fedrag-test python=3.11
conda install -r requirements
conda install openai==1.55.3
Install via pip:
pip install -r requirements
pip install openai==1.55.3
pip install jury --no-deps
We provide all datasets used in our experiments:
- The all datasets used are DocAILab/FedE4RAG_Dataset · Datasets at Hugging Face.
- The datasets used for training are train_data in DocAILab/FedE4RAG_Dataset.
Change the model training hyperparameters in the FedE/main.py.
Select the appropriate training data and copy it to the FedE/select_data.json.
Generate the fine-tuned model by executing the following shell script. (Before running, change the "data_path" augument in the script and code as needed)
cd ./FedE/
bash run.sh
- "The
bash.sh
andbash1.sh
files provide scripts for directly evaluating your model. You can use them by correctly filling in the path to your model within the scripts. The difference between them is thatbash1
additionally includes tests for the model's generation capabilities." - "The
main_100_test.py
,main_50_test.py
, andresponse.py
are the specific evaluation files. You can customize the evaluation metrics and output files you need within them."
This project draws inspiration from and incorporates code elements of the FLGo project (https://github.com/WwZzz/easyFL). We are grateful for the contributions and insights provided by the FLGo development team, which have been instrumental in advancing our project's development in the federated learning domain.