Skip to content

Commit 6c5a35e

Browse files
authored
Merge pull request #283 from Hanyuan9766/proposal/203
OSPP-implemention-Proposal-PIPL-Compliant-Cloud-Edge-Collaborative-Privacy-Preserving-Prompt-Processing-Framework-203
2 parents f49c355 + 8150672 commit 6c5a35e

19 files changed

Lines changed: 6027 additions & 0 deletions

File tree

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# PIPL-Compliant Cloud-Edge Collaborative Privacy-Preserving Prompt Processing Framework
2+
3+
This example implements a PIPL-compliant cloud-edge collaborative privacy-preserving LLM inference workflow validated with the ChnSentiCorp-Lite dataset, including:
4+
5+
- Edge-first inference with hard sample mining
6+
- Adaptive privacy desensitization (regex, NER masking, differential privacy)
7+
- Privacy and performance metrics visualization
8+
- Zero raw text cross-border transmission
9+
- Real-time PIPL compliance verification and audit logging
10+
11+
## Directory Structure
12+
13+
```
14+
edge-cloud_collaborative_learning_bench/
15+
├── benchmarkingjob.yaml # Benchmarking job configuration
16+
├── README.md # Project documentation
17+
├── requirements.txt # Python dependencies
18+
├── test_algorithms/ # Test algorithms directory
19+
│ ├── algorithm.yaml # Algorithm configuration
20+
│ ├── privacy_preserving_llm/ # Privacy-preserving LLM main module
21+
│ │ ├── __init__.py
22+
│ │ └── privacy_preserving_llm.py
23+
│ ├── privacy_detection/ # Privacy detection module
24+
│ │ ├── __init__.py
25+
│ │ ├── pipl_classifier.py
26+
│ │ ├── pii_detector.py
27+
│ │ └── risk_evaluator.py
28+
│ └── privacy_encryption/ # Privacy encryption module
29+
│ ├── __init__.py
30+
│ ├── differential_privacy.py
31+
│ ├── saliency_masking.py
32+
│ ├── dimensionality_reduction.py
33+
│ └── compliance_monitor.py
34+
└── testenv/ # Test environment directory
35+
├── testenv.yaml # Test environment configuration
36+
├── privacy_metrics.py # Privacy evaluation metrics
37+
└── performance_metrics.py # Performance evaluation metrics
38+
```
39+
40+
## Project Background
41+
42+
The rapid advancement of large language models (LLMs) has driven the adoption of cloud-edge collaborative inference architectures, where computationally intensive inference tasks are distributed between edge devices and cloud servers to optimize performance and resource utilization. However, this paradigm introduces critical privacy challenges, particularly when processing user prompts containing sensitive personal information. With the enactment of China's Personal Information Protection Law (PIPL), which mandates strict requirements for cross-border data transmission including "minimal necessity" and "security assurance" principles, organizations face an urgent need to develop privacy-preserving solutions that comply with regulatory requirements while maintaining inference quality. Traditional approaches often require transmitting raw text across borders, creating significant privacy risks and regulatory compliance challenges. This project addresses the fundamental tension between privacy protection and inference utility in cloud-edge collaborative LLM systems, particularly in scenarios requiring cross-border data processing.
43+
44+
## Problems Solved
45+
46+
This project addresses three critical problems in cloud-edge collaborative LLM inference systems. First, it eliminates the privacy leakage risks associated with raw text cross-border transmission by implementing a zero raw text transmission architecture that converts sensitive prompts into anonymized feature vectors before any cross-border transfer occurs. Second, it ensures PIPL compliance by implementing strict adherence to Articles 38-40 of PIPL, including minimal necessity checks, privacy budget management, and real-time compliance verification mechanisms. Third, it resolves the privacy-utility trade-off challenge by developing adaptive privacy desensitization techniques—including differential privacy, saliency-guided masking, and dimensionality reduction—that preserve inference quality while providing strong privacy guarantees. The framework prevents unauthorized reconstruction of original user data from transmitted anonymized vectors, ensuring that sensitive personal information such as names, identification numbers, and locations cannot be recovered by cloud-side adversaries or through membership inference attacks.
47+
48+
## Project Results
49+
50+
The project has achieved comprehensive results across multiple dimensions. Technically, it delivers a complete PIPL-compliant cloud-edge collaborative privacy-preserving prompt processing framework integrated into KubeEdge-Ianvs, featuring edge-first inference with hard sample mining, adaptive privacy desensitization (regex patterns, NER masking, and differential privacy), and real-time compliance monitoring with audit logging. The framework achieves zero raw text cross-border transmission while maintaining inference accuracy comparable to non-privacy-preserving baselines. Academically, it introduces ChnSentiCorp-Lite, the first PIPL-compliant cross-border LLM inference benchmark dataset with 3,000 samples, multi-layer privacy annotations, synthetic PII templates, and dedicated attack evaluation subsets for comprehensive privacy assessment. The project provides a complete evaluation methodology covering utility metrics (task accuracy, end-to-end latency), privacy metrics (Neighbourhood MIA, LOSS Attack, LiRA), and compliance metrics (minimal necessity validation, budget compliance checks, audit integrity verification). Practically, the framework demonstrates successful deployment with Llama-3-8B-Instruct on edge devices and GPT-4o-mini on cloud servers, showcasing production-ready capabilities for privacy-preserving LLM inference in regulated environments.
51+
52+
## Core Features
53+
54+
### 1. Edge-side Privacy Protection
55+
- Perform irreversible privacy transformation on user's sensitive input prompts
56+
- Convert raw text into anonymized feature vectors
57+
- Complete PII detection, entity recognition, and privacy classification locally
58+
59+
### 2. Cloud-side Inference Processing
60+
- Perform inference based solely on anonymized vectors, never accessing raw text
61+
- Receive minimal necessary tags to execute core inference tasks
62+
- Ensure "Zero Raw Text Cross-Border" and "Minimal Tags Cross-Border"
63+
64+
### 3. PIPL Compliance Assurance
65+
- Strictly adhere to "minimal necessity" and "security assurance" principles
66+
- Real-time privacy budget management and audit logging
67+
- Compliance verification before cross-border transmission
68+
69+
## Model Configuration
70+
71+
### Edge Model
72+
- **Model**: Llama-3-8B-Instruct (4-bit quantized)
73+
- **Function**: Local PIPL entity recognition, semantic classification, and anonymized vector generation
74+
- **Deployment**: Adapted for edge computing environments (e.g., NVIDIA T4)
75+
76+
### Cloud Model
77+
- **Model**: GPT-4o-mini (API access)
78+
- **Function**: Receive anonymized vectors to perform core inference tasks
79+
- **Deployment**: OpenAI API format, ensuring scalable deployment
80+
81+
## Dataset
82+
83+
**ChnSentiCorp-Lite** - First PIPL-compliant cross-border LLM inference benchmark dataset
84+
85+
- **Total Samples**: 3,000 (2,000 train, 500 validation, 500 test)
86+
- **Data Source**: Carefully curated subset of ChnSentiCorp Chinese sentiment analysis dataset
87+
- **Format**: JSONL with comprehensive privacy annotations
88+
- **Size**: ~15MB (lightweight for rapid evaluation)
89+
90+
### Key Dataset Contributions
91+
- **Multi-layer Privacy Annotations**: Each sample tagged with privacy sensitivity levels and PII entity types
92+
- **Synthetic PII Templates**: 50+ built-in templates for dynamic generation of realistic Chinese personal information
93+
- **PIPL Compliance Mapping**: Granular annotations indicating cross-border transfer permissions under PIPL Articles 38-40
94+
- **Attack Evaluation Subsets**: Dedicated samples for Neighbourhood MIA, LOSS, and LiRA attack testing
95+
96+
## Quick Start
97+
98+
### Requirements
99+
- Python 3.8+
100+
- NVIDIA GPU (recommended T4 or higher)
101+
- API access keys (OpenAI or compatible services)
102+
103+
### Installation Steps
104+
105+
1. Install dependencies:
106+
```bash
107+
pip install -r requirements.txt
108+
```
109+
110+
2. Configure API keys:
111+
```bash
112+
export EDGE_API_KEY="your_edge_model_api_key"
113+
export CLOUD_API_KEY="your_cloud_model_api_key"
114+
```
115+
116+
3. Run benchmark:
117+
```bash
118+
ianvs -f benchmarkingjob.yaml
119+
```
120+
121+
## Evaluation Methods
122+
123+
### 1. Utility Evaluation
124+
- **Task Accuracy**: Compare accuracy changes before and after enabling privacy transformations
125+
- **End-to-End Latency**: Measure total time from user prompt input to receiving final response
126+
127+
### 2. Privacy Evaluation
128+
- **Neighbourhood MIA**: Model-agnostic approach using semantically similar neighbor samples
129+
- **LOSS Attack**: Traditional loss-based membership inference baseline
130+
- **LiRA**: Advanced likelihood ratio test with theoretical optimality properties
131+
132+
### 3. Compliance Evaluation
133+
- **Minimal Necessity Check**: Payload structure validation
134+
- **Budget Compliance Check**: ε accumulation validation
135+
- **Audit Integrity Check**: Log coverage verification
136+
137+
## Technical Architecture
138+
139+
The system adopts strict separation of duties between cloud and edge to ensure compliance of data processing workflows:
140+
141+
1. **Privacy Detection Module**: Identifies and classifies privacy-sensitive information in user prompts
142+
2. **Privacy Encryption Module**: Performs irreversible transformation of sensitive prompts into anonymized vectors
143+
3. **Edge Inference**: Local privacy processing and preliminary inference
144+
4. **Cloud Collaboration**: Advanced inference based on anonymized data
145+
5. **Compliance Monitoring**: Real-time monitoring and audit logging
146+
147+
## Privacy Protection Technologies
148+
149+
- **Differential Privacy**: L2-norm clipping, Gaussian noise injection, budget tracking
150+
- **Saliency-Guided Masking**: Attention-based token importance with configurable suppression
151+
- **Dimensionality Reduction**: Johnson-Lindenstrauss projection with semantic preservation
152+
- **Compliance Verification**: Real-time monitoring and audit logging
153+
154+
## Contributing
155+
156+
We welcome contributions, bug reports, and improvement suggestions. Please follow these steps:
157+
158+
1. Fork this repository
159+
2. Create a feature branch
160+
3. Commit your changes
161+
4. Push to the branch
162+
5. Create a Pull Request
163+
164+
## License
165+
166+
This project uses the same license as KubeEdge-Ianvs.
167+
168+
## Contact
169+
170+
For questions or suggestions, please contact us through GitHub Issues.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
benchmarkingjob:
2+
name: "pipl-compliant-privacy-preserving-llm-bench"
3+
workspace: "./workspace-pipl-llm"
4+
5+
testenv: "./testenv/testenv.yaml"
6+
7+
test_object:
8+
type: "algorithms"
9+
algorithms:
10+
- name: "privacy-preserving-llm-collaboration"
11+
url: "./test_algorithms/algorithm.yaml"
12+
13+
rank:
14+
sort_by: [{"Accuracy": "descend"}, {"Privacy_Score": "descend"}]
15+
16+
visualization:
17+
mode: "selected_only"
18+
method: "print_table"
19+
selected_dataitem:
20+
paradigms: ["all"]
21+
modules: ["privacy_preserving_llm", "privacy_detection", "privacy_encryption"]
22+
hyperparameters: [
23+
"edge_model-model", "cloud_model-model",
24+
"privacy_detection-ner_model", "privacy_encryption-epsilon"
25+
]
26+
metrics: [
27+
"Accuracy", "F1_Score", "Privacy_Score", "PIPL_Compliance_Score",
28+
"End_to_End_Latency", "Throughput", "Privacy_Budget_Consumption",
29+
"MIA_Attack_Success_Rate", "Cross_Border_Payload_Size"
30+
]
31+
32+
save_mode: "selected_and_all"
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Core Dependencies
2+
torch>=1.12.0
3+
transformers>=4.21.0
4+
numpy>=1.21.0
5+
pandas>=1.3.0
6+
scikit-learn>=1.0.0
7+
8+
# Privacy Protection
9+
opacus>=1.3.0 # Differential Privacy
10+
scipy>=1.7.0 # Statistical computations
11+
12+
# NLP and Text Processing
13+
spacy>=3.4.0
14+
regex>=2022.1.18
15+
nltk>=3.7
16+
17+
# Model Quantization
18+
bitsandbytes>=0.37.0
19+
20+
# API and HTTP
21+
openai>=0.27.0
22+
requests>=2.25.0
23+
httpx>=0.23.0
24+
25+
# Configuration and Serialization
26+
PyYAML>=6.0
27+
jsonlines>=3.1.0
28+
29+
# Metrics and Evaluation
30+
seaborn>=0.11.0
31+
matplotlib>=3.5.0
32+
plotly>=5.0.0
33+
34+
# Utils
35+
tqdm>=4.64.0
36+
rich>=12.0.0
37+
loguru>=0.6.0
38+
39+
# Privacy Attack Simulation
40+
membership-inference-attacks>=0.1.0
41+
42+
# Chinese Language Processing
43+
jieba>=0.42.1
44+
45+
# Cryptographic utilities (for secure hash and audit logging)
46+
cryptography>=3.4.8
47+
48+
# Memory optimization
49+
psutil>=5.8.0
50+
51+
# Configuration management
52+
python-dotenv>=0.19.0
53+
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
algorithm:
2+
name: "privacy-preserving-llm-collaboration"
3+
type: "privacy_preserving_llm"
4+
url: "./privacy_preserving_llm/privacy_preserving_llm.py"
5+
6+
edge_model:
7+
name: "meta-llama/Llama-3-8B-Instruct"
8+
quantization: "4bit"
9+
api_base: "https://api.openai.com/v1"
10+
api_key: "${EDGE_API_KEY}"
11+
hidden_layer_index: -2
12+
pooling_strategy: "mean"
13+
max_length: 2048
14+
device: "cuda"
15+
16+
cloud_model:
17+
name: "gpt-4o-mini"
18+
api_base: "https://api.openai.com/v1"
19+
api_key: "${CLOUD_API_KEY}"
20+
vector_adapter:
21+
input_dim: 64
22+
hidden_dim: 512
23+
output_dim: 4096
24+
max_tokens: 1024
25+
temperature: 0.7
26+
27+
privacy_detection:
28+
detection_methods:
29+
regex_patterns: ["phone", "id_card", "email", "address", "name"]
30+
ner_model: "hfl/chinese-bert-wwm-ext"
31+
entity_types: ["PERSON", "ORG", "LOC", "PHONE", "EMAIL", "ID"]
32+
risk_weights:
33+
structured_pii: 0.8
34+
named_entities: 0.6
35+
semantic_context: 0.4
36+
pipl_compliance:
37+
cross_border_threshold: 0.5
38+
high_sensitivity_threshold: 0.7
39+
40+
privacy_encryption:
41+
differential_privacy:
42+
general:
43+
epsilon: 1.2
44+
delta: 0.00001
45+
clipping_norm: 1.0
46+
noise_multiplier: 1.1
47+
high_sensitivity:
48+
epsilon: 0.8
49+
delta: 0.00001
50+
clipping_norm: 0.5
51+
noise_multiplier: 1.5
52+
anonymization:
53+
general_mask_ratio: 0.4
54+
high_sensitivity_mask_ratio: 0.6
55+
projection_method: "johnson_lindenstrauss"
56+
target_dims: 64
57+
saliency_threshold: 0.3
58+
budget_management:
59+
session_limit: 10.0
60+
rate_limit: 5
61+
audit_logging: true
62+
63+
compliance:
64+
pipl_version: "2021"
65+
audit_level: "detailed"
66+
cross_border_policy: "strict"
67+
minimal_necessity: true
68+

0 commit comments

Comments
 (0)