State vector: [present_mask (11 values), scenario_one_hot (2 values)]
- present_mask: Binary vector indicating which PII types are present in the conversation
- Length: 11 (one for each PII type: NAME, PHONE, EMAIL, DATE/DOB, company, location, IP, SSN, CREDIT_CARD, age, sex)
- Value: 1 if PII is present, 0 otherwise
- scenario_one_hot: One-hot encoding of domain
- Length: 2 (restaurant=0, bank=1)
- Example:
[1, 0]for restaurant,[0, 1]for bank
- Total state dimension: 13
Important: The model NEVER sees the "allowed_mask" in the state. It must learn domain-specific patterns from rewards.
- Action space: Binary vector of length 11
- Each element: 0 (don't share) or 1 (share) for each PII type
- Example:
[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]means share PHONE and EMAIL only
- Action space: Binary vector of length 11 (same as GRPO)
- Each element: 0 (don't share) or 1 (share) for each PII type
- Example:
[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]means share PHONE and EMAIL only - Note: All three algorithms (GRPO, GroupedPPO, VanillaRL) now use the same per-PII binary action space. The difference is in the training update method (PPO vs REINFORCE).
Reward function: R(s, a) = α·utility + β·privacy - complexity_penalty
Where:
- Utility: Fraction of allowed PII that was shared
utility = |shared_allowed| / |allowed|
- Privacy: Fraction of disallowed PII that was NOT shared
privacy = 1 - |shared_disallowed| / |disallowed|
- Complexity penalty: Penalty for sharing too many fields
complexity_penalty = λ · (|shared| / |present|)
- Weights (domain-specific):
- Restaurant: α=0.6, β=0.4 (more privacy-leaning)
- Bank: α=0.7, β=0.3 (more utility-leaning)
Group-based reward (for GRPO):
- Reward computed per PII group (identity, contact, financial, etc.)
- Average reward across all groups
- Encourages learning consistent group-level patterns
Input: [present_mask (11), scenario_one_hot (2)] → State (13 dim)
↓
[FC(64) → ReLU → FC(64) → ReLU] → Shared Encoder
↓
┌─────────────────┐
│ │
Policy Head (11) Value Head (1)
│ │
↓ ↓
Bernoulli(11) V(s)
- Shared encoder: 2-layer MLP (13 → 64 → 64)
- Policy head: Linear(64 → 11) outputs Bernoulli logits for each PII
- Value head: Linear(64 → 1) outputs state value V(s)
- For each PII type independently:
- Compute probability:
p = sigmoid(logit) - Sample or threshold:
action = 1 if p >= threshold else 0
- Compute probability:
- Result: Binary vector of length 11
-
Rollout:
for each batch: sample dataset row sample scenario (restaurant/bank) build state = [present_mask, scenario_one_hot] policy(state) → logits (11), value (1) sample actions from Bernoulli(logits) compute reward based on group-level matching
-
Update (PPO-style with KL regularization):
advantages = rewards - old_values ratio = exp(new_log_prob - old_log_prob) policy_loss = -mean(ratio * advantages) value_loss = MSE(new_values, rewards) kl_penalty = KL(new_probs || old_probs) loss = policy_loss + value_coef*value_loss + kl_coef*kl_penalty
-
Key Features:
- Per-PII binary decisions
- Group-based rewards (encourages learning patterns)
- Value function for advantage estimation
- KL regularization prevents large policy updates
1. SAMPLE EXPERIENCE:
Dataset Row: present=[NAME,EMAIL,PHONE], allowed_restaurant=[EMAIL,PHONE]
Scenario: "restaurant"
2. BUILD STATE:
present_mask = [1,1,1,0,0,0,0,0,0,0,0] # NAME,PHONE,EMAIL present
scenario_one_hot = [1,0] # restaurant
state = [1,1,1,0,0,0,0,0,0,0,0, 1,0] # 13 dim
3. POLICY FORWARD:
state → MLP(64) → hidden(64)
hidden → policy_head(11) → logits[11]
logits → sigmoid → probs[11] = [0.01, 0.98, 0.97, ...]
4. SAMPLE ACTIONS:
Sample from Bernoulli(probs) → actions = [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
Meaning: Don't share NAME, Share PHONE, Share EMAIL, don't share others...
**Important**: All 11 actions in this vector are used together:
- Reward is computed using all 11 decisions
- Policy loss uses log probability summed across all 11 PII types
- The model learns from the complete action vector, not individual actions
5. COMPUTE REWARD (group-based):
For "contact" group (PHONE, EMAIL):
- present: [PHONE, EMAIL]
- shared: [PHONE, EMAIL] (both shared)
- allowed: [PHONE, EMAIL] (both allowed)
- utility = 2/2 = 1.0
- privacy = 1.0 (no disallowed shared)
- group_reward = 0.6*1.0 + 0.4*1.0 = 1.0
For "identity" group (NAME):
- present: [NAME]
- shared: [] (not shared)
- allowed: [] (not allowed)
- utility = 1.0 (correctly didn't share)
- privacy = 1.0
- group_reward = 1.0
reward = mean([1.0, 1.0, ...]) - complexity_penalty
6. UPDATE (on batch of 64 experiences):
For each experience in batch:
- advantages = reward - value_baseline
- log_prob = sum(log_prob for all 11 PII actions) # All actions used together
- policy_loss = -log_prob * advantages
Aggregate across batch:
- policy_loss = mean(policy_loss for all 64 experiences)
- value_loss = MSE(new_values, rewards) # Across all 64
- kl_penalty = KL(new_probs || old_probs) # Across all 64
- total_loss = policy_loss + value_loss + kl_penalty
→ backprop → update weights
**Note**: The update uses all 64 experiences in the batch simultaneously, making training more efficient and stable.
Input: [present_mask (11), scenario_one_hot (2)] → State (13 dim)
↓
[FC(64) → ReLU → FC(64) → ReLU] → Shared Encoder
↓
┌─────────────────┐
│ │
Policy Head (11) Value Head (1)
│ │
↓ ↓
Bernoulli(11) V(s)
- Shared encoder: 2-layer MLP (13 → 64 → 64)
- Policy head: Linear(64 → 11) outputs Bernoulli logits for each PII
- Value head: Linear(64 → 1) outputs state value V(s)
- Same architecture as GRPO - difference is in the training update method
- For each PII type independently:
- Compute probability:
p = sigmoid(logit) - Sample or threshold:
action = 1 if p >= threshold else 0
- Compute probability:
- Result: Binary vector of length 11 (same as GRPO)
-
Rollout:
for each batch: sample dataset row sample scenario (restaurant/bank) build state = [present_mask, scenario_one_hot] policy(state) → logits (11), value (1) sample actions from Bernoulli(logits) compute reward based on group-level matching
-
Update (PPO with clipping):
advantages = rewards - old_values ratio = exp(new_log_prob - old_log_prob) surr1 = ratio * advantages surr2 = clip(ratio, 1-ε, 1+ε) * advantages policy_loss = -min(surr1, surr2) # Clipped PPO value_loss = MSE(new_values, rewards) entropy_bonus = entropy(probs) kl_penalty = KL(new_probs || old_probs) loss = policy_loss + value_loss - entropy + kl_penalty
-
Key Features:
- Per-PII binary decisions (same as GRPO)
- Group-based rewards (encourages learning patterns)
- Value function for advantage estimation
- PPO clipping prevents large updates
- Entropy bonus encourages exploration
- KL regularization prevents large policy updates
1. SAMPLE EXPERIENCE:
Dataset Row: present=[NAME,EMAIL,PHONE,SSN], allowed_bank=[EMAIL,PHONE,SSN]
Scenario: "bank"
2. BUILD STATE:
present_mask = [1,1,1,0,0,0,0,1,0,0,0] # NAME,PHONE,EMAIL,SSN
scenario_one_hot = [0,1] # bank
state = [1,1,1,0,0,0,0,1,0,0,0, 0,1] # 13 dim
3. POLICY FORWARD:
state → MLP(64) → hidden(64)
hidden → policy_head(11) → logits[11]
logits → sigmoid → probs[11] = [0.01, 0.98, 0.97, ..., 0.85]
4. SAMPLE ACTIONS:
Sample from Bernoulli(probs) → actions = [0, 1, 1, 0, ..., 1]
Meaning: Don't share NAME, Share PHONE, Share EMAIL, ..., Share SSN
5. COMPUTE REWARD (group-based):
For "contact" group (PHONE, EMAIL):
- present: [PHONE, EMAIL]
- shared: [PHONE, EMAIL] (both shared)
- allowed: [PHONE, EMAIL] (both allowed)
- utility = 2/2 = 1.0
- privacy = 1.0 (no disallowed shared)
- group_reward = 0.7*1.0 + 0.3*1.0 = 1.0
For "financial" group (SSN):
- present: [SSN]
- shared: [SSN] (shared)
- allowed: [SSN] (allowed)
- utility = 1/1 = 1.0, privacy = 1.0
- group_reward = 1.0
For "identity" group (NAME):
- present: [NAME]
- shared: [] (not shared)
- allowed: [] (not allowed)
- utility = 1.0 (correctly didn't share)
- privacy = 1.0
- group_reward = 1.0
reward = mean([1.0, 1.0, 1.0, ...]) - complexity_penalty
6. UPDATE (PPO with clipping):
advantages = reward - value_baseline
ratio = exp(new_log_prob - old_log_prob)
surr1 = ratio * advantages
surr2 = clip(ratio, 0.8, 1.2) * advantages
policy_loss = -min(surr1, surr2) # Clipped!
value_loss = MSE(value, reward)
entropy_bonus = entropy(probs)
kl_penalty = KL(new_probs || old_probs)
loss = policy_loss + value_loss - entropy + kl_penalty
→ backprop → update weights
Input: [present_mask (11), scenario_one_hot (2)] → State (13 dim)
↓
[FC(64) → Tanh → FC(64) → Tanh] → Shared Encoder
↓
Policy Head (11)
↓
Bernoulli(11)
- Shared encoder: 2-layer MLP (13 → 64 → 64)
- Policy head: Linear(64 → 11) outputs Bernoulli logits for each PII
- No value function (simpler than GRPO/GroupedPPO)
- Same as GRPO/GroupedPPO: Binary vector of length 11 (per-PII decisions)
-
Rollout:
for each batch: sample dataset row sample scenario (restaurant/bank) build state = [present_mask, scenario_one_hot] policy(state) → logits (11) sample actions from Bernoulli(logits) compute reward based on group-level matching
For each iteration:
- Collect batch (64 random experiences)
- Compute rewards for all 64
- Update policy using all 64 experiences together
- Repeat until convergence
-
Update (Simple REINFORCE):
advantages = (rewards - mean(rewards)) / std(rewards) # Normalized for each transition: log_prob = log π(actions | state) # Sum over all PII loss = -log_prob * advantage # REINFORCE
-
Key Features:
- Simplest algorithm (no value function)
- Per-PII binary decisions (same as GRPO/GroupedPPO)
- Group-based rewards (encourages learning patterns)
- REINFORCE policy gradient
- Normalized rewards as advantages
- No clipping or KL regularization
1. SAMPLE EXPERIENCE:
Dataset Row: present=[NAME,EMAIL,PHONE,SSN], allowed_bank=[EMAIL,PHONE,SSN]
Scenario: "bank"
2. BUILD STATE:
present_mask = [1,1,1,0,0,0,0,1,0,0,0] # NAME,PHONE,EMAIL,SSN
scenario_one_hot = [0,1] # bank
state = [1,1,1,0,0,0,0,1,0,0,0, 0,1] # 13 dim
3. POLICY FORWARD:
state → MLP(64) → hidden(64)
hidden → policy_head(11) → logits[11]
logits → sigmoid → probs[11] = [0.01, 0.98, 0.97, ..., 0.85]
4. SAMPLE ACTIONS:
Sample from Bernoulli(probs) → actions = [0, 1, 1, 0, ..., 1]
Meaning: Don't share NAME, Share PHONE, Share EMAIL, ..., Share SSN
5. COMPUTE REWARD (group-based):
Same as GRPO/GroupedPPO - per-group rewards, then averaged
6. UPDATE (Simple REINFORCE):
For all transitions:
advantages = (rewards - mean(rewards)) / std(rewards) # Normalize
For each transition:
log_prob = log π(actions | state) # Sum over all PII
loss = -log_prob * advantage # Simple REINFORCE
No value function, no clipping, no KL penalty - just gradient!
┌─────────────────────────────────────────────────────────────┐
│ STEP 1: INITIALIZATION │
└─────────────────────────────────────────────────────────────┘
├─ Load dataset (CSV/Excel)
├─ Parse: ground_truth → present_mask
├─ Parse: allowed_restaurant → allowed_mask_restaurant
├─ Parse: allowed_bank → allowed_mask_bank
└─ Initialize policy network (random weights)
┌─────────────────────────────────────────────────────────────┐
│ STEP 2: ROLLOUT (Collect Batch of Experiences) │
└─────────────────────────────────────────────────────────────┘
For batch_size (e.g., 64) samples:
├─ 1. Sample random dataset row
├─ 2. Sample random scenario (restaurant or bank)
├─ 3. Build state:
│ ├─ present_mask = [1,1,1,0,0,...] (which PII present)
│ ├─ scenario_one_hot = [1,0] or [0,1] (restaurant/bank)
│ └─ state = concat(present_mask, scenario_one_hot)
│
├─ 4. Policy forward pass:
│ ├─ state → encoder → hidden features
│ └─ hidden → policy_head → actions
│
├─ 5. Sample actions:
│ └─ All algorithms: Sample ONE action vector of length 11 from Bernoulli
│ - Each vector contains 11 binary decisions (one per PII type)
│ - All 11 actions are used together for reward and loss computation
│
├─ 6. Apply actions:
│ └─ All algorithms: actions directly = which PII to share
│ - All 11 decisions in the action vector are used together
│
└─ 7. Compute reward:
├─ Compare shared PII vs allowed_mask (using all 11 actions)
├─ Calculate utility and privacy
└─ reward = α·utility + β·privacy - complexity_penalty
After collecting batch_size experiences (e.g., 64):
└─ Store all experiences in batch for update
┌─────────────────────────────────────────────────────────────┐
│ STEP 3: POLICY UPDATE (on entire batch) │
└─────────────────────────────────────────────────────────────┘
├─ Compute advantages for all batch_size experiences:
│ ├─ GRPO/GroupedPPO: advantages = rewards - value_baseline
│ └─ VanillaRL: advantages = normalized(rewards)
│
├─ Compute policy gradient using all batch experiences:
│ ├─ For each experience: log_prob = sum(log_prob for all 11 PII actions)
│ ├─ GRPO: PPO-style with KL regularization (no clipping)
│ ├─ GroupedPPO: PPO with clipping + entropy + KL
│ └─ VanillaRL: REINFORCE (simple gradient)
│
└─ Update network weights via backpropagation
- Update uses all batch_size experiences simultaneously
- More efficient than updating after each single experience
┌─────────────────────────────────────────────────────────────┐
│ STEP 4: CONVERGENCE CHECK │
└─────────────────────────────────────────────────────────────┘
├─ Evaluate on validation set
├─ Check if reward improved > threshold
├─ If no improvement for 'patience' iterations → STOP
└─ Otherwise continue to next iteration
┌─────────────────────────────────────────────────────────────┐
│ TRAINING LOOP │
└─────────────────────────────────────────────────────────────┘
1. LOAD DATASET
├─ Read CSV/Excel
├─ Parse ground_truth → present_mask
├─ Parse allowed_restaurant → allowed_mask_restaurant
└─ Parse allowed_bank → allowed_mask_bank
2. FOR EACH ITERATION:
a) ROLLOUT BATCH (collect experiences)
├─ Sample dataset row
├─ Sample scenario (restaurant or bank)
├─ Build state = [present_mask, scenario_one_hot]
├─ Policy(state) → actions
├─ Apply actions → determine which PII to share
└─ Compute reward based on allowed_mask
b) UPDATE POLICY
├─ Compute advantages (rewards - baseline)
├─ Compute policy gradient
└─ Update network weights
c) EVALUATE (every N iterations)
├─ Run policy on evaluation set
└─ Log average reward
3. CHECK CONVERGENCE
├─ If no improvement > threshold for patience iterations
└─ Stop training
4. SAVE MODEL
└─ Save policy weights
All algorithms use gradient descent (specifically Adam optimizer) to update neural network weights. The process follows these steps:
- Forward Pass: Compute predictions and loss
- Backward Pass: Compute gradients via backpropagation
- Gradient Clipping: Prevent exploding gradients
- Weight Update: Update weights using optimizer
┌─────────────────────────────────────────────────────────────┐
│ STEP 1: FORWARD PASS │
└─────────────────────────────────────────────────────────────┘
├─ Input: batch of states (shape: [batch_size, 13])
├─ Policy network forward:
│ ├─ state → encoder layers → hidden features
│ └─ hidden → policy_head → logits (shape: [batch_size, 11])
│
├─ Convert logits to probabilities:
│ └─ probs = sigmoid(logits) # Each PII type gets a probability
│
├─ Sample actions from Bernoulli distribution:
│ └─ actions ~ Bernoulli(probs) # Binary vector [0,1,1,0,...]
│
└─ Compute log probabilities:
└─ log_probs = sum(log_prob(action_i | prob_i) for all 11 PII types)
┌─────────────────────────────────────────────────────────────┐
│ STEP 2: COMPUTE LOSS │
└─────────────────────────────────────────────────────────────┘
├─ Calculate advantages (how good was this action?):
│ ├─ GRPO/GroupedPPO: advantage = reward - value_baseline
│ └─ VanillaRL: advantage = (reward - mean(rewards)) / std(rewards)
│
├─ Compute policy loss:
│ ├─ GRPO: loss = -mean(ratio * advantage) + KL_penalty + value_loss
│ ├─ GroupedPPO: loss = -mean(clipped_ratio * advantage) + value_loss + entropy
│ └─ VanillaRL: loss = -mean(log_prob * advantage)
│
└─ Total loss combines:
├─ Policy loss (main objective)
├─ Value loss (for GRPO/GroupedPPO)
├─ KL divergence penalty (for GRPO/GroupedPPO)
└─ Entropy bonus (for GroupedPPO)
┌─────────────────────────────────────────────────────────────┐
│ STEP 3: BACKWARD PASS (Gradient Computation) │
└─────────────────────────────────────────────────────────────┘
├─ optimizer.zero_grad() # Clear previous gradients
│
├─ loss.backward() # Backpropagation:
│ ├─ Computes ∂loss/∂weight for ALL weights in network
│ ├─ Uses chain rule to propagate gradients backward
│ └─ Stores gradients in weight.grad for each parameter
│
└─ torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
└─ Clips gradients to max norm of 1.0 (prevents exploding gradients)
┌─────────────────────────────────────────────────────────────┐
│ STEP 4: WEIGHT UPDATE (Gradient Descent) │
└─────────────────────────────────────────────────────────────┘
├─ optimizer.step() # Adam optimizer update:
│ ├─ For each weight w:
│ │ ├─ Compute momentum: m_t = β₁·m_{t-1} + (1-β₁)·grad
│ │ ├─ Compute adaptive learning rate: v_t = β₂·v_{t-1} + (1-β₂)·grad²
│ │ ├─ Bias correction: m̂ = m_t / (1-β₁^t), v̂ = v_t / (1-β₂^t)
│ │ └─ Update: w = w - (learning_rate / (√v̂ + ε)) · m̂
│ │
│ └─ Adam adapts learning rate per parameter
│
└─ Weights are now updated! Network learned from this batch.
def policy_gradient_update(policy, optimizer, batch, epochs=4, ...):
# Convert batch to tensors
states, actions, rewards, old_log_probs, old_values, old_probs = batch.to_tensors(device)
# Compute advantages
advantages = rewards - old_values
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
for _ in range(epochs): # Multiple epochs per batch
# Forward pass
logits, values = policy(states)
probs = torch.sigmoid(logits)
dist = torch.distributions.Bernoulli(probs=probs)
log_probs = dist.log_prob(actions).sum(dim=1)
# Compute loss
ratio = torch.exp(log_probs - old_log_probs)
policy_loss = -(ratio * advantages).mean()
value_loss = F.mse_loss(values, rewards)
kl_loss = compute_kl_divergence(old_probs, probs)
entropy = dist.entropy().sum(dim=1).mean()
total_loss = policy_loss + value_coef * value_loss + kl_coef * kl_loss - entropy_coef * entropy
# GRADIENT DESCENT:
optimizer.zero_grad() # 1. Clear gradients
total_loss.backward() # 2. Compute gradients (backpropagation)
torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0) # 3. Clip gradients
optimizer.step() # 4. Update weights (Adam optimizer)def policy_gradient_update(policy, optimizer, transitions, epochs=3):
# Compute normalized advantages
rewards = torch.tensor(transitions.rewards, dtype=torch.float32)
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
for _ in range(epochs):
# GRADIENT DESCENT:
optimizer.zero_grad() # 1. Clear gradients
# Forward pass
states = torch.tensor(transitions.states, dtype=torch.float32)
logits = policy(states)
probs = torch.sigmoid(logits)
dist = torch.distributions.Bernoulli(probs=probs)
log_probs = dist.log_prob(actions).sum(dim=1)
# Compute loss (REINFORCE)
loss = -(log_probs * advantages).mean()
loss.backward() # 2. Compute gradients (backpropagation)
torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0) # 3. Clip gradients
optimizer.step() # 4. Update weights (Adam optimizer)Adam Optimizer (used by all algorithms):
- Learning Rate: Default
3e-4(0.0003) - Adaptive: Adjusts learning rate per parameter
- Momentum: Uses exponential moving averages of gradients
- Benefits:
- Faster convergence than SGD
- Handles sparse gradients well
- Less sensitive to hyperparameters
Gradient Clipping:
- Purpose: Prevents exploding gradients that can destabilize training
- Method: Clips gradient norm to max value of 1.0
- Formula:
grad = grad * min(1.0, max_norm / grad_norm)
For each iteration:
1. Rollout batch (collect experiences)
2. For each epoch (multiple updates per batch):
a. Forward pass → compute loss
b. Backward pass → compute gradients
c. Clip gradients
d. Optimizer.step() → update weights
3. Evaluate (every N iterations)
4. Check convergence
- Batch Processing: Updates use entire batch (e.g., 64 samples) simultaneously
- Multiple Epochs: Each batch is used for multiple gradient updates (2-4 epochs)
- Gradient Clipping: Prevents gradient explosion
- Adam Optimizer: Adaptive learning rate per parameter
- Backpropagation: Automatically computes gradients for all weights via chain rule
| Aspect | GRPO | GroupedPPO | VanillaRL |
|---|---|---|---|
| Action Space | 11 binary (per-PII) | 11 binary (per-PII) | 11 binary (per-PII) |
| Policy Output | Bernoulli(11) | Bernoulli(11) | Bernoulli(11) |
| Value Function | Yes (1 head) | Yes (1 head) | No |
| Update Method | PPO + KL reg | PPO + clipping + entropy + KL | REINFORCE |
| Complexity | Medium | Medium | Low |
| Advantages | Per-PII control, KL regularization | Per-PII control, PPO clipping | Simple, fast |
| Reward | Group-based | Group-based | Group-based |
- Granularity: Per-PII decisions
- Learning: Learns which individual PII types to share
- Output: Binary vector
[0,1,1,0,...]for each PII - Update: PPO-style with KL regularization
- Granularity: Per-PII decisions (same as GRPO)
- Learning: Learns which individual PII types to share
- Output: Binary vector
[0,1,1,0,...]for each PII - Update: PPO with clipping + entropy bonus + KL regularization
- Granularity: Per-PII decisions (same as GRPO/GroupedPPO)
- Learning: Learns which individual PII types to share
- Output: Binary vector
[0,1,1,0,...]for each PII - Update: Simple REINFORCE (no value function, no clipping, no KL)
for each PII group:
present_in_group = PII types in group that are present
shared_in_group = PII types in group that were shared
allowed_in_group = PII types in group that are allowed
utility = |shared_allowed| / |allowed| # How much allowed was shared
privacy = 1 - |shared_disallowed| / |disallowed| # How much disallowed was NOT shared
group_reward = α·utility + β·privacy - complexity_penalty
group_rewards.append(group_reward)
reward = mean(group_rewards) # Average across groups# All algorithms use the same group-based reward computation
for each PII group:
present_in_group = PII types in group that are present
shared_in_group = PII types in group that were shared (from per-PII actions)
allowed_in_group = PII types in group that are allowed
utility = |shared_allowed| / |allowed| # How much allowed was shared
privacy = 1 - |shared_disallowed| / |disallowed| # How much disallowed was NOT shared
group_reward = α·utility + β·privacy - complexity_penalty
group_rewards.append(group_reward)
reward = mean(group_rewards) # Average across groupsThe model never sees the "allowed_mask" - it must infer the pattern from rewards.
Purpose: Optimized to show utility-privacy tradeoff across directives
Frequencies (15,805 rows):
- EMAIL: 98.7% → learned prob >0.99 (shared by all directives)
- PHONE: 60.8% → learned prob >0.99 (shared by all directives)
- DATE/DOB: 56.7% → learned prob >0.99 (shared by all directives)
- SSN: 90.3% → learned prob >0.98 (shared by all directives)
- CREDIT_CARD: 90.3% → learned prob >0.98 (shared by all directives)
- 100% coverage: All rows with SSN/CREDIT_CARD in ground_truth also have them in allowed_bank
Expected Results (Bank Domain):
- STRICTLY (≥0.7): Utility = 1.0, Privacy = 1.0 ✓ Perfect match
- BALANCED (≥0.5): Utility = 1.0, Privacy = 1.0 ✓ Perfect match
- ACCURATELY (≤0.3): Utility = 1.0, Privacy = 1.0 ✓ Perfect match
python pipeline/train.py \
--algorithm grpo \
--dataset 690-Project-Dataset-final.csv \
--num_iters 300 \
--batch_size 64 \
--output_dir modelsModel Output: models/{algorithm}_model.pt
python pipeline/test.py \
--algorithm grpo \
--model models/grpo_model.pt \
--directive balanced \
--get-regexNote: Utility and privacy are calculated from the model's derived regex pattern (when all PII is present), NOT from the dataset. The dataset is only used during training for reward computation.