Last updated on March 1, 2026.
A Python package for differentially private linear regression and synthetic data generation using the Binning-Aggregation framework under Gaussian differential privacy (GDP).
This package implements the algorithms from the paper and may be expanded with additional functionality in the near future. Please use the command below to obtain the latest version.
It is based on the paper:
Lin, S., Slavković, A., & Bhoomireddy, D. R. (2026). Differentially private linear regression and synthetic data generation with statistical guarantees. In Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS). Proceedings of Machine Learning Research.
If you use this package, please cite:
@inproceedings{lin2026differentially,
title = {Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees},
author = {Lin, Shurong and Slavkovi{\'c}, Aleksandra and Bhoomireddy, Deekshith Reddy},
booktitle = {Proceedings of the 29th International Conference on Artificial Intelligence and Statistics},
series = {Proceedings of Machine Learning Research},
year = {2026},
publisher = {PMLR}
}Based on the method from the paper:
- Binning-Aggregation: Differentially private data binning followed by aggregation and privatization
- DP Linear Regression: Bias-corrected weighted least squares with valid confidence intervals
- DP Synthetic Data Generation: Generate privacy-preserving synthetic datasets
- GDP Privacy Accounting: Tight composition using Gaussian Differential Privacy
pip install git+https://github.com/shuronglin/binagg.gitpip uninstall binagg -y && pip install git+https://github.com/shuronglin/binagg.git# Clone the repository
git clone https://github.com/shuronglin/binagg.git
cd binagg
# Install in development mode
pip install -e .
# Or install with development dependencies
pip install -e ".[dev]"pip install binagg- Python >= 3.9
- NumPy >= 1.20
- SciPy >= 1.7
For detailed tutorials, see the examples/ folder, which includes both real data and simulated data examples.
import numpy as np
from binagg import dp_linear_regression
# Generate sample data
np.random.seed(42)
n, d = 500, 3
X = np.random.uniform(0, 10, (n, d))
true_beta = np.array([1.5, -2.0, 0.5])
y = X @ true_beta + np.random.normal(0, 1, n)
# Define public domain bounds (required for DP, must be specified by analyst)
# These should be known a priori or privately computed from the sensitive data
x_bounds = [(0, 10), (0, 10), (0, 10)] # Known domain for each feature
y_bounds = (-30, 30) # Known range for target variable
# Run DP regression with μ=1.0 privacy budget
result = dp_linear_regression(
X, y, x_bounds, y_bounds,
mu=1.0, # Privacy budget (μ-GDP)
alpha=0.05, # 95% confidence intervals
random_state=42
)
# Results
print("Coefficients:", result.coefficients)
print("Standard Errors:", result.standard_errors)
print("95% CI:", result.confidence_intervals)
print(f"Number of bins: {result.n_bins}")from binagg import generate_synthetic_data
# Generate synthetic data
syn_result = generate_synthetic_data(
X, y, x_bounds, y_bounds,
mu=1.0,
random_state=42
)
print(f"Generated {syn_result.n_samples} synthetic samples")
print(f"Synthetic X shape: {syn_result.X_synthetic.shape}")
print(f"Synthetic y shape: {syn_result.y_synthetic.shape}")
# Use synthetic data for downstream analysis
X_syn = syn_result.X_synthetic
y_syn = syn_result.y_syntheticfrom binagg import (
delta_from_gdp,
eps_from_mu_delta,
mu_from_eps_delta,
compose_gdp
)
# μ-GDP to (ε, δ)-DP: given μ and ε, compute δ
delta = delta_from_gdp(mu=1.0, eps=2.0)
print(f"(μ=1.0, ε=2.0) → δ={delta:.6f}")
# μ-GDP to (ε, δ)-DP: given μ and δ, compute ε
eps = eps_from_mu_delta(mu=1.0, delta=1e-5)
print(f"(μ=1.0, δ=1e-5) → ε={eps:.2f}")
# (ε, δ)-DP to μ-GDP
mu = mu_from_eps_delta(eps=1.0, delta=1e-5)
print(f"(ε=1.0, δ=1e-5) → μ={mu:.2f}")
# Compose multiple mechanisms
total_mu = compose_gdp(0.5, 0.5, 0.5, 0.5) # Four mechanisms
print(f"Composed privacy: μ={total_mu:.2f}")Performs differentially private linear regression with bias correction.
Parameters:
X: Feature matrix of shape (n, d)y: Label vector of shape (n,)x_bounds: Per-feature bounds as [(L_1, U_1), ..., (L_d, U_d)] - must be specified by analyst, not computed from datay_bounds: Bounds on y as (y_min, y_max)mu: Total privacy budget in μ-GDPtheta: PrivTree splitting threshold (default: 0)alpha: Significance level for confidence intervals (default: 0.05 for 95% CI)budget_ratios: Privacy budget ratios for (binning, count, sum_x, sum_y) (default: (1, 3, 3, 3))min_count: Minimum noisy count to keep a bin (default: 2)clip: Whether to clip input data to bounds (default: True)return_synthetic: If True, also return synthetic data using the same privacy budget (default: False)clip_synthetic_output: Whether to clip synthetic output to bounds, only used when return_synthetic=True (default: False)preserve_sample_size: If True, rescale noisy counts so total equals original sample size n (default: True)random_state: Random seed for reproducibility
Returns:
- If
return_synthetic=False:DPRegressionResultwith coefficients, standard_errors, confidence_intervals, n_bins - If
return_synthetic=True: Tuple of (DPRegressionResult,SyntheticDataResult) - both share the same privacy budget
Generates differentially private synthetic data that preserves the joint (X, y) distribution.
Parameters:
X: Feature matrix of shape (n, d)y: Label vector of shape (n,)x_bounds: Per-feature bounds as [(L_1, U_1), ..., (L_d, U_d)]y_bounds: Bounds on y as (y_min, y_max)mu: Total privacy budget in μ-GDPtheta: PrivTree splitting threshold (default: 0)budget_ratios: Privacy budget ratios for (binning, count, sum_x, sum_y) (default: (1, 3, 3, 3))min_count: Minimum noisy count to generate samples from a bin (default: 2)clip: Whether to clip input data to bounds (default: True)clip_output: Whether to clip synthetic output data to bounds (default: False)preserve_sample_size: If True, rescale noisy counts so total synthetic samples equals original n (default: True)random_state: Random seed for reproducibility
Returns: SyntheticDataResult with:
X_synthetic: Synthetic featuresy_synthetic: Synthetic targetsn_samples: Number of samples generatedn_bins_used: Number of bins used for generation
Private binning using PrivTree algorithm.
Add calibrated noise to bin aggregates.
delta_from_gdp(mu, eps): μ-GDP → (ε, δ)-DP, compute δ given μ and εeps_from_mu_delta(mu, delta): μ-GDP → (ε, δ)-DP, compute ε given μ and δmu_from_eps_delta(eps, delta): (ε, δ)-DP → μ-GDPcompose_gdp(*mus): Compose multiple μ-GDP mechanismsallocate_budget(total_mu, ratios): Split budget by ratios
This package uses μ-GDP for privacy accounting. Smaller values of μ correspond to stronger privacy guarantees.
- μ ≤ 0.5: Strong privacy protection (higher noise, lower accuracy)
- 0.5 < μ ≤ 1.5: Moderate privacy protection
- μ > 1.5: Weaker privacy protection (lower noise, higher accuracy)
from binagg import delta_from_gdp
# For μ=1.0, what's δ at ε=1?
delta = delta_from_gdp(mu=1.0, eps=1.0)
# δ ≈ 0.12
# For μ=1.0, what's δ at ε=2?
delta = delta_from_gdp(mu=1.0, eps=2.0)
# δ ≈ 0.02The default budget split (1, 3, 3, 3) allocates:
- 10% to binning (PrivTree)
- 30% to noisy counts
- 30% to noisy sum(X)
- 30% to noisy sum(y)
See the examples/ directory for complete tutorials:
basic_regression.py: Simple DP regression examplesynthetic_data.py: Generating and using synthetic dataprivacy_accounting.py: Understanding privacy budgetsreal_data_example.py: Working with real datasets
# Run all tests
pytest tests/ -v
# Run specific test module
pytest tests/test_regression.py -v
# Run with coverage
pytest tests/ --cov=binagg- Shurong Lin - Original algorithm implementation and paper author; package development and testing
MIT License - see LICENSE file for details.
Contributions welcome! Please open an issue or pull request on GitHub.