Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
258 changes: 258 additions & 0 deletions docs/tutorials/differential-privacy.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d1d7a7a3",
"metadata": {},
"source": [
"\n",
"# 🔐 NeMo Safe Synthesizer Tutorial: Differential Privacy\n",
"\n",
"Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour on an A100.\n",
Comment thread
nina-xu marked this conversation as resolved.
Outdated
"\n",
"If you have not yet completed the [Safe Synthesizer 101](safe-synthesizer-101.ipynb) tutorial, consider starting there first.\n",
"\n",
Comment thread
nina-xu marked this conversation as resolved.
"### 🖥️ Prerequisites\n",
"\n",
"This notebook requires a GPU. We recommend an H100; minimum A100."
]
},
{
"cell_type": "markdown",
"id": "d501f043",
"metadata": {},
"source": [
"### ⚡ Install Safe Synthesizer\n",
"\n",
"Run the cell below to install NeMo Safe Synthesizer (engine and CUDA 12.8) and kagglehub for the example dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb7b0bdd",
"metadata": {},
"outputs": [],
"source": [
Comment thread
nina-xu marked this conversation as resolved.
"%%capture\n",
"# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
"# SPDX-License-Identifier: Apache-2.0\n",
"\n",
"!uv pip install \\\"nemo-safe-synthesizer[engine,cu128]\\\" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match\n",
"!uv pip install kagglehub\n"
Comment thread
nina-xu marked this conversation as resolved.
Comment thread
nina-xu marked this conversation as resolved.
Comment thread
nina-xu marked this conversation as resolved.
]
Comment thread
nina-xu marked this conversation as resolved.
Comment thread
nina-xu marked this conversation as resolved.
},
{
"cell_type": "markdown",
"id": "3030139c",
"metadata": {},
"source": [
"### 🔑 Set the inference API key for PII column classification\n",
"\n",
"NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set `NSS_INFERENCE_KEY` (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended."
Comment thread
nina-xu marked this conversation as resolved.
Outdated
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "693620c8",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import getpass\n",
"\n",
"# Setting NSS_INFERENCE_KEY is optional but strongly recommended for PII replacement.\n",
"if \"NSS_INFERENCE_KEY\" not in os.environ:\n",
" os.environ[\"NSS_INFERENCE_KEY\"] = getpass.getpass(\"Paste inference API key (or press Enter to skip): \")\n",
"if os.environ.get(\"NSS_INFERENCE_KEY\"):\n",
" print(\"NSS_INFERENCE_KEY is set\")\n",
"else:\n",
" print(\n",
" \"NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. \"\n",
Comment thread
nina-xu marked this conversation as resolved.
" \"We strongly recommend setting a key.\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "bdb29834",
"metadata": {},
"source": [
"### 📥 Load and preview sample dataset\n",
"\n",
"Load a tabular dataset—in this example, the [US Accidents dataset](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) from Kaggle—and preview the first few rows. NeMo Safe Synthesizer will use a subset to keep runtime manageable.\n",
"\n",
"This dataset includes text, categorical, and numeric fields, making it a good demonstration of the model's ability to handle multiple data types.\n",
Comment thread
nina-xu marked this conversation as resolved.
Outdated
"\n",
"The code below also computes a recommended `delta` for differential privacy. Delta should reflect the full dataset size, not the subset, because it bounds the probability of a privacy breach across the entire population. See [Differential Privacy](../user-guide/configuration.md#differential-privacy) for parameter guidance.\n",
"\n",
"> Dataset citations:\n",
">\n",
"> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. \"A Countrywide Traffic Accident Dataset.\", 2019.\n",
"> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n",
">\n",
"> Each user is responsible for checking the content of dataset and the applicable licenses and determining if suitable for the intended use."
Comment thread
nina-xu marked this conversation as resolved.
Outdated
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d456d9b",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import kagglehub\n",
"\n",
"path = kagglehub.dataset_download(\"sobhanmoosavi/us-accidents\")\n",
"print(\"Path to dataset files:\", path)\n",
"df = pd.read_csv(f\"{path}/US_Accidents_March23.csv\", index_col=0)\n",
"full_data_size = len(df)\n",
"recommended_delta = 1 / (full_data_size ** 2) # delta should reflect the full dataset, even when a subset is used as Safe Synthesizer input\n",
Comment thread
nina-xu marked this conversation as resolved.
"\n",
"print(f\"Full dataset size: {len(df)} records\")\n",
"print(f\"Recommended delta: {recommended_delta:.2e}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63f5ce95",
"metadata": {},
"outputs": [],
"source": [
"# use a subset as Safe Synthesizer input for faster runtime\n",
"df = df.sample(n=26250, random_state=318)\n",
Comment thread
nina-xu marked this conversation as resolved.
"print(f\"Input dataset size: {len(df)} records\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "1c394bab",
"metadata": {},
"source": [
"### ⚙️ Create and run Safe Synthesizer job\n",
"\n",
"Create the Safe Synthesizer builder and attach your DataFrame. Enable differential privacy and configure the training and generation stages for optimal performance with DP.\n",
"\n",
"Run the pipeline with `run()`, which performs data processing, PII replacement, training, generation, evaluation and saving of results in a single call. Results are available on `builder.results`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3bbde286",
"metadata": {},
"outputs": [],
"source": [
"from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer\n",
"\n",
"builder = (\n",
" SafeSynthesizer()\n",
" .with_data_source(df) # .with_replace_pii(enable=False) to disable PII replacement\n",
" .with_differential_privacy(dp_enabled=True, delta=recommended_delta)\n",
" .with_train(batch_size=16) # Override the default batch size of 1, which is designed for non-DP training\n",
" .with_generate(use_structured_generation=True) # Improves the percentage of valid records when DP is enabled\n",
")\n",
"builder.run()\n",
"results = builder.results"
]
},
{
"cell_type": "markdown",
"id": "e88f0213",
"metadata": {},
"source": [
"### 📤 Retrieve synthetic data\n",
"\n",
"Inspect the generated synthetic data including row count and preview of the first rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5a7a48d2",
"metadata": {},
"outputs": [],
"source": [
"synth = results.synthetic_data\n",
"print(f\"Number of synthetic rows: {len(synth)}\")\n",
"synth.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2b842ad",
"metadata": {},
"outputs": [],
"source": [
"# Synthetic data and evaluation report are automatically saved to the artifacts directory\n",
"print(f\"Artifacts automatically saved to: {builder._workdir.generate.path}\")"
]
},
{
"cell_type": "markdown",
"id": "75ec6da5",
"metadata": {},
"source": [
"### 🛡️ Review evaluation report\n",
"\n",
"The pipeline computes both quality and privacy metrics. The summary includes timing information and overall scores, while the full evaluation report is rendered as an HTML document."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e121493f",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"print(\"Summary (timing and scores):\")\n",
"print(json.dumps(results.summary.model_dump(), indent=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a42bef2c",
"metadata": {},
"outputs": [],
"source": [
"# View the evaluation report in a sandboxed iframe\n",
"import base64\n",
"from IPython.display import IFrame, display\n",
"\n",
"report_html = results.evaluation_report_html\n",
"if report_html:\n",
" data_url = \"data:text/html;base64,\" + base64.b64encode(report_html.encode()).decode()\n",
" display(IFrame(src=data_url, width=\"100%\", height=800))"
Comment thread
nina-xu marked this conversation as resolved.
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
3 changes: 2 additions & 1 deletion docs/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ Interactive Jupyter notebook tutorials for NeMo Safe Synthesizer.

## Available Tutorials

_No tutorials have been added yet._
- [Safe Synthesizer 101](safe-synthesizer-101.ipynb) -- learn the fundamentals
- [Differential Privacy](differential-privacy.ipynb) -- enable differential privacy guarantees

## Adding a Tutorial

Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,8 @@ nav:
- Evaluation: product-overview/evaluation.md
- Tutorials:
- tutorials/index.md
- Safe Synthesizer 101: tutorials/safe-synthesizer-101.ipynb
- Differential Privacy: tutorials/differential-privacy.ipynb
- User Guide:
- Getting Started: user-guide/getting-started.md
- Running Safe Synthesizer: user-guide/running.md
Expand Down
Loading