-
Notifications
You must be signed in to change notification settings - Fork 5
docs: add dp notebook #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 5 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
6e3599d
docs: add 101 and differential-privacy tutorial notebooks
nina-xu a61b189
add gpu type to runtime
nina-xu eb47c5b
fix import
nina-xu 9437935
add citation & disclaimer; remove 101 from my changes
nina-xu 9a26a0b
feedback
nina-xu d2d1694
nit format
nina-xu 2451db4
copilot nit comments
nina-xu ebe2cf7
nit grammar
nina-xu 2c138d1
Merge branch 'main' into nina/docs/102-dp-notebook
nina-xu ee84007
update installation command
nina-xu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,258 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "d1d7a7a3", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "\n", | ||
| "# 🔐 NeMo Safe Synthesizer Tutorial: Differential Privacy\n", | ||
| "\n", | ||
| "Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour on an A100.\n", | ||
| "\n", | ||
| "If you have not yet completed the [Safe Synthesizer 101](safe-synthesizer-101.ipynb) tutorial, consider starting there first.\n", | ||
| "\n", | ||
|
nina-xu marked this conversation as resolved.
|
||
| "### 🖥️ Prerequisites\n", | ||
| "\n", | ||
| "This notebook requires a GPU. We recommend an H100; minimum A100." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "d501f043", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### ⚡ Install Safe Synthesizer\n", | ||
| "\n", | ||
| "Run the cell below to install NeMo Safe Synthesizer (engine and CUDA 12.8) and kagglehub for the example dataset." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "bb7b0bdd", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
|
nina-xu marked this conversation as resolved.
|
||
| "%%capture\n", | ||
| "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", | ||
| "# SPDX-License-Identifier: Apache-2.0\n", | ||
| "\n", | ||
| "!uv pip install \\\"nemo-safe-synthesizer[engine,cu128]\\\" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match\n", | ||
| "!uv pip install kagglehub\n" | ||
|
nina-xu marked this conversation as resolved.
nina-xu marked this conversation as resolved.
nina-xu marked this conversation as resolved.
|
||
| ] | ||
|
nina-xu marked this conversation as resolved.
nina-xu marked this conversation as resolved.
|
||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "3030139c", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### 🔑 Set the inference API key for PII column classification\n", | ||
| "\n", | ||
| "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set `NSS_INFERENCE_KEY` (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended." | ||
|
nina-xu marked this conversation as resolved.
Outdated
|
||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "693620c8", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import os\n", | ||
| "import getpass\n", | ||
| "\n", | ||
| "# Setting NSS_INFERENCE_KEY is optional but strongly recommended for PII replacement.\n", | ||
| "if \"NSS_INFERENCE_KEY\" not in os.environ:\n", | ||
| " os.environ[\"NSS_INFERENCE_KEY\"] = getpass.getpass(\"Paste inference API key (or press Enter to skip): \")\n", | ||
| "if os.environ.get(\"NSS_INFERENCE_KEY\"):\n", | ||
| " print(\"NSS_INFERENCE_KEY is set\")\n", | ||
| "else:\n", | ||
| " print(\n", | ||
| " \"NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. \"\n", | ||
|
nina-xu marked this conversation as resolved.
|
||
| " \"We strongly recommend setting a key.\"\n", | ||
| " )" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "bdb29834", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### 📥 Load and preview sample dataset\n", | ||
| "\n", | ||
| "Load a tabular dataset—in this example, the [US Accidents dataset](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) from Kaggle—and preview the first few rows. NeMo Safe Synthesizer will use a subset to keep runtime manageable.\n", | ||
| "\n", | ||
| "This dataset includes text, categorical, and numeric fields, making it a good demonstration of the model's ability to handle multiple data types.\n", | ||
|
nina-xu marked this conversation as resolved.
Outdated
|
||
| "\n", | ||
| "The code below also computes a recommended `delta` for differential privacy. Delta should reflect the full dataset size, not the subset, because it bounds the probability of a privacy breach across the entire population. See [Differential Privacy](../user-guide/configuration.md#differential-privacy) for parameter guidance.\n", | ||
| "\n", | ||
| "> Dataset citations:\n", | ||
| ">\n", | ||
| "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. \"A Countrywide Traffic Accident Dataset.\", 2019.\n", | ||
| "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n", | ||
| ">\n", | ||
| "> Each user is responsible for checking the content of dataset and the applicable licenses and determining if suitable for the intended use." | ||
|
nina-xu marked this conversation as resolved.
Outdated
|
||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "3d456d9b", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import pandas as pd\n", | ||
| "import kagglehub\n", | ||
| "\n", | ||
| "path = kagglehub.dataset_download(\"sobhanmoosavi/us-accidents\")\n", | ||
| "print(\"Path to dataset files:\", path)\n", | ||
| "df = pd.read_csv(f\"{path}/US_Accidents_March23.csv\", index_col=0)\n", | ||
| "full_data_size = len(df)\n", | ||
| "recommended_delta = 1 / (full_data_size ** 2) # delta should reflect the full dataset, even when a subset is used as Safe Synthesizer input\n", | ||
|
nina-xu marked this conversation as resolved.
|
||
| "\n", | ||
| "print(f\"Full dataset size: {len(df)} records\")\n", | ||
| "print(f\"Recommended delta: {recommended_delta:.2e}\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "63f5ce95", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# use a subset as Safe Synthesizer input for faster runtime\n", | ||
| "df = df.sample(n=26250, random_state=318)\n", | ||
|
nina-xu marked this conversation as resolved.
|
||
| "print(f\"Input dataset size: {len(df)} records\")\n", | ||
| "df.head()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "1c394bab", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### ⚙️ Create and run Safe Synthesizer job\n", | ||
| "\n", | ||
| "Create the Safe Synthesizer builder and attach your DataFrame. Enable differential privacy and configure the training and generation stages for optimal performance with DP.\n", | ||
| "\n", | ||
| "Run the pipeline with `run()`, which performs data processing, PII replacement, training, generation, evaluation and saving of results in a single call. Results are available on `builder.results`." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "3bbde286", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer\n", | ||
| "\n", | ||
| "builder = (\n", | ||
| " SafeSynthesizer()\n", | ||
| " .with_data_source(df) # .with_replace_pii(enable=False) to disable PII replacement\n", | ||
| " .with_differential_privacy(dp_enabled=True, delta=recommended_delta)\n", | ||
| " .with_train(batch_size=16) # Override the default batch size of 1, which is designed for non-DP training\n", | ||
| " .with_generate(use_structured_generation=True) # Improves the percentage of valid records when DP is enabled\n", | ||
| ")\n", | ||
| "builder.run()\n", | ||
| "results = builder.results" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "e88f0213", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### 📤 Retrieve synthetic data\n", | ||
| "\n", | ||
| "Inspect the generated synthetic data including row count and preview of the first rows." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "5a7a48d2", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "synth = results.synthetic_data\n", | ||
| "print(f\"Number of synthetic rows: {len(synth)}\")\n", | ||
| "synth.head()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "d2b842ad", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Synthetic data and evaluation report are automatically saved to the artifacts directory\n", | ||
| "print(f\"Artifacts automatically saved to: {builder._workdir.generate.path}\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "75ec6da5", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### 🛡️ Review evaluation report\n", | ||
| "\n", | ||
| "The pipeline computes both quality and privacy metrics. The summary includes timing information and overall scores, while the full evaluation report is rendered as an HTML document." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "e121493f", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import json\n", | ||
| "\n", | ||
| "print(\"Summary (timing and scores):\")\n", | ||
| "print(json.dumps(results.summary.model_dump(), indent=2))" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "a42bef2c", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# View the evaluation report in a sandboxed iframe\n", | ||
| "import base64\n", | ||
| "from IPython.display import IFrame, display\n", | ||
| "\n", | ||
| "report_html = results.evaluation_report_html\n", | ||
| "if report_html:\n", | ||
| " data_url = \"data:text/html;base64,\" + base64.b64encode(report_html.encode()).decode()\n", | ||
| " display(IFrame(src=data_url, width=\"100%\", height=800))" | ||
|
nina-xu marked this conversation as resolved.
|
||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": ".venv", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.11.13" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 5 | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.