diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb new file mode 100644 index 000000000..f3e200e71 --- /dev/null +++ b/docs/tutorials/differential-privacy.ipynb @@ -0,0 +1,258 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d1d7a7a3", + "metadata": {}, + "source": [ + "\n", + "# 🔐 NeMo Safe Synthesizer Tutorial: Differential Privacy\n", + "\n", + "Learn how to apply differential privacy to achieve strong privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour on an A100.\n", + "\n", + "If you have not yet completed the [Safe Synthesizer 101](safe-synthesizer-101.ipynb) tutorial, consider starting there first.\n", + "\n", + "### 🖥️ Prerequisites\n", + "\n", + "This notebook requires a GPU. We recommend an H100; minimum A100." + ] + }, + { + "cell_type": "markdown", + "id": "d501f043", + "metadata": {}, + "source": [ + "### ⚡ Install Safe Synthesizer\n", + "\n", + "Run the cell below to install NeMo Safe Synthesizer (engine and CUDA 12.8) and kagglehub for the example dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bb7b0bdd", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", + "# SPDX-License-Identifier: Apache-2.0\n", + "\n", + "!uv pip install \"nemo-safe-synthesizer[engine,cu128]\" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match\n", + "!uv pip install kagglehub\n" + ] + }, + { + "cell_type": "markdown", + "id": "3030139c", + "metadata": {}, + "source": [ + "### 🔑 Set the inference API key for PII column classification\n", + "\n", + "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set `NSS_INFERENCE_KEY`. By default, the inference endpoint is `https://integrate.api.nvidia.com/v1` (the NVIDIA integrate URL). You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "693620c8", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import getpass\n", + "\n", + "# Setting NSS_INFERENCE_KEY is optional but strongly recommended for PII replacement.\n", + "if \"NSS_INFERENCE_KEY\" not in os.environ:\n", + " os.environ[\"NSS_INFERENCE_KEY\"] = getpass.getpass(\"Paste inference API key (or press Enter to skip): \")\n", + "if os.environ.get(\"NSS_INFERENCE_KEY\"):\n", + " print(\"NSS_INFERENCE_KEY is set\")\n", + "else:\n", + " print(\n", + " \"NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. \"\n", + " \"We strongly recommend setting a key.\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "bdb29834", + "metadata": {}, + "source": [ + "### 📥 Load and preview sample dataset\n", + "\n", + "Load a tabular dataset—in this example, the [US Accidents dataset](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) from Kaggle—and preview the first few rows. NeMo Safe Synthesizer will use a subset to keep runtime manageable.\n", + "\n", + "This dataset includes text, categorical, and numeric fields, all of which are supported by Safe Synthesizer.\n", + "\n", + "The code below also computes a recommended `delta` for differential privacy. Delta should reflect the full dataset size, not the subset, because it bounds the probability of a privacy breach across the entire population. See [Differential Privacy](../user-guide/configuration.md#differential-privacy) for parameter guidance.\n", + "\n", + "> Dataset citations:\n", + ">\n", + "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. \"A Countrywide Traffic Accident Dataset.\", 2019.\n", + "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n", + ">\n", + "> Each user is responsible for checking the content of the dataset and the applicable licenses and determining if it is suitable for the intended use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d456d9b", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import kagglehub\n", + "\n", + "path = kagglehub.dataset_download(\"sobhanmoosavi/us-accidents\")\n", + "print(\"Path to dataset files:\", path)\n", + "df = pd.read_csv(f\"{path}/US_Accidents_March23.csv\", index_col=0)\n", + "full_data_size = len(df)\n", + "recommended_delta = 1 / (full_data_size ** 2) # delta should reflect the full dataset, even when a subset is used as Safe Synthesizer input\n", + "\n", + "print(f\"Full dataset size: {len(df)} records\")\n", + "print(f\"Recommended delta: {recommended_delta:.2e}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "63f5ce95", + "metadata": {}, + "outputs": [], + "source": [ + "# use a subset as Safe Synthesizer input for faster runtime\n", + "df = df.sample(n=26250, random_state=318)\n", + "print(f\"Input dataset size: {len(df)} records\")\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "1c394bab", + "metadata": {}, + "source": [ + "### ⚙️ Create and run Safe Synthesizer job\n", + "\n", + "Create the Safe Synthesizer builder and attach your DataFrame. Enable differential privacy and configure the training and generation stages for optimal performance with DP.\n", + "\n", + "Run the pipeline with `run()`, which performs data processing, PII replacement, training, generation, evaluation and saving of results in a single call. Results are available on `builder.results`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3bbde286", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer\n", + "\n", + "builder = (\n", + " SafeSynthesizer()\n", + " .with_data_source(df) # .with_replace_pii(enable=False) to disable PII replacement\n", + " .with_differential_privacy(dp_enabled=True, delta=recommended_delta)\n", + " .with_train(batch_size=16) # Override the default batch size of 1, which is designed for non-DP training\n", + " .with_generate(use_structured_generation=True) # Improves the percentage of valid records when DP is enabled\n", + ")\n", + "builder.run()\n", + "results = builder.results" + ] + }, + { + "cell_type": "markdown", + "id": "e88f0213", + "metadata": {}, + "source": [ + "### 📤 Retrieve synthetic data\n", + "\n", + "Inspect the generated synthetic data including row count and preview of the first rows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a7a48d2", + "metadata": {}, + "outputs": [], + "source": [ + "synth = results.synthetic_data\n", + "print(f\"Number of synthetic rows: {len(synth)}\")\n", + "synth.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d2b842ad", + "metadata": {}, + "outputs": [], + "source": [ + "# Synthetic data and evaluation report are automatically saved to the artifacts directory\n", + "print(f\"Artifacts automatically saved to: {builder._workdir.generate.path}\")" + ] + }, + { + "cell_type": "markdown", + "id": "75ec6da5", + "metadata": {}, + "source": [ + "### 🛡️ Review evaluation report\n", + "\n", + "The pipeline computes both quality and privacy metrics. The summary includes timing information and overall scores, while the full evaluation report is rendered as an HTML document." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e121493f", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "print(\"Summary (timing and scores):\")\n", + "print(json.dumps(results.summary.model_dump(), indent=2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a42bef2c", + "metadata": {}, + "outputs": [], + "source": [ + "# View the evaluation report in a sandboxed iframe\n", + "import base64\n", + "from IPython.display import IFrame, display\n", + "\n", + "report_html = results.evaluation_report_html\n", + "if report_html:\n", + " data_url = \"data:text/html;base64,\" + base64.b64encode(report_html.encode()).decode()\n", + " display(IFrame(src=data_url, width=\"100%\", height=800))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index 875aa72dc..5201d8d94 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -8,6 +8,7 @@ Interactive Jupyter notebook tutorials for NeMo Safe Synthesizer. ## Available Tutorials - [Safe Synthesizer 101](safe-synthesizer-101.ipynb) -- learn the fundamentals +- [Differential Privacy](differential-privacy.ipynb) -- enable differential privacy guarantees ## Adding a Tutorial diff --git a/mkdocs.yml b/mkdocs.yml index 66cbe204b..25b8a032d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -158,7 +158,8 @@ nav: - Evaluation: product-overview/evaluation.md - Tutorials: - tutorials/index.md - - Safe Synthesizer 101: tutorials/safe-synthesizer-101.ipynb + - Safe Synthesizer 101: tutorials/safe-synthesizer-101.ipynb + - Differential Privacy: tutorials/differential-privacy.ipynb - User Guide: - Getting Started: user-guide/getting-started.md - Running Safe Synthesizer: user-guide/running.md