NVIDIA-NeMo · nina-xu · Mar 31, 2026 · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026
@@ -0,0 +1,258 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "d1d7a7a3",
+      "metadata": {},
+      "source": [
+        "\n",
+        "# 🔐 NeMo Safe Synthesizer Tutorial: Differential Privacy\n",
+        "\n",
+        "Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour on an A100.\n",
+        "\n",
+        "If you have not yet completed the [Safe Synthesizer 101](safe-synthesizer-101.ipynb) tutorial, consider starting there first.\n",
+        "\n",
+        "### 🖥️ Prerequisites\n",
+        "\n",
+        "This notebook requires a GPU. We recommend an H100; minimum A100."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d501f043",
+      "metadata": {},
+      "source": [
+        "### ⚡ Install Safe Synthesizer\n",
+        "\n",
+        "Run the cell below to install NeMo Safe Synthesizer (engine and CUDA 12.8) and kagglehub for the example dataset."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "bb7b0bdd",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "%%capture\n",
+        "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
+        "# SPDX-License-Identifier: Apache-2.0\n",
+        "\n",
+        "!uv pip install \\\"nemo-safe-synthesizer[engine,cu128]\\\" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match\n",
+        "!uv pip install kagglehub\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "3030139c",
+      "metadata": {},
+      "source": [
+        "### 🔑 Set the inference API key for PII column classification\n",
+        "\n",
+        "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set `NSS_INFERENCE_KEY` (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "693620c8",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "import getpass\n",
+        "\n",
+        "# Setting NSS_INFERENCE_KEY is optional but strongly recommended for PII replacement.\n",
+        "if \"NSS_INFERENCE_KEY\" not in os.environ:\n",
+        "    os.environ[\"NSS_INFERENCE_KEY\"] = getpass.getpass(\"Paste inference API key (or press Enter to skip): \")\n",
+        "if os.environ.get(\"NSS_INFERENCE_KEY\"):\n",
+        "    print(\"NSS_INFERENCE_KEY is set\")\n",
+        "else:\n",
+        "    print(\n",
+        "        \"NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. \"\n",
+        "        \"We strongly recommend setting a key.\"\n",
+        "    )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "bdb29834",
+      "metadata": {},
+      "source": [
+        "### 📥 Load and preview sample dataset\n",
+        "\n",
+        "Load a tabular dataset—in this example, the [US Accidents dataset](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) from Kaggle—and preview the first few rows. NeMo Safe Synthesizer will use a subset to keep runtime manageable.\n",
+        "\n",
+        "This dataset includes text, categorical, and numeric fields, making it a good demonstration of the model's ability to handle multiple data types.\n",
+        "\n",
+        "The code below also computes a recommended `delta` for differential privacy. Delta should reflect the full dataset size, not the subset, because it bounds the probability of a privacy breach across the entire population. See [Differential Privacy](../user-guide/configuration.md#differential-privacy) for parameter guidance.\n",
+        "\n",
+        "> Dataset citations:\n",
+        ">\n",
+        "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. \"A Countrywide Traffic Accident Dataset.\", 2019.\n",
+        "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n",
+        ">\n",
+        "> Each user is responsible for checking the content of dataset and the applicable licenses and determining if suitable for the intended use."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "3d456d9b",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import pandas as pd\n",
+        "import kagglehub\n",
+        "\n",
+        "path = kagglehub.dataset_download(\"sobhanmoosavi/us-accidents\")\n",
+        "print(\"Path to dataset files:\", path)\n",
+        "df = pd.read_csv(f\"{path}/US_Accidents_March23.csv\", index_col=0)\n",
+        "full_data_size = len(df)\n",
+        "recommended_delta = 1 / (full_data_size ** 2)  # delta should reflect the full dataset, even when a subset is used as Safe Synthesizer input\n",
+        "\n",
+        "print(f\"Full dataset size: {len(df)} records\")\n",
+        "print(f\"Recommended delta: {recommended_delta:.2e}\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "63f5ce95",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# use a subset as Safe Synthesizer input for faster runtime\n",
+        "df = df.sample(n=26250, random_state=318)\n",
+        "print(f\"Input dataset size: {len(df)} records\")\n",
+        "df.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "1c394bab",
+      "metadata": {},
+      "source": [
+        "### ⚙️ Create and run Safe Synthesizer job\n",
+        "\n",
+        "Create the Safe Synthesizer builder and attach your DataFrame. Enable differential privacy and configure the training and generation stages for optimal performance with DP.\n",
+        "\n",
+        "Run the pipeline with `run()`, which performs data processing, PII replacement, training, generation, evaluation and saving of results in a single call. Results are available on `builder.results`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "3bbde286",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer\n",
+        "\n",
+        "builder = (\n",
+        "    SafeSynthesizer()\n",
+        "    .with_data_source(df)  # .with_replace_pii(enable=False) to disable PII replacement\n",
+        "    .with_differential_privacy(dp_enabled=True, delta=recommended_delta)\n",
+        "    .with_train(batch_size=16)  # Override the default batch size of 1, which is designed for non-DP training\n",
+        "    .with_generate(use_structured_generation=True)  # Improves the percentage of valid records when DP is enabled\n",
+        ")\n",
+        "builder.run()\n",
+        "results = builder.results"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "e88f0213",
+      "metadata": {},
+      "source": [
+        "### 📤 Retrieve synthetic data\n",
+        "\n",
+        "Inspect the generated synthetic data including row count and preview of the first rows."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "5a7a48d2",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "synth = results.synthetic_data\n",
+        "print(f\"Number of synthetic rows: {len(synth)}\")\n",
+        "synth.head()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "d2b842ad",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Synthetic data and evaluation report are automatically saved to the artifacts directory\n",
+        "print(f\"Artifacts automatically saved to: {builder._workdir.generate.path}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "75ec6da5",
+      "metadata": {},
+      "source": [
+        "### 🛡️ Review evaluation report\n",
+        "\n",
+        "The pipeline computes both quality and privacy metrics. The summary includes timing information and overall scores, while the full evaluation report is rendered as an HTML document."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "e121493f",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import json\n",
+        "\n",
+        "print(\"Summary (timing and scores):\")\n",
+        "print(json.dumps(results.summary.model_dump(), indent=2))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "a42bef2c",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# View the evaluation report in a sandboxed iframe\n",
+        "import base64\n",
+        "from IPython.display import IFrame, display\n",
+        "\n",
+        "report_html = results.evaluation_report_html\n",
+        "if report_html:\n",
+        "    data_url = \"data:text/html;base64,\" + base64.b64encode(report_html.encode()).decode()\n",
+        "    display(IFrame(src=data_url, width=\"100%\", height=800))"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": ".venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.13"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}
@@ -7,7 +7,8 @@ Interactive Jupyter notebook tutorials for NeMo Safe Synthesizer.
 
 ## Available Tutorials
 
-_No tutorials have been added yet._
+- [Safe Synthesizer 101](safe-synthesizer-101.ipynb) -- learn the fundamentals
+- [Differential Privacy](differential-privacy.ipynb) -- enable differential privacy guarantees
 
 ## Adding a Tutorial
 

@@ -158,6 +158,8 @@ nav:
       - Evaluation: product-overview/evaluation.md
   - Tutorials:
       - tutorials/index.md
+      - Safe Synthesizer 101: tutorials/safe-synthesizer-101.ipynb
+      - Differential Privacy: tutorials/differential-privacy.ipynb
   - User Guide:
       - Getting Started: user-guide/getting-started.md
       - Running Safe Synthesizer: user-guide/running.md