From 6e3599d25750bf4729192e7bcca6a89a1bf946ed Mon Sep 17 00:00:00 2001 From: nina-xu <19981858+nina-xu@users.noreply.github.com> Date: Tue, 24 Mar 2026 17:01:34 +0000 Subject: [PATCH 1/9] docs: add 101 and differential-privacy tutorial notebooks Add two interactive Jupyter notebook tutorials: - safe-synthesizer-101.ipynb: fundamentals of PII replacement, training, generation, and evaluation - differential-privacy.ipynb: configuring differential privacy with recommended delta and structured generation Display the HTML evaluation report in a sandboxed IFrame to prevent CSS/JS leakage into the notebook UI. Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> Made-with: Cursor --- docs/tutorials/differential-privacy.ipynb | 257 ++++++++++++++++++++++ docs/tutorials/index.md | 3 +- docs/tutorials/safe-synthesizer-101.ipynb | 79 ++++--- mkdocs.yml | 2 + 4 files changed, 307 insertions(+), 34 deletions(-) create mode 100644 docs/tutorials/differential-privacy.ipynb diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb new file mode 100644 index 000000000..ddbf26cc0 --- /dev/null +++ b/docs/tutorials/differential-privacy.ipynb @@ -0,0 +1,257 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1b6d66e2", + "metadata": {}, + "source": [ + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "d1d7a7a3", + "metadata": {}, + "source": [ + "\n", + "# 🔐 NeMo Safe Synthesizer Tutorial: Differential Privacy\n", + "\n", + "Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour.\n", + "\n", + "If you have not yet completed the [Safe Synthesizer 101](safe-synthesizer-101.ipynb) tutorial, consider starting there first.\n", + "\n", + "### 🖥️ Prerequisites\n", + "\n", + "This notebook requires a GPU. We recommend an H100; minimum A100." + ] + }, + { + "cell_type": "markdown", + "id": "d501f043", + "metadata": {}, + "source": [ + "### ⚡ Install Safe Synthesizer\n", + "\n", + "Run the cell below to install NeMo Safe Synthesizer (engine and CUDA 12.8) and kagglehub for the example dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bb7b0bdd", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!uv pip install nemo-safe-synthesizer[engine,cu128] --extra-index-url \"https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple\"\n", + "!uv pip install kagglehub\n" + ] + }, + { + "cell_type": "markdown", + "id": "3030139c", + "metadata": {}, + "source": [ + "### 🔑 Set the inference API key for PII column classification\n", + "\n", + "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set `NSS_INFERENCE_KEY` (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "693620c8", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import getpass\n", + "\n", + "# Setting NSS_INFERENCE_KEY is optional but strongly recommended for PII replacement.\n", + "if \"NSS_INFERENCE_KEY\" not in os.environ:\n", + " os.environ[\"NSS_INFERENCE_KEY\"] = getpass.getpass(\"Paste inference API key (or press Enter to skip): \")\n", + "if os.environ.get(\"NSS_INFERENCE_KEY\"):\n", + " print(\"NSS_INFERENCE_KEY is set\")\n", + "else:\n", + " print(\n", + " \"NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. \"\n", + " \"We strongly recommend setting a key.\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "bdb29834", + "metadata": {}, + "source": [ + "### 📥 Load and preview sample dataset\n", + "\n", + "Load a tabular dataset—in this example, the [US Accidents dataset](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) from Kaggle—and preview the first few rows. NeMo Safe Synthesizer will use a subset to keep runtime manageable.\n", + "\n", + "This dataset includes text, categorical, and numeric fields, making it a good demonstration of the model's ability to handle multiple data types.\n", + "\n", + "The code below also computes a recommended `delta` for differential privacy. Delta should reflect the full dataset size, not the subset, because it bounds the probability of a privacy breach across the entire population. See [Differential Privacy](../user-guide/configuration.md#differential-privacy) for parameter guidance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d456d9b", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import kagglehub\n", + "\n", + "path = kagglehub.dataset_download(\"sobhanmoosavi/us-accidents\")\n", + "print(\"Path to dataset files:\", path)\n", + "df = pd.read_csv(f\"{path}/US_Accidents_March23.csv\", index_col=0)\n", + "full_data_size = len(df)\n", + "recommended_delta = 1 / (full_data_size ** 2) # delta should reflect the full dataset, even when a subset is used as Safe Synthesizer input\n", + "\n", + "print(f\"Full dataset size: {len(df)} records\")\n", + "print(f\"Recommended delta: {recommended_delta:.2e}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "63f5ce95", + "metadata": {}, + "outputs": [], + "source": [ + "# use a subset as Safe Synthesizer input for faster runtime\n", + "df = df.sample(n=26250, random_state=318)\n", + "print(f\"Input dataset size: {len(df)} records\")\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "1c394bab", + "metadata": {}, + "source": [ + "### ⚙️ Create and run Safe Synthesizer job\n", + "\n", + "Create the Safe Synthesizer builder and attach your DataFrame. Enable differential privacy and configure the training and generation stages for optimal performance with DP.\n", + "\n", + "Run the pipeline with `run()`, which performs data processing, PII replacement, training, generation, evaluation and saving of results in a single call. Results are available on `builder.results`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3bbde286", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer\n", + "\n", + "builder = (\n", + " SafeSynthesizer()\n", + " .with_data_source(df) # .with_replace_pii(enable=False) to disable PII replacement\n", + " .with_differential_privacy(dp_enabled=True, delta=recommended_delta)\n", + " .with_train(batch_size=16) # Override the default batch size of 1, which is designed for non-DP training\n", + " .with_generate(use_structured_generation=True) # Improves the percentage of valid records when DP is enabled\n", + ")\n", + "builder.run()\n", + "results = builder.results" + ] + }, + { + "cell_type": "markdown", + "id": "e88f0213", + "metadata": {}, + "source": [ + "### 📤 Retrieve synthetic data\n", + "\n", + "Inspect the generated synthetic data including row count and preview of the first rows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a7a48d2", + "metadata": {}, + "outputs": [], + "source": [ + "synth = results.synthetic_data\n", + "print(f\"Number of synthetic rows: {len(synth)}\")\n", + "synth.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d2b842ad", + "metadata": {}, + "outputs": [], + "source": [ + "# Synthetic data and evaluation report are automatically saved to the artifacts directory\n", + "print(f\"Artifacts automatically saved to: {builder._workdir.generate.path}\")" + ] + }, + { + "cell_type": "markdown", + "id": "75ec6da5", + "metadata": {}, + "source": [ + "### 🛡️ Review evaluation report\n", + "\n", + "The pipeline computes both quality and privacy metrics. The summary includes timing information and overall scores, while the full evaluation report is rendered as an HTML document." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e121493f", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "print(\"Summary (timing and scores):\")\n", + "print(json.dumps(results.summary.model_dump(), indent=2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a42bef2c", + "metadata": {}, + "outputs": [], + "source": [ + "# View the evaluation report in a sandboxed iframe\n", + "import base64\n", + "from IPython.display import IFrame\n", + "\n", + "report_html = results.evaluation_report_html\n", + "if report_html:\n", + " data_url = \"data:text/html;base64,\" + base64.b64encode(report_html.encode()).decode()\n", + " display(IFrame(src=data_url, width=\"100%\", height=800))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index 85d10a372..5201d8d94 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -7,7 +7,8 @@ Interactive Jupyter notebook tutorials for NeMo Safe Synthesizer. ## Available Tutorials -_No tutorials have been added yet._ +- [Safe Synthesizer 101](safe-synthesizer-101.ipynb) -- learn the fundamentals +- [Differential Privacy](differential-privacy.ipynb) -- enable differential privacy guarantees ## Adding a Tutorial diff --git a/docs/tutorials/safe-synthesizer-101.ipynb b/docs/tutorials/safe-synthesizer-101.ipynb index 528969400..2533cf8b6 100644 --- a/docs/tutorials/safe-synthesizer-101.ipynb +++ b/docs/tutorials/safe-synthesizer-101.ipynb @@ -1,24 +1,31 @@ { "cells": [ + { + "cell_type": "markdown", + "id": "6808567a", + "metadata": {}, + "source": [ + "\n", + "" + ] + }, { "cell_type": "markdown", "id": "d1d7a7a3", "metadata": {}, "source": [ "\n", - "# 🔐 Nemo Safe Synthesizer Tutorial: The Basics\n", + "# 🔐 NeMo Safe Synthesizer Tutorial: The Basics\n", "\n", "#### What you'll learn\n", "\n", - "In this notebook, we'll explore the fundamentals of the NeMo Safe Synthesizer: PII replacement, training on a sample dataset, generating synthetic data, and evaluating quality and privacy.\n", + "In this notebook, we'll explore the fundamentals of NeMo Safe Synthesizer: PII replacement, training on a sample dataset, generating synthetic data, and evaluating quality and privacy.\n", "\n", - "This library supports numeric, categorical, and text fields within the training data and generates realistic synthetic data that mirrors the structure of your data. A full run takes ~15 minutes on an A100; an H100 is faster.\n", + "This library supports numeric, categorical, and text fields within the training data and generates realistic synthetic data that mirrors the structure of your data. A full run takes about 15 minutes.\n", "\n", "### 🖥️ Prerequisites\n", "\n", - "This notebook is intended to run on a **GPU**. We recommend an **H100**; minimum **A100**.\n", - "\n", - "\n" + "This notebook requires a GPU. We recommend an H100; minimum A100." ] }, { @@ -26,9 +33,9 @@ "id": "d501f043", "metadata": {}, "source": [ - "### ⚡ Colab Setup\n", + "### ⚡ Install Safe Synthesizer\n", "\n", - "Run the cell below to install Nemo Safe Synthesizer (engine and CUDA 12.8) and the `datasets` library for the sample dataset." + "Run the cell below to install NeMo Safe Synthesizer (engine and CUDA 12.8) and the `datasets` library for the sample dataset." ] }, { @@ -56,9 +63,9 @@ "metadata": {}, "source": [ "\n", - "### 🔑 Set the inference API key for column classification\n", + "### 🔑 Set the inference API key for PII column classification\n", "\n", - "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer column types and improve PII detection accuracy. To enable this feature, set `NSS_INFERENCE_KEY` (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended.\n" + "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set `NSS_INFERENCE_KEY` (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended.\n" ] }, { @@ -78,7 +85,7 @@ " print(\"NSS_INFERENCE_KEY is set\")\n", "else:\n", " print(\n", - " \"NSS_INFERENCE_KEY is not set. \"\n", + " \"NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. \"\n", " \"We strongly recommend setting a key.\"\n", " )" ] @@ -90,7 +97,7 @@ "source": [ "### 📥 Load and preview sample dataset\n", "\n", - "Load a tabular dataset—in this example, the [clinc_oos](https://huggingface.co/datasets/clinc/clinc_oos) from Huggingface—and preview the first few rows. NeMo Safe Synthesizer will use this DataFrame as its training data.\n", + "Load a tabular dataset—in this example, the [clinc_oos](https://huggingface.co/datasets/clinc/clinc_oos) dataset from Hugging Face—and preview the first few rows. NeMo Safe Synthesizer will use this DataFrame as its training data.\n", "\n", "This dataset includes a text column and a categorical intent label, making it a good demonstration of multi-type synthesis." ] @@ -105,8 +112,8 @@ "from datasets import load_dataset\n", "\n", "dataset = load_dataset(\"clinc/clinc_oos\", \"small\")\n", - "df = dataset[\"train\"].to_pandas() # type: ignore[union-attr]\n", - "df.head() # type: ignore[union-attr]" + "df = dataset[\"train\"].to_pandas()\n", + "df.head()" ] }, { @@ -115,15 +122,11 @@ "metadata": {}, "source": [ "\n", + "### ⚙️ Create and run Safe Synthesizer job\n", "\n", + "Create the Safe Synthesizer builder and attach your DataFrame. Run the pipeline with `run()`, which performs data processing, PII replacement, training, generation, and evaluation in a single call. Results are available on `builder.results`.\n", "\n", - "### ⚙️ Create and run Safe Synthesizer job\n", - "\n", - "Create the Safe Synthesizer builder and attach your DataFrame. \n", - "Run the pipeline with `run()`, which performs data processing, PII replacement, training, generation, and evaluation in a single call. Results are available on `builder.results`.\n", - "\n", - " Please refer to the [configuration docs](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/blob/main/docs/user-guide/configuration.md) for the full list of options.\n", - "\n" + "Refer to the [configuration docs](../user-guide/configuration.md) for the full list of options.\n" ] }, { @@ -135,12 +138,9 @@ "source": [ "from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer\n", "\n", - "\n", - "# To disable PII replacement for the run, chain `.with_replace_pii(enable=False)` on the builder before `run()`.\n", - "builder = SafeSynthesizer().with_data_source(df)\n", - "\n", + "builder = SafeSynthesizer().with_data_source(df) # .with_replace_pii(enable=False) to disable PII replacement\n", "builder.run()\n", - "results = builder.results" + "results = builder.results\n" ] }, { @@ -165,6 +165,17 @@ "synth.head()" ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e8f90d5", + "metadata": {}, + "outputs": [], + "source": [ + "# Synthetic data and evaluation report are automatically saved to the artifacts directory\n", + "print(f\"Artifacts automatically saved to: {builder._workdir.generate.path}\")" + ] + }, { "cell_type": "markdown", "id": "75ec6da5", @@ -195,12 +206,14 @@ "metadata": {}, "outputs": [], "source": [ - "# Download the full HTML evaluation report\n", - "if results.evaluation_report_html:\n", - " report_path = \"evaluation_report.html\"\n", - " with open(report_path, \"w\") as f:\n", - " f.write(results.evaluation_report_html)\n", - " print(f\"The HTML evaluation report is saved in {report_path}.\")" + "# View the evaluation report in a sandboxed iframe\n", + "import base64\n", + "from IPython.display import IFrame\n", + "\n", + "report_html = results.evaluation_report_html\n", + "if report_html:\n", + " data_url = \"data:text/html;base64,\" + base64.b64encode(report_html.encode()).decode()\n", + " display(IFrame(src=data_url, width=\"100%\", height=800))" ] } ], @@ -220,7 +233,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.14" + "version": "3.11.13" } }, "nbformat": 4, diff --git a/mkdocs.yml b/mkdocs.yml index 96a7e446c..25b8a032d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -158,6 +158,8 @@ nav: - Evaluation: product-overview/evaluation.md - Tutorials: - tutorials/index.md + - Safe Synthesizer 101: tutorials/safe-synthesizer-101.ipynb + - Differential Privacy: tutorials/differential-privacy.ipynb - User Guide: - Getting Started: user-guide/getting-started.md - Running Safe Synthesizer: user-guide/running.md From a61b189eb2e1f7e60d0f7ad129334950199e9aa5 Mon Sep 17 00:00:00 2001 From: nina-xu <19981858+nina-xu@users.noreply.github.com> Date: Tue, 24 Mar 2026 17:19:10 +0000 Subject: [PATCH 2/9] add gpu type to runtime Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> --- docs/tutorials/differential-privacy.ipynb | 2 +- docs/tutorials/safe-synthesizer-101.ipynb | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb index ddbf26cc0..b9bba8706 100644 --- a/docs/tutorials/differential-privacy.ipynb +++ b/docs/tutorials/differential-privacy.ipynb @@ -17,7 +17,7 @@ "\n", "# 🔐 NeMo Safe Synthesizer Tutorial: Differential Privacy\n", "\n", - "Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour.\n", + "Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour on an A100.\n", "\n", "If you have not yet completed the [Safe Synthesizer 101](safe-synthesizer-101.ipynb) tutorial, consider starting there first.\n", "\n", diff --git a/docs/tutorials/safe-synthesizer-101.ipynb b/docs/tutorials/safe-synthesizer-101.ipynb index 2533cf8b6..32c8441d5 100644 --- a/docs/tutorials/safe-synthesizer-101.ipynb +++ b/docs/tutorials/safe-synthesizer-101.ipynb @@ -21,7 +21,7 @@ "\n", "In this notebook, we'll explore the fundamentals of NeMo Safe Synthesizer: PII replacement, training on a sample dataset, generating synthetic data, and evaluating quality and privacy.\n", "\n", - "This library supports numeric, categorical, and text fields within the training data and generates realistic synthetic data that mirrors the structure of your data. A full run takes about 15 minutes.\n", + "This library supports numeric, categorical, and text fields within the training data and generates realistic synthetic data that mirrors the structure of your data. A full run takes about 15 minutes on an A100.\n", "\n", "### 🖥️ Prerequisites\n", "\n", From eb47c5b6246b26b9276ce5d2749b2dad49e3fd39 Mon Sep 17 00:00:00 2001 From: nina-xu <19981858+nina-xu@users.noreply.github.com> Date: Tue, 24 Mar 2026 18:05:17 +0000 Subject: [PATCH 3/9] fix import Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> --- docs/tutorials/differential-privacy.ipynb | 2 +- docs/tutorials/safe-synthesizer-101.ipynb | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb index b9bba8706..2c47c1416 100644 --- a/docs/tutorials/differential-privacy.ipynb +++ b/docs/tutorials/differential-privacy.ipynb @@ -224,7 +224,7 @@ "source": [ "# View the evaluation report in a sandboxed iframe\n", "import base64\n", - "from IPython.display import IFrame\n", + "from IPython.display import IFrame, display\n", "\n", "report_html = results.evaluation_report_html\n", "if report_html:\n", diff --git a/docs/tutorials/safe-synthesizer-101.ipynb b/docs/tutorials/safe-synthesizer-101.ipynb index 32c8441d5..4bac34e5d 100644 --- a/docs/tutorials/safe-synthesizer-101.ipynb +++ b/docs/tutorials/safe-synthesizer-101.ipynb @@ -208,7 +208,7 @@ "source": [ "# View the evaluation report in a sandboxed iframe\n", "import base64\n", - "from IPython.display import IFrame\n", + "from IPython.display import IFrame, display\n", "\n", "report_html = results.evaluation_report_html\n", "if report_html:\n", From 9437935d7436731a9e4df02e1eb50c35ec8ab1ae Mon Sep 17 00:00:00 2001 From: nina-xu <19981858+nina-xu@users.noreply.github.com> Date: Mon, 30 Mar 2026 20:53:26 +0000 Subject: [PATCH 4/9] add citation & disclaimer; remove 101 from my changes Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> --- docs/tutorials/differential-privacy.ipynb | 17 +++-- docs/tutorials/safe-synthesizer-101.ipynb | 79 ++++++++++------------- 2 files changed, 45 insertions(+), 51 deletions(-) diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb index 2c47c1416..8b247a6ad 100644 --- a/docs/tutorials/differential-privacy.ipynb +++ b/docs/tutorials/differential-privacy.ipynb @@ -1,12 +1,12 @@ { "cells": [ { - "cell_type": "markdown", - "id": "1b6d66e2", + "cell_type": "raw", + "id": "ac33fed2", "metadata": {}, "source": [ - "\n", - "" + "SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", + "SPDX-License-Identifier: Apache-2.0" ] }, { @@ -91,7 +91,14 @@ "\n", "This dataset includes text, categorical, and numeric fields, making it a good demonstration of the model's ability to handle multiple data types.\n", "\n", - "The code below also computes a recommended `delta` for differential privacy. Delta should reflect the full dataset size, not the subset, because it bounds the probability of a privacy breach across the entire population. See [Differential Privacy](../user-guide/configuration.md#differential-privacy) for parameter guidance." + "The code below also computes a recommended `delta` for differential privacy. Delta should reflect the full dataset size, not the subset, because it bounds the probability of a privacy breach across the entire population. See [Differential Privacy](../user-guide/configuration.md#differential-privacy) for parameter guidance.\n", + "\n", + "> Dataset citations:\n", + ">\n", + "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. \"A Countrywide Traffic Accident Dataset.\", 2019.\n", + "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n", + ">\n", + "> Each user is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use." ] }, { diff --git a/docs/tutorials/safe-synthesizer-101.ipynb b/docs/tutorials/safe-synthesizer-101.ipynb index 4bac34e5d..528969400 100644 --- a/docs/tutorials/safe-synthesizer-101.ipynb +++ b/docs/tutorials/safe-synthesizer-101.ipynb @@ -1,31 +1,24 @@ { "cells": [ - { - "cell_type": "markdown", - "id": "6808567a", - "metadata": {}, - "source": [ - "\n", - "" - ] - }, { "cell_type": "markdown", "id": "d1d7a7a3", "metadata": {}, "source": [ "\n", - "# 🔐 NeMo Safe Synthesizer Tutorial: The Basics\n", + "# 🔐 Nemo Safe Synthesizer Tutorial: The Basics\n", "\n", "#### What you'll learn\n", "\n", - "In this notebook, we'll explore the fundamentals of NeMo Safe Synthesizer: PII replacement, training on a sample dataset, generating synthetic data, and evaluating quality and privacy.\n", + "In this notebook, we'll explore the fundamentals of the NeMo Safe Synthesizer: PII replacement, training on a sample dataset, generating synthetic data, and evaluating quality and privacy.\n", "\n", - "This library supports numeric, categorical, and text fields within the training data and generates realistic synthetic data that mirrors the structure of your data. A full run takes about 15 minutes on an A100.\n", + "This library supports numeric, categorical, and text fields within the training data and generates realistic synthetic data that mirrors the structure of your data. A full run takes ~15 minutes on an A100; an H100 is faster.\n", "\n", "### 🖥️ Prerequisites\n", "\n", - "This notebook requires a GPU. We recommend an H100; minimum A100." + "This notebook is intended to run on a **GPU**. We recommend an **H100**; minimum **A100**.\n", + "\n", + "\n" ] }, { @@ -33,9 +26,9 @@ "id": "d501f043", "metadata": {}, "source": [ - "### ⚡ Install Safe Synthesizer\n", + "### ⚡ Colab Setup\n", "\n", - "Run the cell below to install NeMo Safe Synthesizer (engine and CUDA 12.8) and the `datasets` library for the sample dataset." + "Run the cell below to install Nemo Safe Synthesizer (engine and CUDA 12.8) and the `datasets` library for the sample dataset." ] }, { @@ -63,9 +56,9 @@ "metadata": {}, "source": [ "\n", - "### 🔑 Set the inference API key for PII column classification\n", + "### 🔑 Set the inference API key for column classification\n", "\n", - "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set `NSS_INFERENCE_KEY` (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended.\n" + "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer column types and improve PII detection accuracy. To enable this feature, set `NSS_INFERENCE_KEY` (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended.\n" ] }, { @@ -85,7 +78,7 @@ " print(\"NSS_INFERENCE_KEY is set\")\n", "else:\n", " print(\n", - " \"NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. \"\n", + " \"NSS_INFERENCE_KEY is not set. \"\n", " \"We strongly recommend setting a key.\"\n", " )" ] @@ -97,7 +90,7 @@ "source": [ "### 📥 Load and preview sample dataset\n", "\n", - "Load a tabular dataset—in this example, the [clinc_oos](https://huggingface.co/datasets/clinc/clinc_oos) dataset from Hugging Face—and preview the first few rows. NeMo Safe Synthesizer will use this DataFrame as its training data.\n", + "Load a tabular dataset—in this example, the [clinc_oos](https://huggingface.co/datasets/clinc/clinc_oos) from Huggingface—and preview the first few rows. NeMo Safe Synthesizer will use this DataFrame as its training data.\n", "\n", "This dataset includes a text column and a categorical intent label, making it a good demonstration of multi-type synthesis." ] @@ -112,8 +105,8 @@ "from datasets import load_dataset\n", "\n", "dataset = load_dataset(\"clinc/clinc_oos\", \"small\")\n", - "df = dataset[\"train\"].to_pandas()\n", - "df.head()" + "df = dataset[\"train\"].to_pandas() # type: ignore[union-attr]\n", + "df.head() # type: ignore[union-attr]" ] }, { @@ -122,11 +115,15 @@ "metadata": {}, "source": [ "\n", - "### ⚙️ Create and run Safe Synthesizer job\n", "\n", - "Create the Safe Synthesizer builder and attach your DataFrame. Run the pipeline with `run()`, which performs data processing, PII replacement, training, generation, and evaluation in a single call. Results are available on `builder.results`.\n", "\n", - "Refer to the [configuration docs](../user-guide/configuration.md) for the full list of options.\n" + "### ⚙️ Create and run Safe Synthesizer job\n", + "\n", + "Create the Safe Synthesizer builder and attach your DataFrame. \n", + "Run the pipeline with `run()`, which performs data processing, PII replacement, training, generation, and evaluation in a single call. Results are available on `builder.results`.\n", + "\n", + " Please refer to the [configuration docs](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/blob/main/docs/user-guide/configuration.md) for the full list of options.\n", + "\n" ] }, { @@ -138,9 +135,12 @@ "source": [ "from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer\n", "\n", - "builder = SafeSynthesizer().with_data_source(df) # .with_replace_pii(enable=False) to disable PII replacement\n", + "\n", + "# To disable PII replacement for the run, chain `.with_replace_pii(enable=False)` on the builder before `run()`.\n", + "builder = SafeSynthesizer().with_data_source(df)\n", + "\n", "builder.run()\n", - "results = builder.results\n" + "results = builder.results" ] }, { @@ -165,17 +165,6 @@ "synth.head()" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "8e8f90d5", - "metadata": {}, - "outputs": [], - "source": [ - "# Synthetic data and evaluation report are automatically saved to the artifacts directory\n", - "print(f\"Artifacts automatically saved to: {builder._workdir.generate.path}\")" - ] - }, { "cell_type": "markdown", "id": "75ec6da5", @@ -206,14 +195,12 @@ "metadata": {}, "outputs": [], "source": [ - "# View the evaluation report in a sandboxed iframe\n", - "import base64\n", - "from IPython.display import IFrame, display\n", - "\n", - "report_html = results.evaluation_report_html\n", - "if report_html:\n", - " data_url = \"data:text/html;base64,\" + base64.b64encode(report_html.encode()).decode()\n", - " display(IFrame(src=data_url, width=\"100%\", height=800))" + "# Download the full HTML evaluation report\n", + "if results.evaluation_report_html:\n", + " report_path = \"evaluation_report.html\"\n", + " with open(report_path, \"w\") as f:\n", + " f.write(results.evaluation_report_html)\n", + " print(f\"The HTML evaluation report is saved in {report_path}.\")" ] } ], @@ -233,7 +220,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.13" + "version": "3.11.14" } }, "nbformat": 4, From 9a26a0b2ef995b2b2aa6919a5d0f47668123ea71 Mon Sep 17 00:00:00 2001 From: nina-xu <19981858+nina-xu@users.noreply.github.com> Date: Mon, 30 Mar 2026 21:03:45 +0000 Subject: [PATCH 5/9] feedback Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> --- docs/tutorials/differential-privacy.ipynb | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb index 8b247a6ad..7cc211edf 100644 --- a/docs/tutorials/differential-privacy.ipynb +++ b/docs/tutorials/differential-privacy.ipynb @@ -1,14 +1,5 @@ { "cells": [ - { - "cell_type": "raw", - "id": "ac33fed2", - "metadata": {}, - "source": [ - "SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", - "SPDX-License-Identifier: Apache-2.0" - ] - }, { "cell_type": "markdown", "id": "d1d7a7a3", @@ -44,7 +35,10 @@ "outputs": [], "source": [ "%%capture\n", - "!uv pip install nemo-safe-synthesizer[engine,cu128] --extra-index-url \"https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple\"\n", + "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", + "# SPDX-License-Identifier: Apache-2.0\n", + "\n", + "!uv pip install \\\"nemo-safe-synthesizer[engine,cu128]\\\" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match\n", "!uv pip install kagglehub\n" ] }, @@ -98,7 +92,7 @@ "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. \"A Countrywide Traffic Accident Dataset.\", 2019.\n", "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n", ">\n", - "> Each user is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use." + "> Each user is responsible for checking the content of dataset and the applicable licenses and determining if suitable for the intended use." ] }, { From d2d16945fc2075cc6b1dc126a5f0e3bb3ebd5f72 Mon Sep 17 00:00:00 2001 From: nina-xu <19981858+nina-xu@users.noreply.github.com> Date: Mon, 30 Mar 2026 21:11:32 +0000 Subject: [PATCH 6/9] nit format Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> --- docs/tutorials/differential-privacy.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb index 7cc211edf..09c131155 100644 --- a/docs/tutorials/differential-privacy.ipynb +++ b/docs/tutorials/differential-privacy.ipynb @@ -38,7 +38,7 @@ "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", "# SPDX-License-Identifier: Apache-2.0\n", "\n", - "!uv pip install \\\"nemo-safe-synthesizer[engine,cu128]\\\" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match\n", + "!uv pip install nemo-safe-synthesizer[engine,cu128] --extra-index-url \"https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple\"\n", "!uv pip install kagglehub\n" ] }, @@ -83,7 +83,7 @@ "\n", "Load a tabular dataset—in this example, the [US Accidents dataset](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) from Kaggle—and preview the first few rows. NeMo Safe Synthesizer will use a subset to keep runtime manageable.\n", "\n", - "This dataset includes text, categorical, and numeric fields, making it a good demonstration of the model's ability to handle multiple data types.\n", + "This dataset includes text, categorical, and numeric fields, all of which are supported by Safe Synthesizer.\n", "\n", "The code below also computes a recommended `delta` for differential privacy. Delta should reflect the full dataset size, not the subset, because it bounds the probability of a privacy breach across the entire population. See [Differential Privacy](../user-guide/configuration.md#differential-privacy) for parameter guidance.\n", "\n", From 2451db4c9baca12b123080ad1ba6d9e30e478b1d Mon Sep 17 00:00:00 2001 From: nina-xu <19981858+nina-xu@users.noreply.github.com> Date: Mon, 30 Mar 2026 21:17:02 +0000 Subject: [PATCH 7/9] copilot nit comments Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> --- docs/tutorials/differential-privacy.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb index 09c131155..db38797ad 100644 --- a/docs/tutorials/differential-privacy.ipynb +++ b/docs/tutorials/differential-privacy.ipynb @@ -8,7 +8,7 @@ "\n", "# 🔐 NeMo Safe Synthesizer Tutorial: Differential Privacy\n", "\n", - "Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour on an A100.\n", + "Learn how to apply differential privacy to achieve strong privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour on an A100.\n", "\n", "If you have not yet completed the [Safe Synthesizer 101](safe-synthesizer-101.ipynb) tutorial, consider starting there first.\n", "\n", @@ -49,7 +49,7 @@ "source": [ "### 🔑 Set the inference API key for PII column classification\n", "\n", - "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set `NSS_INFERENCE_KEY` (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended." + "NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set `NSS_INFERENCE_KEY`. By default, the inference endpoint is `https://integrate.api.nvidia.com/v1` (the NVIDIA integrate URL). You can obtain an API key from [build.nvidia.com](https://build.nvidia.com/settings/api-keys)). Setting this value is optional but strongly recommended." ] }, { From ebe2cf7ba6afeca3e4c65f5bb9d33c1bcb07674d Mon Sep 17 00:00:00 2001 From: nina-xu <19981858+nina-xu@users.noreply.github.com> Date: Tue, 31 Mar 2026 13:55:48 +0000 Subject: [PATCH 8/9] nit grammar Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> --- docs/tutorials/differential-privacy.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb index db38797ad..c6fdbb9ec 100644 --- a/docs/tutorials/differential-privacy.ipynb +++ b/docs/tutorials/differential-privacy.ipynb @@ -92,7 +92,7 @@ "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. \"A Countrywide Traffic Accident Dataset.\", 2019.\n", "> - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n", ">\n", - "> Each user is responsible for checking the content of dataset and the applicable licenses and determining if suitable for the intended use." + "> Each user is responsible for checking the content of the dataset and the applicable licenses and determining if it is suitable for the intended use." ] }, { From ee84007c1efa1bc05a8ed03e2af90cd32ec8ec2f Mon Sep 17 00:00:00 2001 From: nina-xu <19981858+nina-xu@users.noreply.github.com> Date: Tue, 31 Mar 2026 18:39:51 +0000 Subject: [PATCH 9/9] update installation command Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> --- docs/tutorials/differential-privacy.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/differential-privacy.ipynb b/docs/tutorials/differential-privacy.ipynb index c6fdbb9ec..f3e200e71 100644 --- a/docs/tutorials/differential-privacy.ipynb +++ b/docs/tutorials/differential-privacy.ipynb @@ -38,7 +38,7 @@ "# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", "# SPDX-License-Identifier: Apache-2.0\n", "\n", - "!uv pip install nemo-safe-synthesizer[engine,cu128] --extra-index-url \"https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple\"\n", + "!uv pip install \"nemo-safe-synthesizer[engine,cu128]\" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match\n", "!uv pip install kagglehub\n" ] },