chore: rebase and updates

binaryaaron · binaryaaron · commit b4ae90dfd052 · 2026-03-10T00:46:34.000-06:00
Signed-off-by: Aaron Gonzales &lt;aagonzales@nvidia.com&gt;
diff --git a/.github/workflows/ci-checks.yml b/.github/workflows/ci-checks.yml
@@ -169,7 +169,7 @@ jobs:
         run: |
           make bootstrap-nss cpu
           make test-smoke
-      
+
 
   # ---------------------------------------------------------------------------
   # Single required status check for branch protection.
diff --git a/STYLE_GUIDE.md b/STYLE_GUIDE.md
@@ -727,10 +727,10 @@ Testing conventions are substantial enough to warrant their own section. For the
 - Fixture scope: function-scoped by default. Session scope only when empirically justified by test runtime -- not based on assumptions about cost.
 - Assertions: bare `assert` is the primary style; `pytest.raises()` with `match=` for exceptions; `pytest.approx()` for floating-point comparisons
 - Docstrings: optional for simple tests, recommended for complex/e2e tests explaining purpose
-- Markers: auto-assigned by path via `pytest_collection_modifyitems` (`/e2e/` -> `e2e`, `/gpu_integration/` -> `gpu_integration`, default -> `unit`). Explicit markers: `@pytest.mark.slow`, `@pytest.mark.timeout()`.
+- Markers: auto-assigned by path via `pytest_collection_modifyitems` (`/e2e/` -> `e2e`, `/smoke/` -> `smoke`, default -> `unit`). Explicit markers: `@pytest.mark.slow`, `@pytest.mark.requires_gpu`, `@pytest.mark.timeout()`.
 - `conftest.py`: shared fixtures per directory; root conftest has `load_test_dataset()` and `load_test_dataframe()` helpers
 - Use `tmp_path` fixture for file operations, never write to the repo tree
-- Mark CUDA-dependent tests with `@pytest.mark.e2e` or `@pytest.mark.gpu_integration`
+- Mark CUDA-dependent tests with `@pytest.mark.e2e`, `@pytest.mark.smoke`, or `@pytest.mark.requires_gpu`
 - Mock only external boundaries, not internal implementation details
 - Test isolation: no shared mutable state or execution-order dependencies between tests. If something must be run first before executing a test, include it in the test or a fixture.
 - Use `@pytest.mark.parametrize` for testing multiple input combinations rather than copy-pasting similar tests
diff --git a/docs/user-guide/troubleshooting.md b/docs/user-guide/troubleshooting.md
@@ -438,4 +438,3 @@ Checklist:
 
 Library bugs. If you encounter this error through documented interfaces,
 please [file an issue on GitHub](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/issues).
-
diff --git a/pyproject.toml b/pyproject.toml
@@ -37,7 +37,7 @@ dependencies = [
   "colorama>=0.4.6",
   "tqdm>=4.67.1",
   "setuptools>=80.0.0",
-  
+
 ]
 
 [dependency-groups]
diff --git a/pytest.ini b/pytest.ini
@@ -18,6 +18,8 @@ markers =
     smoke: Smoke tests - quick tests exercising training/generation hot paths with tiny models
     e2e: End-to-end tests - test the entire pipeline from data to generation to evaluation
     requires_gpu: Test needs CUDA hardware (orthogonal modifier, stacks on smoke/e2e)
+    smollm2: SmolLM2 Hub download tests (used by Makefile for process isolation)
+    unsloth: Unsloth backend tests (process-isolated from DP tests)
     noautouse: Marker to skip autouse fixtures for specific tests
 
 # Note: Unit tests (testing single classes/functions with no infrastructure dependencies)
diff --git a/tests/smoke/README.md b/tests/smoke/README.md
@@ -19,6 +19,34 @@ end-to-end without throwing. Use the smallest model that exercises the path
 (the local `tiny_llama` stub for most things, SmolLM2-135M when you need
 a real tokenizer/model).
 
+## GPU Test Process Isolation
+
+GPU smoke tests run in three separate single-process (`-n 0`) pytest invocations to avoid CUDA and import-time conflicts:
+
+1. Local tiny-model tests (everything except SmolLM2 and Unsloth)
+2. SmolLM2 Hub download test (downloads ~270MB from HuggingFace)
+3. Unsloth backend test (process-isolated from DP tests)
+
+Why: Unsloth monkey-patches transformers at import time, poisoning Opacus/DP if they share a process. CUDA device-side asserts also cascade across xdist workers. The Makefile `test-smoke-gpu` target handles the split automatically via `-k` filters.
+
+Tests use pytestmark decorators:
+
+```python
+pytestmark = [
+    pytest.mark.requires_gpu,
+    pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available"),
+    pytest.mark.skipif(sys.platform == "darwin", reason="Not applicable on macOS"),
+]
+```
+
+For SmolLM2 and Unsloth tests, add the marker to a test function:
+
+```python
+@pytest.mark.usefixtures("_register_smollm2")  # for SmolLM2 tests
+def test_full_pipeline_smollm2(...):
+    ...
+```
+
 ## Things that will bite you
 
 - LoRA rank must be 8 (not 4). vLLM silently rejects rank 4. Use `lora_r=8`.
@@ -27,7 +55,6 @@ a real tokenizer/model).
 - Stub tokenizer vocab is 32000. If you change the tiny model config, keep `vocab_size=32000` or you'll get shape mismatches.
 - Always set `use_unsloth=False` unless you're specifically testing Unsloth. The `auto` default can pull it in and it monkey-patches transformers globally.
 - CPU tests need `optim="adamw_torch"`. The production default (`paged_adamw_32bit`) requires bitsandbytes CUDA kernels.
-- Unsloth tests run in a separate process. Unsloth patches transformers at import time, which breaks Opacus/DP if they share a process. The Makefile handles this automatically.
 
 ## What's in `conftest.py`?
 

Original file line number	Diff line number	Diff line change
`@@ -438,4 +438,3 @@ Checklist:`
`438`	`438`
`439`	`439`	`Library bugs. If you encounter this error through documented interfaces,`
`440`	`440`	`please [file an issue on GitHub](https://github.com/NVIDIA-NeMo/Safe-Synthesizer/issues).`
`441`		`-`
Original file line number	Diff line number	Diff line change
`@@ -37,7 +37,7 @@ dependencies = [`
`37`	`37`	`"colorama>=0.4.6",`
`38`	`38`	`"tqdm>=4.67.1",`
`39`	`39`	`"setuptools>=80.0.0",`
`40`		`-`
	`40`	`+`
`41`	`41`	`]`
`42`	`42`
`43`	`43`	`[dependency-groups]`