test: add smoke tests for training and generation hot paths#76
Conversation
kendrickb-nvidia
left a comment
There was a problem hiding this comment.
Should we add some guidance to CONTRIBUTING.md about when and why to use smoke tests versus the others?
When are we planning to run these (I don't see changes to CI yet)? Will we gate running other tests on smoke tests passing?
0280ed2 to
8453877
Compare
8453877 to
482412e
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive smoke tests to catch dependency breakage and SDK orchestration regressions in training and generation hot paths. The tests use tiny models and run quickly (CPU tests in seconds, GPU tests in ~20 minutes) without checking output quality.
Changes:
- Adds 13 new smoke test files covering CPU training/generation, GPU SDK training/generation, adapter persistence, resume flows, structured generation, timeseries, and Unsloth integration
- Moves SmolLM2 metadata from production code to test-only
tests/smoke/conftest.py(fixes CUDA device-side asserts from out-of-range BOS token) - Restructures pytest markers: 4 categories (
unit,slow,smoke,e2e) + 1 orthogonal modifier (requires_gpu) - Splits GPU CI workflow into parallel required smoke tests (~20 min) and informational e2e tests (~50 min) for faster feedback
- Adds CPU smoke tests as a required CI job
- Updates documentation across 11 files (README, CONTRIBUTING, AGENTS, test docs, Cursor rules, agent skills)
Reviewed changes
Copilot reviewed 31 out of 32 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/smoke/conftest.py | Shared fixtures for smoke tests including SmolLM2 metadata class, tiny model fixtures, session-scoped tokenizer, and helper functions |
| tests/smoke/init.py | Package marker file with license header |
| tests/smoke/README.md | Documentation explaining when to add smoke tests, common gotchas, and fixture descriptions |
| tests/smoke/test_training_cpu.py | CPU training smoke tests (HF Trainer, LoRA, DP, Assembler) |
| tests/smoke/test_generation_cpu.py | CPU generation smoke tests (schema prompt, processor) |
| tests/smoke/test_nss_training_gpu.py | GPU SDK training tests (standard + DP) |
| tests/smoke/test_nss_generation_gpu.py | GPU generation tests (full chain + manual VllmBackend) |
| tests/smoke/test_nss_adapter_persistence_gpu.py | GPU adapter file checks + PEFT load verification |
| tests/smoke/test_nss_resume_gpu.py | GPU resume flow test (train -> save -> load -> generate) |
| tests/smoke/test_nss_structured_gen_gpu.py | GPU structured generation test (outlines + json_schema) |
| tests/smoke/test_nss_timeseries_gpu.py | GPU timeseries backend test |
| tests/smoke/test_nss_unsloth_gpu.py | GPU Unsloth training test (process-isolated) |
| tests/smoke/test_full_pipeline_gpu.py | GPU full pipeline test with SmolLM2-135M from Hub |
| src/nemo_safe_synthesizer/llm/metadata.py | Removes SmolLM2 class (moved to tests) |
| tests/conftest.py | Updates auto-marking logic for new marker structure |
| tests/llm/test_metadata.py | Removes SmolLM2 test scenarios |
| tests/e2e/test_safe_synthesizer.py | Updates markers from gpu_integration to requires_gpu |
| pytest.ini | Redefines markers: removes 5 old markers, adds 3 new ones |
| pyproject.toml | Adds setuptools>=80.0.0 dependency |
| uv.lock | Updates dependency lock with setuptools |
| Makefile | Renames test targets: test-smoke, test-smoke-gpu (replaces test-gpu-integration) |
| .github/workflows/gpu-tests.yml | Splits into gpu-smoke-test (required) and gpu-e2e-test (informational) jobs |
| .github/workflows/ci-checks.yml | Adds CPU smoke-test as required job |
| README.md | Updates test command documentation |
| CONTRIBUTING.md | Updates test command documentation and adds smoke test reference |
| AGENTS.md | Updates test command and marker documentation |
| tests/AGENTS.md | Updates test command reference |
| tests/TESTING.md | Comprehensive update of markers, commands, fixtures, and isolation patterns |
| .cursor/rules/repo-navigation.mdc | Updates marker list |
| .claude/commands/gpu-test.md | Updates GPU test command reference |
| .agent/skills/python-testing-patterns/SKILL.md | Updates marker documentation |
| .agent/skills/diagnose-failures/SKILL.md | Updates marker and command table |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b4ae90d to
fb8129b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 32 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fb8129b to
9137c3e
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 33 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
tests/llm/test_metadata.py:130
- This removes SmolLM2 coverage from the unit-level metadata tests (no
ModelDetectionScenario/ModelInitScenariofor SmolLM2) even thoughSmolLM2is still a productionModelMetadatasubclass. Consider updating the SmolLM2 scenarios to use tokenizer-derived BOS/EOS (or otherwise avoid hardcoding an out-of-rangebos_token_id) rather than dropping the unit test coverage entirely.
# Model detection scenarios - maps model paths to expected classes
MODEL_DETECTION_SCENARIOS = [
ModelDetectionScenario("tinyllama", "TinyLlama/TinyLlama-1.1B-Chat-v1.0", TinyLlama),
ModelDetectionScenario("qwen", "Qwen/Qwen2-0.5B", Qwen),
ModelDetectionScenario("llama32", "meta-llama/Llama32-1B", Llama32),
ModelDetectionScenario("smollm3", "HuggingFaceTB/SmolLM3-3B", SmolLM3),
ModelDetectionScenario("mistral", "mistralai/Mistral-7B-Instruct-v0.3", Mistral),
ModelDetectionScenario("nemotron", "nvidia/Nemotron-4-340B", Nemotron),
]
# Model initialization scenarios - tests each model's specific configuration
MODEL_INIT_SCENARIOS = [
ModelInitScenario(
id="tinyllama",
model_class=TinyLlama,
model_path="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
expected_template=PROMPT_TEMPLATE,
expected_add_bos=True,
expected_add_eos=True,
expected_bos_token=None, # Uses tokenizer default
expected_bos_token_id=None,
),
ModelInitScenario(
id="qwen",
model_class=Qwen,
model_path="Qwen/Qwen2-0.5B",
expected_template="user\n {instruction} {schema} \n assistant\n{prefill}",
expected_add_bos=True,
expected_add_eos=False,
expected_bos_token=None,
expected_bos_token_id=None,
),
ModelInitScenario(
id="llama32",
model_class=Llama32,
model_path="meta-llama/Llama32-1B",
expected_template="user\n {instruction} {schema} \n assistant\n{prefill}",
expected_add_bos=False,
expected_add_eos=False,
expected_bos_token="<|im_start|>",
expected_bos_token_id=151644,
),
ModelInitScenario(
id="smollm3",
model_class=SmolLM3,
model_path="HuggingFaceTB/SmolLM3-3B",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
4e53f6b to
bb69311
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 48 out of 49 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kendrickb-nvidia
left a comment
There was a problem hiding this comment.
Maybe this is a bug with github console. But I'm seeing diffs here that are from already merged PRs. Not just the smoke test changes.
Signed-off-by: aagonzales <aagonzales@nvidia.com> Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com> Signed-off-by: aagonzales <aagonzales@nvidia.com>
Signed-off-by: aagonzales <aagonzales@nvidia.com> Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com> Signed-off-by: aagonzales <aagonzales@nvidia.com>
Signed-off-by: aagonzales <aagonzales@nvidia.com> Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com> Signed-off-by: aagonzales <aagonzales@nvidia.com>
Signed-off-by: aagonzales <aagonzales@nvidia.com> Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com> Signed-off-by: aagonzales <aagonzales@nvidia.com>
- Replace dead `enable_replace_pii=False` with `replace_pii=None` in all `from_params()` calls (the old key was silently ignored by Pydantic) - Remove dead `enable_synthesis=True` (no such field exists in src/) - Add `vllm` pytest marker for GPU memory isolation; refactor Makefile `test-smoke-gpu` into marker-based groups (train-only / vllm / smollm2 / unsloth) - Tighten `except GenerationError` to assert on expected message - Fix SmolLM2 metadata: remove hardcoded out-of-range BOS token IDs - Promote `base_smoke_config` and `_patch_attn_eager` to session scope - Make `tiny_llama_config` depend on `stub_tokenizer` (eliminate redundant load) - Add CPU smoke tests for evaluation (MultimodalReport) and PII replacement (NemoPII) - Update smoke test docs (README.md, TESTING.md) to reflect new markers and scopes Made-with: Cursor Signed-off-by: aagonzales <aagonzales@nvidia.com>
- ci-checks.yml: fix echo alignment in status check job - TESTING.md: use `uv run --frozen pytest` instead of bare `pytest` - test_training_cpu.py: remove unused `fixture_stub_tokenizer_path` param - Makefile: use PYTEST_NO_XDIST_CMD instead of appending -n 0 to PYTEST_CMD; fix test-gpu-integration help text to match its actual recipe (e2e only) Made-with: Cursor Signed-off-by: aagonzales <aagonzales@nvidia.com>
Signed-off-by: aagonzales <aagonzales@nvidia.com>
Signed-off-by: aagonzales <aagonzales@nvidia.com>
bb69311 to
4e3a6a2
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 36 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| decoded = stub_tokenizer.decode(output[0], skip_special_tokens=True) | ||
| assert len(decoded) > len(prompt) # model produced something |
There was a problem hiding this comment.
test_generate_with_schema_prompt asserts len(decoded) > len(prompt) after decoding with skip_special_tokens=True. With a randomly initialized model and greedy decoding, it's possible for generation to immediately emit only special tokens (e.g., EOS) so the decoded text matches the prompt, making this test flaky. Prefer asserting on tensor lengths (e.g., output.shape[-1] > input_ids.shape[-1]) or simply that generate() returns without error and the output shape is valid.
| decoded = stub_tokenizer.decode(output[0], skip_special_tokens=True) | |
| assert len(decoded) > len(prompt) # model produced something | |
| # Ensure generate() returns a tensor with valid shape (same batch, no shorter sequence) | |
| assert output is not None | |
| assert output.shape[0] == input_ids.shape[0] | |
| assert output.shape[1] >= input_ids.shape[1] |
| rope_scaling_factor=1, | ||
| ) | ||
| mock_metadata = MagicMock() | ||
| mock_metadata.prompt_config.prompt_template = PROMPT_TEMPLATE |
There was a problem hiding this comment.
ModelMetadata.prompt_config uses the field name template (see LLMPromptConfig.template), but this test sets mock_metadata.prompt_config.prompt_template. Even though the current create_processor() path may not read it, keeping the mock aligned with the real metadata interface will prevent silent breakage if create_processor() starts using the prompt template in more modes.
| mock_metadata.prompt_config.prompt_template = PROMPT_TEMPLATE | |
| mock_metadata.prompt_config.template = PROMPT_TEMPLATE |
mckornfield
left a comment
There was a problem hiding this comment.
marks are a bit copy pasta but probably more readable, mostly talking to myself in reviews, lg
| 2. vLLM generation (`vllm` marker): each file gets its own process because vLLM pre-allocates all GPU memory and never releases it. | ||
| 3. SmolLM2 / Unsloth (`smollm2`, `unsloth` markers): each gets its own process, auto-discovered via markers. | ||
|
|
||
| When adding a new GPU smoke test, add the appropriate markers to `pytestmark`: |
There was a problem hiding this comment.
are all of them appropriate? does it make sense to just add one pytest_add_gpu_markers() fn or nah?
## Summary the smoke tests pr #76 uses smollm2 for testing. during one of many rebases and changes to that branch, somewhere along the way we lost the following change. - `SmolLM2` metadata hardcoded `bos_token_id=151644` (copied from the `Llama32` class), but all SmolLM2 models have a 49152-token vocab where `<|im_start|>`; token id `151644` caused a CUDA `vectorized_gather_kernel` oob during tests. - Fix: look up the actual `<|im_start|>` token ID from the tokenizer via `convert_tokens_to_ids` at init time instead of hardcoding it. - Also corrects stale `generation/AGENTS.md` that said `ignore_eos=True` (now always `False` after #265). ## Test plan - [x] `pytest tests/llm/` -- 52 passed - [x] `pytest tests/smoke/test_full_pipeline_gpu.py::test_full_pipeline_smollm2` -- passes (was CUDA crash before) Signed-off-by: aagonzales <aagonzales@nvidia.com>
Summary
Adds smoke tests exercising training, generation, evaluation, and PII replacement hot paths with tiny models. Catches dependency breakage and SDK orchestration regressions without checking output quality.
Changes beyond the test files themselves:
unit,slow,smoke,e2e) + orthogonal modifiers (requires_gpu,vllm,smollm2,unsloth). Removed 5 unused markers.test-smoke(CPU),test-smoke-gpu(GPU with vLLM/SmolLM2/Unsloth process isolation).ci-checks.yml.Test plan