Skip to content

[None][chore] Add multinode e2e and accuracy cases on DGX-Spark#12110

Open
JennyLiu-nv wants to merge 1 commit intoNVIDIA:mainfrom
JennyLiu-nv:spart_multinode_func
Open

[None][chore] Add multinode e2e and accuracy cases on DGX-Spark#12110
JennyLiu-nv wants to merge 1 commit intoNVIDIA:mainfrom
JennyLiu-nv:spart_multinode_func

Conversation

@JennyLiu-nv
Copy link
Collaborator

@JennyLiu-nv JennyLiu-nv commented Mar 11, 2026

Summary by CodeRabbit

  • Tests

    • Added accuracy benchmark reference data for three new models: Llama 3.1 70B, NVIDIA Llama 3.1 405B Instruct (FP4), and DeepSeek R1 Distill Llama 70B.
    • Expanded test coverage for multi-GPU tensor parallelism configurations and multinode deployment scenarios.
    • Added end-to-end multinode test cases.
  • Chores

    • Updated test configuration files to organize and manage multinode test execution across various model configurations.
  • Using mpirun + trtllm-llmapi-launch + pytest to run the 2nodes on DGX-Spark

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Jenny Liu <JennyLiu-nv+JennyLiu@users.noreply.github.com>
@JennyLiu-nv JennyLiu-nv requested review from a team as code owners March 11, 2026 09:54
@JennyLiu-nv JennyLiu-nv enabled auto-merge (squash) March 11, 2026 09:55
@JennyLiu-nv JennyLiu-nv changed the title [None][Test] Add multinode e2e and accuracy cases on DGX-Spark [None][chore] Add multinode e2e and accuracy cases on DGX-Spark Mar 11, 2026
@JennyLiu-nv
Copy link
Collaborator Author

/bot run

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

The changes add three new model entries to MMLU accuracy references, introduce a helper function for multi-node accuracy testing with multiple new test classes and methods, create and modify end-to-end multi-node tests, and reorganize test manifests by converting a text file to YAML with GPU count-based conditions.

Changes

Cohort / File(s) Summary
Accuracy References
tests/integration/defs/accuracy/references/mmlu.yaml
Added three new model accuracy entries: meta-llama/Llama-3.1-70B (78.58), nvidia/Llama-3.1-405B-Instruct-NVFP4 with NVFP4 quantization (84.82), and deepseek-ai/DeepSeek-R1-Distill-Llama-70B (78.19).
Accuracy Tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Added _run_multinode_accuracy helper function supporting tensor/pipeline/MOE parallelism, speculative decoding with draft models, and benchmark evaluation (MMLU, GSM8K). Introduced three new test classes (TestLlama3\_1\_70B, TestLlama3\_1\_405BInstructFp4, TestDeepSeekR1DistillLlama70B) and extended existing test classes with test\_auto\_dtype\_tp2 methods for multi-node TP2 coverage.
End-to-End Tests
tests/integration/defs/test_e2e.py
Added new test_ptp_quickstart_advanced_multinode test with multi-node parameters. Modified test_ptp_quickstart_advanced_deepseek_multi_nodes to capture output and assert presence of "Generated text:" instead of using check_call.
Test Manifests
tests/integration/test_lists/qa/llm_spark_func.yml, tests/integration/test_lists/qa/llm_spark_func.txt
Created new YAML manifest (llm_spark_func.yml) defining single and multi-GPU test matrices with system_gpu_count conditions. Removed 77 lines of test entries from text manifest (llm_spark_func.txt).

Sequence Diagram

sequenceDiagram
    participant Test as Test Case
    participant Func as _run_multinode_accuracy
    participant LLM as LLM Engine
    participant Spec as Speculative Decoder
    participant Bench as Benchmark (MMLU/GSM8K)

    Test->>Func: invoke with model_path, benchmarks, tp_size, draft_model_path
    Func->>LLM: construct with tensor/pipeline parallelism
    activate LLM
    LLM-->>Func: LLM initialized
    deactivate LLM

    alt draft_model_path provided
        Func->>Spec: configure with draft model
        activate Spec
        Spec-->>Func: speculative decoding ready
        deactivate Spec
    end

    loop for each benchmark
        Func->>Bench: evaluate benchmark
        activate Bench
        Bench->>LLM: query with kv_cache_config
        LLM-->>Bench: results
        Bench-->>Func: accuracy metrics
        deactivate Bench
    end

    Func-->>Test: accuracy results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is minimal and does not follow the provided template structure or provide adequate detail about the changes. Add a detailed description explaining the objective, approach, and test coverage; include a proper PR title with ticket/issue format and type; complete the PR checklist section.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding multinode end-to-end and accuracy test cases for DGX-Spark, matching the primary content of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

94-120: Validate benchmarks before opening the multinode LLM.

An unsupported entry currently fails only after the model is fully constructed, which turns a simple typo into a very expensive two-node failure. Precompute the task list before entering with LLM(...).

♻️ Proposed fix
 def _run_multinode_accuracy(model_path,
                             model_name,
                             *,
                             benchmarks,
                             tp_size=2,
                             pp_size=1,
                             ep_size=1,
                             draft_model_path=None,
                             max_draft_len=2,
                             kv_cache_config=None,
                             max_num_tokens=4096,
                             max_batch_size=1,
                             **llm_kwargs):
     benchmark_task_map = {
         "mmlu": MMLU,
         "gsm8k": GSM8K,
     }
+    invalid_benchmarks = [
+        benchmark for benchmark in benchmarks
+        if benchmark not in benchmark_task_map
+    ]
+    if invalid_benchmarks:
+        raise ValueError(f"Unsupported benchmarks: {invalid_benchmarks}")
+    tasks = [benchmark_task_map[benchmark](model_name)
+             for benchmark in benchmarks]
     if kv_cache_config is None:
         kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.5,
                                         enable_block_reuse=draft_model_path
                                         is None)
@@
     with LLM(model_path,
              tensor_parallel_size=tp_size,
              pipeline_parallel_size=pp_size,
              moe_expert_parallel_size=ep_size,
              max_num_tokens=max_num_tokens,
              max_batch_size=max_batch_size,
              kv_cache_config=kv_cache_config,
              speculative_config=spec_config,
              **llm_kwargs) as llm:
-        for benchmark in benchmarks:
-            task_cls = benchmark_task_map.get(benchmark)
-            if task_cls is None:
-                raise ValueError(f"Unsupported benchmark: {benchmark}")
-            task = task_cls(model_name)
+        for task in tasks:
             task.evaluate(llm)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py` around lines 94 -
120, Pre-validate the requested benchmarks before constructing the distributed
LLM: iterate over the incoming benchmarks list and map each entry using
benchmark_task_map to produce a task list (or raise ValueError for unknown
names) prior to entering the with LLM(...) context; update code around
benchmark_task_map, benchmarks, and task_cls so the unsupported-benchmark check
happens before LLM(...) is called, and then use the precomputed task list inside
the with-block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py`:
- Line 81: SPDX header year is stale after modifying this file; update the
top-of-file copyright year to the latest meaningful modification year. Open the
file and edit the SPDX header on line 1 to replace 2025 with 2026 so it matches
the recent changes (this file contains the function _run_multinode_accuracy
which was modified), ensure only the year is changed and keep the rest of the
header intact.
- Around line 81-123: Add an optional expected_quant_algo parameter to
_run_multinode_accuracy (default None) and after the LLM is constructed (inside
the with LLM(...) as llm: block) if expected_quant_algo is not None assert that
llm.args.quant_config.quant_algo == expected_quant_algo (or raise ValueError
with a clear message) so callers can verify post-load quantization; update any
callers to pass expected_quant_algo=QuantAlgo.NVFP4/FP8 as needed.

---

Nitpick comments:
In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py`:
- Around line 94-120: Pre-validate the requested benchmarks before constructing
the distributed LLM: iterate over the incoming benchmarks list and map each
entry using benchmark_task_map to produce a task list (or raise ValueError for
unknown names) prior to entering the with LLM(...) context; update code around
benchmark_task_map, benchmarks, and task_cls so the unsupported-benchmark check
happens before LLM(...) is called, and then use the precomputed task list inside
the with-block.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f04ce185-8cd4-4b89-96d9-865833df93f9

📥 Commits

Reviewing files that changed from the base of the PR and between b8c96d4 and 17c8017.

📒 Files selected for processing (5)
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/defs/test_e2e.py
  • tests/integration/test_lists/qa/llm_spark_func.txt
  • tests/integration/test_lists/qa/llm_spark_func.yml
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/qa/llm_spark_func.txt

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38581 [ run ] triggered by Bot. Commit: 17c8017 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38581 [ run ] completed with state SUCCESS. Commit: 17c8017
/LLM/main/L0_MergeRequest_PR pipeline #29918 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@JennyLiu-nv
Copy link
Collaborator Author

/bot run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants