Skip to content

[CI] Add inference performance regression tests#1140

Merged
Eigensystem merged 14 commits intohao-ai-lab:mainfrom
AjAnubolu:ci/performance-tests
Mar 7, 2026
Merged

[CI] Add inference performance regression tests#1140
Eigensystem merged 14 commits intohao-ai-lab:mainfrom
AjAnubolu:ci/performance-tests

Conversation

@AjAnubolu
Copy link
Collaborator

Summary

  • Adds automated performance CI measuring generation latency and peak GPU memory for Wan2.1-T2V-1.3B
  • 1 warmup + 3 measurement runs with device-aware thresholds (L40S initial placeholders)
  • Writes JSON results for future trend tracking
  • Triggers on changes to dits, pipelines, attention, layers, worker, entrypoints

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the continuous integration pipeline by integrating automated performance regression tests. The primary goal is to proactively monitor and prevent performance degradations in video generation models, specifically focusing on generation latency and GPU memory consumption. By establishing device-aware thresholds and logging detailed results, the changes ensure that critical performance metrics remain within acceptable bounds, contributing to the overall stability and efficiency of the system.

Highlights

  • Automated Performance CI: Introduced a new automated performance CI step to measure generation latency and peak GPU memory for the Wan2.1-T2V-1.3B model.
  • Device-Aware Thresholds: Implemented device-aware performance thresholds (initially for L40S GPUs) to validate generation time and peak memory, with a fallback for unknown devices.
  • Result Tracking: Configured the performance tests to write detailed JSON results, including individual run metrics and thresholds, for future trend tracking and analysis.
  • Targeted Triggering: The new performance tests are configured to trigger specifically on changes to core components like dits, pipelines, attention, layers, worker, entrypoints, and performance test files themselves.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .buildkite/pipeline.yml
    • Added a new Buildkite step named 'Performance Tests' to the CI pipeline.
    • Configured the performance test step to run on specific file path changes, including core model components and the performance test directory.
    • Set the TEST_TYPE environment variable to 'performance' for the new step.
  • .buildkite/scripts/pr_test.sh
    • Extended the case statement to handle a new TEST_TYPE value, 'performance'.
    • Mapped the 'performance' test type to execute the run_performance_tests Modal function.
  • fastvideo/tests/modal/pr_test.py
    • Added a new Modal function run_performance_tests to orchestrate the execution of performance tests.
    • Configured run_performance_tests to use two L40S GPUs, a 30-minute timeout, and specific environment variables for Hugging Face cache and PyTorch CUDA memory allocation.
  • fastvideo/tests/performance/test_inference_performance.py
    • Created a new pytest file test_inference_performance.py dedicated to measuring video generation performance.
    • Defined parameters for the Wan2.1-T2V-1.3B model, including num_gpus, model_path, height, width, num_frames, and inference_steps.
    • Implemented logic for device-aware performance thresholds, with initial values for L40S GPUs and a default fallback.
    • Included helper functions for retrieving thresholds, shutting down executors, running single generations, and writing results to JSON files.
    • The main test function performs warmup runs, multiple measurement runs, calculates average generation time and maximum peak GPU memory, and asserts these against the defined thresholds.
Activity
  • AjAnubolu created this pull request to introduce automated inference performance regression tests.
  • The pull request body provides a summary of the changes, including the scope of the tests and trigger conditions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces automated performance regression tests for the Wan2.1-T2V-1.3B model, measuring generation latency and peak GPU memory. The changes include updates to the Buildkite pipeline, the pr_test.sh script, and a new Modal test file. The new test_inference_performance.py file sets up device-aware thresholds and writes JSON results for trend tracking, which is a valuable addition for monitoring performance over time. The implementation is generally robust, with proper resource cleanup and clear assertion messages.

- "pyproject.toml"
- "docker/Dockerfile.python3.12"
config:
command: "timeout 30m .buildkite/scripts/pr_test.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command for the "Performance Tests" step includes timeout 30m. A similar timeout (timeout=1800) is also specified in the run_performance_tests function within fastvideo/tests/modal/pr_test.py. It's generally better to have a single source of truth for timeouts to avoid confusion and potential conflicts. Consider removing one of these timeouts or clarifying their intended roles (e.g., Buildkite timeout as a failsafe, Modal timeout as the primary control).


logger = init_logger(__name__)

REQUIRED_GPUS = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The REQUIRED_GPUS constant is defined but not used within this file. If this constant is meant to enforce or indicate the number of GPUs required for the test, consider adding a check to ensure the test environment meets this requirement (e.g., assert torch.cuda.device_count() == REQUIRED_GPUS). Otherwise, it can be removed to avoid dead code.

results_dir = os.path.join(script_dir, "results")
os.makedirs(results_dir, exist_ok=True)

timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timestamp for the filename is generated using strftime("%Y%m%dT%H%M%SZ"), while the timestamp within the JSON results (line 210) uses datetime.now(timezone.utc).isoformat(). For consistency, it would be beneficial to use the same format for both, preferably isoformat() if the filename can handle the characters, or explicitly define a common format string for both uses.

Suggested change
timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
timestamp = datetime.now(timezone.utc).isoformat().replace(":", "-").replace(".", "-") # ISO 8601 compatible and filename-safe

Copy link
Collaborator

@Eigensystem Eigensystem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Maybe you can also record the current performance data and check whether subsequent pull requests will cause a performance drop in CI.

@AjAnubolu AjAnubolu force-pushed the ci/performance-tests branch from 901a738 to 094c69f Compare March 2, 2026 04:28
@AjAnubolu
Copy link
Collaborator Author

Collaborator

Calibrated L40S thresholds from a baseline run (28.3s avg latency, 8908MB peak memory) and set a threshold (1.2x) to check performance regressions. Results are also written to JSON after each run for tracking, but baseline needs to be manually updated with the current implementation. Does this look reasonable or do you have any recommendations for improvement?

@AjAnubolu AjAnubolu added the go Trigger Buildkite CI label Mar 5, 2026
@AjAnubolu AjAnubolu requested a review from Eigensystem March 5, 2026 01:25
Copy link
Collaborator

@Eigensystem Eigensystem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good. How should we use the dashboard?

@AjAnubolu
Copy link
Collaborator Author

I just saw vLLM had a dashboard to track and visualize performance over time(perf.vllm.ai), can change or remove if unnecessary, dont think sglang had one

@Eigensystem
Copy link
Collaborator

I just saw vLLM had a dashboard to track and visualize performance over time(perf.vllm.ai), can change or remove if unnecessary, dont think sglang had one

Maybe remove it from this PR?

@Eigensystem Eigensystem merged commit 02c1c49 into hao-ai-lab:main Mar 7, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Trigger Buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants