[CI] Add inference performance regression tests#1140
[CI] Add inference performance regression tests#1140Eigensystem merged 14 commits intohao-ai-lab:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the continuous integration pipeline by integrating automated performance regression tests. The primary goal is to proactively monitor and prevent performance degradations in video generation models, specifically focusing on generation latency and GPU memory consumption. By establishing device-aware thresholds and logging detailed results, the changes ensure that critical performance metrics remain within acceptable bounds, contributing to the overall stability and efficiency of the system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces automated performance regression tests for the Wan2.1-T2V-1.3B model, measuring generation latency and peak GPU memory. The changes include updates to the Buildkite pipeline, the pr_test.sh script, and a new Modal test file. The new test_inference_performance.py file sets up device-aware thresholds and writes JSON results for trend tracking, which is a valuable addition for monitoring performance over time. The implementation is generally robust, with proper resource cleanup and clear assertion messages.
| - "pyproject.toml" | ||
| - "docker/Dockerfile.python3.12" | ||
| config: | ||
| command: "timeout 30m .buildkite/scripts/pr_test.sh" |
There was a problem hiding this comment.
The command for the "Performance Tests" step includes timeout 30m. A similar timeout (timeout=1800) is also specified in the run_performance_tests function within fastvideo/tests/modal/pr_test.py. It's generally better to have a single source of truth for timeouts to avoid confusion and potential conflicts. Consider removing one of these timeouts or clarifying their intended roles (e.g., Buildkite timeout as a failsafe, Modal timeout as the primary control).
|
|
||
| logger = init_logger(__name__) | ||
|
|
||
| REQUIRED_GPUS = 2 |
There was a problem hiding this comment.
The REQUIRED_GPUS constant is defined but not used within this file. If this constant is meant to enforce or indicate the number of GPUs required for the test, consider adding a check to ensure the test environment meets this requirement (e.g., assert torch.cuda.device_count() == REQUIRED_GPUS). Otherwise, it can be removed to avoid dead code.
| results_dir = os.path.join(script_dir, "results") | ||
| os.makedirs(results_dir, exist_ok=True) | ||
|
|
||
| timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ") |
There was a problem hiding this comment.
The timestamp for the filename is generated using strftime("%Y%m%dT%H%M%SZ"), while the timestamp within the JSON results (line 210) uses datetime.now(timezone.utc).isoformat(). For consistency, it would be beneficial to use the same format for both, preferably isoformat() if the filename can handle the characters, or explicitly define a common format string for both uses.
| timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ") | |
| timestamp = datetime.now(timezone.utc).isoformat().replace(":", "-").replace(".", "-") # ISO 8601 compatible and filename-safe |
Eigensystem
left a comment
There was a problem hiding this comment.
Overall looks good. Maybe you can also record the current performance data and check whether subsequent pull requests will cause a performance drop in CI.
901a738 to
094c69f
Compare
Adds back PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True which was accidentally removed, causing a ~2x performance regression. Thresholds calibrated from baseline (28.3s / 8908MB) at 1.2x.
Calibrated L40S thresholds from a baseline run (28.3s avg latency, 8908MB peak memory) and set a threshold (1.2x) to check performance regressions. Results are also written to JSON after each run for tracking, but baseline needs to be manually updated with the current implementation. Does this look reasonable or do you have any recommendations for improvement? |
Eigensystem
left a comment
There was a problem hiding this comment.
Overall looks pretty good. How should we use the dashboard?
|
I just saw vLLM had a dashboard to track and visualize performance over time(perf.vllm.ai), can change or remove if unnecessary, dont think sglang had one |
Maybe remove it from this PR? |
Summary