test: add asynchronous benchmark script to measure inference concurrency#2185
test: add asynchronous benchmark script to measure inference concurrency#2185GuilhermeGors wants to merge 6 commits intoQwenLM:mainfrom
Conversation
This tool tracks concurrent request handling, temporal overlap, and serialization vs parallelism behavior, specifically targeting the Qwen 3.5 GDN architectural bottlenecks.
There was a problem hiding this comment.
Pull request overview
Adds a standalone async benchmark script to probe whether Ollama serves multiple simultaneous inference requests in parallel or serializes them, with reporting and JSON export to help investigate Qwen 3.5 concurrency behavior (relates to #2155).
Changes:
- Introduces an
aiohttp-based async runner that fires N simultaneous/api/generatestreaming requests and captures TTFT/total time. - Implements a concurrency analysis heuristic (overlap/serialization verdict) plus a textual timeline visualization.
- Exports results and derived metrics to a timestamped JSON file for offline analysis.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ormalization, and safe imports
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replaced 'if value:' with 'if value is not None:' to prevent valid 0.0 metrics from being dropped.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Applied Copilot review to prevent Path Traversal vulnerabilities by stripping illegal directory characters from --model strings and ensuring output is strictly contained within ./bench_results using os.path.realpath validation.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…_ttft display Added --output-dir CLI argument (defaults to bench_results), replaced PEP 585 list[str] with typing.List[str] + from __future__ import annotations for Python 3.8 support, and fixed misleading 0.000s avg TTFT display when no data exists.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Applied a complete purge of inline comments to avoid linter false alarms and clear up the iteration trail.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
All automated code review feedback has been resolved. The benchmark script has been fully refactored and is now production-ready.
Key fixes implemented:
- Accuracy: Fixed overlap mathematical formulas and token counts (using
eval_count) to ensure strict parallelism detection. - Safety: Added input sanitization against Path Traversal and isolated JSON outputs to a dedicated
./bench_results/directory. - Stability: Added
errors="replace"to prevent UTF-8 streaming crashes and handled silent JSON decode failures. - Compatibility: Enforced Python 3.8 support by deferring type annotations with
__future__.
The tool is safe, exact, and ready to be merged
This tool tracks concurrent request handling, temporal overlap, and serialization vs parallelism behavior, specifically targeting the Qwen 3.5 GDN architectural bottlenecks.
Relates to #2155