-
Notifications
You must be signed in to change notification settings - Fork 5
other: Script for debugging and metrics #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds debugging and metrics capabilities to the RBLN backend for performance measurement and validation. It introduces a performance tracking system to measure latency and throughput for both prefill and decode operations, along with validation scripts for comparing model outputs.
Key changes:
- Added performance tracking infrastructure with metrics collection
- Enhanced debug logging for request tracking
- Created validation and performance measurement scripts
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_rbln/worker/model_runner.py | Added performance tracking, timing measurements, and debug logging for prefill/decode steps |
| vllm_rbln/worker/metrics.py | New metrics tracking module with PerformanceTracker class for collecting and reporting performance statistics |
| vllm_rbln/rbln_envs.py | Added RBLN_METRICS environment variable configuration |
| scripts/validation/compare_logprobs_advanced.py | New advanced validation script for comparing model outputs between CPU and RBLN backends |
| scripts/validation/compare_logprobs.py | Refactored existing validation script to improve structure and device-specific configuration |
| scripts/performance/measure_latency_advanced.py | New performance measurement script with configurable parameters for throughput and latency testing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| if model_input.attn_metadata is not None: | ||
| model_input.attn_metadata.kv_caches = kv_caches | ||
|
|
||
| start_time = time.perf_counter() |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timing measurement should be conditional on the RBLN_METRICS environment variable to avoid unnecessary overhead when metrics are disabled.
| return sum(self.latencies) / len( | ||
| self.latencies) * 1000 if self.latencies else 0.0 |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression should be parenthesized for clarity: (sum(self.latencies) / len(self.latencies)) * 1000 to make the order of operations explicit.
| return sum(self.latencies) / len( | |
| self.latencies) * 1000 if self.latencies else 0.0 | |
| return (sum(self.latencies) / len( | |
| self.latencies)) * 1000 if self.latencies else 0.0 |
| "max_logprobs": VOCAB_SIZE, | ||
| } | ||
| if device == "cpu": | ||
| llm_args["block_size"] = 128 # 1024 is not working for long prompt |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should explain why 1024 doesn't work for long prompts on CPU to help future maintainers understand the limitation.
| llm_args["block_size"] = 128 # 1024 is not working for long prompt | |
| llm_args["block_size"] = 128 # On CPU, using a block_size of 1024 can cause excessive memory usage or performance issues with long prompts, leading to failures. Reducing block_size to 128 avoids these issues. |
| # (Runtime) code=203 INIT_ALREADY_CREATED: | ||
| # A runtime has already been created for that compiled model | ||
| # (Context failed to be created, compile_id=0). | ||
| # Try creating a runtime on a different NPU(s), or use an existing runtime. |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This FIXME comment describes a known issue but should include a reference to a tracking issue or ticket number for resolution.
| # Try creating a runtime on a different NPU(s), or use an existing runtime. | |
| # Try creating a runtime on a different NPU(s), or use an existing runtime. | |
| # Tracking issue: https://github.com/rebellions-inc/repo/issues/123 |
|
@huijjj FYI |
|
Regarding the metrics, vLLM already has a built-in metric logging system in its codebase. Wouldn’t it be better to leverage that instead? As it is implemented in LLMEngine itself rather than worker or model runner we can leverage it regardless of which worker(optimum or torch.compile) we use. |
|
@huijjj The purpose of this test is a bit different from what you described. |
I see, your goal is to measure prefill and decode separately within a single LLMEngine .generate call. I’m not entirely convinced this change is necessary. It might be better to achieve that without modifying the core path. For instance we could approximate by running the same workload twice: (a) with the output length fixed to 1, and (b) with the normal output length, then comparing the results. It’s not perfect as scheduler and KV-cache effects could skew the comparison, but it could provide us good enough results while keeping the core untouched. For benchmarking, our team have been using the scripts in the benchmark-script branch for performance reporting. Most of that code is copy-pasted from vLLM’s benchmarking scripts, per our prior agreement, so we can produce comparable and objective results. In this PR I see new benchmarking scripts and metrics, could you share the context for introducing them? Do you intend for us to use these for future measurements and reporting? |
Hello, and thank you for sharing your idea and benchmark. The vLLM’s metric manager is tightly coupled with the vLLM engine and includes vLLM operations. I think it is good for service monitoring, but not for development of device operation. |
@rebel-eunji |
🚀 Summary of Changes
📌 Related Issues / Tickets
✅ Type of Change
feature)model)core)bug-fix)perf)refactor)docs)other): please describe🧪 How to Test
python scripts/validation/compare_logprobs_advanced.pypython scripts/performance/measure_latency_advanced.py📸 Screenshots / Logs (if applicable)
📋 Checklist
💬 Notes