Skip to content

Conversation

@rebel-eunji
Copy link
Contributor

@rebel-eunji rebel-eunji commented Oct 16, 2025

🚀 Summary of Changes

  • Debugging
    • Given the length of prompt and generation step, we can validate the result
  • Metrics
    • Given the length of prompt and generation step, we can measure the latency and throughput

📌 Related Issues / Tickets

  • Resolves #
  • Related to #

✅ Type of Change

  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (bug-fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

  1. Run
  • python scripts/validation/compare_logprobs_advanced.py
  • python scripts/performance/measure_latency_advanced.py

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes

  • TODO
    • Add device time
    • Add TTFT
    • Add warmup option

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds debugging and metrics capabilities to the RBLN backend for performance measurement and validation. It introduces a performance tracking system to measure latency and throughput for both prefill and decode operations, along with validation scripts for comparing model outputs.

Key changes:

  • Added performance tracking infrastructure with metrics collection
  • Enhanced debug logging for request tracking
  • Created validation and performance measurement scripts

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
vllm_rbln/worker/model_runner.py Added performance tracking, timing measurements, and debug logging for prefill/decode steps
vllm_rbln/worker/metrics.py New metrics tracking module with PerformanceTracker class for collecting and reporting performance statistics
vllm_rbln/rbln_envs.py Added RBLN_METRICS environment variable configuration
scripts/validation/compare_logprobs_advanced.py New advanced validation script for comparing model outputs between CPU and RBLN backends
scripts/validation/compare_logprobs.py Refactored existing validation script to improve structure and device-specific configuration
scripts/performance/measure_latency_advanced.py New performance measurement script with configurable parameters for throughput and latency testing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

if model_input.attn_metadata is not None:
model_input.attn_metadata.kv_caches = kv_caches

start_time = time.perf_counter()
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing measurement should be conditional on the RBLN_METRICS environment variable to avoid unnecessary overhead when metrics are disabled.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +37
return sum(self.latencies) / len(
self.latencies) * 1000 if self.latencies else 0.0
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression should be parenthesized for clarity: (sum(self.latencies) / len(self.latencies)) * 1000 to make the order of operations explicit.

Suggested change
return sum(self.latencies) / len(
self.latencies) * 1000 if self.latencies else 0.0
return (sum(self.latencies) / len(
self.latencies)) * 1000 if self.latencies else 0.0

Copilot uses AI. Check for mistakes.
"max_logprobs": VOCAB_SIZE,
}
if device == "cpu":
llm_args["block_size"] = 128 # 1024 is not working for long prompt
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should explain why 1024 doesn't work for long prompts on CPU to help future maintainers understand the limitation.

Suggested change
llm_args["block_size"] = 128 # 1024 is not working for long prompt
llm_args["block_size"] = 128 # On CPU, using a block_size of 1024 can cause excessive memory usage or performance issues with long prompts, leading to failures. Reducing block_size to 128 avoids these issues.

Copilot uses AI. Check for mistakes.
# (Runtime) code=203 INIT_ALREADY_CREATED:
# A runtime has already been created for that compiled model
# (Context failed to be created, compile_id=0).
# Try creating a runtime on a different NPU(s), or use an existing runtime.
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This FIXME comment describes a known issue but should include a reference to a tracking issue or ticket number for resolution.

Suggested change
# Try creating a runtime on a different NPU(s), or use an existing runtime.
# Try creating a runtime on a different NPU(s), or use an existing runtime.
# Tracking issue: https://github.com/rebellions-inc/repo/issues/123

Copilot uses AI. Check for mistakes.
@rebel-jiwoopark
Copy link
Collaborator

@huijjj FYI

@huijjj
Copy link
Collaborator

huijjj commented Oct 16, 2025

Regarding the metrics, vLLM already has a built-in metric logging system in its codebase. Wouldn’t it be better to leverage that instead? As it is implemented in LLMEngine itself rather than worker or model runner we can leverage it regardless of which worker(optimum or torch.compile) we use.

@rebel-eunji
Copy link
Contributor Author

rebel-eunji commented Oct 16, 2025

@huijjj
Thank you for your review.

The purpose of this test is a bit different from what you described.
It aims to measure the average metrics (not each logging step) of the prefill and decode phases separately over the multiple iteration simply (without any additional service like Prometheus, Grafana..)
Therefore, reusing the built-in metric logging in vLLM may not be suitable for our purpose.

@huijjj
Copy link
Collaborator

huijjj commented Oct 17, 2025

@huijjj Thank you for your review.

The purpose of this test is a bit different from what you described. It aims to measure the average metrics (not each logging step) of the prefill and decode phases separately over the multiple iteration simply (without any additional service like Prometheus, Grafana..) Therefore, reusing the built-in metric logging in vLLM may not be suitable for our purpose.

I see, your goal is to measure prefill and decode separately within a single LLMEngine .generate call. I’m not entirely convinced this change is necessary. It might be better to achieve that without modifying the core path. For instance we could approximate by running the same workload twice: (a) with the output length fixed to 1, and (b) with the normal output length, then comparing the results. It’s not perfect as scheduler and KV-cache effects could skew the comparison, but it could provide us good enough results while keeping the core untouched.

For benchmarking, our team have been using the scripts in the benchmark-script branch for performance reporting. Most of that code is copy-pasted from vLLM’s benchmarking scripts, per our prior agreement, so we can produce comparable and objective results. In this PR I see new benchmarking scripts and metrics, could you share the context for introducing them? Do you intend for us to use these for future measurements and reporting?

@rebel-eunji
Copy link
Contributor Author

@huijjj Thank you for your review.
The purpose of this test is a bit different from what you described. It aims to measure the average metrics (not each logging step) of the prefill and decode phases separately over the multiple iteration simply (without any additional service like Prometheus, Grafana..) Therefore, reusing the built-in metric logging in vLLM may not be suitable for our purpose.

I see, your goal is to measure prefill and decode separately within a single LLMEngine .generate call. I’m not entirely convinced this change is necessary. It might be better to achieve that without modifying the core path. For instance we could approximate by running the same workload twice: (a) with the output length fixed to 1, and (b) with the normal output length, then comparing the results. It’s not perfect as scheduler and KV-cache effects could skew the comparison, but it could provide us good enough results while keeping the core untouched.

For benchmarking, our team have been using the scripts in the benchmark-script branch for performance reporting. Most of that code is copy-pasted from vLLM’s benchmarking scripts, per our prior agreement, so we can produce comparable and objective results. In this PR I see new benchmarking scripts and metrics, could you share the context for introducing them? Do you intend for us to use these for future measurements and reporting?

Hello, and thank you for sharing your idea and benchmark.
After discussing it with our team, we’ve decided to proceed with this approach for our implementation.
We need to measure fine-grained latency and throughput directly, rather than relying on indirect methods for internal development.

The vLLM’s metric manager is tightly coupled with the vLLM engine and includes vLLM operations. I think it is good for service monitoring, but not for development of device operation.
In addition, we plan to measure the latency of running operations at the RBLN device level in a more low-level manner.
Therefore, modifying the core components is unavoidable, but I’ll do my best to minimize the changes.
And this metric is used solely for internal development and is not associated with official benchmarking or reporting.

@huijjj
Copy link
Collaborator

huijjj commented Oct 17, 2025

@huijjj Thank you for your review.
The purpose of this test is a bit different from what you described. It aims to measure the average metrics (not each logging step) of the prefill and decode phases separately over the multiple iteration simply (without any additional service like Prometheus, Grafana..) Therefore, reusing the built-in metric logging in vLLM may not be suitable for our purpose.

I see, your goal is to measure prefill and decode separately within a single LLMEngine .generate call. I’m not entirely convinced this change is necessary. It might be better to achieve that without modifying the core path. For instance we could approximate by running the same workload twice: (a) with the output length fixed to 1, and (b) with the normal output length, then comparing the results. It’s not perfect as scheduler and KV-cache effects could skew the comparison, but it could provide us good enough results while keeping the core untouched.
For benchmarking, our team have been using the scripts in the benchmark-script branch for performance reporting. Most of that code is copy-pasted from vLLM’s benchmarking scripts, per our prior agreement, so we can produce comparable and objective results. In this PR I see new benchmarking scripts and metrics, could you share the context for introducing them? Do you intend for us to use these for future measurements and reporting?

Hello, and thank you for sharing your idea and benchmark. After discussing it with our team, we’ve decided to proceed with this approach for our implementation. We need to measure fine-grained latency and throughput directly, rather than relying on indirect methods for internal development.

The vLLM’s metric manager is tightly coupled with the vLLM engine and includes vLLM operations. I think it is good for service monitoring, but not for development of device operation. In addition, we plan to measure the latency of running operations at the RBLN device level in a more low-level manner. Therefore, modifying the core components is unavoidable, but I’ll do my best to minimize the changes. And this metric is used solely for internal development and is not associated with official benchmarking or reporting.

@rebel-eunji
Understood. As performance evaluation is an important part of our collaboration, I wanted to clarify to ensure we're on the same page. Thank you for the kind and thorough explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants