other: Script for debugging and metrics #112

rebel-eunji · 2025-10-16T07:07:53Z

🚀 Summary of Changes

Debugging
- Given the length of prompt and generation step, we can validate the result
Metrics
- Given the length of prompt and generation step, we can measure the latency and throughput

📌 Related Issues / Tickets

Resolves #
Related to #

✅ Type of Change

✨ Feature (feature)
🧠 Model support (model)
🧬 Core engine changes (core)
🛠 Bug fix (bug-fix)
⚙️ Performance improvement (perf)
🔁 Refactor or code cleanup (refactor)
📄 Documentation (docs)
❓ Other (other): please describe

🧪 How to Test

Run

python scripts/validation/compare_logprobs_advanced.py
python scripts/performance/measure_latency_advanced.py

📸 Screenshots / Logs (if applicable)

📋 Checklist

PR title follows Conventional Commits format
This PR is linked to an existing issue
The test method is described, and the expected result is clearly stated
Relevant documentation has been updated (if applicable)

💬 Notes

TODO
- Add device time
- Add TTFT
- Add warmup option

Copilot

Pull Request Overview

This PR adds debugging and metrics capabilities to the RBLN backend for performance measurement and validation. It introduces a performance tracking system to measure latency and throughput for both prefill and decode operations, along with validation scripts for comparing model outputs.

Key changes:

Added performance tracking infrastructure with metrics collection
Enhanced debug logging for request tracking
Created validation and performance measurement scripts

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
vllm_rbln/worker/model_runner.py	Added performance tracking, timing measurements, and debug logging for prefill/decode steps
vllm_rbln/worker/metrics.py	New metrics tracking module with PerformanceTracker class for collecting and reporting performance statistics
vllm_rbln/rbln_envs.py	Added RBLN_METRICS environment variable configuration
scripts/validation/compare_logprobs_advanced.py	New advanced validation script for comparing model outputs between CPU and RBLN backends
scripts/validation/compare_logprobs.py	Refactored existing validation script to improve structure and device-specific configuration
scripts/performance/measure_latency_advanced.py	New performance measurement script with configurable parameters for throughput and latency testing

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T07:08:42Z

vllm_rbln/worker/model_runner.py

            if model_input.attn_metadata is not None:
                model_input.attn_metadata.kv_caches = kv_caches

+            start_time = time.perf_counter()


The timing measurement should be conditional on the RBLN_METRICS environment variable to avoid unnecessary overhead when metrics are disabled.

Copilot · 2025-10-16T07:08:42Z

vllm_rbln/worker/metrics.py

+        return sum(self.latencies) / len(
+            self.latencies) * 1000 if self.latencies else 0.0


The expression should be parenthesized for clarity: (sum(self.latencies) / len(self.latencies)) * 1000 to make the order of operations explicit.

Suggested change

return sum(self.latencies) / len(

self.latencies) * 1000 if self.latencies else 0.0

return (sum(self.latencies) / len(

self.latencies)) * 1000 if self.latencies else 0.0

Copilot · 2025-10-16T07:08:42Z

scripts/validation/compare_logprobs_advanced.py

+        "max_logprobs": VOCAB_SIZE,
+    }
+    if device == "cpu":
+        llm_args["block_size"] = 128  # 1024 is not working for long prompt


The comment should explain why 1024 doesn't work for long prompts on CPU to help future maintainers understand the limitation.

Suggested change

llm_args["block_size"] = 128 # 1024 is not working for long prompt

llm_args["block_size"] = 128 # On CPU, using a block_size of 1024 can cause excessive memory usage or performance issues with long prompts, leading to failures. Reducing block_size to 128 avoids these issues.

Copilot · 2025-10-16T07:08:43Z

scripts/performance/measure_latency_advanced.py

+    # (Runtime) code=203 INIT_ALREADY_CREATED:
+    # A runtime has already been created for that compiled model
+    # (Context failed to be created, compile_id=0).
+    # Try creating a runtime on a different NPU(s), or use an existing runtime.


This FIXME comment describes a known issue but should include a reference to a tracking issue or ticket number for resolution.

Suggested change

# Try creating a runtime on a different NPU(s), or use an existing runtime.

# Try creating a runtime on a different NPU(s), or use an existing runtime.

# Tracking issue: https://github.com/rebellions-inc/repo/issues/123

rebel-jiwoopark · 2025-10-16T07:19:15Z

@huijjj FYI

huijjj · 2025-10-16T07:59:15Z

Regarding the metrics, vLLM already has a built-in metric logging system in its codebase. Wouldn’t it be better to leverage that instead? As it is implemented in LLMEngine itself rather than worker or model runner we can leverage it regardless of which worker(optimum or torch.compile) we use.

rebel-eunji · 2025-10-16T08:36:00Z

@huijjj
Thank you for your review.

The purpose of this test is a bit different from what you described.
It aims to measure the average metrics (not each logging step) of the prefill and decode phases separately over the multiple iteration simply (without any additional service like Prometheus, Grafana..)
Therefore, reusing the built-in metric logging in vLLM may not be suitable for our purpose.

huijjj · 2025-10-17T01:51:35Z

@huijjj Thank you for your review.

The purpose of this test is a bit different from what you described. It aims to measure the average metrics (not each logging step) of the prefill and decode phases separately over the multiple iteration simply (without any additional service like Prometheus, Grafana..) Therefore, reusing the built-in metric logging in vLLM may not be suitable for our purpose.

I see, your goal is to measure prefill and decode separately within a single LLMEngine .generate call. I’m not entirely convinced this change is necessary. It might be better to achieve that without modifying the core path. For instance we could approximate by running the same workload twice: (a) with the output length fixed to 1, and (b) with the normal output length, then comparing the results. It’s not perfect as scheduler and KV-cache effects could skew the comparison, but it could provide us good enough results while keeping the core untouched.

For benchmarking, our team have been using the scripts in the benchmark-script branch for performance reporting. Most of that code is copy-pasted from vLLM’s benchmarking scripts, per our prior agreement, so we can produce comparable and objective results. In this PR I see new benchmarking scripts and metrics, could you share the context for introducing them? Do you intend for us to use these for future measurements and reporting?

rebel-eunji · 2025-10-17T04:53:55Z

@huijjj Thank you for your review.
The purpose of this test is a bit different from what you described. It aims to measure the average metrics (not each logging step) of the prefill and decode phases separately over the multiple iteration simply (without any additional service like Prometheus, Grafana..) Therefore, reusing the built-in metric logging in vLLM may not be suitable for our purpose.

I see, your goal is to measure prefill and decode separately within a single LLMEngine .generate call. I’m not entirely convinced this change is necessary. It might be better to achieve that without modifying the core path. For instance we could approximate by running the same workload twice: (a) with the output length fixed to 1, and (b) with the normal output length, then comparing the results. It’s not perfect as scheduler and KV-cache effects could skew the comparison, but it could provide us good enough results while keeping the core untouched.

For benchmarking, our team have been using the scripts in the benchmark-script branch for performance reporting. Most of that code is copy-pasted from vLLM’s benchmarking scripts, per our prior agreement, so we can produce comparable and objective results. In this PR I see new benchmarking scripts and metrics, could you share the context for introducing them? Do you intend for us to use these for future measurements and reporting?

Hello, and thank you for sharing your idea and benchmark.
After discussing it with our team, we’ve decided to proceed with this approach for our implementation.
We need to measure fine-grained latency and throughput directly, rather than relying on indirect methods for internal development.

The vLLM’s metric manager is tightly coupled with the vLLM engine and includes vLLM operations. I think it is good for service monitoring, but not for development of device operation.
In addition, we plan to measure the latency of running operations at the RBLN device level in a more low-level manner.
Therefore, modifying the core components is unavoidable, but I’ll do my best to minimize the changes.
And this metric is used solely for internal development and is not associated with official benchmarking or reporting.

huijjj · 2025-10-17T06:01:10Z

@huijjj Thank you for your review.
The purpose of this test is a bit different from what you described. It aims to measure the average metrics (not each logging step) of the prefill and decode phases separately over the multiple iteration simply (without any additional service like Prometheus, Grafana..) Therefore, reusing the built-in metric logging in vLLM may not be suitable for our purpose.

I see, your goal is to measure prefill and decode separately within a single LLMEngine .generate call. I’m not entirely convinced this change is necessary. It might be better to achieve that without modifying the core path. For instance we could approximate by running the same workload twice: (a) with the output length fixed to 1, and (b) with the normal output length, then comparing the results. It’s not perfect as scheduler and KV-cache effects could skew the comparison, but it could provide us good enough results while keeping the core untouched.
For benchmarking, our team have been using the scripts in the benchmark-script branch for performance reporting. Most of that code is copy-pasted from vLLM’s benchmarking scripts, per our prior agreement, so we can produce comparable and objective results. In this PR I see new benchmarking scripts and metrics, could you share the context for introducing them? Do you intend for us to use these for future measurements and reporting?

Hello, and thank you for sharing your idea and benchmark. After discussing it with our team, we’ve decided to proceed with this approach for our implementation. We need to measure fine-grained latency and throughput directly, rather than relying on indirect methods for internal development.

The vLLM’s metric manager is tightly coupled with the vLLM engine and includes vLLM operations. I think it is good for service monitoring, but not for development of device operation. In addition, we plan to measure the latency of running operations at the RBLN device level in a more low-level manner. Therefore, modifying the core components is unavoidable, but I’ll do my best to minimize the changes. And this metric is used solely for internal development and is not associated with official benchmarking or reporting.

@rebel-eunji
Understood. As performance evaluation is an important part of our collaboration, I wanted to clarify to ensure we're on the same page. Thank you for the kind and thorough explanation!

rebel-eunji added 14 commits October 15, 2025 15:51

add new script

dfcdb23

pre-commit

6bce545

rename the max_new_tokens -> max_tokens

55f15de

change the default block_size

1591042

fix the pre-commit bug

365cee1

change the default option

0fa1e98

add typing

bdf2271

fix typing

2eac0e8

fix typing and add logging about prefill/decode

bd00789

fix block_size

3e5c775

fix batch_size

b50f76e

fix the logging style

a353bc9

add latency check code

2ac5f15

pre-commit

86c5960

rebel-eunji requested review from Copilot, rebel-jiwoopark, rebel-jonghewk and rebel-wonsubkim October 16, 2025 07:07

Copilot AI reviewed Oct 16, 2025

View reviewed changes

Merge branch 'main' into scripts

930c1c2

rebel-jiwoopark requested a review from huijjj October 16, 2025 07:19

rebel-jiwoopark assigned rebel-eunji Oct 17, 2025

rebel-jiwoopark requested a review from rebel-jaehwang October 17, 2025 05:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

other: Script for debugging and metrics #112

other: Script for debugging and metrics #112

Uh oh!

rebel-eunji commented Oct 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

rebel-jiwoopark commented Oct 16, 2025

Uh oh!

huijjj commented Oct 16, 2025 •

edited

Loading

Uh oh!

rebel-eunji commented Oct 16, 2025 •

edited

Loading

Uh oh!

huijjj commented Oct 17, 2025

Uh oh!

rebel-eunji commented Oct 17, 2025

Uh oh!

huijjj commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return sum(self.latencies) / len(
		self.latencies) * 1000 if self.latencies else 0.0

	llm_args["block_size"] = 128 # 1024 is not working for long prompt
	llm_args["block_size"] = 128 # On CPU, using a block_size of 1024 can cause excessive memory usage or performance issues with long prompts, leading to failures. Reducing block_size to 128 avoids these issues.

	# Try creating a runtime on a different NPU(s), or use an existing runtime.
	# Try creating a runtime on a different NPU(s), or use an existing runtime.
	# Tracking issue: https://github.com/rebellions-inc/repo/issues/123

other: Script for debugging and metrics #112

Are you sure you want to change the base?

other: Script for debugging and metrics #112

Uh oh!

Conversation

rebel-eunji commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Summary of Changes

📌 Related Issues / Tickets

✅ Type of Change

🧪 How to Test

📸 Screenshots / Logs (if applicable)

📋 Checklist

💬 Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

rebel-jiwoopark commented Oct 16, 2025

Uh oh!

huijjj commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rebel-eunji commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huijjj commented Oct 17, 2025

Uh oh!

rebel-eunji commented Oct 17, 2025

Uh oh!

huijjj commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rebel-eunji commented Oct 16, 2025 •

edited

Loading

huijjj commented Oct 16, 2025 •

edited

Loading

rebel-eunji commented Oct 16, 2025 •

edited

Loading