Skip to content

other: log KV cache layout, warm-up phases, rbln backend invocations#589

Merged
rebel-jaehwang merged 4 commits intodevfrom
logging
May 8, 2026
Merged

other: log KV cache layout, warm-up phases, rbln backend invocations#589
rebel-jaehwang merged 4 commits intodevfrom
logging

Conversation

@rebel-jaehwang
Copy link
Copy Markdown
Contributor

@rebel-jaehwang rebel-jaehwang commented May 7, 2026

🚀 Summary of Changes

To improve observability, log

  • summary of KV cache shape and size
  • warm up phases
  • information about each invocation of torch.compile rbln backend (call stack, input shape, and whether the compilation is outside warm up phase)

📌 Related Issues / Tickets

https://github.com/rebellions-sw/vllm-rbln-internal/issues/63

✅ Type of Change

  • ❓ Other (other): please describe

example

$ VLLM_RBLN_USE_VLLM_MODEL=1 VLLM_RBLN_DECODE_BATCH_BUCKET_STRATEGY=manual VLLM_RBLN_DECODE_BATCH_BUCKET_MANUAL_BUCKETS=1,8 python examples/experimental/offline_inference_basic.py
...
[rbln_model_runner.py:4417] KV cache: num_blocks=370, num_groups=1, num_tensors=16, total=11.562 GiB
[rbln_model_runner.py:4434] KV cache: 16 layer(s) shape/dtype
...
[rbln_model_runner.py:1924] Warm-up: prefill (seq_len=128)
...
[torch_compile_backend.py:83] rbln_backend [warm-up] rbln_model_runner.py:2924(execute_model) <- rbln_model_runner.py:2105(_execute_dummy_requests) <- rbln_model_runner.py:1925(_warm_up_model_inner): inputs=[(1, 128):torch.int64, (1, 128):torch.int64, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (1, 40):torch.int16, (40,):torch.int32, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (1,):torch.int32]
...
[torch_compile_backend.py:91] rbln_backend done: xx s
[rbln_model_runner.py:1977] Warm-up: decode (batch_bucket=8, query_len=1)
[torch_compile_backend.py:83] rbln_backend [warm-up] rbln_model_runner.py:2924(execute_model) <- rbln_model_runner.py:2105(_execute_dummy_requests) <- rbln_model_runner.py:1990(_warm_up_model_inner): inputs=[(8, 1):torch.int64, (8, 1):torch.int64, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (8, 40):torch.int16, (8, 40):torch.int32, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16]
...
[torch_compile_backend.py:91] rbln_backend done: xx s
[rbln_model_runner.py:1977] Warm-up: decode (batch_bucket=8, query_len=1)
[torch_compile_backend.py:83] rbln_backend [warm-up] rbln_model_runner.py:2924(execute_model) <- rbln_model_runner.py:2105(_execute_dummy_requests) <- rbln_model_runner.py:1990(_warm_up_model_inner): inputs=[(8, 1):torch.int64, (8, 1):torch.int64, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (8, 40):torch.int16, (8, 40):torch.int32, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16, (2, 370, 8, 1, 1024, 64):torch.bfloat16]

To improve observability, log
* summary of KV cache shape and size
* warm up phases
* information about each invocation of torch.compile rbln backend (call
  stack, input shape, and whether the compilation is outside warm up
  phase)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 65.15152% with 23 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
vllm_rbln/v1/worker/rbln_model_runner.py 14.28% 18 Missing ⚠️
vllm_rbln/torch_compile_backend.py 88.09% 2 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

@rebel-jaehwang rebel-jaehwang merged commit 698bddd into dev May 8, 2026
17 checks passed
@rebel-jaehwang rebel-jaehwang deleted the logging branch May 8, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants