Skip to content

Uplift for QWen3-8B#1885

Open
sott0n wants to merge 3 commits intodevfrom
uplift-qwen3-8b
Open

Uplift for QWen3-8B#1885
sott0n wants to merge 3 commits intodevfrom
uplift-qwen3-8b

Conversation

@sott0n
Copy link
Contributor

@sott0n sott0n commented Jan 26, 2026

Running Qwen3-8B on N150 with tt-inference-server v0.8.0 can result in an out-of-memory (OOM) error reported by the Discord community.

The issue has been resolved by this vLLM commit that explicitly sets max_tokens_all_users.
This PR uplifts that fix to address the OOM issue and ensure Qwen3-8B runs correctly on N150.

fix #1869

@sott0n sott0n requested a review from a team as a code owner January 26, 2026 12:15
@github-actions
Copy link
Contributor

github-actions bot commented Jan 26, 2026

✅ Test Results - PASSED

Summary

Component Total Passed Skipped Failed Status
tt-inference-server 388 388 0 0
tt-media-server 467 467 0 0
Overall 855 855 0 0

Details

  • Python Version: 3.10
  • Workflow: Test Gate
  • Commit: 5375923
  • Run ID: 21928816225

🎉 All tests passed! This PR is ready for review.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 26, 2026

✅ Test Coverage Report

Coverage of Changed Lines

Metric Value
Coverage %
Threshold 50%
Status ✅ PASSED

💡 This checks coverage of newly added/modified lines only, not total codebase coverage.

Copy link
Contributor

@bgoelTT bgoelTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we uplift the commits can you please execute a Models CI dispatch run that proves the whole benchmark and accuracy evaluations workflows complete?

@sott0n
Copy link
Contributor Author

sott0n commented Jan 30, 2026

I ran the release workflow with this Uplift commit, and it worked fine for everything except Galaxy. Galaxy seems to either hang or fail partway through, so I’m currently looking into it.

Qwen3-8B on n150 ## Tenstorrent Model Release Summary: Qwen3-8B on n150

Metadata: Qwen3-8B on n150

{
"report_id": "id_tt-transformers_Qwen3-8B_n150_2026-01-30_00-53-40",
"model_name": "Qwen3-8B",
"model_id": "id_tt-transformers_Qwen3-8B_n150",
"model_spec_json": "/home/kyamaguchi/tt-inference-server/workflow_logs/run_specs/tt_model_spec_2026-01-29_23-59-57_id_tt-transformers_Qwen3-8B_n150_release_ELQmmwze.json",
"model_repo": "Qwen/Qwen3-8B",
"model_impl": "tt-transformers",
"inference_engine": "vLLM",
"device": "n150",
"server_mode": "docker",
"tt_metal_commit": "41345ac",
"vllm_commit": "628d4dc",
"run_command": "python run.py --model Qwen3-8B --device n150 --workflow release --docker-server"
}

Performance Benchmark Sweeps for Qwen3-8B on n150

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on n150

Source ISL OSL Concurrency N Req TTFT (ms) TPOT (ms) Tput User (TPS) Tput Decode (TPS) Tput Prefill (TPS) E2EL (ms) Req Tput (RPS) Total Token Throughput (tokens/duration)
vLLM 128 128 1 8 124.7 43.7 22.9 22.9 1026.3 5669.5 0.176 45.15
vLLM 128 128 32 256 3465.8 46.2 21.66 693.1 1181.8 9329.8 3.429 877.76
vLLM 128 1024 1 4 124.5 45.7 21.89 21.9 1028.4 46860.4 0.021 24.58
vLLM 128 1024 32 128 4250.2 52.7 18.99 607.5 963.7 58132.7 0.519 597.89
vLLM 1024 128 1 4 333.6 48.9 20.47 20.5 3069.3 6539.2 0.153 176.16
vLLM 2048 128 1 4 635.9 56.3 17.77 17.8 3220.8 7783.1 0.128 279.56
vLLM 2048 128 18 72 12270.4 71.0 14.09 253.7 3004.3 21282.5 0.781 1699.14
vLLM 2048 2048 10 20 6605.7 69.5 14.39 143.9 3100.3 148876.1 0.055 224.18
vLLM 3000 64 13 52 16775.0 64.9 15.4 200.2 2324.9 20866.4 0.583 1786.84
vLLM 3072 128 1 4 1351.1 63.7 15.7 15.7 2273.7 9442.1 0.106 338.89
vLLM 4000 64 10 40 13648.1 72.7 13.76 137.6 2930.8 18226.3 0.527 2140.76
vLLM 4096 128 1 2 1376.7 71.3 14.02 14.0 2975.3 10438.1 0.096 404.65
vLLM 8000 64 5 10 14472.0 99.6 10.04 50.2 2764.0 20744.8 0.204 1648.85
vLLM 8192 128 1 2 3061.5 99.9 10.01 10.0 2675.8 15754.5 0.063 528.09
vLLM 16000 64 2 4 14880.4 156.7 6.38 12.8 2150.5 24755.0 0.081 1297.8
vLLM 16384 128 1 2 7494.6 159.3 6.28 6.3 2186.1 27726.2 0.036 595.53
vLLM 32000 64 1 2 20567.8 271.6 3.68 3.7 1555.8 37678.0 0.027 850.99
vLLM 32000 128 1 2 20558.8 271.2 3.69 3.7 1556.5 55003.0 0.018 584.11

Note: all metrics are means across benchmark run unless otherwise stated.

ISL: Input Sequence Length (tokens)
OSL: Output Sequence Length (tokens)
Concurrency: number of concurrent requests (batch size)
N Req: total number of requests (sample size, N)
TTFT: Time To First Token (ms)
TPOT: Time Per Output Token (ms)
Tput User: Throughput per user (TPS)
Tput Decode: Throughput for decode tokens, across all users (TPS)
Tput Prefill: Throughput for prefill tokens (TPS)
E2EL: End-to-End Latency (ms)
Req Tput: Request Throughput (RPS)

Qwen3-8B on t3k ## Tenstorrent Model Release Summary: Qwen3-8B on t3k

Metadata: Qwen3-8B on t3k

{
"report_id": "id_tt-transformers_Qwen3-8B_t3k_2026-01-29_12-56-00",
"model_name": "Qwen3-8B",
"model_id": "id_tt-transformers_Qwen3-8B_t3k",
"model_spec_json": "/home/kyamaguchi/tt-inference-server/workflow_logs/run_specs/tt_model_spec_2026-01-29_12-18-43_id_tt-transformers_Qwen3-8B_t3k_release_dHOFti0Q.json",
"model_repo": "Qwen/Qwen3-8B",
"model_impl": "tt-transformers",
"inference_engine": "vLLM",
"device": "t3k",
"server_mode": "docker",
"tt_metal_commit": "41345ac",
"vllm_commit": "628d4dc",
"run_command": "python run.py --model Qwen3-8B --device t3k --workflow release --docker-server"
}

Performance Benchmark Sweeps for Qwen3-8B on t3k

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on t3k

Source ISL OSL Concurrency N Req TTFT (ms) TPOT (ms) Tput User (TPS) Tput Decode (TPS) Tput Prefill (TPS) E2EL (ms) Req Tput (RPS) Total Token Throughput (tokens/duration)
vLLM 128 128 1 8 58.1 24.4 41.02 41.0 2203.6 3154.0 0.317 81.16
vLLM 128 128 32 256 1247.8 26.1 38.28 1224.9 3282.5 4565.6 7.006 1793.57
vLLM 128 1024 1 4 58.0 25.0 40.04 40.0 2207.0 25606.0 0.039 44.99
vLLM 128 1024 32 128 1262.6 26.5 37.74 1207.8 3244.1 28366.6 1.128 1299.46
vLLM 1024 128 1 4 142.3 25.4 39.34 39.3 7194.1 3370.8 0.297 341.72
vLLM 2048 128 1 4 270.2 26.2 38.14 38.1 7578.7 3600.4 0.278 604.32
vLLM 2048 128 18 72 4396.5 28.0 35.69 642.4 8384.9 7954.9 2.262 4922.81
vLLM 2048 2048 10 20 2466.3 27.3 36.59 365.9 8303.9 58407.9 0.171 701.26
vLLM 3000 64 13 52 6164.2 27.6 36.2 470.6 6326.8 7904.4 1.644 5038.24
vLLM 3072 128 1 4 500.9 27.0 37.0 37.0 6133.1 3933.1 0.254 813.53
vLLM 4000 64 10 40 4771.4 28.3 35.36 353.6 8383.3 6553.0 1.526 6200.6
vLLM 4096 128 1 2 507.0 27.9 35.85 35.9 8078.5 4049.3 0.247 1043.04
vLLM 8000 64 5 10 4939.2 30.7 32.56 162.8 8098.5 6873.8 0.727 5864.8
vLLM 8192 128 1 2 1021.2 30.4 32.9 32.9 8021.7 4882.0 0.205 1704.12
vLLM 16000 64 2 4 4330.7 37.1 26.99 54.0 7389.1 6665.2 0.3 4819.88
vLLM 16384 128 1 2 2195.8 36.6 27.31 27.3 7461.6 6846.2 0.146 2411.74
vLLM 32000 64 1 2 5075.7 48.8 20.5 20.5 6304.5 8148.3 0.123 3934.88
vLLM 32000 128 1 2 5076.0 48.7 20.51 20.5 6304.2 11266.6 0.089 2851.54

Note: all metrics are means across benchmark run unless otherwise stated.

ISL: Input Sequence Length (tokens)
OSL: Output Sequence Length (tokens)
Concurrency: number of concurrent requests (batch size)
N Req: total number of requests (sample size, N)
TTFT: Time To First Token (ms)
TPOT: Time Per Output Token (ms)
Tput User: Throughput per user (TPS)
Tput Decode: Throughput for decode tokens, across all users (TPS)
Tput Prefill: Throughput for prefill tokens (TPS)
E2EL: End-to-End Latency (ms)
Req Tput: Request Throughput (RPS)

Qwen3-8B on galaxy_t3k ## Tenstorrent Model Release Summary: Qwen3-8B on galaxy_t3k

Metadata: Qwen3-8B on galaxy_t3k

{
"report_id": "id_tt-transformers_Qwen3-8B_galaxy_t3k_2026-01-30_02-05-08",
"model_name": "Qwen3-8B",
"model_id": "id_tt-transformers_Qwen3-8B_galaxy_t3k",
"model_spec_json": "/home/ubuntu/works/tt-inference-server/workflow_logs/run_specs/tt_model_spec_2026-01-30_01-41-49_id_tt-transformers_Qwen3-8B_galaxy_t3k_release_ur9k6LIg.json",
"model_repo": "Qwen/Qwen3-8B",
"model_impl": "tt-transformers",
"inference_engine": "vLLM",
"device": "galaxy_t3k",
"server_mode": "docker",
"tt_metal_commit": "41345ac",
"vllm_commit": "628d4dc",
"run_command": "python run.py --model Qwen3-8B --device galaxy_t3k --workflow release --docker-server"
}

Performance Benchmark Sweeps for Qwen3-8B on galaxy_t3k

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on galaxy_t3k

Source ISL OSL Concurrency N Req TTFT (ms) TPOT (ms) Tput User (TPS) Tput Decode (TPS) Tput Prefill (TPS) E2EL (ms) Req Tput (RPS) Total Token Throughput (tokens/duration)
vLLM 128 128 1 8 54.5 25.9 38.64 38.6 2349.4 3341.5 0.299 76.6
vLLM 128 128 32 256 1245.7 26.5 37.67 1205.4 3288.1 4617.3 6.927 1773.39
vLLM 128 1024 1 4 50.9 25.5 39.26 39.3 2516.3 26105.9 0.038 44.13
vLLM 128 1024 32 128 1258.2 27.2 36.8 1177.6 3255.5 29056.8 1.101 1268.54
vLLM 1024 128 1 4 150.3 25.6 38.99 39.0 6815.1 3407.6 0.293 338.04
vLLM 2048 128 1 4 280.0 27.5 36.3 36.3 7313.3 3778.5 0.265 575.82
vLLM 2048 128 18 72 4542.4 27.9 35.87 645.6 8115.6 8083.3 2.226 4843.88
vLLM 2048 2048 10 20 2542.8 27.8 35.92 359.2 8054.0 59537.1 0.168 687.95
vLLM 3000 64 13 52 6392.2 28.1 35.61 463.0 6101.2 8161.2 1.593 4879.48
vLLM 3072 128 1 4 525.2 27.1 36.97 37.0 5849.6 3960.7 0.252 807.85
vLLM 4000 64 10 40 4939.0 28.4 35.18 351.8 8098.8 6730.0 1.485 6036.71
vLLM 4096 128 1 2 528.5 27.9 35.83 35.8 7750.6 4072.9 0.245 1036.96
vLLM 8000 64 5 10 5092.5 31.6 31.69 158.4 7854.7 7080.5 0.706 5693.74
vLLM 8192 128 1 2 1052.7 31.2 32.07 32.1 7781.6 5012.5 0.199 1659.67
vLLM 16000 64 2 4 4350.5 37.1 26.95 53.9 7355.5 6688.2 0.299 4803.11
vLLM 16384 128 1 2 2196.5 37.2 26.88 26.9 7459.3 6921.8 0.144 2385.32
vLLM 32000 64 1 2 4877.2 49.2 20.34 20.3 6561.1 7974.9 0.125 4020.49
vLLM 32000 128 1 2 4882.5 49.4 20.24 20.2 6554.1 11156.7 0.09 2879.6

Note: all metrics are means across benchmark run unless otherwise stated.

ISL: Input Sequence Length (tokens)
OSL: Output Sequence Length (tokens)
Concurrency: number of concurrent requests (batch size)
N Req: total number of requests (sample size, N)
TTFT: Time To First Token (ms)
TPOT: Time Per Output Token (ms)
Tput User: Throughput per user (TPS)
Tput Decode: Throughput for decode tokens, across all users (TPS)
Tput Prefill: Throughput for prefill tokens (TPS)
E2EL: End-to-End Latency (ms)
Req Tput: Request Throughput (RPS)

Also, for the N150 Models CI, the performance benchmark results are being produced, but it still appears to be failing, so I’m investigating that as well.

https://github.com/tenstorrent/tt-shield/actions/runs/21421020337/job/61980149524

@bgoelTT
Copy link
Contributor

bgoelTT commented Jan 30, 2026

I ran the release workflow with this Uplift commit, and it worked fine for everything except Galaxy. Galaxy seems to either hang or fail partway through, so I’m currently looking into it.

Qwen3-8B on n150
Qwen3-8B on t3k
Qwen3-8B on galaxy_t3k
Also, for the N150 Models CI, the performance benchmark results are being produced, but it still appears to be failing, so I’m investigating that as well.

https://github.com/tenstorrent/tt-shield/actions/runs/21421020337/job/61980149524

@sott0n the benchmark workflow is passing, it returns status code 0. What is failing is the accuracy evaluations due to our fork of lm-eval-harness being rebased. I noticed you were using this branch of tt-inference-server which makes sense why it would fail - you'll need to rebase uplift-qwen3-8b to include the latest changes on dev

Now for the Galaxy hangs, do you have a Models CI run to examine?

@sott0n
Copy link
Contributor Author

sott0n commented Jan 30, 2026

@bgoelTT No, I don’t have a Models CI run yet. I tested it on a local Galaxy setup and observed that it hangs. However, I saw the same behavior with the current commit on the dev branch (not this uplift), so I’ll also check it in the Models CI.

@tstescoTT
Copy link
Collaborator

@sott0n can you post the Models CI run showing this uplift change?

@sott0n
Copy link
Contributor Author

sott0n commented Feb 12, 2026

@tstescoTT Sorry for the lack of updates.

The Models CI seems to be failing partway through, so I’ve been running the release on a local 6U system. However, it appears to hang during execution, so I’ve been bisecting tt-metal commits to identify which commit introduced the issue. I haven’t been able to pinpoint the root cause yet.
At least, I’ve confirmed that the original commit works correctly. Given that, I think it would be reasonable either to decouple the Galaxy ModelSpec from this change, or to uplift only n150 for now.

From the inference-server policy perspective, would it be problematic for Qwen3-8B to be split per device?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants