Uplift for QWen3-8B by sott0n · Pull Request #1885 · tenstorrent/tt-inference-server

sott0n · 2026-01-26T12:15:47Z

Running Qwen3-8B on N150 with tt-inference-server v0.8.0 can result in an out-of-memory (OOM) error reported by the Discord community.

The issue has been resolved by this vLLM commit that explicitly sets max_tokens_all_users.
This PR uplifts that fix to address the OOM issue and ensure Qwen3-8B runs correctly on N150.

fix #1869

github-actions · 2026-01-26T12:18:57Z

✅ Test Results - PASSED

Summary

Component	Total	Passed	Status
tt-inference-server	388	388	✅
tt-media-server	467	467	✅
Overall	855	855	✅

Details

Python Version: 3.10
Workflow: Test Gate
Commit: 5375923
Run ID: 21928816225

🎉 All tests passed! This PR is ready for review.

github-actions · 2026-01-26T12:19:26Z

✅ Test Coverage Report

Coverage of Changed Lines

Metric	Value
Coverage	%
Threshold	50%
Status	✅ PASSED

💡 This checks coverage of newly added/modified lines only, not total codebase coverage.

bgoelTT

Before we uplift the commits can you please execute a Models CI dispatch run that proves the whole benchmark and accuracy evaluations workflows complete?

sott0n · 2026-01-30T12:43:06Z

I ran the release workflow with this Uplift commit, and it worked fine for everything except Galaxy. Galaxy seems to either hang or fail partway through, so I’m currently looking into it.

Qwen3-8B on n150

## Tenstorrent Model Release Summary: Qwen3-8B on n150

Metadata: Qwen3-8B on n150

{
"report_id": "id_tt-transformers_Qwen3-8B_n150_2026-01-30_00-53-40",
"model_name": "Qwen3-8B",
"model_id": "id_tt-transformers_Qwen3-8B_n150",
"model_spec_json": "/home/kyamaguchi/tt-inference-server/workflow_logs/run_specs/tt_model_spec_2026-01-29_23-59-57_id_tt-transformers_Qwen3-8B_n150_release_ELQmmwze.json",
"model_repo": "Qwen/Qwen3-8B",
"model_impl": "tt-transformers",
"inference_engine": "vLLM",
"device": "n150",
"server_mode": "docker",
"tt_metal_commit": "41345ac",
"vllm_commit": "628d4dc",
"run_command": "python run.py --model Qwen3-8B --device n150 --workflow release --docker-server"
}

Performance Benchmark Sweeps for Qwen3-8B on n150

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on n150

Source	ISL	OSL	Concurrency	N Req	TTFT (ms)	TPOT (ms)	Tput User (TPS)	Tput Decode (TPS)	Tput Prefill (TPS)	E2EL (ms)	Req Tput (RPS)	Total Token Throughput (tokens/duration)
vLLM	128	128	1	8	124.7	43.7	22.9	22.9	1026.3	5669.5	0.176	45.15
vLLM	128	128	32	256	3465.8	46.2	21.66	693.1	1181.8	9329.8	3.429	877.76
vLLM	128	1024	1	4	124.5	45.7	21.89	21.9	1028.4	46860.4	0.021	24.58
vLLM	128	1024	32	128	4250.2	52.7	18.99	607.5	963.7	58132.7	0.519	597.89
vLLM	1024	128	1	4	333.6	48.9	20.47	20.5	3069.3	6539.2	0.153	176.16
vLLM	2048	128	1	4	635.9	56.3	17.77	17.8	3220.8	7783.1	0.128	279.56
vLLM	2048	128	18	72	12270.4	71.0	14.09	253.7	3004.3	21282.5	0.781	1699.14
vLLM	2048	2048	10	20	6605.7	69.5	14.39	143.9	3100.3	148876.1	0.055	224.18
vLLM	3000	64	13	52	16775.0	64.9	15.4	200.2	2324.9	20866.4	0.583	1786.84
vLLM	3072	128	1	4	1351.1	63.7	15.7	15.7	2273.7	9442.1	0.106	338.89
vLLM	4000	64	10	40	13648.1	72.7	13.76	137.6	2930.8	18226.3	0.527	2140.76
vLLM	4096	128	1	2	1376.7	71.3	14.02	14.0	2975.3	10438.1	0.096	404.65
vLLM	8000	64	5	10	14472.0	99.6	10.04	50.2	2764.0	20744.8	0.204	1648.85
vLLM	8192	128	1	2	3061.5	99.9	10.01	10.0	2675.8	15754.5	0.063	528.09
vLLM	16000	64	2	4	14880.4	156.7	6.38	12.8	2150.5	24755.0	0.081	1297.8
vLLM	16384	128	1	2	7494.6	159.3	6.28	6.3	2186.1	27726.2	0.036	595.53
vLLM	32000	64	1	2	20567.8	271.6	3.68	3.7	1555.8	37678.0	0.027	850.99
vLLM	32000	128	1	2	20558.8	271.2	3.69	3.7	1556.5	55003.0	0.018	584.11

Note: all metrics are means across benchmark run unless otherwise stated.

ISL: Input Sequence Length (tokens)
OSL: Output Sequence Length (tokens)
Concurrency: number of concurrent requests (batch size)
N Req: total number of requests (sample size, N)
TTFT: Time To First Token (ms)
TPOT: Time Per Output Token (ms)
Tput User: Throughput per user (TPS)
Tput Decode: Throughput for decode tokens, across all users (TPS)
Tput Prefill: Throughput for prefill tokens (TPS)
E2EL: End-to-End Latency (ms)
Req Tput: Request Throughput (RPS)

Qwen3-8B on t3k

## Tenstorrent Model Release Summary: Qwen3-8B on t3k

Metadata: Qwen3-8B on t3k

{
"report_id": "id_tt-transformers_Qwen3-8B_t3k_2026-01-29_12-56-00",
"model_name": "Qwen3-8B",
"model_id": "id_tt-transformers_Qwen3-8B_t3k",
"model_spec_json": "/home/kyamaguchi/tt-inference-server/workflow_logs/run_specs/tt_model_spec_2026-01-29_12-18-43_id_tt-transformers_Qwen3-8B_t3k_release_dHOFti0Q.json",
"model_repo": "Qwen/Qwen3-8B",
"model_impl": "tt-transformers",
"inference_engine": "vLLM",
"device": "t3k",
"server_mode": "docker",
"tt_metal_commit": "41345ac",
"vllm_commit": "628d4dc",
"run_command": "python run.py --model Qwen3-8B --device t3k --workflow release --docker-server"
}

Performance Benchmark Sweeps for Qwen3-8B on t3k

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on t3k

Source	ISL	OSL	Concurrency	N Req	TTFT (ms)	TPOT (ms)	Tput User (TPS)	Tput Decode (TPS)	Tput Prefill (TPS)	E2EL (ms)	Req Tput (RPS)	Total Token Throughput (tokens/duration)
vLLM	128	128	1	8	58.1	24.4	41.02	41.0	2203.6	3154.0	0.317	81.16
vLLM	128	128	32	256	1247.8	26.1	38.28	1224.9	3282.5	4565.6	7.006	1793.57
vLLM	128	1024	1	4	58.0	25.0	40.04	40.0	2207.0	25606.0	0.039	44.99
vLLM	128	1024	32	128	1262.6	26.5	37.74	1207.8	3244.1	28366.6	1.128	1299.46
vLLM	1024	128	1	4	142.3	25.4	39.34	39.3	7194.1	3370.8	0.297	341.72
vLLM	2048	128	1	4	270.2	26.2	38.14	38.1	7578.7	3600.4	0.278	604.32
vLLM	2048	128	18	72	4396.5	28.0	35.69	642.4	8384.9	7954.9	2.262	4922.81
vLLM	2048	2048	10	20	2466.3	27.3	36.59	365.9	8303.9	58407.9	0.171	701.26
vLLM	3000	64	13	52	6164.2	27.6	36.2	470.6	6326.8	7904.4	1.644	5038.24
vLLM	3072	128	1	4	500.9	27.0	37.0	37.0	6133.1	3933.1	0.254	813.53
vLLM	4000	64	10	40	4771.4	28.3	35.36	353.6	8383.3	6553.0	1.526	6200.6
vLLM	4096	128	1	2	507.0	27.9	35.85	35.9	8078.5	4049.3	0.247	1043.04
vLLM	8000	64	5	10	4939.2	30.7	32.56	162.8	8098.5	6873.8	0.727	5864.8
vLLM	8192	128	1	2	1021.2	30.4	32.9	32.9	8021.7	4882.0	0.205	1704.12
vLLM	16000	64	2	4	4330.7	37.1	26.99	54.0	7389.1	6665.2	0.3	4819.88
vLLM	16384	128	1	2	2195.8	36.6	27.31	27.3	7461.6	6846.2	0.146	2411.74
vLLM	32000	64	1	2	5075.7	48.8	20.5	20.5	6304.5	8148.3	0.123	3934.88
vLLM	32000	128	1	2	5076.0	48.7	20.51	20.5	6304.2	11266.6	0.089	2851.54

Note: all metrics are means across benchmark run unless otherwise stated.

ISL: Input Sequence Length (tokens)
OSL: Output Sequence Length (tokens)
Concurrency: number of concurrent requests (batch size)
N Req: total number of requests (sample size, N)
TTFT: Time To First Token (ms)
TPOT: Time Per Output Token (ms)
Tput User: Throughput per user (TPS)
Tput Decode: Throughput for decode tokens, across all users (TPS)
Tput Prefill: Throughput for prefill tokens (TPS)
E2EL: End-to-End Latency (ms)
Req Tput: Request Throughput (RPS)

Qwen3-8B on galaxy_t3k

## Tenstorrent Model Release Summary: Qwen3-8B on galaxy_t3k

Metadata: Qwen3-8B on galaxy_t3k

{
"report_id": "id_tt-transformers_Qwen3-8B_galaxy_t3k_2026-01-30_02-05-08",
"model_name": "Qwen3-8B",
"model_id": "id_tt-transformers_Qwen3-8B_galaxy_t3k",
"model_spec_json": "/home/ubuntu/works/tt-inference-server/workflow_logs/run_specs/tt_model_spec_2026-01-30_01-41-49_id_tt-transformers_Qwen3-8B_galaxy_t3k_release_ur9k6LIg.json",
"model_repo": "Qwen/Qwen3-8B",
"model_impl": "tt-transformers",
"inference_engine": "vLLM",
"device": "galaxy_t3k",
"server_mode": "docker",
"tt_metal_commit": "41345ac",
"vllm_commit": "628d4dc",
"run_command": "python run.py --model Qwen3-8B --device galaxy_t3k --workflow release --docker-server"
}

Performance Benchmark Sweeps for Qwen3-8B on galaxy_t3k

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on galaxy_t3k

Source	ISL	OSL	Concurrency	N Req	TTFT (ms)	TPOT (ms)	Tput User (TPS)	Tput Decode (TPS)	Tput Prefill (TPS)	E2EL (ms)	Req Tput (RPS)	Total Token Throughput (tokens/duration)
vLLM	128	128	1	8	54.5	25.9	38.64	38.6	2349.4	3341.5	0.299	76.6
vLLM	128	128	32	256	1245.7	26.5	37.67	1205.4	3288.1	4617.3	6.927	1773.39
vLLM	128	1024	1	4	50.9	25.5	39.26	39.3	2516.3	26105.9	0.038	44.13
vLLM	128	1024	32	128	1258.2	27.2	36.8	1177.6	3255.5	29056.8	1.101	1268.54
vLLM	1024	128	1	4	150.3	25.6	38.99	39.0	6815.1	3407.6	0.293	338.04
vLLM	2048	128	1	4	280.0	27.5	36.3	36.3	7313.3	3778.5	0.265	575.82
vLLM	2048	128	18	72	4542.4	27.9	35.87	645.6	8115.6	8083.3	2.226	4843.88
vLLM	2048	2048	10	20	2542.8	27.8	35.92	359.2	8054.0	59537.1	0.168	687.95
vLLM	3000	64	13	52	6392.2	28.1	35.61	463.0	6101.2	8161.2	1.593	4879.48
vLLM	3072	128	1	4	525.2	27.1	36.97	37.0	5849.6	3960.7	0.252	807.85
vLLM	4000	64	10	40	4939.0	28.4	35.18	351.8	8098.8	6730.0	1.485	6036.71
vLLM	4096	128	1	2	528.5	27.9	35.83	35.8	7750.6	4072.9	0.245	1036.96
vLLM	8000	64	5	10	5092.5	31.6	31.69	158.4	7854.7	7080.5	0.706	5693.74
vLLM	8192	128	1	2	1052.7	31.2	32.07	32.1	7781.6	5012.5	0.199	1659.67
vLLM	16000	64	2	4	4350.5	37.1	26.95	53.9	7355.5	6688.2	0.299	4803.11
vLLM	16384	128	1	2	2196.5	37.2	26.88	26.9	7459.3	6921.8	0.144	2385.32
vLLM	32000	64	1	2	4877.2	49.2	20.34	20.3	6561.1	7974.9	0.125	4020.49
vLLM	32000	128	1	2	4882.5	49.4	20.24	20.2	6554.1	11156.7	0.09	2879.6

Note: all metrics are means across benchmark run unless otherwise stated.

ISL: Input Sequence Length (tokens)
OSL: Output Sequence Length (tokens)
Concurrency: number of concurrent requests (batch size)
N Req: total number of requests (sample size, N)
TTFT: Time To First Token (ms)
TPOT: Time Per Output Token (ms)
Tput User: Throughput per user (TPS)
Tput Decode: Throughput for decode tokens, across all users (TPS)
Tput Prefill: Throughput for prefill tokens (TPS)
E2EL: End-to-End Latency (ms)
Req Tput: Request Throughput (RPS)

Also, for the N150 Models CI, the performance benchmark results are being produced, but it still appears to be failing, so I’m investigating that as well.

https://github.com/tenstorrent/tt-shield/actions/runs/21421020337/job/61980149524

bgoelTT · 2026-01-30T20:17:37Z

I ran the release workflow with this Uplift commit, and it worked fine for everything except Galaxy. Galaxy seems to either hang or fail partway through, so I’m currently looking into it.

Qwen3-8B on n150
Qwen3-8B on t3k
Qwen3-8B on galaxy_t3k
Also, for the N150 Models CI, the performance benchmark results are being produced, but it still appears to be failing, so I’m investigating that as well.

https://github.com/tenstorrent/tt-shield/actions/runs/21421020337/job/61980149524

@sott0n the benchmark workflow is passing, it returns status code 0. What is failing is the accuracy evaluations due to our fork of lm-eval-harness being rebased. I noticed you were using this branch of tt-inference-server which makes sense why it would fail - you'll need to rebase uplift-qwen3-8b to include the latest changes on dev

Now for the Galaxy hangs, do you have a Models CI run to examine?

sott0n · 2026-01-30T22:51:20Z

@bgoelTT No, I don’t have a Models CI run yet. I tested it on a local Galaxy setup and observed that it hangs. However, I saw the same behavior with the current commit on the dev branch (not this uplift), so I’ll also check it in the Models CI.

tstescoTT · 2026-02-11T15:36:00Z

@sott0n can you post the Models CI run showing this uplift change?

sott0n · 2026-02-12T00:32:36Z

@tstescoTT Sorry for the lack of updates.

The Models CI seems to be failing partway through, so I’ve been running the release on a local 6U system. However, it appears to hang during execution, so I’ve been bisecting tt-metal commits to identify which commit introduced the issue. I haven’t been able to pinpoint the root cause yet.
At least, I’ve confirmed that the original commit works correctly. Given that, I think it would be reasonable either to decouple the Galaxy ModelSpec from this change, or to uplift only n150 for now.

From the inference-server policy perspective, would it be problematic for Qwen3-8B to be split per device?

Uplift for QWen3-8B

607af42

sott0n requested a review from a team as a code owner January 26, 2026 12:15

bgoelTT requested changes Jan 26, 2026

View reviewed changes

Merge branch 'dev' into uplift-qwen3-8b

6327345

sott0n requested review from idjuricTT and tstescoTT as code owners January 30, 2026 22:49

tstescoTT mentioned this pull request Feb 4, 2026

OoM error when running Qwen3-8B on N150 #1869

Open

Merge branch 'dev' into uplift-qwen3-8b

c4fe3d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uplift for QWen3-8B#1885

Uplift for QWen3-8B#1885
sott0n wants to merge 3 commits intodevfrom
uplift-qwen3-8b

sott0n commented Jan 26, 2026

Uh oh!

github-actions bot commented Jan 26, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 26, 2026 •

edited

Loading

Uh oh!

bgoelTT left a comment

Uh oh!

sott0n commented Jan 30, 2026

Metadata: Qwen3-8B on n150

Performance Benchmark Sweeps for Qwen3-8B on n150

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on n150

Metadata: Qwen3-8B on t3k

Performance Benchmark Sweeps for Qwen3-8B on t3k

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on t3k

Metadata: Qwen3-8B on galaxy_t3k

Performance Benchmark Sweeps for Qwen3-8B on galaxy_t3k

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on galaxy_t3k

Uh oh!

bgoelTT commented Jan 30, 2026 •

edited

Loading

Uh oh!

sott0n commented Jan 30, 2026

Uh oh!

tstescoTT commented Feb 11, 2026

Uh oh!

sott0n commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sott0n commented Jan 26, 2026

Uh oh!

github-actions bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Test Results - PASSED

Summary

Details

Uh oh!

github-actions bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Test Coverage Report

Coverage of Changed Lines

Uh oh!

bgoelTT left a comment

Choose a reason for hiding this comment

Uh oh!

sott0n commented Jan 30, 2026

Metadata: Qwen3-8B on n150

Performance Benchmark Sweeps for Qwen3-8B on n150

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on n150

Metadata: Qwen3-8B on t3k

Performance Benchmark Sweeps for Qwen3-8B on t3k

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on t3k

Metadata: Qwen3-8B on galaxy_t3k

Performance Benchmark Sweeps for Qwen3-8B on galaxy_t3k

vLLM Text-to-Text Performance Benchmark Sweeps for Qwen3-8B on galaxy_t3k

Uh oh!

bgoelTT commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sott0n commented Jan 30, 2026

Uh oh!

tstescoTT commented Feb 11, 2026

Uh oh!

sott0n commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Jan 26, 2026 •

edited

Loading

github-actions bot commented Jan 26, 2026 •

edited

Loading

bgoelTT commented Jan 30, 2026 •

edited

Loading