Skip to content

Conversation

@cyberofficial
Copy link
Contributor

This pull request corrects the [limit] configurations for five models available on the Vultr Inference API to reflect their true, empirically tested context windows. The previous configurations contained aspirational limits that would lead to runtime errors. Due to limited documentation, the previous numbers were outdated.

Methodology:

I conducted a series of automated tests against the Vultr Inference API to determine the precise, stable context window for each model. The testing script rapidly found the upper bound, then used a guided search to fine-tune and verify the maximum token limit. The new configurations are based on these stable results.

Test Results Summary:

Model Name Verified Avg. Limit Safe Limit (Floor) Hard Limit (Ceiling)
deepseek-r1-distill-qwen-32b 130,466 tokens 130,000 tokens 131,000 tokens
qwen2.5-coder-32b-instruct 15,940 tokens 15,000 tokens 16,000 tokens
deepseek-r1-distill-llama-70b 130,466 tokens 130,000 tokens 131,000 tokens
gpt-oss-120b 130,530 tokens 130,000 tokens 131,000 tokens
kimi-k2-instruct 63,667 tokens 63,000 tokens 64,000 tokens

Proposed Changes:

The [limit] section for each model file has been updated to be friendly for coding tasks (maximizing input context while reserving a generous output buffer) and to stay within the verified total token limits.


deepseek-r1-distill-qwen-32b.toml & deepseek-r1-distill-llama-70b.toml

 [limit]
-context = 128_000
-output = 32_768
+# VERIFIED TOTAL: ~130k. Reserving 8k for output.
+context = 121_808
+output = 8_192

gpt-oss-120b.toml

 [limit]
-context = 128_000
-output = 131_072
+# VERIFIED TOTAL: ~130k. Reserving 8k for output.
+context = 121_808
+output = 8_192

kimi-k2-instruct.toml

 [limit]
-context = 128_000
-output = 16_384
+# VERIFIED TOTAL: ~63k. Reserving 4k for output.
+context = 58_904
+output = 4_096

qwen2.5-coder-32b-instruct.toml

 [limit]
-context = 128_000
-output = 32_768
+# VERIFIED TOTAL: ~15k. Reserving 2k for output.
+context = 12_952
+output = 2_048

Adjusted the 'context' and 'output' token limits in TOML configs for deepseek-r1-distill-llama-70b, deepseek-r1-distill-qwen-32b, gpt-oss-120b, kimi-k2-instruct, and qwen2.5-coder-32b-instruct models to reflect new capacity constraints.
@rekram1-node rekram1-node merged commit cbaf338 into sst:dev Oct 23, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants