fix: raise default n_ctx from 4096 to 8192 and fix UTF-8 token boundary#187
Merged
Michael-A-Kuykendall merged 1 commit intoMichael-A-Kuykendall:mainfrom Mar 25, 2026
Conversation
Fixes two bugs found during multi-round LLM deliberation experiments with qwen3:8b, cogito:8b, and gemma3:1b on Shimmy v1.9.0 GPU build. ## Fix 1 — n_ctx default 4096 → 8192 (issue Michael-A-Kuykendall#182) model_registry.rs (3 locations) and main.rs (5 locations) hardcode ctx_len=4096. With thinking models (qwen3, cogito, deepseek-r1) a single deliberation round exhausts the KV cache: system prompt (~80t) + task (~200t) + prior draft (~1610t) + transcript (~500t) + CoT chain (~1000t) + output (2048t) = 5438t > 4096 This causes NoKvCacheSlot errors that surface as HTTP 502 Bad Gateway. Fixed to 8192 in all six locations. A follow-up improvement would be to read context_length from the GGUF metadata via llama_model_meta_val_str so each model uses its own native default. Regression test: tests/regression/issue_182_kvcache_ctx_default.rs ## Fix 2 — UTF-8 token boundary crash (issue Michael-A-Kuykendall#183) engine/llama.rs generation loop called: token_to_str(token, Special::Plaintext)? token_to_str calls String::from_utf8(bytes)?. Byte-level tokenizers (qwen3, qwen2.5, deepseek, and most multilingual models) emit individual bytes as separate tokens — the character 你 (U+4F60) arrives as three consecutive tokens [0xE4, 0xBD, 0xA0]. from_utf8 on a single-byte token fails with FromUtf8Error, the ? propagates it, and the server returns 502. Fixed to: token_to_bytes(token, Special::Plaintext) .map(|b| String::from_utf8_lossy(&b).into_owned()) .unwrap_or_default() from_utf8_lossy accepts partial sequences; the complete character is reconstructed correctly as bytes accumulate across tokens. Regression test: tests/regression/issue_183_utf8_token_boundary.rs Signed-off-by: Scott Johnson <m6gmjmjwfw@liamekaens.com> Signed-off-by: scott <scott@procyon.here>
3 tasks
Michael-A-Kuykendall
approved these changes
Mar 25, 2026
Owner
Michael-A-Kuykendall
left a comment
There was a problem hiding this comment.
Approved and welcome to the Shimmy Team!
0db9b4a
into
Michael-A-Kuykendall:main
1 of 4 checks passed
This was referenced Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two bugs found during multi-round LLM deliberation experiments with thinking models (qwen3:8b, cogito:8b) on Shimmy v1.9.0 GPU build over several hundred experiment runs.
Fix 1 — n_ctx default 4096 → 8192
model_registry.rs(3 locations) andmain.rs(5 locations) hardcodectx_len=4096. With thinking models a single deliberation round exhausts the KV cache:This causes
NoKvCacheSloterrors that surface as HTTP 502 Bad Gateway. Fixed to 8192 in all six locations. A follow-up improvement would be to readcontext_lengthfrom the GGUF metadata viallama_model_meta_val_strso each model uses its own native default.Fix 2 — UTF-8 token boundary crash
engine/llama.rsgeneration loop calledtoken_to_str(token, Special::Plaintext)?which callsString::from_utf8(bytes)?. Byte-level tokenizers (qwen3, qwen2.5, deepseek, and most multilingual models) emit individual bytes as separate tokens — the character 你 (U+4F60) arrives as three consecutive tokens[0xE4, 0xBD, 0xA0].from_utf8on a single-byte token fails withFromUtf8Error, the?propagates it, and the server returns 502.Fixed to:
from_utf8_lossyaccepts partial sequences; the complete character is reconstructed correctly as bytes accumulate across tokens.Test plan
tests/regression/issue_182_kvcache_ctx_default.rs— 4 tests verifying 8192 default and that the failing scenario (5438 tokens) fits within 8192 but not 4096tests/regression/issue_183_utf8_token_boundary.rs— 5 tests verifying multi-byte character reconstruction, partial sequence tolerance, ASCII passthrough, empty token handlingcargo test --test regressionpasses (all existing + new tests)Context
These fixes came out of running ACMT (Author-Critic MetaTransformer), a multi-model LLM deliberation pipeline that uses Shimmy as its exclusive inference backend. Shimmy's speed and GGUF-native loading make it ideal for multi-model chained inference workloads.
Scott Johnson (@LopezNuance)