fix: raise default n_ctx from 4096 to 8192 and fix UTF-8 token boundary by LopezNuance · Pull Request #187 · Michael-A-Kuykendall/shimmy

LopezNuance · 2026-03-23T22:15:43Z

Summary

Fixes two bugs found during multi-round LLM deliberation experiments with thinking models (qwen3:8b, cogito:8b) on Shimmy v1.9.0 GPU build over several hundred experiment runs.

Fix 1 — n_ctx default 4096 → 8192

model_registry.rs (3 locations) and main.rs (5 locations) hardcode ctx_len=4096. With thinking models a single deliberation round exhausts the KV cache:

system prompt (~80t) + task (~200t) + prior draft (~1610t)
+ transcript (~500t) + CoT chain (~1000t) + output (2048t) = 5438t > 4096

This causes NoKvCacheSlot errors that surface as HTTP 502 Bad Gateway. Fixed to 8192 in all six locations. A follow-up improvement would be to read context_length from the GGUF metadata via llama_model_meta_val_str so each model uses its own native default.

Fix 2 — UTF-8 token boundary crash

engine/llama.rs generation loop called token_to_str(token, Special::Plaintext)? which calls String::from_utf8(bytes)?. Byte-level tokenizers (qwen3, qwen2.5, deepseek, and most multilingual models) emit individual bytes as separate tokens — the character 你 (U+4F60) arrives as three consecutive tokens [0xE4, 0xBD, 0xA0]. from_utf8 on a single-byte token fails with FromUtf8Error, the ? propagates it, and the server returns 502.

Fixed to:

let piece = self.model.token_to_bytes(token, Special::Plaintext)
    .map(|b| String::from_utf8_lossy(&b).into_owned())
    .unwrap_or_default();

from_utf8_lossy accepts partial sequences; the complete character is reconstructed correctly as bytes accumulate across tokens.

Test plan

Regression test tests/regression/issue_182_kvcache_ctx_default.rs — 4 tests verifying 8192 default and that the failing scenario (5438 tokens) fits within 8192 but not 4096
Regression test tests/regression/issue_183_utf8_token_boundary.rs — 5 tests verifying multi-byte character reconstruction, partial sequence tolerance, ASCII passthrough, empty token handling
cargo test --test regression passes (all existing + new tests)
Validated on RTX 4090 with qwen3:8b, cogito:8b, gemma3:1b — KV cache confirmed at 8192 cells in Shimmy log, no NoKvCacheSlot errors across 33-task tier-3 evaluation suite

Context

These fixes came out of running ACMT (Author-Critic MetaTransformer), a multi-model LLM deliberation pipeline that uses Shimmy as its exclusive inference backend. Shimmy's speed and GGUF-native loading make it ideal for multi-model chained inference workloads.

Scott Johnson (@LopezNuance)

Fixes two bugs found during multi-round LLM deliberation experiments with qwen3:8b, cogito:8b, and gemma3:1b on Shimmy v1.9.0 GPU build. ## Fix 1 — n_ctx default 4096 → 8192 (issue Michael-A-Kuykendall#182) model_registry.rs (3 locations) and main.rs (5 locations) hardcode ctx_len=4096. With thinking models (qwen3, cogito, deepseek-r1) a single deliberation round exhausts the KV cache: system prompt (~80t) + task (~200t) + prior draft (~1610t) + transcript (~500t) + CoT chain (~1000t) + output (2048t) = 5438t > 4096 This causes NoKvCacheSlot errors that surface as HTTP 502 Bad Gateway. Fixed to 8192 in all six locations. A follow-up improvement would be to read context_length from the GGUF metadata via llama_model_meta_val_str so each model uses its own native default. Regression test: tests/regression/issue_182_kvcache_ctx_default.rs ## Fix 2 — UTF-8 token boundary crash (issue Michael-A-Kuykendall#183) engine/llama.rs generation loop called: token_to_str(token, Special::Plaintext)? token_to_str calls String::from_utf8(bytes)?. Byte-level tokenizers (qwen3, qwen2.5, deepseek, and most multilingual models) emit individual bytes as separate tokens — the character 你 (U+4F60) arrives as three consecutive tokens [0xE4, 0xBD, 0xA0]. from_utf8 on a single-byte token fails with FromUtf8Error, the ? propagates it, and the server returns 502. Fixed to: token_to_bytes(token, Special::Plaintext) .map(|b| String::from_utf8_lossy(&b).into_owned()) .unwrap_or_default() from_utf8_lossy accepts partial sequences; the complete character is reconstructed correctly as bytes accumulate across tokens. Regression test: tests/regression/issue_183_utf8_token_boundary.rs Signed-off-by: Scott Johnson <m6gmjmjwfw@liamekaens.com> Signed-off-by: scott <scott@procyon.here>

Michael-A-Kuykendall

Approved and welcome to the Shimmy Team!

LopezNuance requested a review from Michael-A-Kuykendall as a code owner March 23, 2026 22:15

LopezNuance mentioned this pull request Mar 23, 2026

fix: use native GGUF chat template instead of name-based inference #188

Merged

3 tasks

Michael-A-Kuykendall approved these changes Mar 25, 2026

View reviewed changes

Michael-A-Kuykendall merged commit 0db9b4a into Michael-A-Kuykendall:main Mar 25, 2026
1 of 4 checks passed

This was referenced Apr 10, 2026

fix: model caching, sliding-window KV eviction, and penalty forwarding #193

Closed

fix: UTF-8 char boundary panic in repetition detector #194

Closed

fix: UTF-8 char boundary panic in repetition detector #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: raise default n_ctx from 4096 to 8192 and fix UTF-8 token boundary#187

fix: raise default n_ctx from 4096 to 8192 and fix UTF-8 token boundary#187
Michael-A-Kuykendall merged 1 commit intoMichael-A-Kuykendall:mainfrom
LopezNuance:fix/kvcache-ctx-and-utf8

LopezNuance commented Mar 23, 2026

Uh oh!

Michael-A-Kuykendall left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

LopezNuance commented Mar 23, 2026

Summary

Fix 1 — n_ctx default 4096 → 8192

Fix 2 — UTF-8 token boundary crash

Test plan

Context

Uh oh!

Michael-A-Kuykendall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants