-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Description
kolosal-server
on windows does not return messages when queried using OpenAI library. The model runs while using CPU/GPU but never outputs the result.
Expected Behavior
llm.chat.completions.create(...)
returns a non-empty content.
Actual Behavior
llm.chat.completions.create(...)
returns a ChatCompletion object with empty content.
ChatCompletion(id='chatcmpl-0', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='', refusal=None, role='assistant', an
notations=None, audio=None, function_call=None, tool_calls=None))], created=1756437393, model='qwen3-0.6b:UD-Q8_K_XL', object='chat.completion', service_tier=None, system
_fingerprint='fp_4d29efe704', usage=CompletionUsage(completion_tokens=0, prompt_tokens=133, total_tokens=133, completion_tokens_details=None, prompt_tokens_details=None))
Steps to Reproduce
llm.chat.completions.create(
model="qwen3-0.6b:UD-Q8_K_XL",
stream=False # Also doesn't work with streaming
message=[{"role": "user", "content": "/no_think create a random english sentence}]
)
Log
[2025-08-28 13:51:51.570] [DEBUG] [Thread 20384] Processing request from 192.168.18.204
[2025-08-28 13:51:51.570] [DEBUG] [Thread 20384] Processing POST request for /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.570] [DEBUG] [Thread 20384] Calling auth middleware for POST /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.571] [DEBUG] Auth middleware processing request: POST /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.571] [DEBUG] CORS headers - Origin: , Request-Headers: , Request-Method:
[2025-08-28 13:51:51.571] [DEBUG] CORS: Request approved for origin: , method: POST
[2025-08-28 13:51:51.571] [DEBUG] CORS result - IsValid: true, IsPreflight: false
[2025-08-28 13:51:51.571] [DEBUG] Rate limit check passed for client 192.168.18.204 - Requests: 1/100, Remaining: 99
[2025-08-28 13:51:51.571] [DEBUG] Rate limit result - Allowed: true, Used: 1, Remaining: 99
[2025-08-28 13:51:51.571] [DEBUG] Request approved for client 192.168.18.204 - Rate limit: 1/100, CORS origin: none
[2025-08-28 13:51:51.571] [DEBUG] Auth middleware completed - Request allowed: true
[2025-08-28 13:51:51.571] [DEBUG] [Thread 20384] Auth middleware result - Allowed: true, Status: 200, Reason:
[2025-08-28 13:51:51.571] [DEBUG] [Thread 20384] Content-Length: 659
[2025-08-28 13:51:51.572] [DEBUG] [Thread 20384] Read 659 additional bytes for body
[2025-08-28 13:51:51.572] [INFO] [Thread 20384] Received chat completion request
[2025-08-28 13:51:51.573] [INFO] Engine ID 'qwen3-8b:UD-Q8_K_XL' was unloaded due to inactivity. Attempting to reload.
[2025-08-28 13:51:51.573] [INFO] Reloading llama-vulkan inference engine plugin...
[2025-08-28 13:51:51.575] [INFO] Successfully loaded inference engine: llama-vulkan
[2025-08-28 13:51:51.575] [INFO] Creating new inference engine instance for reload...
[2025-08-28 13:51:51.575] [INFO] Reloading model from path: C:\ProgramData\Kolosal\bin\models\Qwen3-8B-128K-UD-Q8_K_XL.gguf
[INFERENCE] Using CUDA or Vulkan
[2025-08-28 13:51:51.674] [DEBUG] [Thread 8628] Processing request from 192.168.18.204
[2025-08-28 13:51:51.674] [DEBUG] [Thread 8628] Processing POST request for /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.674] [DEBUG] [Thread 8628] Calling auth middleware for POST /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.674] [DEBUG] Auth middleware processing request: POST /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.674] [DEBUG] CORS headers - Origin: , Request-Headers: , Request-Method:
[2025-08-28 13:51:51.674] [DEBUG] CORS: Request approved for origin: , method: POST
[2025-08-28 13:51:51.674] [DEBUG] CORS result - IsValid: true, IsPreflight: false
[2025-08-28 13:51:51.674] [DEBUG] Rate limit check passed for client 192.168.18.204 - Requests: 2/100, Remaining: 98
[2025-08-28 13:51:51.675] [DEBUG] Rate limit result - Allowed: true, Used: 2, Remaining: 98
[2025-08-28 13:51:51.675] [DEBUG] Request approved for client 192.168.18.204 - Rate limit: 2/100, CORS origin: none
[2025-08-28 13:51:51.675] [DEBUG] Auth middleware completed - Request allowed: true
[2025-08-28 13:51:51.675] [DEBUG] [Thread 8628] Auth middleware result - Allowed: true, Status: 200, Reason:
[2025-08-28 13:51:51.675] [DEBUG] [Thread 8628] Content-Length: 656
[2025-08-28 13:51:51.675] [DEBUG] [Thread 8628] Read 656 additional bytes for body
[2025-08-28 13:51:51.675] [INFO] [Thread 8628] Received chat completion request
[2025-08-28 13:51:51.675] [DEBUG] Engine ID 'qwen3-8b:UD-Q8_K_XL' is being loaded by another thread. Waiting...
[2025-08-28 13:51:57.134] [DEBUG] Autoscaling check at 15647 (next check interval was: 10 seconds)
[2025-08-28 13:51:57.134] [DEBUG] Next autoscaling check in 60 seconds
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_chat_templates_init: failed to parse chat template (defaulting to chatml): Expected value expression at row 18, column 30:
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
^
{%- set index = (messages|length - 1) - loop.index0 %}
[2025-08-28 13:52:03.363] [INFO] Successfully reloaded model for engine 'qwen3-8b:UD-Q8_K_XL'
[2025-08-28 13:52:03.363] [INFO] Successfully reloaded engine ID 'qwen3-8b:UD-Q8_K_XL'.
[2025-08-28 13:52:03.363] [DEBUG] Engine ID 'qwen3-8b:UD-Q8_K_XL' loaded by another thread.
[2025-08-28 13:52:03.363] [INFO] [Thread 20384] Processing non-streaming chat completion request for model 'qwen3-8b:UD-Q8_K_XL'
[2025-08-28 13:52:03.363] [INFO] [Thread 8628] Processing non-streaming chat completion request for model 'qwen3-8b:UD-Q8_K_XL'
[INFERENCE] [ERROR] [getJobResult] Invalid job ID 0
[2025-08-28 13:52:34.095] [INFO] [Thread 20384] Non-streaming chat completion completed for model 'qwen3-8b:UD-Q8_K_XL'
[2025-08-28 13:52:34.096] [DEBUG] [Thread 20384] Completed request for /v1/chat/completions
Environment
OS: Windows 11 26100.4946
Metadata
Metadata
Assignees
Labels
No labels