Skip to content

API Doesn't Output Chat Results on Windows #22

@Steven4565

Description

@Steven4565

Description

kolosal-server on windows does not return messages when queried using OpenAI library. The model runs while using CPU/GPU but never outputs the result.

Expected Behavior

llm.chat.completions.create(...) returns a non-empty content.

Actual Behavior

llm.chat.completions.create(...) returns a ChatCompletion object with empty content.

ChatCompletion(id='chatcmpl-0', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='', refusal=None, role='assistant', an
notations=None, audio=None, function_call=None, tool_calls=None))], created=1756437393, model='qwen3-0.6b:UD-Q8_K_XL', object='chat.completion', service_tier=None, system
_fingerprint='fp_4d29efe704', usage=CompletionUsage(completion_tokens=0, prompt_tokens=133, total_tokens=133, completion_tokens_details=None, prompt_tokens_details=None))

Steps to Reproduce

llm.chat.completions.create(
  model="qwen3-0.6b:UD-Q8_K_XL",
  stream=False # Also doesn't work with streaming
  message=[{"role": "user", "content": "/no_think create a random english sentence}]
)

Log

[2025-08-28 13:51:51.570] [DEBUG] [Thread 20384] Processing request from 192.168.18.204
[2025-08-28 13:51:51.570] [DEBUG] [Thread 20384] Processing POST request for /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.570] [DEBUG] [Thread 20384] Calling auth middleware for POST /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.571] [DEBUG] Auth middleware processing request: POST /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.571] [DEBUG] CORS headers - Origin: , Request-Headers: , Request-Method:
[2025-08-28 13:51:51.571] [DEBUG] CORS: Request approved for origin: , method: POST
[2025-08-28 13:51:51.571] [DEBUG] CORS result - IsValid: true, IsPreflight: false
[2025-08-28 13:51:51.571] [DEBUG] Rate limit check passed for client 192.168.18.204 - Requests: 1/100, Remaining: 99
[2025-08-28 13:51:51.571] [DEBUG] Rate limit result - Allowed: true, Used: 1, Remaining: 99
[2025-08-28 13:51:51.571] [DEBUG] Request approved for client 192.168.18.204 - Rate limit: 1/100, CORS origin: none
[2025-08-28 13:51:51.571] [DEBUG] Auth middleware completed - Request allowed: true
[2025-08-28 13:51:51.571] [DEBUG] [Thread 20384] Auth middleware result - Allowed: true, Status: 200, Reason:
[2025-08-28 13:51:51.571] [DEBUG] [Thread 20384] Content-Length: 659
[2025-08-28 13:51:51.572] [DEBUG] [Thread 20384] Read 659 additional bytes for body
[2025-08-28 13:51:51.572] [INFO] [Thread 20384] Received chat completion request
[2025-08-28 13:51:51.573] [INFO] Engine ID 'qwen3-8b:UD-Q8_K_XL' was unloaded due to inactivity. Attempting to reload.
[2025-08-28 13:51:51.573] [INFO] Reloading llama-vulkan inference engine plugin...
[2025-08-28 13:51:51.575] [INFO] Successfully loaded inference engine: llama-vulkan
[2025-08-28 13:51:51.575] [INFO] Creating new inference engine instance for reload...
[2025-08-28 13:51:51.575] [INFO] Reloading model from path: C:\ProgramData\Kolosal\bin\models\Qwen3-8B-128K-UD-Q8_K_XL.gguf
[INFERENCE] Using CUDA or Vulkan
[2025-08-28 13:51:51.674] [DEBUG] [Thread 8628] Processing request from 192.168.18.204
[2025-08-28 13:51:51.674] [DEBUG] [Thread 8628] Processing POST request for /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.674] [DEBUG] [Thread 8628] Calling auth middleware for POST /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.674] [DEBUG] Auth middleware processing request: POST /v1/chat/completions from 192.168.18.204
[2025-08-28 13:51:51.674] [DEBUG] CORS headers - Origin: , Request-Headers: , Request-Method:
[2025-08-28 13:51:51.674] [DEBUG] CORS: Request approved for origin: , method: POST
[2025-08-28 13:51:51.674] [DEBUG] CORS result - IsValid: true, IsPreflight: false
[2025-08-28 13:51:51.674] [DEBUG] Rate limit check passed for client 192.168.18.204 - Requests: 2/100, Remaining: 98
[2025-08-28 13:51:51.675] [DEBUG] Rate limit result - Allowed: true, Used: 2, Remaining: 98
[2025-08-28 13:51:51.675] [DEBUG] Request approved for client 192.168.18.204 - Rate limit: 2/100, CORS origin: none
[2025-08-28 13:51:51.675] [DEBUG] Auth middleware completed - Request allowed: true
[2025-08-28 13:51:51.675] [DEBUG] [Thread 8628] Auth middleware result - Allowed: true, Status: 200, Reason:
[2025-08-28 13:51:51.675] [DEBUG] [Thread 8628] Content-Length: 656
[2025-08-28 13:51:51.675] [DEBUG] [Thread 8628] Read 656 additional bytes for body
[2025-08-28 13:51:51.675] [INFO] [Thread 8628] Received chat completion request
[2025-08-28 13:51:51.675] [DEBUG] Engine ID 'qwen3-8b:UD-Q8_K_XL' is being loaded by another thread. Waiting...
[2025-08-28 13:51:57.134] [DEBUG] Autoscaling check at 15647 (next check interval was: 10 seconds)
[2025-08-28 13:51:57.134] [DEBUG] Next autoscaling check in 60 seconds
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_chat_templates_init: failed to parse chat template (defaulting to chatml): Expected value expression at row 18, column 30:
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
                             ^
    {%- set index = (messages|length - 1) - loop.index0 %}

[2025-08-28 13:52:03.363] [INFO] Successfully reloaded model for engine 'qwen3-8b:UD-Q8_K_XL'
[2025-08-28 13:52:03.363] [INFO] Successfully reloaded engine ID 'qwen3-8b:UD-Q8_K_XL'.
[2025-08-28 13:52:03.363] [DEBUG] Engine ID 'qwen3-8b:UD-Q8_K_XL' loaded by another thread.
[2025-08-28 13:52:03.363] [INFO] [Thread 20384] Processing non-streaming chat completion request for model 'qwen3-8b:UD-Q8_K_XL'
[2025-08-28 13:52:03.363] [INFO] [Thread 8628] Processing non-streaming chat completion request for model 'qwen3-8b:UD-Q8_K_XL'
[INFERENCE] [ERROR] [getJobResult] Invalid job ID 0

[2025-08-28 13:52:34.095] [INFO] [Thread 20384] Non-streaming chat completion completed for model 'qwen3-8b:UD-Q8_K_XL'
[2025-08-28 13:52:34.096] [DEBUG] [Thread 20384] Completed request for /v1/chat/completions

Environment

OS: Windows 11 26100.4946

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions