You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a client disconnects while llama-server is still processing the prompt (before any token is streamed), the server continues running the generation until completion. This wastes compute and keeps the model busy even though no client is connected to receive the output.
Start llama-server with any model.
Send a /v1/chat/completions request with a moderately large prompt.
Disconnect the client immediately after the request is sent (e.g., terminate curl, close browser tab, cancel HTTP request in client).
Observe that llama-server keeps generating tokens until completion even though no client is connected.