API Call to Model Never Resolves, Leaving cURL Request Open Indefinitely

## LocalAI version:

 v1.25.0
Latest

## Environment, CPU architecture, OS, and Version:

Linux cocopilot 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

## Describe the bug:

I'm encountering an issue where the curl request I make to the LocalAI server is not resolving. The server seems to hang indefinitely.

Although the DBG GRPC (coquito-127.0.0.1:38731): stdout console output indicates that the response is generated correctly, the API request (cURL) never finishes and remains open indefinitely, meaning the API call never resolves.



## To Reproduce:

1. Run the LocalAI server with the "CUSTOM" model.
2. Execute the following curl command:

```bash
curl http://192.168.0.222:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "mymodel",
     "messages": [{"role": "user", "content": "myJSON"}],
     "temperature": 0.9
}'

Expected behavior:

I expect to receive a response from the server, but the request hangs indefinitely.

`31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  328:        blk.36.attn_output.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  329:           blk.36.ffn_gate.weight q8_0     [  5120, 13824,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  330:             blk.36.ffn_up.weight q8_0     [  5120, 13824,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  331:           blk.36.ffn_down.weight q8_0     [ 13824,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  332:          blk.36.attn_norm.weight f32      [  5120,     1,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  333:           blk.36.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  334:             blk.37.attn_q.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  335:             blk.37.attn_k.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  336:             blk.37.attn_v.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  337:        blk.37.attn_output.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  338:           blk.37.ffn_gate.weight q8_0     [  5120, 13824,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  339:             blk.37.ffn_up.weight q8_0     [  5120, 13824,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  340:           blk.37.ffn_down.weight q8_0     [ 13824,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  341:          blk.37.attn_norm.weight f32      [  5120,     1,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  342:           blk.37.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  343:             blk.38.attn_q.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  344:             blk.38.attn_k.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  345:             blk.38.attn_v.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  346:        blk.38.attn_output.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  347:           blk.38.ffn_gate.weight q8_0     [  5120, 13824,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  348:             blk.38.ffn_up.weight q8_0     [  5120, 13824,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  349:           blk.38.ffn_down.weight q8_0     [ 13824,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  350:          blk.38.attn_norm.weight f32      [  5120,     1,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  351:           blk.38.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  352:             blk.39.attn_q.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  353:             blk.39.attn_k.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  354:             blk.39.attn_v.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  355:        blk.39.attn_output.weight q8_0     [  5120,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  356:           blk.39.ffn_gate.weight q8_0     [  5120, 13824,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  357:             blk.39.ffn_up.weight q8_0     [  5120, 13824,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  358:           blk.39.ffn_down.weight q8_0     [ 13824,  5120,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  359:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  360:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  361:               output_norm.weight f32      [  5120,     1,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - tensor  362:                    output.weight q8_0     [  5120, 32000,     1,     1 ]
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   0:                       general.architecture str     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   1:                               general.name str     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   2:                       llama.context_length u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   3:                     llama.embedding_length u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   4:                          llama.block_count u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   7:                 llama.attention.head_count u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv  10:                          general.file_type u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - kv  18:               general.quantization_version u32     
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - type  f32:   81 tensors
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_model_loader: - type q8_0:  282 tensors
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: format         = GGUF V2 (latest)
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: arch           = llama
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: vocab type     = SPM
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_vocab        = 32000
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_merges       = 0
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_ctx_train    = 4096
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_ctx          = 950
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_embd         = 5120
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_head         = 40
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_head_kv      = 40
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_layer        = 40
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_rot          = 128
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_gqa          = 1
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: f_norm_eps     = 1.0e-05
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: n_ff           = 13824
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: freq_base      = 10000.0
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: freq_scale     = 1
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: model type     = 13B
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: model ftype    = mostly Q8_0
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: model size     = 13.02 B
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: general.name   = LLaMA v2
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: BOS token = 1 '<s>'
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: EOS token = 2 '</s>'
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: UNK token = 0 '<unk>'
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_print_meta: LF token  = 13 '<0x0A>'
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_tensors: ggml ctx size = 13189.98 MB
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llm_load_tensors: mem required  = 13189.98 MB (+ 1484.38 MB per state)
[127.0.0.1]:41396  200  -  GET      /readyz
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr ...................................................................................................
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_new_context_with_model: kv self size  = 1484.38 MB
11:31AM DBG GRPC(coquito-127.0.0.1:38731): stderr llama_new_context_with_model: compute buffer total size =  105.69 MB
[127.0.0.1]:35954  200  -  GET      /readyz
[127.0.0.1]:46878  200  -  GET      /readyz
[127.0.0.1]:48512  200  -  GET      /readyz
11:35AM DBG GRPC(coquito-127.0.0.1:38731): stdout  {"program":"TN Deportivo a las 13","tasks":{"title":"Generar un título en función al summary","summary":"Resumen de 50 palabras","short_summary":"Resumen corto maximo 20 palabras","hashtags":"Generar hashtags separados por coma","startTime":"Identificar podID y elegir en que MM:SS habla de lo citado en el resúmen","emotion":"La felicidad que este texto podría causar en un humano (0 tristeza, 1 felicidad)","products":"Lista de productos que se puedan vender relacionados a la temática del programa, se especifico con los productos. Pon false si no hay productos relacionados o si la temática no es alegre"},"query":" PodId:64fdc4c88b49ef1c0d56514c PodTranscription: que se va a hacer la incómitra,\nPatriz Aburo,\nyo iba a intentar mostrar que si\npresento libro, Patriz Aburo,\nse jueves,\nel jueves presento libro,\nla foto de derecho con su\nprograma de gobierno.\nBien, lo cual un gran acto.\nMuy bien.\nGracias, Ventur.\nNos vamos a descansar la voz\nporque tenés un día largo.\nUn día larguísimo.\nVamos a ir a la general paz,\nallí se dio un choque,\nuna mujer que fue a otro peceo,\npues estaba cruzando por la mitad\nde la general paz.\nAl aferrato está allí,\nen vivo, Alan.\nMuy bien.\nEn estos momentos continuan\nlas tareas periciales de la\nunidad criminalística de la\npolicía de la ciudad.\nAl amo de una mujer que cruzó\ncorriendo ambas manos de la\ngeneral paz y fue arrollada por\nun auto que circulaba en la mano\nhacia el riachuelo.\nEl hecho ocurrió pasar la\nsiete de la madrugada.\nEn ese momento,\nel auto finalmente se detiene,\nse presenta el personal.\nPolicía, y se corta toda la\nmano hacia riachuelo.\nEn esta situación estamos,\nahora están retirando el auto\nporque las pericias acaban\ndeterminar y entonces resta\na guardar la llegada del transporte\nen médicos, fórense,\nla morguera, como se le dice,\npopularmente, pero no cesan\nlas maniobras peligroso.\nLa gente no se le va a ver con\nla mano hacia el paro.\nLa gente no se le va a ver con\nla mano hacia el paro.\nAhora van a ver que un bolsbag\nen bola de color azul dio marcha\ntras por la colectora de general\npapu, que se equivocó y va a\nbuscar tomar la rampa de acceso.\nDespués también vimos como\notra persona más caminaba\na las inmediaciones.\nRealmente una zona complicada.\nImaginate la siete y veinte y la\nmañana va manejando, se te cruzan\nen infracción, mucha policía,\npero nadie le llamó la atención.\nAhora mismo, cuanto a este\
11:35AM DBG GRPC(coquito-127.0.0.1:38731): stdout ","response":"json structure (title, summary, startTime, short_summary, hashtags, emotion, products)"}
11:35AM DBG GRPC(coquito-127.0.0.1:38731): stdout {"program":"TN Deportivo a las 13","tasks":{"title":"Resumen de TN Deportivo a las 13","summary":"Presentación de libro de Patriz Aburo, accidente en la General Paz, investigaciones policiales y pericias forenses","short_summary":"Presentación de libro, accidente en la General Paz, investigaciones policiales y pericias forenses","hashtags":"#TNDeportivo, #AccidenteGeneralPaz, #InvestigacionesPoliciales, #PericiasForenses","startTime":"2023-01-14T13:00:00Z","short_summary":"Presentación de libro, accidente en la General Paz, investigaciones policiales y pericias forenses","emotion":null,"products":null}
11:35AM DBG GRPC(coquito-127.0.0.1:38731): stdout {"program":"TN Deportivo a las 13","tasks":{"title":"Resumen de TN Deportivo a las 13","summary":"Presentación de libro de Patriz Aburo, accidente en la General Paz, investigaciones policiales y pericias forenses","short_summary":"Presentación de libro, accidente en la General Paz, investigaciones policiales y pericias forenses","hashtags":"#TNDeportivo, #AccidenteGeneralPaz, #InvestigacionesPoliciales, #PericiasForenses","startTime":"2023-01-14T13:00:00Z","short_summary":"Presentación de libro, accidente en la General Paz, investigaciones policiales y pericias forenses","emotion":null,"products":null}
[127.0.0.1]:58172  200  -  GET      /readyz
[127.0.0.1]:49096  200  -  GET      /readyz
[127.0.0.1]:54406  200  -  GET      /readyz
[127.0.0.1]:59174  200  -  GET      /readyz
[127.0.0.1]:44248  200  -  GET      /readyz `


Additional context:

the model is llama2 with 13b

model directory:
coco.yaml  coquito  llama2-chat-message.tmpl

coco.yaml:
name: coco
backend: llama
parameters:
  top_k: 80
  temperature: 0.4
  top_p: 0.5
  model: coquito
#context_size: 100
#threads: 13
#debug: true
low_vram: true
#numa: true
# Enable F16 if backend supports it
#f16: true
#gpu_layers: 22
# Enable debugging
debug: true
ngqa: 5
template:
   chat_message: llama2-chat-message.tmpl
context_size: 950
system_prompt:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API Call to Model Never Resolves, Leaving cURL Request Open Indefinitely #1074

LocalAI version:

Environment, CPU architecture, OS, and Version:

Describe the bug:

To Reproduce:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

API Call to Model Never Resolves, Leaving cURL Request Open Indefinitely #1074

Description

LocalAI version:

Environment, CPU architecture, OS, and Version:

Describe the bug:

To Reproduce:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions