🐛 [Bug]: New install - response keeps repeating the last line

### Bug description

I just pulled the image, spun up a container with default settings.  I downloaded the Mistral-7B model, and left everything default.  I've tried a few short questions, and the answer repeats the last line until I stop the container.

### Steps to reproduce

1) Spin up new container with default settings (from repo)
2) Download Mistral-7B
3) Start a new chat and ask "what is the square root of nine"

### Environment Information

Docker version: 25.0.3
OS: Ubuntu 22.04.4 LTS on kernel 5.15.0-97
CPU: AMD Ryzen 5 2400G
Broswer: Firefox version 123.0

### Screenshots

![image](https://github.com/serge-chat/serge/assets/45110141/f388f1a3-9f20-41e9-a02b-bb5bad11a405)


### Relevant log output

```shell
llm_load_print_meta: BOS token        = 1 '<s>'

llm_load_print_meta: EOS token        = 2 '</s>'

llm_load_print_meta: UNK token        = 0 '<unk>'

llm_load_print_meta: LF token         = 13 '<0x0A>'

llm_load_tensors: ggml ctx size =    0.11 MiB

llm_load_tensors: offloading 0 repeating layers to GPU

llm_load_tensors: offloaded 0/33 layers to GPU

llm_load_tensors:        CPU buffer size =  4165.37 MiB

...............................................................................................

llama_new_context_with_model: n_ctx      = 2153

llama_new_context_with_model: freq_base  = 10000.0

llama_new_context_with_model: freq_scale = 1

llama_kv_cache_init:        CPU KV buffer size =   269.13 MiB

llama_new_context_with_model: KV self size  =  269.12 MiB, K (f16):  134.56 MiB, V (f16):  134.56 MiB

llama_new_context_with_model:        CPU input buffer size   =    12.22 MiB

llama_new_context_with_model:        CPU compute buffer size =   174.42 MiB

llama_new_context_with_model: graph splits (measure): 1

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 

Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '32768', 'general.name': 'mistralai_mistral-7b-v0.1', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}

18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...

Received termination signal!

++ _term

++ echo 'Received termination signal!'

++ kill -TERM 18

++ kill -TERM 19

18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...

18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...
```


### Confirmations

- [X] I'm running the latest version of the main branch.
- [X] I checked existing issues to see if this has already been described.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [Bug]: New install - response keeps repeating the last line #1182

Bug description

Steps to reproduce

Environment Information

Screenshots

Relevant log output

Confirmations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🐛 [Bug]: New install - response keeps repeating the last line #1182

Description

Bug description

Steps to reproduce

Environment Information

Screenshots

Relevant log output

Confirmations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions