KV cache seems not reset before applied to new sequence

**Describe the bug**  

In several situations, we obtain this error:
```
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 810, in _forward_internal
    result = self._forward_internal_no_check(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 917, in _forward_internal_no_check
    y = block.forward(
        ^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 58, in forward
    self._check_kv_cache(cache, block_idx, batch_size, chunk_len)
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 107, in _check_kv_cache
    raise ValueError(
ValueError: KV cache for layer 0: chunk_len = 32768, must be <= max_forward_length() = 2048 (input_pos = 34816)
```

This is with cache length 21768, chunk size 2048, so the prefill chunk is size 32k, all others are <= 2k. What happens here is that the prefill forward is called, but `input_pos>0` in KV cache. KV caches should have been reset, but are not!


**To reproduce**  

```
CUDA_VISIBLE_DEVICES="0" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/loradora/full/logs"; python3 keys_values/__main__.py finetune_long_full Qwen/Qwen2.5-0.5B --out_dir /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/loradora/full --data LongBenchV2 --data.max_seq_length 150000 --data.metadata_dir /home/ubuntu/out/finetune/data --precision bf16-true --kv_cache.name h2o-torch-quantized8  --kv_cache.cache_length 32768 --kv_cache.chunk_size 2048 --verbose some --grad.layers_per_cell 1 --train.save_interval 10 --train.micro_batch_size 2 --train.global_batch_size 2 --eval.interval 10 --eval.micro_batch_size 4 --head_model seq_classification_on_logits  --eval.initial_validation False --data.trainloader_longest_first True
```

After 30 iterations:
```
Caught out of memory error. Original message:
CUDA out of memory. Tried to allocate 4.00 GiB. GPU 0 has a total capacity of 39.49 GiB of which 3.28 GiB is free. Including non-PyTorch memory, this process has 15.07 GiB memory in use. Process 355057 has 21.13 GiB memory in use. Of the allocated memory 14.21 GiB is allocated by PyTorch, and 345.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Reducing 'attention_forward_temp_size_gb' limit:
Old value: 4.000
New value: 3.000
[...]
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 810, in _forward_internal
    result = self._forward_internal_no_check(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 917, in _forward_internal_no_check
    y = block.forward(
        ^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 58, in forward
    self._check_kv_cache(cache, block_idx, batch_size, chunk_len)
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 107, in _check_kv_cache
    raise ValueError(
ValueError: KV cache for layer 0: chunk_len = 32768, must be <= max_forward_length() = 2048 (input_pos = 34816)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KV cache seems not reset before applied to new sequence #45

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

KV cache seems not reset before applied to new sequence #45

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions