Skip to content

KV cache seems not reset before applied to new sequence #45

Description

@mseeger

Describe the bug

In several situations, we obtain this error:

  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 810, in _forward_internal
    result = self._forward_internal_no_check(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 917, in _forward_internal_no_check
    y = block.forward(
        ^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 58, in forward
    self._check_kv_cache(cache, block_idx, batch_size, chunk_len)
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 107, in _check_kv_cache
    raise ValueError(
ValueError: KV cache for layer 0: chunk_len = 32768, must be <= max_forward_length() = 2048 (input_pos = 34816)

This is with cache length 21768, chunk size 2048, so the prefill chunk is size 32k, all others are <= 2k. What happens here is that the prefill forward is called, but input_pos>0 in KV cache. KV caches should have been reset, but are not!

To reproduce

CUDA_VISIBLE_DEVICES="0" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/loradora/full/logs"; python3 keys_values/__main__.py finetune_long_full Qwen/Qwen2.5-0.5B --out_dir /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/loradora/full --data LongBenchV2 --data.max_seq_length 150000 --data.metadata_dir /home/ubuntu/out/finetune/data --precision bf16-true --kv_cache.name h2o-torch-quantized8  --kv_cache.cache_length 32768 --kv_cache.chunk_size 2048 --verbose some --grad.layers_per_cell 1 --train.save_interval 10 --train.micro_batch_size 2 --train.global_batch_size 2 --eval.interval 10 --eval.micro_batch_size 4 --head_model seq_classification_on_logits  --eval.initial_validation False --data.trainloader_longest_first True

After 30 iterations:

Caught out of memory error. Original message:
CUDA out of memory. Tried to allocate 4.00 GiB. GPU 0 has a total capacity of 39.49 GiB of which 3.28 GiB is free. Including non-PyTorch memory, this process has 15.07 GiB memory in use. Process 355057 has 21.13 GiB memory in use. Of the allocated memory 14.21 GiB is allocated by PyTorch, and 345.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Reducing 'attention_forward_temp_size_gb' limit:
Old value: 4.000
New value: 3.000
[...]
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 810, in _forward_internal
    result = self._forward_internal_no_check(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 917, in _forward_internal_no_check
    y = block.forward(
        ^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 58, in forward
    self._check_kv_cache(cache, block_idx, batch_size, chunk_len)
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 107, in _check_kv_cache
    raise ValueError(
ValueError: KV cache for layer 0: chunk_len = 32768, must be <= max_forward_length() = 2048 (input_pos = 34816)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions