Describe the bug
In several situations, we obtain this error:
File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 810, in _forward_internal
result = self._forward_internal_no_check(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 917, in _forward_internal_no_check
y = block.forward(
^^^^^^^^^^^^^^
File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 58, in forward
self._check_kv_cache(cache, block_idx, batch_size, chunk_len)
File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 107, in _check_kv_cache
raise ValueError(
ValueError: KV cache for layer 0: chunk_len = 32768, must be <= max_forward_length() = 2048 (input_pos = 34816)
This is with cache length 21768, chunk size 2048, so the prefill chunk is size 32k, all others are <= 2k. What happens here is that the prefill forward is called, but input_pos>0 in KV cache. KV caches should have been reset, but are not!
To reproduce
CUDA_VISIBLE_DEVICES="0" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/loradora/full/logs"; python3 keys_values/__main__.py finetune_long_full Qwen/Qwen2.5-0.5B --out_dir /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/loradora/full --data LongBenchV2 --data.max_seq_length 150000 --data.metadata_dir /home/ubuntu/out/finetune/data --precision bf16-true --kv_cache.name h2o-torch-quantized8 --kv_cache.cache_length 32768 --kv_cache.chunk_size 2048 --verbose some --grad.layers_per_cell 1 --train.save_interval 10 --train.micro_batch_size 2 --train.global_batch_size 2 --eval.interval 10 --eval.micro_batch_size 4 --head_model seq_classification_on_logits --eval.initial_validation False --data.trainloader_longest_first True
After 30 iterations:
Caught out of memory error. Original message:
CUDA out of memory. Tried to allocate 4.00 GiB. GPU 0 has a total capacity of 39.49 GiB of which 3.28 GiB is free. Including non-PyTorch memory, this process has 15.07 GiB memory in use. Process 355057 has 21.13 GiB memory in use. Of the allocated memory 14.21 GiB is allocated by PyTorch, and 345.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Reducing 'attention_forward_temp_size_gb' limit:
Old value: 4.000
New value: 3.000
[...]
File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 810, in _forward_internal
result = self._forward_internal_no_check(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 917, in _forward_internal_no_check
y = block.forward(
^^^^^^^^^^^^^^
File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 58, in forward
self._check_kv_cache(cache, block_idx, batch_size, chunk_len)
File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 107, in _check_kv_cache
raise ValueError(
ValueError: KV cache for layer 0: chunk_len = 32768, must be <= max_forward_length() = 2048 (input_pos = 34816)
Describe the bug
In several situations, we obtain this error:
This is with cache length 21768, chunk size 2048, so the prefill chunk is size 32k, all others are <= 2k. What happens here is that the prefill forward is called, but
input_pos>0in KV cache. KV caches should have been reset, but are not!To reproduce
After 30 iterations: