Evaluation script fails when started independently >2 times

**Describe the bug**  

Running this:
```
> CUDA_VISIBLE_DEVICES="0" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2/eval_logs0"; python3 keys_values/__main__.py eval_long /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2 --model_type lora --devices 1 --batch_size 4 --kv_cache.name h2o-torch-quantized8 --kv_cache.cache_length 32768 --kv_cache.chunk_size 1024 --verbose some --attention_forward_temp_size_gb 8 --lora_dropout 0

> CUDA_VISIBLE_DEVICES="1" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2/eval_logs1"; python3 keys_values/__main__.py eval_long /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2 --model_type lora --devices 1 --batch_size 4 --kv_cache.name h2o-torch-quantized8 --kv_cache.cache_length 32768 --kv_cache.chunk_size 1024 --verbose some --attention_forward_temp_size_gb 8 --lora_dropout 0

> CUDA_VISIBLE_DEVICES="2" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2/eval_logs2"; python3 keys_values/__main__.py eval_long /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2 --model_type lora --devices 1 --batch_size 4 --kv_cache.name h2o-torch-quantized8 --kv_cache.cache_length 32768 --kv_cache.chunk_size 1024 --verbose some --attention_forward_temp_size_gb 8 --lora_dropout 0

> CUDA_VISIBLE_DEVICES="3" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2/eval_logs3"; python3 keys_values/__main__.py eval_long /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2 --model_type lora --devices 1 --batch_size 4 --kv_cache.name h2o-torch-quantized8 --kv_cache.cache_length 32768 --kv_cache.chunk_size 1024 --verbose some --attention_forward_temp_size_gb 8 --lora_dropout 0
```

Devices 0, 1 run fine. For devices 2, 3:
```
  File "/home/ubuntu/sync/keys_values/keys_values/__main__.py", line 140, in main
    auto_cli(PARSER_DATA)
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/jsonargparse/_cli.py", line 129, in auto_cli
    return _run_component(component, init.get(subcommand))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/jsonargparse/_cli.py", line 227, in _run_component
    return component(**cfg)
           ^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/finetune/longcontext_eval.py", line 274, in setup
    fabric.launch(
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/lightning/fabric/fabric.py", line 1010, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/lightning/fabric/fabric.py", line 1121, in _wrap_and_launch
    return to_run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/lightning/fabric/fabric.py", line 1126, in _wrap_with_setup
    return to_run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/finetune/longcontext_eval.py", line 481, in main
    raise ex
  File "/home/ubuntu/sync/keys_values/keys_values/finetune/longcontext_eval.py", line 448, in main
    loss_values = model(batch[INPUT_IDS_NAME], batch["targets"])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 645, in forward
    return self._forward_only(input_ids, targets, scale_factor)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 1001, in _forward_only
    loss_full = self._forward_internal(input_ids, targets, scale_factor)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 810, in _forward_internal
    result = self._forward_internal_no_check(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 917, in _forward_internal_no_check
    y = block.forward(
        ^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 58, in forward
    self._check_kv_cache(cache, block_idx, batch_size, chunk_len)
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 107, in _check_kv_cache
    raise ValueError(
ValueError: KV cache for layer 0: chunk_len = 32768, must be <= max_forward_length() = 1024 (input_pos = 33792)
```

Also, output logs are not written for 2, 3.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation script fails when started independently >2 times #42

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluation script fails when started independently >2 times #42

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions