Skip to content

Evaluation script fails when started independently >2 times #42

@mseeger

Description

@mseeger

Describe the bug

Running this:

> CUDA_VISIBLE_DEVICES="0" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2/eval_logs0"; python3 keys_values/__main__.py eval_long /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2 --model_type lora --devices 1 --batch_size 4 --kv_cache.name h2o-torch-quantized8 --kv_cache.cache_length 32768 --kv_cache.chunk_size 1024 --verbose some --attention_forward_temp_size_gb 8 --lora_dropout 0

> CUDA_VISIBLE_DEVICES="1" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2/eval_logs1"; python3 keys_values/__main__.py eval_long /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2 --model_type lora --devices 1 --batch_size 4 --kv_cache.name h2o-torch-quantized8 --kv_cache.cache_length 32768 --kv_cache.chunk_size 1024 --verbose some --attention_forward_temp_size_gb 8 --lora_dropout 0

> CUDA_VISIBLE_DEVICES="2" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2/eval_logs2"; python3 keys_values/__main__.py eval_long /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2 --model_type lora --devices 1 --batch_size 4 --kv_cache.name h2o-torch-quantized8 --kv_cache.cache_length 32768 --kv_cache.chunk_size 1024 --verbose some --attention_forward_temp_size_gb 8 --lora_dropout 0

> CUDA_VISIBLE_DEVICES="3" PYTORCH_ALLOC_CONF=expandable_segments:True KEYSVALS_LOG_DIR="/home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2/eval_logs3"; python3 keys_values/__main__.py eval_long /home/ubuntu/out/finetune/ml_ws/lora/qwen2_5_0_5b/variant0_copy2 --model_type lora --devices 1 --batch_size 4 --kv_cache.name h2o-torch-quantized8 --kv_cache.cache_length 32768 --kv_cache.chunk_size 1024 --verbose some --attention_forward_temp_size_gb 8 --lora_dropout 0

Devices 0, 1 run fine. For devices 2, 3:

  File "/home/ubuntu/sync/keys_values/keys_values/__main__.py", line 140, in main
    auto_cli(PARSER_DATA)
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/jsonargparse/_cli.py", line 129, in auto_cli
    return _run_component(component, init.get(subcommand))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/jsonargparse/_cli.py", line 227, in _run_component
    return component(**cfg)
           ^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/finetune/longcontext_eval.py", line 274, in setup
    fabric.launch(
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/lightning/fabric/fabric.py", line 1010, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/lightning/fabric/fabric.py", line 1121, in _wrap_and_launch
    return to_run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/lightning/fabric/fabric.py", line 1126, in _wrap_with_setup
    return to_run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/finetune/longcontext_eval.py", line 481, in main
    raise ex
  File "/home/ubuntu/sync/keys_values/keys_values/finetune/longcontext_eval.py", line 448, in main
    loss_values = model(batch[INPUT_IDS_NAME], batch["targets"])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/virtenvs/keysvals/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 645, in forward
    return self._forward_only(input_ids, targets, scale_factor)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 1001, in _forward_only
    loss_full = self._forward_internal(input_ids, targets, scale_factor)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 810, in _forward_internal
    result = self._forward_internal_no_check(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/long_context.py", line 917, in _forward_internal_no_check
    y = block.forward(
        ^^^^^^^^^^^^^^
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 58, in forward
    self._check_kv_cache(cache, block_idx, batch_size, chunk_len)
  File "/home/ubuntu/sync/keys_values/keys_values/kvcache/stack_layers.py", line 107, in _check_kv_cache
    raise ValueError(
ValueError: KV cache for layer 0: chunk_len = 32768, must be <= max_forward_length() = 1024 (input_pos = 33792)

Also, output logs are not written for 2, 3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions