Quantization for Llama-70b raises CUDA OOM 

Hello, 

Using the quantization config provided by torchtune, I am unable to run a quantization of llama-3-70b.

```shell 
tune run quantize --config configs/custom_quantization_untrained_llama.yaml 
```
with `custom_quantization_untrained_llama.yaml` the exact default quantification config pointing toward the safetensors files of llama-3-70b.

Config is : 
```shell
2024-06-27:14:26:29,993 INFO     [_utils.py:33] Running QuantizationRecipe with resolved config:

checkpointer:
  _component_: torchtune.utils.FullModelHFCheckpointer
  checkpoint_dir: /data/checkpoints/llama-3-70b-instruct-hf/
  checkpoint_files:
  - model-00001-of-00030.safetensors
  - model-00002-of-00030.safetensors
  - model-00003-of-00030.safetensors
  - model-00004-of-00030.safetensors
  - model-00005-of-00030.safetensors
  - model-00006-of-00030.safetensors
  - model-00007-of-00030.safetensors
  - model-00008-of-00030.safetensors
  - model-00009-of-00030.safetensors
  - model-00010-of-00030.safetensors
  - model-00011-of-00030.safetensors
  - model-00012-of-00030.safetensors
  - model-00013-of-00030.safetensors
  - model-00014-of-00030.safetensors
  - model-00015-of-00030.safetensors
  - model-00016-of-00030.safetensors
  - model-00017-of-00030.safetensors
  - model-00018-of-00030.safetensors
  - model-00019-of-00030.safetensors
  - model-00020-of-00030.safetensors
  - model-00021-of-00030.safetensors
  - model-00022-of-00030.safetensors
  - model-00023-of-00030.safetensors
  - model-00024-of-00030.safetensors
  - model-00025-of-00030.safetensors
  - model-00026-of-00030.safetensors
  - model-00027-of-00030.safetensors
  - model-00028-of-00030.safetensors
  - model-00029-of-00030.safetensors
  - model-00030-of-00030.safetensors
  model_type: LLAMA3
  output_dir: /workspaces/Meta-Llama-3-70B-Instruct/
  recipe_checkpoint: null
device: cuda
dtype: bf16
model:
  _component_: torchtune.models.llama3.llama3_70b
quantizer:
  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
  groupsize: 256
seed: 1234
```

Error is :   
`torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization for Llama-70b raises CUDA OOM #1128

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development