`int4_weight_only` Slows Down `torch.nn.Linear` for Llama2 7B Shapes

I have created a small [script](https://gist.github.com/mostafaelhoushi/7279cc2548426ede1a2b86b59f553761) to benchmark int4 quantization on A100 GPUs, with inputs that have batch size 1 and seqlen 1.

When I test weigh shapes that exist in Llama2 7B, I actually get a slow down:

```
# input_dim, output_dim = 4096, 4096
Baseline:       0.023313920497894287 ms
Quantized:      0.08300095558166504 ms
```

```
# input_dim, output_dim = 4096, 11008
Baseline:       0.06082496166229248 ms
Quantized:      0.08460960388183594 ms
```

```
# input_dim, output_dim = 11008, 4096
Baseline:       0.059748477935791015 ms
Quantized:      0.09495231628417969 ms
```

When I use a really large shape that doesn't exist in Llama2 7B, I do get some speedup:
```
# input_dim, output_dim = 11008, 11008
Baseline:       0.14746272087097168 ms
Quantized:      0.09298111915588379 ms
```

This is strange because gpt-fast uses a similar int4 quantization and gets 2x speedup on Llama2 7B.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`int4_weight_only` Slows Down `torch.nn.Linear` for Llama2 7B Shapes #1606

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

int4_weight_only Slows Down torch.nn.Linear for Llama2 7B Shapes #1606

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`int4_weight_only` Slows Down `torch.nn.Linear` for Llama2 7B Shapes #1606