Eval fails on CUDA with AOTI exported model

### 🐛 Describe the bug


```
$ python torchchat.py export stories110M --dtype float16 --output-dso-path stories.so

Using device=cuda
Setting max_seq_length to 300 for DSO export.
Loading model...
Time to load model: 0.44 seconds
-----------------------------------------------------------
Exporting model using AOT Inductor to /content/torchchat-1/stories.so
W1017 20:10:20.554000 7389 torch/_export/__init__.py:225] +============================+
W1017 20:10:20.554000 7389 torch/_export/__init__.py:226] |     !!!   WARNING   !!!    |
W1017 20:10:20.555000 7389 torch/_export/__init__.py:227] +============================+
W1017 20:10:20.555000 7389 torch/_export/__init__.py:228] torch._export.aot_compile() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export()) instead.
The generated DSO model can be found at: /content/torchchat-1/stories.so
2024-10-17 20:12:01.733978: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-17 20:12:01.753928: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-17 20:12:01.759899: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-17 20:12:01.774061: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-17 20:12:02.820101: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Using device=cuda
Loading model...
Time to load model: 0.48 seconds
-----------------------------------------------------------

$ python torchchat.py eval stories110M --dtype float16 --dso-path stories.so --limit 5
2024-10-17:20:12:09,610 INFO     [huggingface.py:162] Using device 'cuda'
config.json: 100% 665/665 [00:00<00:00, 3.09MB/s]
model.safetensors: 100% 548M/548M [00:05<00:00, 101MB/s]
generation_config.json: 100% 124/124 [00:00<00:00, 733kB/s]
tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 132kB/s]
vocab.json: 100% 1.04M/1.04M [00:00<00:00, 4.67MB/s]
merges.txt: 100% 456k/456k [00:00<00:00, 1.09MB/s]
tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 2.14MB/s]
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
2024-10-17:20:12:27,047 WARNING  [task.py:763] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-17:20:12:27,047 WARNING  [task.py:775] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-17:20:12:27,047 WARNING  [task.py:763] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-17:20:12:27,047 WARNING  [task.py:775] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-17:20:12:27,047 WARNING  [task.py:763] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
2024-10-17:20:12:27,047 WARNING  [task.py:775] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
wikitext_document_level.py: 100% 10.7k/10.7k [00:00<00:00, 39.4MB/s]
README.md: 100% 7.78k/7.78k [00:00<00:00, 32.7MB/s]
Repo card metadata block was not found. Setting CardData to empty.
2024-10-17:20:12:29,949 WARNING  [repocard.py:107] Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100% 4.72M/4.72M [00:00<00:00, 7.37MB/s]
Generating test split: 62 examples [00:00, 656.90 examples/s]
Generating train split: 629 examples [00:00, 1999.28 examples/s]
Generating validation split: 60 examples [00:00, 2830.26 examples/s]
2024-10-17:20:12:32,165 INFO     [task.py:395] Building contexts for wikitext on rank 0...
100% 5/5 [00:00<00:00, 420.70it/s]
2024-10-17:20:12:32,178 INFO     [evaluator.py:362] Running loglikelihood_rolling requests
  0% 0/5 [00:01<?, ?it/s]
Time to run eval: 23.69s.
Traceback (most recent call last):
  File "/content/torchchat-1/torchchat.py", line 92, in <module>
    eval_main(args)
  File "/content/torchchat-1/torchchat/usages/eval.py", line 271, in main
    result = eval(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/content/torchchat-1/torchchat/usages/eval.py", line 217, in eval
    eval_results = evaluate(
  File "/usr/local/lib/python3.10/dist-packages/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lm_eval/evaluator.py", line 373, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/usr/local/lib/python3.10/dist-packages/lm_eval/models/huggingface.py", line 840, in loglikelihood_rolling
    string_nll = self._loglikelihood_tokens(
  File "/usr/local/lib/python3.10/dist-packages/lm_eval/models/huggingface.py", line 1074, in _loglikelihood_tokens
    logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(
RuntimeError: Size does not match at dimension 1 expected index [1, 1537, 1] to be smaller than self [1, 1, 32000] apart from dimension 2
```


### Versions

```
--2024-10-17 20:26:14--  https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23357 (23K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py      100%[===================>]  22.81K  --.-KB/s    in 0s      

2024-10-17 20:26:14 (156 MB/s) - ‘collect_env.py’ saved [23357/23357]

Collecting environment information...
PyTorch version: 2.6.0.dev20241002+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.4
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.85+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.00GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   1
Socket(s):                            1
Stepping:                             3
BogoMIPS:                             4000.28
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            32 KiB (1 instance)
L1i cache:                            32 KiB (1 instance)
L2 cache:                             1 MiB (1 instance)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0,1
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable; SMT Host state unknown
Vulnerability Meltdown:               Vulnerable
Vulnerability Mmio stale data:        Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable (Syscall hardening enabled)
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] optree==0.13.0
[pip3] pytorch-triton==3.1.0+cf34004b8a
[pip3] torch==2.6.0.dev20241002+cu121
[pip3] torchao==0.5.0
[pip3] torchaudio==2.4.1+cu121
[pip3] torchsummary==1.5.1
[pip3] torchtune==0.4.0.dev20241010+cu121
[pip3] torchvision==0.20.0.dev20241002+cu121
[conda] Could not collect
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval fails on CUDA with AOTI exported model #1311

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval fails on CUDA with AOTI exported model #1311

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions