Added support for Multimodal eval #1499

anirudhs001 · 2025-02-23T04:06:35Z

Used VLMEvalWrapper and Llama3VisionTransform from torchtune to support evaluation for multimodal models (llama3.2 11b only for now).

Bumped up lm_eval to lm_eval==0.4.7 to use HFMultimodalLM, the class that VLMEvalWrapper inherits from.

A sample run for mmmu_val_art:

(venv) anirudhsingh@Anirudhs-MacBook-Pro-4 torchchat % python torchchat.py eval Llama-3.2-mm --device cpu --dtype bf16 --task mmmu_val_art --modality text-image --max-seq-length 2048 
NumExpr defaulting to 12 threads.
PyTorch version 2.7.0.dev20250124 available.
Looking for libcustom_ops_aot_lib.so in /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch
Loading custom ops library: /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.dylib
Unable to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/anirudhsingh/MISC/playground/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Modality of model=text-image
Using device=cpu
Loading model...
Time to load model: 0.25 seconds
-----------------------------------------------------------
Building contexts for mmmu_val_art on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 20148.78it/s]
Running generate_until requests
Running generate_until requests with text+image input: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [7:49:19<00:00, 938.65s/it]
Time to run eval: 28171.31s.
Time in model.forward: 28154.47s, over 30 model evaluations
forward run time stats - Median: 360.38s Min: 355.40s Max: 8932.57s
For model /Users/anirudhsingh/.torchchat/model-cache/meta-llama/Llama-3.2-11B-Vision-Instruct/model.pth
mmmu_val_art:
 alias: Art
 acc,none: 0.2333
 acc_stderr,none: 0.0785

And with a limit of 1 sample:

(venv) anirudhsingh@Anirudhs-MacBook-Pro-4 torchchat % python torchchat.py eval Llama-3.2-mm --device cpu --dtype bf16 --task mmmu_val_art --limit 1 --modality text-image --max-seq-length 720
NumExpr defaulting to 12 threads.
PyTorch version 2.7.0.dev20250124 available.
Looking for libcustom_ops_aot_lib.so in /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch
Loading custom ops library: /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.dylib
Unable to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/anirudhsingh/MISC/playground/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Modality of model=text-image
Using device=cpu
Loading model...
Time to load model: 0.25 seconds
-----------------------------------------------------------
Building contexts for mmmu_val_art on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5159.05it/s]
Running generate_until requests
Running generate_until requests with text+image input: 100%|██████████████████████████████████████████████████████████████| 1/1 [08:38<00:00, 518.97s/it]
Time to run eval: 531.16s.
Time in model.forward: 518.80s, over 1 model evaluations
forward run time stats - Median: 518.80s Min: 518.80s Max: 518.80s
For model /Users/anirudhsingh/.torchchat/model-cache/meta-llama/Llama-3.2-11B-Vision-Instruct/model.pth
mmmu_val_art:
 alias: Art
 acc,none: 0.0000

pytorch-bot · 2025-02-23T04:06:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1499

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 1 Unrelated Failure

As of commit ae66baf with merge base 2766a95 ():

NEW FAILURES - The following jobs have failed:

pull / test-cpu-eval-sanity-check (aarch64, stories15M) (gh)
Process completed with exit code 1.
pull / test-cpu-eval-sanity-check (x86_64, stories15M) (gh)
Process completed with exit code 1.
pull / test-cpu-eval-sanity-check-float16 (aarch64, stories15M) (gh)
Process completed with exit code 1.
pull / test-cpu-eval-sanity-check-float16 (x86_64, stories15M) (gh)
Process completed with exit code 1.
pull / test-cpu-eval-sanity-check-float32 (aarch64, stories15M) (gh)
Process completed with exit code 1.
pull / test-cpu-eval-sanity-check-float32 (x86_64, stories15M) (gh)
Process completed with exit code 1.
pull / test-gpu-eval-sanity-check (cuda, stories15M) / linux-job (gh)
RuntimeError: CUDA error: device-side assert triggered

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-torchao-aoti-experimental (macos-14-xlarge) (gh) (trunk failure)
Process completed with exit code 134.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Jack-Khuu

Nice work!
Haven't sat down and give it a full test run, but left some initial thoughts

Jack-Khuu · 2025-02-25T08:20:13Z

install/install_requirements.sh

@@ -130,5 +130,5 @@ if [[ -x "$(command -v nvidia-smi)" ]]; then
 fi
 (
  set -x
-  $PIP_EXECUTABLE install evaluate=="0.4.3" lm-eval=="0.4.2" psutil=="6.0.0"
+  $PIP_EXECUTABLE install evaluate=="0.4.3" lm-eval=="0.4.7" psutil=="6.0.0"


Beyond the scope of this PR, but the duplicated requirements in here vs requirements.txt will be collapsed when we introduce packaging

Jack-Khuu · 2025-02-25T08:23:12Z

torchchat/cli/cli.py

+        type=str,
+        default="text",
+        choices=["text", "text-image"],
+        # help=argparse.SUPPRESS,


Suggested change

# help=argparse.SUPPRESS,

Since this arg is only used for evaluation, let's bump it into _add_evaluation_args() below

Jack-Khuu · 2025-02-25T08:24:46Z

torchchat/usages/eval.py

@@ -168,6 +183,250 @@ def _model_generate(self, context, max_length, eos_token_id):
        raise Exception("unimplemented")


+class VLMEvalWrapper(HFMultimodalLM):


Let's add a comment/link pointing back to torchtune's implementation as well

Jack-Khuu · 2025-02-25T08:29:50Z

torchchat/cli/builder.py

@@ -71,6 +71,7 @@ class BuilderArgs:
    dynamic_shapes: bool = False
    max_seq_length: Optional[int] = None
    attention_backend: str = "math"
+    modality: Optional[str] = "text"


modality isn't super related to the builderargs, so let's leave it out. I commented in the Argparser with details

Jack-Khuu · 2025-02-25T08:34:34Z

torchchat/usages/eval.py

@@ -223,6 +482,57 @@ def eval(
    return eval_results


+def multi_model_eval(


Looks like this and eval() are fairly similar. Mind combining them?

anirudh added 6 commits February 23, 2025 09:32

[wip] Added cli args and other changes to eval multi-modal models

2aa67b4

remove redundant comment

78bdacf

Added Llama3VisionTransform in TokenizerArgs and other changes

bfc62dc

use kv caching and other minor fixes

8900f8a

default batch size 1

59ce657

lint eval.py and builder.py

afdb3ce

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 23, 2025

This was referenced Feb 23, 2025

Added support for Multimodal eval #1498

Closed

Multimodal Eval Enablement (Looking for Developer to Implement Design) #1334

Open

Jack-Khuu requested review from Gasoonjia and Jack-Khuu February 23, 2025 21:21

Jack-Khuu added enhancement New feature or request Evaluation/Benchmarking Issues Related to Evaluation and Benchmarking labels Feb 23, 2025

lm-eval 0.4.2->0.4.7 in install_requirements.sh

ae66baf

Jack-Khuu reviewed Feb 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for Multimodal eval #1499

Added support for Multimodal eval #1499

anirudhs001 commented Feb 23, 2025 •

edited

Loading

pytorch-bot bot commented Feb 23, 2025 •

edited

Loading

Jack-Khuu left a comment

Jack-Khuu Feb 25, 2025

Jack-Khuu Feb 25, 2025

Jack-Khuu Feb 25, 2025

Jack-Khuu Feb 25, 2025

Jack-Khuu Feb 25, 2025

Jack-Khuu Feb 25, 2025

		@@ -168,6 +183,250 @@ def _model_generate(self, context, max_length, eos_token_id):
		raise Exception("unimplemented")


		class VLMEvalWrapper(HFMultimodalLM):

		@@ -223,6 +482,57 @@ def eval(
		return eval_results


		def multi_model_eval(

Added support for Multimodal eval #1499

Are you sure you want to change the base?

Added support for Multimodal eval #1499

Conversation

anirudhs001 commented Feb 23, 2025 • edited Loading

pytorch-bot bot commented Feb 23, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1499

❌ 7 New Failures, 1 Unrelated Failure

Jack-Khuu left a comment

Choose a reason for hiding this comment

Jack-Khuu Feb 25, 2025

Choose a reason for hiding this comment

Jack-Khuu Feb 25, 2025

Choose a reason for hiding this comment

Jack-Khuu Feb 25, 2025

Choose a reason for hiding this comment

Jack-Khuu Feb 25, 2025

Choose a reason for hiding this comment

Jack-Khuu Feb 25, 2025

Choose a reason for hiding this comment

Jack-Khuu Feb 25, 2025

Choose a reason for hiding this comment

anirudhs001 commented Feb 23, 2025 •

edited

Loading

pytorch-bot bot commented Feb 23, 2025 •

edited

Loading