-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for Multimodal eval #1499
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1499
Note: Links to docs will display an error until the docs builds have been completed. ❌ 7 New Failures, 1 Unrelated FailureAs of commit ae66baf with merge base 2766a95 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Haven't sat down and give it a full test run, but left some initial thoughts
@@ -130,5 +130,5 @@ if [[ -x "$(command -v nvidia-smi)" ]]; then | |||
fi | |||
( | |||
set -x | |||
$PIP_EXECUTABLE install evaluate=="0.4.3" lm-eval=="0.4.2" psutil=="6.0.0" | |||
$PIP_EXECUTABLE install evaluate=="0.4.3" lm-eval=="0.4.7" psutil=="6.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond the scope of this PR, but the duplicated requirements in here vs requirements.txt will be collapsed when we introduce packaging
type=str, | ||
default="text", | ||
choices=["text", "text-image"], | ||
# help=argparse.SUPPRESS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# help=argparse.SUPPRESS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this arg is only used for evaluation, let's bump it into _add_evaluation_args()
below
@@ -168,6 +183,250 @@ def _model_generate(self, context, max_length, eos_token_id): | |||
raise Exception("unimplemented") | |||
|
|||
|
|||
class VLMEvalWrapper(HFMultimodalLM): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment/link pointing back to torchtune's implementation as well
@@ -71,6 +71,7 @@ class BuilderArgs: | |||
dynamic_shapes: bool = False | |||
max_seq_length: Optional[int] = None | |||
attention_backend: str = "math" | |||
modality: Optional[str] = "text" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modality
isn't super related to the builderargs, so let's leave it out. I commented in the Argparser with details
@@ -223,6 +482,57 @@ def eval( | |||
return eval_results | |||
|
|||
|
|||
def multi_model_eval( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this and eval()
are fairly similar. Mind combining them?
PR for #1334
Used
VLMEvalWrapper
andLlama3VisionTransform
from torchtune to support evaluation for multimodal models (llama3.2 11b only for now).Bumped up lm_eval to
lm_eval==0.4.7
to useHFMultimodalLM
, the class thatVLMEvalWrapper
inherits from.A sample run for mmmu_val_art:
And with a limit of 1 sample: