You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But it appears that the model's memory usage in nvidia-smi is still 136GB on B200, the same as the unquantized model, even though the model weights are definitely much smaller. Performance with bench_one_batch doesn't seem to improve as well, suggesting somehow the quantized model was not loaded. Why is this the case? Is there another flag I need to set?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I loaded the AWQ Qwen model from https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ, and started as follows:
But it appears that the model's memory usage in nvidia-smi is still 136GB on B200, the same as the unquantized model, even though the model weights are definitely much smaller. Performance with
bench_one_batchdoesn't seem to improve as well, suggesting somehow the quantized model was not loaded. Why is this the case? Is there another flag I need to set?Beta Was this translation helpful? Give feedback.
All reactions