-
Notifications
You must be signed in to change notification settings - Fork 161
Description
I'm using vllm-mlx on macOS (MBP M5 Max, 128GB, Tahoe 26.3.1 (a)) to host mlx-community/Qwen3-VL-30B-A3B-Instruct-bf16. Overall it works great, but I'm running into one annoying issue.
If I make sequential calls via the OpenAI API to describe a series of images one at a time, and several of the images are similar (but not the same), I find I get back identical text descriptions for multiple images in a row. The images are sufficiently different that the repeated returned description is wrong in at least some regard for the second and subsequent similar images, but they are word-for-word the same description as the first similar one in the sequence.
Is there some caching going on for similar-but-not-the-same images, which is causing a previously-generated similar-image input to be re-used for later queries (or something)? If so, is there a way to opt out of it?
For more context, I'm sending requests to http://localhost:8000/v1/chat/completions, with a query that includes a text message and a single image message containing a base64-encoded JPEG (max 2048px in each dimension).