Summary
The VideoMediaIO.load_base64() method at vllm/multimodal/media/video.py:51-62 splits video/jpeg data URLs by comma to extract individual JPEG frames, but does not enforce a frame count limit. The num_frames parameter (default: 32), which is enforced by the load_bytes() code path at line 47-48, is completely bypassed in the video/jpeg base64 path. An attacker can send a single API request containing thousands of comma-separated base64-encoded JPEG frames, causing the server to decode all frames into memory and crash with OOM.
Details
Vulnerable code
# video.py:51-62
def load_base64(self, media_type: str, data: str) -> tuple[npt.NDArray, dict[str, Any]]:
if media_type.lower() == "video/jpeg":
load_frame = partial(self.image_io.load_base64, "image/jpeg")
return np.stack(
[np.asarray(load_frame(frame_data)) for frame_data in data.split(",")]
# ^^^^^^^^^^
# Unbounded split — no frame count limit
), {}
return self.load_bytes(base64.b64decode(data))
The load_bytes() path (line 47-48) properly delegates to a video loader that respects self.num_frames (default 32). The load_base64("video/jpeg", ...) path bypasses this limit entirely — data.split(",") produces an unbounded list and every frame is decoded into a numpy array.
video/jpeg is part of vLLM's public API
video/jpeg is a vLLM-specific MIME type, not IANA-registered. However it is part of the public API surface:
encode_video_url() at vllm/multimodal/utils.py:96-108 generates data:video/jpeg;base64,... URLs
- Official test suites at
tests/entrypoints/openai/test_video.py:62 and tests/entrypoints/test_chat_utils.py:153 both use this format
Memory amplification
Each JPEG frame decodes to a full numpy array. For 640x480 RGB images, each frame is ~921 KB decoded. 5000 frames = ~4.6 GB. np.stack() then creates an additional copy. The compressed JPEG payload is small (~100 KB for 5000 frames) but decompresses to gigabytes.
Data flow
POST /v1/chat/completions
→ chat_utils.py:1434 video_url type → mm_parser.parse_video()
→ chat_utils.py:872 parse_video() → self._connector.fetch_video()
→ connector.py:295 fetch_video() → load_from_url(url, self.video_io)
→ connector.py:91 _load_data_url(): url_spec.path.split(",", 1)
→ media_type = "video/jpeg"
→ data = "<frame1>,<frame2>,...,<frame10000>"
→ connector.py:100 media_io.load_base64("video/jpeg", data)
→ video.py:54 data.split(",") ← UNBOUNDED
→ video.py:55-57 all frames decoded into numpy arrays
→ video.py:56 np.stack([...]) ← massive combined array → OOM
connector.py:91 uses split(",", 1) which splits on only the first comma. All remaining commas stay in data and are later split by video.py:54.
Comparison with existing protections
| Code Path |
Frame Limit |
File |
load_bytes() (binary video) |
Yes — num_frames (default 32) |
video.py:46-49 |
load_base64("video/jpeg", ...) |
No — unlimited data.split(",") |
video.py:51-62 |
References
Summary
The
VideoMediaIO.load_base64()method atvllm/multimodal/media/video.py:51-62splitsvideo/jpegdata URLs by comma to extract individual JPEG frames, but does not enforce a frame count limit. Thenum_framesparameter (default: 32), which is enforced by theload_bytes()code path at line 47-48, is completely bypassed in thevideo/jpegbase64 path. An attacker can send a single API request containing thousands of comma-separated base64-encoded JPEG frames, causing the server to decode all frames into memory and crash with OOM.Details
Vulnerable code
The
load_bytes()path (line 47-48) properly delegates to a video loader that respectsself.num_frames(default 32). Theload_base64("video/jpeg", ...)path bypasses this limit entirely —data.split(",")produces an unbounded list and every frame is decoded into a numpy array.video/jpeg is part of vLLM's public API
video/jpegis a vLLM-specific MIME type, not IANA-registered. However it is part of the public API surface:encode_video_url()atvllm/multimodal/utils.py:96-108generatesdata:video/jpeg;base64,...URLstests/entrypoints/openai/test_video.py:62andtests/entrypoints/test_chat_utils.py:153both use this formatMemory amplification
Each JPEG frame decodes to a full numpy array. For 640x480 RGB images, each frame is ~921 KB decoded. 5000 frames = ~4.6 GB.
np.stack()then creates an additional copy. The compressed JPEG payload is small (~100 KB for 5000 frames) but decompresses to gigabytes.Data flow
connector.py:91usessplit(",", 1)which splits on only the first comma. All remaining commas stay indataand are later split byvideo.py:54.Comparison with existing protections
load_bytes()(binary video)num_frames(default 32)load_base64("video/jpeg", ...)data.split(",")References