Some multimodal models incorporate nontrivial preprocessing pipelines. Even relatively simple cases like image-to-text models are loading, decoding, converting representations or color spaces, and scaling before tokenizing, and many of these steps are unaccelerated. An imprecise or static napkin-math model of performance here (e.g., "Pillow can downscale images at a rate of 30-60 megapixels/second/core on an Intel Xeon") could be useful to guide resource limits for deployments.