I only see multimodal input with text output when using Emu2, any solution to generate text + multi-images?