Another multimodal models？

Hello, may I ask if there are any other multimodal models that can be loaded besides this model?
An example is : luodian/OTTER-MPT1B-RPJama-Init 
I saw the use of llama llm and other vision encoders in the paper, but I don't quite understand how to decouple the  luodian/OTTER-MPT1B-RPJama-Init  model with LLM and Vision Encoder? Is there any way to achieve it?