-
Notifications
You must be signed in to change notification settings - Fork 568
Open
Description
Recursive Vision Language Models
There is huge potential in making the RLM framework multi-modal, meaning giving it the ability to append images and other files to its prompts. Agents working with multiple 1000-page documents that contain figures, tables, maps and charts could become enormously powerful if the LLM they are connected with allows for visual processing (which most models nowadays do).
Currently, I am working on this fork to make it possible such that RLMs can send images into their sub-calls. It works pretty well for now, but I think it is still premature to open a PR, for the following reasons:
- It is only supported in a Docker environment together with an OpenAI model
- There are unresolved issues such as the integration of VLM support with the recent commit feat: add depth>1 recursive subcalls with limits and cost tracking #84 that allows for shallow LLM calls in a local environment only
@alexzhang13 & Community: Let me know if you want to see this Pull Request happen, and what needs to be discussed/resolved/implemented before
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
