Vision Language Model Support for Subqueries

# Recursive Vision Language Models

There is huge potential in making the RLM framework multi-modal, meaning giving it the ability to append images and other files to its prompts. Agents working with multiple 1000-page documents that contain figures, tables, maps and charts could become enormously powerful if the LLM they are connected with allows for visual processing (which most models nowadays do).

<p align="center">
  <img width="539" height="333"
       alt="Image"
       src="https://github.com/user-attachments/assets/2e030559-88c7-4d69-b450-1ec3995d42a2" />
</p>

Currently, I am working on [this fork](https://github.com/famitzsy8/rvlm) to make it possible such that RLMs can send images into their sub-calls. It works pretty well for now, but I think it is still premature to open a PR, for the following reasons:

- It is only supported in a Docker environment together with an OpenAI model
- There are unresolved issues such as the integration of VLM support with the recent commit #84 that allows for shallow LLM calls in a local environment only

@alexzhang13 & Community: Let me know if you want to see this Pull Request happen, and what needs to be discussed/resolved/implemented before

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision Language Model Support for Subqueries #117

Recursive Vision Language Models

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Vision Language Model Support for Subqueries #117

Description

Recursive Vision Language Models

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions