This repository extends the Recursive Language Models (RLMs) with support for vision-language models (VLMs), allowing images and PDFs to be passed in together with the query. The examples and tests for this implementation make use of US Congressional documents.
RLM with VLM support can reason over scanned documents, tables, and figures that are not extractable as indexed text.
Disclaimer: This extension is OpenAI client only for the moment and only works in a Docker environment.
Now, PDFs and images (JPG/JPEG/PNG) can be passed to the new prompt/ directory that contains all the files that come with the prompt:
prompt/
├──pdfs/
│ ├── doc1.pdf
│ └── doc2.pdf
└──images/
├── img1.png
├── img2.jpeg
└── img3.jpg
These files get loaded during the start-up of the Docker container (see Dockerfile.vlm).
We changed the signature of the llm_query and the llm_query_batched methods to accept paths and lists of paths to images:
llm_query(prompt: str, image_paths: List[str] = None, **kwargs)
llm_query_batched(prompts: List[str], image_path_lists: List[List[str]] = None, **kwargs)These image paths then get encoded to data URI's inside the container and then passed to the OpenAI client on the host machine. The original idea was to lazy load the images on the host machine, but this is not possible since the host machine does not have access to the filesystem container.
- Clone this repository
git clone https://github.com/famitzsy8/rvlm.git- Build the Docker image
docker build -t rlm-vlm-tryout -f Dockerfile.vlm .- Set your OpenAI API key in your terminal
export OPENAI_API_KEY=your_openai_api_key- Run the script (by default with
gpt-5.2)
uv run python -m examples.vlm_exampleWe have 4 pytest fixtures inside of tests/vlm/fixtures/:
test_one_image: Tests a prompt with a single imagetest_n_images: Tests a prompt with multiple imagestest_pdf2image: Tests a prompt with a PDF (and if the model can convert the PDF to images)test_supported_formats: Tests the prompt with 3 images of the different supported formats (JPG, JPEG, PNG)
- Clone this repository
git clone https://github.com/famitzsy8/rvlm.git- Set the OpenAI API key in your terminal
export OPENAI_API_KEY=your_openai_api_key- Run the tests
make test-vlmIf you want to see the output of the tests (that can be quite long) wrap them into this command:
make test-vlm 2>&1 | tee test-vlm.txtThis ensures that you have all the output saved, but see it live when running the tests from your terminal.
The prime motivation for this extension was my own quest to process large and complex documents from the United States Congress without polluting context. Most of the nitty-gritty details happen in hearings and are logged onto committee reports that can be long with multiple figures, tables and explanatory images. Now it is possible to detect and look at figures with the VLMs.
- Currently we can ONLY use OpenAI models that support images as input. You can find the input descriptions for the OpenAI models here. Selection of a non-VLM model or a model not from OpenAI will raise errors.


