Description
I am leveraging PixelRAG for a research project that involves ingesting heavily dynamic, JavaScript-rendered websites and analyzing them through multimodal chat interfaces (e.g., ChatGPT/Claude Web UI).
While pixelshot does an incredible job executing client-side JS and preserving complex layout structural integrity (like tables and data graphs) via its 1568px tiled slicing strategy, passing these visual tiles directly into an LLM chat interface incurs significant vision token overhead.
The Problem / Question
For text-heavy segments of dynamic sites, turning pixels back into text via a Vision-Language Model (VLM) can drastically drain the available context window and maximize token consumption.
- Are there any existing best practices or hidden flags within the pipeline to mitigate token overhead when using a manual chat-interface workflow?
- Has there been consideration for a hybrid approach (e.g., extracting a lightweight parallel markdown/text layout chunk alongside the screenshot tile) to give users the option of text vs. pixel delivery depending on whether the asset is a chart or a text paragraph?
Proposed Enhancement (If applicable)
It would be highly valuable to have an option in the CLI or programmatic rendering API (e.g., pixelshot --output-hybrid) that outputs both the .jpg tile for visual assets (charts/infographics) and a stripped, markdown representation for pure structural text blocks. This would allow researchers to selectively drop text or pixels into their chat prompts, saving thousands of vision tokens.
Description
I am leveraging PixelRAG for a research project that involves ingesting heavily dynamic, JavaScript-rendered websites and analyzing them through multimodal chat interfaces (e.g., ChatGPT/Claude Web UI).
While
pixelshotdoes an incredible job executing client-side JS and preserving complex layout structural integrity (like tables and data graphs) via its 1568px tiled slicing strategy, passing these visual tiles directly into an LLM chat interface incurs significant vision token overhead.The Problem / Question
For text-heavy segments of dynamic sites, turning pixels back into text via a Vision-Language Model (VLM) can drastically drain the available context window and maximize token consumption.
Proposed Enhancement (If applicable)
It would be highly valuable to have an option in the CLI or programmatic rendering API (e.g.,
pixelshot --output-hybrid) that outputs both the.jpgtile for visual assets (charts/infographics) and a stripped, markdown representation for pure structural text blocks. This would allow researchers to selectively drop text or pixels into their chat prompts, saving thousands of vision tokens.