Context
Add an optional VLM evaluation path in the evaluate module. Use open-source RoboReward 4B models or StepEval's per-subgoal pattern to auto-score checkpoint screenshots.
Priority: Do later — depends on Constraint Evaluator shipping first. VLM judge is an enhancement layer on top.
Open questions
- Which VLM? RoboReward 4B is purpose-trained but may not generalize; general VLMs (Claude/GPT-4o) are expensive but more capable
- Likely need two paths: local small model for fast CI checks, cloud model for deep analysis
References
- Prerequisite: Constraint Evaluator (end-to-end)
- Roadmap:
docs/roadmap-2026-q2.md §D
Context
Add an optional VLM evaluation path in the evaluate module. Use open-source RoboReward 4B models or StepEval's per-subgoal pattern to auto-score checkpoint screenshots.
Priority: Do later — depends on Constraint Evaluator shipping first. VLM judge is an enhancement layer on top.
Open questions
References
docs/roadmap-2026-q2.md§D