Skip to content

feat: VLM judge integration in evaluate module #140

@MiaoDX

Description

@MiaoDX

Context

Add an optional VLM evaluation path in the evaluate module. Use open-source RoboReward 4B models or StepEval's per-subgoal pattern to auto-score checkpoint screenshots.

Priority: Do later — depends on Constraint Evaluator shipping first. VLM judge is an enhancement layer on top.

Open questions

  • Which VLM? RoboReward 4B is purpose-trained but may not generalize; general VLMs (Claude/GPT-4o) are expensive but more capable
  • Likely need two paths: local small model for fast CI checks, cloud model for deep analysis

References

  • Prerequisite: Constraint Evaluator (end-to-end)
  • Roadmap: docs/roadmap-2026-q2.md §D

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions