Support other VLMs for object detection

### Description

Right now we use frontier Gemini model for detection and goal translation. Might be achievable with Gemma 4, Molmo, etc.

### Motivation

No need for Gemini API key. Easier to experiment with local models.

### Alternatives Considered

_No response_