Add llm_pick_keyframe option for LLM-guided keyframe selection by CamSoper · Pull Request #634 · valentinfrlch/ha-llmvision

CamSoper · 2026-04-14T16:50:31Z

First, thanks for building llmvision. It's become a key part of my Home Assistant setup and I really appreciate the work you've put into it.

Why

When analyzing a stream, the current SSIM motion heuristic picks the frame with the most visual change. In practice that often lands on an empty scene right after the subject of interest has left the frame, so the snapshot that ends up exposed to Home Assistant isn't the one a human would have picked. Since the LLM is already looking at every candidate frame, it seemed worth letting it tell us which one best represents the event.

How

This PR adds a new llm_pick_keyframe option for video_analyzer and stream_analyzer. When enabled alongside expose_images, the LLM is asked to identify the best frame and that frame is the one exposed. If the model's response is missing or invalid, it falls back to the existing SSIM behavior, so nothing regresses when the option is off or when the LLM misbehaves.

Implementation notes:

New LLM_PICK_KEYFRAME constant and service parameter.
Keyframe saving is deferred: candidate frames are stashed on MediaProcessor instead of immediately calling _expose_image(), so we can pick after the LLM has responded.
A best_frame instruction is injected into the prompt and the JSON schema.
best_frame is extracted from structured or text responses, with a fallback to SSIM-based selection when the value is missing or out of range.
New expose_keyframe_by_index() and expose_keyframe_ssim_fallback() methods on MediaProcessor.
A follow-up refactor decouples the media source from the keyframe strategy so the two paths are cleanly separated.
Logging was added so users can see which keyframe the LLM chose.
Tests cover extraction logic, prompt injection, schema injection, and the new MediaProcessor methods.

No behavior change when the option is not set. Happy to iterate on naming, defaults, or structure if you'd prefer a different shape. No pressure at all to take this, I'm maintaining it on a fork either way.

Thanks again for the project.

When enabled alongside expose_images, the LLM chooses which analyzed frame best represents the event instead of the SSIM motion heuristic. The SSIM approach often picks empty scenes after the subject has left; the LLM can identify the frame with the most relevant content. - Add LLM_PICK_KEYFRAME constant and service parameter for video_analyzer and stream_analyzer - Defer keyframe saving: stash candidate frames on MediaProcessor instead of immediately calling _expose_image() - Inject best_frame instruction into LLM prompt and JSON schema - Extract best_frame from structured or text responses with fallback to SSIM-based selection when LLM output is missing/invalid - Add expose_keyframe_by_index() and expose_keyframe_ssim_fallback() methods to MediaProcessor - Add comprehensive tests for extraction logic, prompt injection, schema injection, and MediaProcessor expose methods https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg Co-authored-by: Claude <noreply@anthropic.com>

* Add llm_pick_keyframe option for LLM-guided keyframe selection When enabled alongside expose_images, the LLM chooses which analyzed frame best represents the event instead of the SSIM motion heuristic. The SSIM approach often picks empty scenes after the subject has left; the LLM can identify the frame with the most relevant content. - Add LLM_PICK_KEYFRAME constant and service parameter for video_analyzer and stream_analyzer - Defer keyframe saving: stash candidate frames on MediaProcessor instead of immediately calling _expose_image() - Inject best_frame instruction into LLM prompt and JSON schema - Extract best_frame from structured or text responses with fallback to SSIM-based selection when LLM output is missing/invalid - Add expose_keyframe_by_index() and expose_keyframe_ssim_fallback() methods to MediaProcessor - Add comprehensive tests for extraction logic, prompt injection, schema injection, and MediaProcessor expose methods https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg * Strip best_frame from LLM response to prevent it showing in descriptions The LLM includes best_frame in its text response which was leaking into the user-visible timeline description. Now _extract_best_frame() removes best_frame from both structured_response dicts and response_text before returning. Updated tests to verify stripping behavior. https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg --------- Co-authored-by: Claude <noreply@anthropic.com>

* Refactor llm_pick_keyframe: decouple media processing from keyframe strategy Design changes: - MediaProcessor always stashes candidate_frames when expose_images=True; it no longer decides whether to expose immediately or defer. The llm_pick_keyframe parameter is removed from every media method signature (record, add_video, add_videos, add_streams). - Keyframe selection is now entirely the handler's responsibility: * llm_pick_keyframe=False → handler calls select_and_expose_keyframe() right after media processing (SSIM on stashed candidates). * llm_pick_keyframe=True → handler waits for LLM response, extracts best_frame, falls back to SSIM on failure. - Prompt injection uses an unambiguous bracketed tag [BEST_FRAME: <label>] instead of free-form "best_frame:" text, making stripping reliable. Guards against double-injection on fallback retry. - _extract_best_frame (hidden mutation) split into _match_best_frame (pure lookup) and _extract_and_strip_best_frame (explicit about mutation). - candidate_frames is a public attribute (no underscore) since handlers access it directly. - add_videos processes videos sequentially instead of via asyncio.gather to avoid concurrent appends to the shared candidate_frames list. - getattr guards removed from providers.py; call is always ServiceCallData. https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg * Remove dead code, deduplicate handlers, drop redundant comments - Remove unused SSIM computation in record() — key_idx was computed but never used after the refactor removed the inline expose call - Extract _resolve_llm_keyframe() helper to eliminate identical 10-line blocks in video_analyzer and stream_analyzer - Delete four "Add processor.key_frame to response if it exists" comments that just restate the code https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg --------- Co-authored-by: Claude <noreply@anthropic.com>

Adds an INFO log line when the LLM successfully picks a frame, showing the label and index. The failure path already logged a WARNING. https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg Co-authored-by: Claude <noreply@anthropic.com>

valentinfrlch · 2026-04-20T11:16:09Z

@CamSoper thank you for the kind words and your work, I appreciate it! What your proposing sounds interesting. However, I think throwing an LLM against picking the best frame to analyze might be a little overkill (it will also almost double the time to analyze).

I agree however that the current SSIM based implementation might not be optimal, and would be very interested in improving this. Perhaps we could add a simple object detection model to look for person, car, etc. to distinguish between empty frames and the one we actually want to analyze.
I'd love to chat about this on discord: https://discord.gg/MHWd9ukj

CamSoper · 2026-04-20T17:06:23Z

Thanks, will hop on Discord.

A couple things I think might be a misread of the PR:

It's a per-call service parameter (llm_pick_keyframe) that defaults to false, not a global switch. If you don't set it, you get the existing SSIM path unchanged.
It's not a second LLM call. The candidate frames are already in the prompt of the existing vision request. This just adds one sentence asking the model to name its pick, plus a best_frame field in the JSON schema when structured output is used. Same request, same images. Overhead is a few output tokens, and I haven't seen measurable latency difference between on and off.

On "overkill" -- an object detector is a reasonable direction, but I think it solves a different problem rather than competing with this:

A detector filters empty frames before anything goes to the LLM. Good for the "SSIM picked an empty frame" case, assuming the detector's labels cover what you care about.
LLM pick ranks among the frames the LLM is already looking at. It helps when several frames contain a person/car/package but one is clearly better -- better framed, sharper, the actual moment of the event. A generic detector can't rank that.

Bundling a detection model, weights, and label mapping into a HACS integration is also a bigger ask than a prompt tweak.

I've been running this on my own setup for a while with both structured and non-structured output, and the frames it picks are a real improvement over SSIM for my cameras. One person's sample, sure, but it's why I bothered upstreaming.

No pressure -- I'll keep the fork going either way, and I'm open to reshaping the PR if there's a shape you'd prefer. See you on Discord.

valentinfrlch · 2026-04-28T14:11:26Z

Sorry for the delay! I misunderstood what this is trying to do (I though it would call the api a second time).
I think the approach for when structured_response is enabled is solid. However, I'm not sure how well it works when structured_response is disabled (which is the default), especially for local models.

I want to finally push 1.7.0, so unfortunately I won't have time to do proper testing of this before that, but will keep this in mind for next release!

Thanks for your work!

CamSoper · 2026-05-01T17:39:23Z

@valentinfrlch Makes total sense to me!

It does work on the non-structured_response mode, but it's a tiny bit kludgy. It injects an extra line onto the prompt and strips the extra output on the response.

If you wanted to make this a structured_response-only, I'd be happy to do that. That's how I'm using it.

CamSoper and others added 4 commits April 11, 2026 14:55

Log which keyframe the LLM selected (#4)

394b8d2

Adds an INFO log line when the LLM successfully picks a frame, showing the label and index. The failure path already logged a WARNING. https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg Co-authored-by: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add llm_pick_keyframe option for LLM-guided keyframe selection#634

Add llm_pick_keyframe option for LLM-guided keyframe selection#634
CamSoper wants to merge 4 commits into
valentinfrlch:mainfrom
CamSoper:feature/llm-pick-keyframe

CamSoper commented Apr 14, 2026

Uh oh!

valentinfrlch commented Apr 20, 2026

Uh oh!

CamSoper commented Apr 20, 2026

Uh oh!

valentinfrlch commented Apr 28, 2026

Uh oh!

CamSoper commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

CamSoper commented Apr 14, 2026

Why

How

Uh oh!

valentinfrlch commented Apr 20, 2026

Uh oh!

CamSoper commented Apr 20, 2026

Uh oh!

valentinfrlch commented Apr 28, 2026

Uh oh!

CamSoper commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants