Skip to content

Add llm_pick_keyframe option for LLM-guided keyframe selection#634

Open
CamSoper wants to merge 4 commits into
valentinfrlch:mainfrom
CamSoper:feature/llm-pick-keyframe
Open

Add llm_pick_keyframe option for LLM-guided keyframe selection#634
CamSoper wants to merge 4 commits into
valentinfrlch:mainfrom
CamSoper:feature/llm-pick-keyframe

Conversation

@CamSoper

Copy link
Copy Markdown
Contributor

First, thanks for building llmvision. It's become a key part of my Home Assistant setup and I really appreciate the work you've put into it.

Why

When analyzing a stream, the current SSIM motion heuristic picks the frame with the most visual change. In practice that often lands on an empty scene right after the subject of interest has left the frame, so the snapshot that ends up exposed to Home Assistant isn't the one a human would have picked. Since the LLM is already looking at every candidate frame, it seemed worth letting it tell us which one best represents the event.

How

This PR adds a new llm_pick_keyframe option for video_analyzer and stream_analyzer. When enabled alongside expose_images, the LLM is asked to identify the best frame and that frame is the one exposed. If the model's response is missing or invalid, it falls back to the existing SSIM behavior, so nothing regresses when the option is off or when the LLM misbehaves.

Implementation notes:

  • New LLM_PICK_KEYFRAME constant and service parameter.
  • Keyframe saving is deferred: candidate frames are stashed on MediaProcessor instead of immediately calling _expose_image(), so we can pick after the LLM has responded.
  • A best_frame instruction is injected into the prompt and the JSON schema.
  • best_frame is extracted from structured or text responses, with a fallback to SSIM-based selection when the value is missing or out of range.
  • New expose_keyframe_by_index() and expose_keyframe_ssim_fallback() methods on MediaProcessor.
  • A follow-up refactor decouples the media source from the keyframe strategy so the two paths are cleanly separated.
  • Logging was added so users can see which keyframe the LLM chose.
  • Tests cover extraction logic, prompt injection, schema injection, and the new MediaProcessor methods.

No behavior change when the option is not set. Happy to iterate on naming, defaults, or structure if you'd prefer a different shape. No pressure at all to take this, I'm maintaining it on a fork either way.

Thanks again for the project.

CamSoper and others added 4 commits April 11, 2026 14:55
When enabled alongside expose_images, the LLM chooses which analyzed
frame best represents the event instead of the SSIM motion heuristic.
The SSIM approach often picks empty scenes after the subject has left;
the LLM can identify the frame with the most relevant content.

- Add LLM_PICK_KEYFRAME constant and service parameter for
  video_analyzer and stream_analyzer
- Defer keyframe saving: stash candidate frames on MediaProcessor
  instead of immediately calling _expose_image()
- Inject best_frame instruction into LLM prompt and JSON schema
- Extract best_frame from structured or text responses with fallback
  to SSIM-based selection when LLM output is missing/invalid
- Add expose_keyframe_by_index() and expose_keyframe_ssim_fallback()
  methods to MediaProcessor
- Add comprehensive tests for extraction logic, prompt injection,
  schema injection, and MediaProcessor expose methods

https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg

Co-authored-by: Claude <noreply@anthropic.com>
* Add llm_pick_keyframe option for LLM-guided keyframe selection

When enabled alongside expose_images, the LLM chooses which analyzed
frame best represents the event instead of the SSIM motion heuristic.
The SSIM approach often picks empty scenes after the subject has left;
the LLM can identify the frame with the most relevant content.

- Add LLM_PICK_KEYFRAME constant and service parameter for
  video_analyzer and stream_analyzer
- Defer keyframe saving: stash candidate frames on MediaProcessor
  instead of immediately calling _expose_image()
- Inject best_frame instruction into LLM prompt and JSON schema
- Extract best_frame from structured or text responses with fallback
  to SSIM-based selection when LLM output is missing/invalid
- Add expose_keyframe_by_index() and expose_keyframe_ssim_fallback()
  methods to MediaProcessor
- Add comprehensive tests for extraction logic, prompt injection,
  schema injection, and MediaProcessor expose methods

https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg

* Strip best_frame from LLM response to prevent it showing in descriptions

The LLM includes best_frame in its text response which was leaking into
the user-visible timeline description. Now _extract_best_frame() removes
best_frame from both structured_response dicts and response_text before
returning. Updated tests to verify stripping behavior.

https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg

---------

Co-authored-by: Claude <noreply@anthropic.com>
* Refactor llm_pick_keyframe: decouple media processing from keyframe strategy

Design changes:

- MediaProcessor always stashes candidate_frames when expose_images=True;
  it no longer decides whether to expose immediately or defer.  The
  llm_pick_keyframe parameter is removed from every media method signature
  (record, add_video, add_videos, add_streams).

- Keyframe selection is now entirely the handler's responsibility:
  * llm_pick_keyframe=False → handler calls select_and_expose_keyframe()
    right after media processing (SSIM on stashed candidates).
  * llm_pick_keyframe=True → handler waits for LLM response, extracts
    best_frame, falls back to SSIM on failure.

- Prompt injection uses an unambiguous bracketed tag [BEST_FRAME: <label>]
  instead of free-form "best_frame:" text, making stripping reliable.
  Guards against double-injection on fallback retry.

- _extract_best_frame (hidden mutation) split into _match_best_frame
  (pure lookup) and _extract_and_strip_best_frame (explicit about mutation).

- candidate_frames is a public attribute (no underscore) since handlers
  access it directly.

- add_videos processes videos sequentially instead of via asyncio.gather
  to avoid concurrent appends to the shared candidate_frames list.

- getattr guards removed from providers.py; call is always ServiceCallData.

https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg

* Remove dead code, deduplicate handlers, drop redundant comments

- Remove unused SSIM computation in record() — key_idx was computed
  but never used after the refactor removed the inline expose call
- Extract _resolve_llm_keyframe() helper to eliminate identical
  10-line blocks in video_analyzer and stream_analyzer
- Delete four "Add processor.key_frame to response if it exists"
  comments that just restate the code

https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg

---------

Co-authored-by: Claude <noreply@anthropic.com>
Adds an INFO log line when the LLM successfully picks a frame, showing
the label and index. The failure path already logged a WARNING.

https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg

Co-authored-by: Claude <noreply@anthropic.com>
@valentinfrlch

Copy link
Copy Markdown
Owner

@CamSoper thank you for the kind words and your work, I appreciate it! What your proposing sounds interesting. However, I think throwing an LLM against picking the best frame to analyze might be a little overkill (it will also almost double the time to analyze).

I agree however that the current SSIM based implementation might not be optimal, and would be very interested in improving this. Perhaps we could add a simple object detection model to look for person, car, etc. to distinguish between empty frames and the one we actually want to analyze.
I'd love to chat about this on discord: https://discord.gg/MHWd9ukj

@CamSoper

Copy link
Copy Markdown
Contributor Author

Thanks, will hop on Discord.

A couple things I think might be a misread of the PR:

  • It's a per-call service parameter (llm_pick_keyframe) that defaults to false, not a global switch. If you don't set it, you get the existing SSIM path unchanged.
  • It's not a second LLM call. The candidate frames are already in the prompt of the existing vision request. This just adds one sentence asking the model to name its pick, plus a best_frame field in the JSON schema when structured output is used. Same request, same images. Overhead is a few output tokens, and I haven't seen measurable latency difference between on and off.

On "overkill" -- an object detector is a reasonable direction, but I think it solves a different problem rather than competing with this:

  • A detector filters empty frames before anything goes to the LLM. Good for the "SSIM picked an empty frame" case, assuming the detector's labels cover what you care about.
  • LLM pick ranks among the frames the LLM is already looking at. It helps when several frames contain a person/car/package but one is clearly better -- better framed, sharper, the actual moment of the event. A generic detector can't rank that.

Bundling a detection model, weights, and label mapping into a HACS integration is also a bigger ask than a prompt tweak.

I've been running this on my own setup for a while with both structured and non-structured output, and the frames it picks are a real improvement over SSIM for my cameras. One person's sample, sure, but it's why I bothered upstreaming.

No pressure -- I'll keep the fork going either way, and I'm open to reshaping the PR if there's a shape you'd prefer. See you on Discord.

@valentinfrlch

Copy link
Copy Markdown
Owner

Sorry for the delay! I misunderstood what this is trying to do (I though it would call the api a second time).
I think the approach for when structured_response is enabled is solid. However, I'm not sure how well it works when structured_response is disabled (which is the default), especially for local models.

I want to finally push 1.7.0, so unfortunately I won't have time to do proper testing of this before that, but will keep this in mind for next release!

Thanks for your work!

@CamSoper

CamSoper commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

@valentinfrlch Makes total sense to me!

It does work on the non-structured_response mode, but it's a tiny bit kludgy. It injects an extra line onto the prompt and strips the extra output on the response.

If you wanted to make this a structured_response-only, I'd be happy to do that. That's how I'm using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants