Add llm_pick_keyframe option for LLM-guided keyframe selection#634
Add llm_pick_keyframe option for LLM-guided keyframe selection#634CamSoper wants to merge 4 commits into
Conversation
When enabled alongside expose_images, the LLM chooses which analyzed frame best represents the event instead of the SSIM motion heuristic. The SSIM approach often picks empty scenes after the subject has left; the LLM can identify the frame with the most relevant content. - Add LLM_PICK_KEYFRAME constant and service parameter for video_analyzer and stream_analyzer - Defer keyframe saving: stash candidate frames on MediaProcessor instead of immediately calling _expose_image() - Inject best_frame instruction into LLM prompt and JSON schema - Extract best_frame from structured or text responses with fallback to SSIM-based selection when LLM output is missing/invalid - Add expose_keyframe_by_index() and expose_keyframe_ssim_fallback() methods to MediaProcessor - Add comprehensive tests for extraction logic, prompt injection, schema injection, and MediaProcessor expose methods https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg Co-authored-by: Claude <noreply@anthropic.com>
* Add llm_pick_keyframe option for LLM-guided keyframe selection When enabled alongside expose_images, the LLM chooses which analyzed frame best represents the event instead of the SSIM motion heuristic. The SSIM approach often picks empty scenes after the subject has left; the LLM can identify the frame with the most relevant content. - Add LLM_PICK_KEYFRAME constant and service parameter for video_analyzer and stream_analyzer - Defer keyframe saving: stash candidate frames on MediaProcessor instead of immediately calling _expose_image() - Inject best_frame instruction into LLM prompt and JSON schema - Extract best_frame from structured or text responses with fallback to SSIM-based selection when LLM output is missing/invalid - Add expose_keyframe_by_index() and expose_keyframe_ssim_fallback() methods to MediaProcessor - Add comprehensive tests for extraction logic, prompt injection, schema injection, and MediaProcessor expose methods https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg * Strip best_frame from LLM response to prevent it showing in descriptions The LLM includes best_frame in its text response which was leaking into the user-visible timeline description. Now _extract_best_frame() removes best_frame from both structured_response dicts and response_text before returning. Updated tests to verify stripping behavior. https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg --------- Co-authored-by: Claude <noreply@anthropic.com>
* Refactor llm_pick_keyframe: decouple media processing from keyframe strategy
Design changes:
- MediaProcessor always stashes candidate_frames when expose_images=True;
it no longer decides whether to expose immediately or defer. The
llm_pick_keyframe parameter is removed from every media method signature
(record, add_video, add_videos, add_streams).
- Keyframe selection is now entirely the handler's responsibility:
* llm_pick_keyframe=False → handler calls select_and_expose_keyframe()
right after media processing (SSIM on stashed candidates).
* llm_pick_keyframe=True → handler waits for LLM response, extracts
best_frame, falls back to SSIM on failure.
- Prompt injection uses an unambiguous bracketed tag [BEST_FRAME: <label>]
instead of free-form "best_frame:" text, making stripping reliable.
Guards against double-injection on fallback retry.
- _extract_best_frame (hidden mutation) split into _match_best_frame
(pure lookup) and _extract_and_strip_best_frame (explicit about mutation).
- candidate_frames is a public attribute (no underscore) since handlers
access it directly.
- add_videos processes videos sequentially instead of via asyncio.gather
to avoid concurrent appends to the shared candidate_frames list.
- getattr guards removed from providers.py; call is always ServiceCallData.
https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg
* Remove dead code, deduplicate handlers, drop redundant comments
- Remove unused SSIM computation in record() — key_idx was computed
but never used after the refactor removed the inline expose call
- Extract _resolve_llm_keyframe() helper to eliminate identical
10-line blocks in video_analyzer and stream_analyzer
- Delete four "Add processor.key_frame to response if it exists"
comments that just restate the code
https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg
---------
Co-authored-by: Claude <noreply@anthropic.com>
Adds an INFO log line when the LLM successfully picks a frame, showing the label and index. The failure path already logged a WARNING. https://claude.ai/code/session_01WfgiiUuVbBGFtvTn7wzpFg Co-authored-by: Claude <noreply@anthropic.com>
|
@CamSoper thank you for the kind words and your work, I appreciate it! What your proposing sounds interesting. However, I think throwing an LLM against picking the best frame to analyze might be a little overkill (it will also almost double the time to analyze). I agree however that the current SSIM based implementation might not be optimal, and would be very interested in improving this. Perhaps we could add a simple object detection model to look for |
|
Thanks, will hop on Discord. A couple things I think might be a misread of the PR:
On "overkill" -- an object detector is a reasonable direction, but I think it solves a different problem rather than competing with this:
Bundling a detection model, weights, and label mapping into a HACS integration is also a bigger ask than a prompt tweak. I've been running this on my own setup for a while with both structured and non-structured output, and the frames it picks are a real improvement over SSIM for my cameras. One person's sample, sure, but it's why I bothered upstreaming. No pressure -- I'll keep the fork going either way, and I'm open to reshaping the PR if there's a shape you'd prefer. See you on Discord. |
|
Sorry for the delay! I misunderstood what this is trying to do (I though it would call the api a second time). I want to finally push 1.7.0, so unfortunately I won't have time to do proper testing of this before that, but will keep this in mind for next release! Thanks for your work! |
|
@valentinfrlch Makes total sense to me! It does work on the non- If you wanted to make this a |
First, thanks for building llmvision. It's become a key part of my Home Assistant setup and I really appreciate the work you've put into it.
Why
When analyzing a stream, the current SSIM motion heuristic picks the frame with the most visual change. In practice that often lands on an empty scene right after the subject of interest has left the frame, so the snapshot that ends up exposed to Home Assistant isn't the one a human would have picked. Since the LLM is already looking at every candidate frame, it seemed worth letting it tell us which one best represents the event.
How
This PR adds a new
llm_pick_keyframeoption forvideo_analyzerandstream_analyzer. When enabled alongsideexpose_images, the LLM is asked to identify the best frame and that frame is the one exposed. If the model's response is missing or invalid, it falls back to the existing SSIM behavior, so nothing regresses when the option is off or when the LLM misbehaves.Implementation notes:
LLM_PICK_KEYFRAMEconstant and service parameter.MediaProcessorinstead of immediately calling_expose_image(), so we can pick after the LLM has responded.best_frameinstruction is injected into the prompt and the JSON schema.best_frameis extracted from structured or text responses, with a fallback to SSIM-based selection when the value is missing or out of range.expose_keyframe_by_index()andexpose_keyframe_ssim_fallback()methods onMediaProcessor.MediaProcessormethods.No behavior change when the option is not set. Happy to iterate on naming, defaults, or structure if you'd prefer a different shape. No pressure at all to take this, I'm maintaining it on a fork either way.
Thanks again for the project.