Add flags to signal capabilities and requirements in LLM #151

dangusev · 2025-11-04T22:03:26Z

Added needs_tts,needs_sst, handles_audio and handles_video properties to LLM and Realtime classes.

This way, Agent doesn't need to do isinstance(llm, <LLM | Realtime>) checks to decide whether to forward video, audio, etc.

Summary by CodeRabbit

Refactor
- Switched audio/video routing and transcript forwarding to use explicit model capability flags (audio, video, STT, TTS) instead of the legacy realtime mode.
- Reworked turn-detection and per-user transcript handling to trigger responses more reliably when capabilities indicate direct handling.
- Simplified logging, state tracking, and configuration validation to reflect capability-based decisions.

Note

Add needs_stt/needs_tts/handles_audio/handles_video flags to LLM/Realtime and refactor Agent logic to route audio/video and turn handling based on these capabilities.

LLM/Realtime:
- Add capability flags: needs_stt, needs_tts, handles_audio, handles_video in LLM; set Realtime to handle audio/video and not need STT/TTS.
Agent:
- Replace realtime_mode/isinstance checks with capability flags across audio/video routing, publishing, and input needs.
- Gate STT->LLM triggering and turn detection using handles_audio and needs_{stt,tts}; avoid loops when agent is speaking.
- Update RTC prep and config validation to use flags; use llm.output_track when handles_audio.
- Video handling: forward/switch tracks based on handles_video; rename handlers to on_video_track_added/removed and clean up track switching logic.

^{Written by Cursor Bugbot for commit 4c3b8b7. This will update automatically on new commits. Configure here.}

coderabbitai · 2025-11-04T22:03:37Z

Walkthrough

Replaces the former realtime_mode concept with explicit LLM capability flags (handles_audio, handles_video, needs_stt, needs_tts) and updates agent control flow, event handlers, and media/transcript routing to use those flags.

Changes

Cohort / File(s)	Summary
LLM capability declarations `agents-core/vision_agents/core/llm/llm.py`, `agents-core/vision_agents/core/llm/realtime.py`	Removed `sts`; added `needs_stt`, `needs_tts`, `handles_audio`, `handles_video` on `LLM`. `Realtime` class now declares `handles_audio=True`, `handles_video=True`, `needs_stt=False`, `needs_tts=False`.
Agent logic & handlers `agents-core/vision_agents/core/agents/agents.py`	Removed `realtime_mode` property. Replaced runtime-mode branches with checks of `llm.handles_audio` / `llm.handles_video`. Renamed video handlers (`on_track` → `on_video_track_added`, `on_track_removed` → `on_video_track_removed`). Updated audio/video forwarding, STT/TTS routing, per-user pending-transcript handling, and turn-detection triggers to use LLM capability flags.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Agent
    participant LLM
    participant STT_TTS

    Note over Client,Agent: incoming media / transcripts
    Client->>Agent: audio / video / transcript
    Agent->>LLM: query capabilities (handles_audio / handles_video)
    alt LLM.handles_audio or LLM.handles_video
        Agent->>LLM: forward media directly
        LLM-->>Agent: response (text/media)
    else
        Agent->>STT_TTS: send audio for STT
        STT_TTS-->>Agent: transcript
        Agent->>LLM: send transcript/text
        LLM-->>Agent: text response
        alt LLM.needs_tts
            Agent->>STT_TTS: request TTS
            STT_TTS-->>Agent: audio response
        end
    end
    Agent->>Client: deliver response (text/audio/video)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay special attention to agents-core/vision_agents/core/agents/agents.py for correctness of replaced branches and renamed handler registrations.
Verify LLM and Realtime attribute semantics and default values in agents-core/vision_agents/core/llm/llm.py and .../realtime.py.
Inspect per-user pending-transcript handling and turn-detection for race conditions and correct flush/forward behavior.

Possibly related PRs

fix: Agent Example and TURN detection #70 — overlaps agent turn-detection, per-user transcript buffering, and realtime decision logic modifications in agents.py.

Suggested reviewers

maxkahan

Poem

The machine learns which mouths to open, which to hold shut;
a ledger of noise, a ledger of silence clenched like teeth.
I name the devices that will drink the water of sound,
instruct the rooms to cough up their voices, one by one —
and watch the small bright bodies of response appear.

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding capability and requirement flags (needs_stt, needs_tts, handles_audio, handles_video) to LLM classes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chore/agent-llm-refactoring

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c1584e0 and 4c3b8b7.

📒 Files selected for processing (1)

agents-core/vision_agents/core/agents/agents.py (12 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/agents/agents.py

🧬 Code graph analysis (1)

agents-core/vision_agents/core/agents/agents.py (3)

agents-core/vision_agents/core/edge/sfu_events.py (16)

participant (1496-1501)

participant (1504-1507)

participant (1545-1550)

participant (1553-1556)

participant (1625-1630)

participant (1633-1636)

participant (2100-2105)

participant (2108-2111)

participant (2156-2161)

participant (2164-2167)

user_id (489-493)

user_id (856-860)

user_id (901-905)

user_id (1186-1190)

user_id (2093-2097)

user_id (2142-2146)

agents-core/vision_agents/core/events/base.py (1)

user_id (45-48)

agents-core/vision_agents/core/llm/llm.py (1)

simple_response (77-83)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Cursor Bugbot
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Test "not integration"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 60f6d83 and 89fd60b.

📒 Files selected for processing (3)

agents-core/vision_agents/core/agents/agents.py (13 hunks)
agents-core/vision_agents/core/llm/llm.py (1 hunks)
agents-core/vision_agents/core/llm/realtime.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/llm/llm.py
agents-core/vision_agents/core/llm/realtime.py
agents-core/vision_agents/core/agents/agents.py

🧬 Code graph analysis (1)

agents-core/vision_agents/core/agents/agents.py (4)

agents-core/vision_agents/core/edge/sfu_events.py (16)

participant (1496-1501)

participant (1504-1507)

participant (1545-1550)

participant (1553-1556)

participant (1625-1630)

participant (1633-1636)

participant (2100-2105)

participant (2108-2111)

participant (2156-2161)

participant (2164-2167)

user_id (489-493)

user_id (856-860)

user_id (901-905)

user_id (1186-1190)

user_id (2093-2097)

user_id (2142-2146)

agents-core/vision_agents/core/events/base.py (1)

user_id (45-48)

agents-core/vision_agents/core/turn_detection/events.py (1)

TurnEndedEvent (29-45)

agents-core/vision_agents/core/llm/llm.py (1)

simple_response (57-63)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"

🔇 Additional comments (10)

agents-core/vision_agents/core/llm/llm.py (1)

37-42: LGTM! Clean capability flag introduction.

The explicit capability flags provide clear contracts for Agent integration, and the defaults correctly represent a traditional LLM's requirements.

agents-core/vision_agents/core/agents/agents.py (9)

417-419: LGTM! Correct capability-based short-circuit.

The early return appropriately skips the STT-to-LLM flow when the LLM directly consumes audio.

791-818: LGTM! Clearer event handler naming and correct capability check.

The rename from on_track to on_video_track_added improves clarity, and the video forwarding correctly depends on llm.handles_video.

821-844: LGTM! Consistent naming and correct capability gating.

The handler rename mirrors on_video_track_added, and track switching is appropriately gated by llm.handles_video.

865-874: LGTM! Correct audio routing based on capabilities.

The conditional correctly routes audio directly to the LLM when it handles audio, or to STT otherwise.

1117-1147: LGTM! Well-structured turn-end handling.

The refactored logic correctly accumulates per-user transcripts and triggers LLM responses on turn completion, with appropriate early-exit for agent self-speech.

1160-1162: LGTM! Correct audio publishing logic.

The property appropriately returns True when either TTS or the LLM itself produces audio output.

1194-1196: LGTM! Correct video input requirement check.

The logic correctly determines video input needs based on processors or LLM video capability.

1245-1257: LGTM! Appropriate configuration validation.

The validation correctly warns when conflicting STT/TTS services are configured alongside a realtime LLM.

1282-1298: LGTM! Correct audio track source selection.

The logic appropriately selects the audio track source based on whether the LLM produces audio directly.

coderabbitai · 2025-11-04T22:07:10Z

agents-core/vision_agents/core/agents/agents.py

+        # Skip the turn event handling if the model doesn't require TTS or SST audio itself.
+        if not (self.llm.needs_tts and self.llm.needs_stt):
            return


⚠️ Potential issue | 🟡 Minor

Fix typo and clarify conditional logic.

Line 1088 contains "SST" (should be "STT"), and the phrasing could be clearer about the conjunction.

Apply this diff:

- # Skip the turn event handling if the model doesn't require TTS or SST audio itself. + # Skip turn event handling if the model handles audio directly (doesn't need both STT and TTS). if not (self.llm.needs_tts and self.llm.needs_stt): return

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Skip the turn event handling if the model doesn't require TTS or SST audio itself.

if not (self.llm.needs_tts and self.llm.needs_stt):

return

# Skip turn event handling if the model handles audio directly (doesn't need both STT and TTS).

if not (self.llm.needs_tts and self.llm.needs_stt):

return

🤖 Prompt for AI Agents

In agents-core/vision_agents/core/agents/agents.py around lines 1088 to 1090, fix the typo "SST" to "STT" and clarify the conditional to match the comment intent: change the conjunction so we skip handling only when the model requires neither TTS nor STT (use OR inside the parentheses), i.e. replace the current condition with one that returns when not (self.llm.needs_tts or self.llm.needs_stt) and keep the rest of the surrounding logic unchanged.

agents-core/vision_agents/core/llm/realtime.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

agents-core/vision_agents/core/llm/realtime.py (1)
45-50: The capability flags are well-implemented.

The type hints that were flagged in the previous review have been added. All four attributes now have explicit bool type annotations, and the defaults correctly reflect that Realtime models handle audio/video directly without requiring separate STT/TTS services.

Consider documenting these public attributes in the class docstring using an "Attributes:" section per Google style guide, though this is not critical for functionality.
Optional: Add attributes documentation

You could enhance the class docstring (after line 40) with an attributes section:
    Attributes:
        handles_audio: Indicates this model can process audio input directly.
        handles_video: Indicates this model can process video input directly.
        needs_stt: Indicates whether speech-to-text service is required.
        needs_tts: Indicates whether text-to-speech service is required.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 89fd60b and 94f7448.

📒 Files selected for processing (1)

agents-core/vision_agents/core/llm/realtime.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/llm/realtime.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

agents-core/vision_agents/core/agents/agents.py (1)
1093-1095: Fix typo "SST" → "STT" in comment.

Line 1093 contains "SST" which should be "STT". The comment could also be clearer about the conditional logic.

Apply this diff to fix the typo:
-        # Skip the turn event handling if the model doesn't require TTS or SST audio itself.
+        # Skip turn event handling if the model doesn't require both STT and TTS.
         if not (self.llm.needs_tts and self.llm.needs_stt):
             return

🧹 Nitpick comments (1)

agents-core/vision_agents/core/agents/agents.py (1)

1143-1146: Simplify participant handling – use existing event attribute.

Lines 1143-1146 attempt to extract participant from event.custom, but event.participant is already available and validated at line 1128. The custom metadata extraction appears redundant.

Consider this simplification:

-                # Create participant object if we have metadata
-                participant = None
-                if hasattr(event, "custom") and event.custom:
-                    # Try to extract participant info from custom metadata
-                    participant = event.custom.get("participant")
-
                 # Trigger LLM response with the complete transcript
-                await self.simple_response(transcript, participant)
+                await self.simple_response(transcript, event.participant)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 94f7448 and c1584e0.

📒 Files selected for processing (3)

agents-core/vision_agents/core/agents/agents.py (12 hunks)
agents-core/vision_agents/core/llm/llm.py (1 hunks)
agents-core/vision_agents/core/llm/realtime.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

agents-core/vision_agents/core/llm/llm.py
agents-core/vision_agents/core/llm/realtime.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/agents/agents.py

🧬 Code graph analysis (1)

agents-core/vision_agents/core/agents/agents.py (3)

agents-core/vision_agents/core/edge/sfu_events.py (16)

participant (1496-1501)

participant (1504-1507)

participant (1545-1550)

participant (1553-1556)

participant (1625-1630)

participant (1633-1636)

participant (2100-2105)

participant (2108-2111)

participant (2156-2161)

participant (2164-2167)

user_id (489-493)

user_id (856-860)

user_id (901-905)

user_id (1186-1190)

user_id (2093-2097)

user_id (2142-2146)

agents-core/vision_agents/core/events/base.py (1)

user_id (45-48)

agents-core/vision_agents/core/llm/llm.py (1)

simple_response (77-83)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit / Test "not integration"
GitHub Check: unit / Test "not integration"

🔇 Additional comments (9)

agents-core/vision_agents/core/agents/agents.py (9)

422-424: LGTM – Clean capability flag usage.

The early return when llm.handles_audio is appropriate here, avoiding redundant LLM invocations when the model processes audio natively.

796-823: Handler renamed appropriately with correct flag usage.

The rename from on_track to on_video_track_added improves clarity, and the llm.handles_video check at line 810 correctly gates video forwarding.

826-849: Handler renamed appropriately with correct flag usage.

The rename to on_video_track_removed is clearer, and the llm.handles_video check at line 844 properly determines whether to switch tracks.

870-879: Correct audio routing based on capability flag.

The llm.handles_audio check appropriately routes audio either directly to the LLM or through STT processing.

971-1002: Appropriate video forwarding based on capability flag.

The llm.handles_video check at line 971 correctly determines whether to forward video frames to the LLM, with proper handling of both processed and raw video tracks.

1165-1167: Correct audio publishing determination.

The llm.handles_audio check properly determines whether to publish audio, accounting for both TTS and native LLM audio handling.

1199-1199: Correct video input determination.

The llm.handles_video check appropriately determines when video input is needed from participants.

1250-1262: Appropriate configuration validation.

The llm.handles_audio check correctly validates that Realtime mode doesn't have conflicting STT/TTS/Turn Detection services configured.

1287-1303: Correct audio track initialization.

The llm.handles_audio check properly determines whether to use the LLM's output track or create a new audio track for TTS.

cursor

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on November 7

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2025-11-05T04:41:30Z

agents-core/vision_agents/core/agents/agents.py

                )

-                if self.realtime_mode and isinstance(self.llm, Realtime):
+                if self.llm.handles_video:


Bug: Video Handling Method Called on Non-Realtime LLMs

The code checks if self.llm.handles_video: and then calls self.llm._watch_video_track(), but _watch_video_track is only defined in the Realtime class, not in the base LLM class. A non-Realtime LLM that sets handles_video = True would cause an AttributeError. The code should check isinstance(self.llm, Realtime) before calling this method, similar to how it's done at lines 911, 992, and 1003.

cursor · 2025-11-05T04:41:30Z

agents-core/vision_agents/core/agents/agents.py


            # when in Realtime mode call the Realtime directly (non-blocking)
-            if self.realtime_mode and isinstance(self.llm, Realtime):
+            if self.llm.handles_audio:


Bug: Audio handling sanity check missing type guard

The code checks if self.llm.handles_audio: and then calls self.llm.simple_audio_response(), but simple_audio_response is an abstract method only defined in the Realtime class, not in the base LLM class. A non-Realtime LLM that sets handles_audio = True would cause an AttributeError. The code should check isinstance(self.llm, Realtime) before calling this method.

cursor · 2025-11-05T04:41:30Z

agents-core/vision_agents/core/agents/agents.py

        # Set up audio track if TTS is available
        if self.publish_audio:
-            if self.realtime_mode and isinstance(self.llm, Realtime):
+            if self.llm.handles_audio:


Bug: AttributeError Risk from Audio Handling Mismatch

The code checks if self.llm.handles_audio: and then accesses self.llm.output_track, but output_track is only defined in the Realtime class, not in the base LLM class. A non-Realtime LLM that sets handles_audio = True would cause an AttributeError. The code should check isinstance(self.llm, Realtime) before accessing this attribute.

github-actions bot added the agents-core label Nov 4, 2025

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

Add flags to signal capabilities and requirements in LLM

c1584e0

dangusev force-pushed the chore/agent-llm-refactoring branch from 94f7448 to c1584e0 Compare November 4, 2025 22:16

Fix ruff errors

0a742d2

github-actions bot added the plugins label Nov 4, 2025

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

Merge branch 'main' into chore/agent-llm-refactoring

4c3b8b7

cursor bot reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add flags to signal capabilities and requirements in LLM #151

Add flags to signal capabilities and requirements in LLM #151

Uh oh!

dangusev commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 4, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 4, 2025

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Nov 5, 2025

Uh oh!

cursor bot Nov 5, 2025

Uh oh!

cursor bot Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add flags to signal capabilities and requirements in LLM #151

Are you sure you want to change the base?

Add flags to signal capabilities and requirements in LLM #151

Uh oh!

Conversation

dangusev commented Nov 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This is the final PR Bugbot will review for you during this billing cycle

Uh oh!

cursor bot Nov 5, 2025

Choose a reason for hiding this comment

Bug: Video Handling Method Called on Non-Realtime LLMs

Uh oh!

cursor bot Nov 5, 2025

Choose a reason for hiding this comment

Bug: Audio handling sanity check missing type guard

Uh oh!

cursor bot Nov 5, 2025

Choose a reason for hiding this comment

Bug: AttributeError Risk from Audio Handling Mismatch

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dangusev commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 4, 2025 •

edited

Loading