-
Notifications
You must be signed in to change notification settings - Fork 71
Add flags to signal capabilities and requirements in LLM #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughReplaces the former Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Agent
participant LLM
participant STT_TTS
Note over Client,Agent: incoming media / transcripts
Client->>Agent: audio / video / transcript
Agent->>LLM: query capabilities (handles_audio / handles_video)
alt LLM.handles_audio or LLM.handles_video
Agent->>LLM: forward media directly
LLM-->>Agent: response (text/media)
else
Agent->>STT_TTS: send audio for STT
STT_TTS-->>Agent: transcript
Agent->>LLM: send transcript/text
LLM-->>Agent: text response
alt LLM.needs_tts
Agent->>STT_TTS: request TTS
STT_TTS-->>Agent: audio response
end
end
Agent->>Client: deliver response (text/audio/video)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Disabled knowledge base sources:
📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)**/*.py📄 CodeRabbit inference engine (.cursor/rules/python.mdc)
Files:
🧬 Code graph analysis (1)agents-core/vision_agents/core/agents/agents.py (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
agents-core/vision_agents/core/agents/agents.py(13 hunks)agents-core/vision_agents/core/llm/llm.py(1 hunks)agents-core/vision_agents/core/llm/realtime.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/python.mdc)
**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide
Files:
agents-core/vision_agents/core/llm/llm.pyagents-core/vision_agents/core/llm/realtime.pyagents-core/vision_agents/core/agents/agents.py
🧬 Code graph analysis (1)
agents-core/vision_agents/core/agents/agents.py (4)
agents-core/vision_agents/core/edge/sfu_events.py (16)
participant(1496-1501)participant(1504-1507)participant(1545-1550)participant(1553-1556)participant(1625-1630)participant(1633-1636)participant(2100-2105)participant(2108-2111)participant(2156-2161)participant(2164-2167)user_id(489-493)user_id(856-860)user_id(901-905)user_id(1186-1190)user_id(2093-2097)user_id(2142-2146)agents-core/vision_agents/core/events/base.py (1)
user_id(45-48)agents-core/vision_agents/core/turn_detection/events.py (1)
TurnEndedEvent(29-45)agents-core/vision_agents/core/llm/llm.py (1)
simple_response(57-63)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: unit / Test "not integration"
- GitHub Check: unit / Ruff & mypy
- GitHub Check: unit / Test "not integration"
🔇 Additional comments (10)
agents-core/vision_agents/core/llm/llm.py (1)
37-42: LGTM! Clean capability flag introduction.The explicit capability flags provide clear contracts for Agent integration, and the defaults correctly represent a traditional LLM's requirements.
agents-core/vision_agents/core/agents/agents.py (9)
417-419: LGTM! Correct capability-based short-circuit.The early return appropriately skips the STT-to-LLM flow when the LLM directly consumes audio.
791-818: LGTM! Clearer event handler naming and correct capability check.The rename from
on_tracktoon_video_track_addedimproves clarity, and the video forwarding correctly depends onllm.handles_video.
821-844: LGTM! Consistent naming and correct capability gating.The handler rename mirrors
on_video_track_added, and track switching is appropriately gated byllm.handles_video.
865-874: LGTM! Correct audio routing based on capabilities.The conditional correctly routes audio directly to the LLM when it handles audio, or to STT otherwise.
1117-1147: LGTM! Well-structured turn-end handling.The refactored logic correctly accumulates per-user transcripts and triggers LLM responses on turn completion, with appropriate early-exit for agent self-speech.
1160-1162: LGTM! Correct audio publishing logic.The property appropriately returns
Truewhen either TTS or the LLM itself produces audio output.
1194-1196: LGTM! Correct video input requirement check.The logic correctly determines video input needs based on processors or LLM video capability.
1245-1257: LGTM! Appropriate configuration validation.The validation correctly warns when conflicting STT/TTS services are configured alongside a realtime LLM.
1282-1298: LGTM! Correct audio track source selection.The logic appropriately selects the audio track source based on whether the LLM produces audio directly.
| # Skip the turn event handling if the model doesn't require TTS or SST audio itself. | ||
| if not (self.llm.needs_tts and self.llm.needs_stt): | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo and clarify conditional logic.
Line 1088 contains "SST" (should be "STT"), and the phrasing could be clearer about the conjunction.
Apply this diff:
- # Skip the turn event handling if the model doesn't require TTS or SST audio itself.
+ # Skip turn event handling if the model handles audio directly (doesn't need both STT and TTS).
if not (self.llm.needs_tts and self.llm.needs_stt):
return📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Skip the turn event handling if the model doesn't require TTS or SST audio itself. | |
| if not (self.llm.needs_tts and self.llm.needs_stt): | |
| return | |
| # Skip turn event handling if the model handles audio directly (doesn't need both STT and TTS). | |
| if not (self.llm.needs_tts and self.llm.needs_stt): | |
| return |
🤖 Prompt for AI Agents
In agents-core/vision_agents/core/agents/agents.py around lines 1088 to 1090,
fix the typo "SST" to "STT" and clarify the conditional to match the comment
intent: change the conjunction so we skip handling only when the model requires
neither TTS nor STT (use OR inside the parentheses), i.e. replace the current
condition with one that returns when not (self.llm.needs_tts or
self.llm.needs_stt) and keep the rest of the surrounding logic unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
agents-core/vision_agents/core/llm/realtime.py (1)
45-50: The capability flags are well-implemented.The type hints that were flagged in the previous review have been added. All four attributes now have explicit
booltype annotations, and the defaults correctly reflect that Realtime models handle audio/video directly without requiring separate STT/TTS services.Consider documenting these public attributes in the class docstring using an "Attributes:" section per Google style guide, though this is not critical for functionality.
Optional: Add attributes documentation
You could enhance the class docstring (after line 40) with an attributes section:
Attributes: handles_audio: Indicates this model can process audio input directly. handles_video: Indicates this model can process video input directly. needs_stt: Indicates whether speech-to-text service is required. needs_tts: Indicates whether text-to-speech service is required.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
agents-core/vision_agents/core/llm/realtime.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/python.mdc)
**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide
Files:
agents-core/vision_agents/core/llm/realtime.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: unit / Ruff & mypy
- GitHub Check: unit / Test "not integration"
- GitHub Check: unit / Test "not integration"
- GitHub Check: unit / Ruff & mypy
94f7448 to
c1584e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
agents-core/vision_agents/core/agents/agents.py (1)
1093-1095: Fix typo "SST" → "STT" in comment.Line 1093 contains "SST" which should be "STT". The comment could also be clearer about the conditional logic.
Apply this diff to fix the typo:
- # Skip the turn event handling if the model doesn't require TTS or SST audio itself. + # Skip turn event handling if the model doesn't require both STT and TTS. if not (self.llm.needs_tts and self.llm.needs_stt): return
🧹 Nitpick comments (1)
agents-core/vision_agents/core/agents/agents.py (1)
1143-1146: Simplify participant handling – use existing event attribute.Lines 1143-1146 attempt to extract participant from
event.custom, butevent.participantis already available and validated at line 1128. The custom metadata extraction appears redundant.Consider this simplification:
- # Create participant object if we have metadata - participant = None - if hasattr(event, "custom") and event.custom: - # Try to extract participant info from custom metadata - participant = event.custom.get("participant") - # Trigger LLM response with the complete transcript - await self.simple_response(transcript, participant) + await self.simple_response(transcript, event.participant)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
agents-core/vision_agents/core/agents/agents.py(12 hunks)agents-core/vision_agents/core/llm/llm.py(1 hunks)agents-core/vision_agents/core/llm/realtime.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- agents-core/vision_agents/core/llm/llm.py
- agents-core/vision_agents/core/llm/realtime.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/python.mdc)
**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide
Files:
agents-core/vision_agents/core/agents/agents.py
🧬 Code graph analysis (1)
agents-core/vision_agents/core/agents/agents.py (3)
agents-core/vision_agents/core/edge/sfu_events.py (16)
participant(1496-1501)participant(1504-1507)participant(1545-1550)participant(1553-1556)participant(1625-1630)participant(1633-1636)participant(2100-2105)participant(2108-2111)participant(2156-2161)participant(2164-2167)user_id(489-493)user_id(856-860)user_id(901-905)user_id(1186-1190)user_id(2093-2097)user_id(2142-2146)agents-core/vision_agents/core/events/base.py (1)
user_id(45-48)agents-core/vision_agents/core/llm/llm.py (1)
simple_response(77-83)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit / Test "not integration"
- GitHub Check: unit / Test "not integration"
🔇 Additional comments (9)
agents-core/vision_agents/core/agents/agents.py (9)
422-424: LGTM – Clean capability flag usage.The early return when
llm.handles_audiois appropriate here, avoiding redundant LLM invocations when the model processes audio natively.
796-823: Handler renamed appropriately with correct flag usage.The rename from
on_tracktoon_video_track_addedimproves clarity, and thellm.handles_videocheck at line 810 correctly gates video forwarding.
826-849: Handler renamed appropriately with correct flag usage.The rename to
on_video_track_removedis clearer, and thellm.handles_videocheck at line 844 properly determines whether to switch tracks.
870-879: Correct audio routing based on capability flag.The
llm.handles_audiocheck appropriately routes audio either directly to the LLM or through STT processing.
971-1002: Appropriate video forwarding based on capability flag.The
llm.handles_videocheck at line 971 correctly determines whether to forward video frames to the LLM, with proper handling of both processed and raw video tracks.
1165-1167: Correct audio publishing determination.The
llm.handles_audiocheck properly determines whether to publish audio, accounting for both TTS and native LLM audio handling.
1199-1199: Correct video input determination.The
llm.handles_videocheck appropriately determines when video input is needed from participants.
1250-1262: Appropriate configuration validation.The
llm.handles_audiocheck correctly validates that Realtime mode doesn't have conflicting STT/TTS/Turn Detection services configured.
1287-1303: Correct audio track initialization.The
llm.handles_audiocheck properly determines whether to use the LLM's output track or create a new audio track for TTS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on November 7
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| ) | ||
|
|
||
| if self.realtime_mode and isinstance(self.llm, Realtime): | ||
| if self.llm.handles_video: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Video Handling Method Called on Non-Realtime LLMs
The code checks if self.llm.handles_video: and then calls self.llm._watch_video_track(), but _watch_video_track is only defined in the Realtime class, not in the base LLM class. A non-Realtime LLM that sets handles_video = True would cause an AttributeError. The code should check isinstance(self.llm, Realtime) before calling this method, similar to how it's done at lines 911, 992, and 1003.
|
|
||
| # when in Realtime mode call the Realtime directly (non-blocking) | ||
| if self.realtime_mode and isinstance(self.llm, Realtime): | ||
| if self.llm.handles_audio: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Audio handling sanity check missing type guard
The code checks if self.llm.handles_audio: and then calls self.llm.simple_audio_response(), but simple_audio_response is an abstract method only defined in the Realtime class, not in the base LLM class. A non-Realtime LLM that sets handles_audio = True would cause an AttributeError. The code should check isinstance(self.llm, Realtime) before calling this method.
| # Set up audio track if TTS is available | ||
| if self.publish_audio: | ||
| if self.realtime_mode and isinstance(self.llm, Realtime): | ||
| if self.llm.handles_audio: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: AttributeError Risk from Audio Handling Mismatch
The code checks if self.llm.handles_audio: and then accesses self.llm.output_track, but output_track is only defined in the Realtime class, not in the base LLM class. A non-Realtime LLM that sets handles_audio = True would cause an AttributeError. The code should check isinstance(self.llm, Realtime) before accessing this attribute.
Added
needs_tts,needs_sst,handles_audioandhandles_videoproperties toLLMandRealtimeclasses.This way,
Agentdoesn't need to doisinstance(llm, <LLM | Realtime>)checks to decide whether to forward video, audio, etc.Summary by CodeRabbit
Note
Add needs_stt/needs_tts/handles_audio/handles_video flags to LLM/Realtime and refactor Agent logic to route audio/video and turn handling based on these capabilities.
needs_stt,needs_tts,handles_audio,handles_videoinLLM; setRealtimeto handle audio/video and not need STT/TTS.realtime_mode/isinstancechecks with capability flags across audio/video routing, publishing, and input needs.handles_audioandneeds_{stt,tts}; avoid loops when agent is speaking.llm.output_trackwhenhandles_audio.handles_video; rename handlers toon_video_track_added/removedand clean up track switching logic.Written by Cursor Bugbot for commit 4c3b8b7. This will update automatically on new commits. Configure here.