Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 37 additions & 7 deletions src/praisonai-agents/praisonaiagents/agent/router_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
from typing import Dict, List, Optional, Any, Union
from .agent import Agent
from ..llm.model_router import ModelRouter
from ..llm import LLM
from ..llm import LLM, TokenUsage
from ..trace.protocol import get_default_emitter

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -213,8 +214,8 @@ def _execute_with_model(
full_prompt = f"{context}\n\n{prompt}"

try:
# Execute with the selected model
response = llm_instance.get_response(
# Execute with the selected model, requesting token usage tracking
result = llm_instance.get_response(
prompt=full_prompt,
system_prompt=self._build_system_prompt(),
tools=tools,
Comment on lines +217 to 221
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Routeragent token_usage not persisted 📎 Requirement gap ✧ Quality

RouterAgent computes token_usage and an estimated_cost but only stores them in in-memory
model_usage_stats and emits them to trace metadata. It does not persist token/cost data into
chat_history or session metadata for later attribution.
Agent Prompt
## Issue description
RouterAgent tracks per-call `token_usage` and `estimated_cost` but does not persist these values into `chat_history` or session metadata, so later attribution/analysis is not possible.

## Issue Context
The project includes a SessionStore that supports per-message `metadata`, and compliance requires storing routing token/cost tracking in chat history or session metadata after routed interactions.

## Fix Focus Areas
- src/praisonai-agents/praisonaiagents/agent/router_agent.py[217-269]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Expand All @@ -225,16 +226,45 @@ def _execute_with_model(
agent_role=self.role,
agent_tools=[t.__name__ if hasattr(t, '__name__') else str(t) for t in (tools or [])],
execute_tool_fn=self.execute_tool if tools else None,
return_token_usage=True, # Request token usage information
**kwargs
)

# Extract response and token usage
if isinstance(result, tuple):
response, token_usage = result
else:
# Fallback for backward compatibility
response = result
token_usage = TokenUsage()

# Update usage statistics
self.model_usage_stats[model_name]['calls'] += 1
self.model_usage_stats[model_name]['tokens'] += token_usage.total_tokens

# Calculate and store cost estimate
model_info = self.model_router.get_model_info(model_name)
if model_info and token_usage.total_tokens > 0:
cost = self.model_router.estimate_cost(model_name, token_usage.total_tokens)
self.model_usage_stats[model_name]['cost'] += cost
Comment on lines +245 to +249
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Emit per-decision cost in the trace event.

estimated_cost is populated with the running model total after accumulation. From the second call onward, each trace event re-includes earlier spend, so any event-level aggregation will overcount. Emit the current call's cost here, or rename the field to cumulative_estimated_cost.

Suggested patch
-            model_info = self.model_router.get_model_info(model_name)
-            if model_info and token_usage.total_tokens > 0:
-                cost = self.model_router.estimate_cost(model_name, token_usage.total_tokens)
+            cost = 0.0
+            model_info = self.model_router.get_model_info(model_name)
+            if model_info and token_usage.total_tokens > 0:
+                cost = self.model_router.estimate_cost(model_name, token_usage.total_tokens)
                 self.model_usage_stats[model_name]['cost'] += cost
-                        'estimated_cost': self.model_usage_stats[model_name]['cost'],
+                        'estimated_cost': cost,
+                        'cumulative_estimated_cost': self.model_usage_stats[model_name]['cost'],

Also applies to: 251-263

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/praisonai-agents/praisonaiagents/agent/router_agent.py` around lines 245
- 249, The trace currently emits a cumulative estimated cost
(model_usage_stats[model_name]['cost']) after you add the current call's cost,
causing later events to include prior spend; change the trace payload to include
the per-call cost variable you compute (cost =
self.model_router.estimate_cost(...)) instead of the accumulated
model_usage_stats value, or if you intentionally want cumulative, rename the
emitted field to cumulative_estimated_cost; update references around
model_router.get_model_info, estimate_cost, model_usage_stats and where the
trace event is built so the event-level key holds the single-call cost (or the
renamed cumulative field) accordingly.


# TODO: Implement token tracking when LLM.get_response() is updated to return token usage
# The LLM response currently returns only text, but litellm provides usage info in:
# response.get("usage") with prompt_tokens, completion_tokens, and total_tokens
# This would require modifying the LLM class to return both text and metadata
# Emit token usage via trace system for observability
try:
trace_emitter = get_default_emitter()
trace_emitter.output(
content=f"RouterAgent routing decision completed",
agent_name=self.name,
metadata={
'selected_model': model_name,
'routing_strategy': self.routing_strategy,
'token_usage': token_usage.to_dict(),
'estimated_cost': self.model_usage_stats[model_name]['cost'],
'total_calls': self.model_usage_stats[model_name]['calls'],
}
)
except Exception as trace_error:
# Don't fail the request if tracing fails
logger.debug(f"Failed to emit trace event: {trace_error}")

return response

Expand Down
7 changes: 6 additions & 1 deletion src/praisonai-agents/praisonaiagents/llm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@ def __getattr__(name):
from .rate_limiter import RateLimiter
_lazy_cache[name] = RateLimiter
return RateLimiter
elif name == "TokenUsage":
from .llm import TokenUsage
_lazy_cache[name] = TokenUsage
return TokenUsage

raise AttributeError(f"module {__name__!r} has no attribute {name!r}")

Expand All @@ -117,5 +121,6 @@ def __getattr__(name):
"ModelProfile",
"TaskComplexity",
"create_routing_agent",
"RateLimiter"
"RateLimiter",
"TokenUsage"
]
107 changes: 99 additions & 8 deletions src/praisonai-agents/praisonaiagents/llm/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import re
import inspect
import asyncio
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Union, Literal, Callable, TYPE_CHECKING

if TYPE_CHECKING:
Expand Down Expand Up @@ -90,6 +91,36 @@ def _is_context_limit_error(self, error_message: str) -> bool:
]
return any(phrase in error_message.lower() for phrase in context_limit_phrases)


@dataclass
class TokenUsage:
"""
Token usage information from LLM response.

This class provides structured access to token consumption data
returned by language models, enabling cost tracking and observability.
"""
prompt_tokens: int = 0
completion_tokens: int = 0
total_tokens: int = 0
cached_tokens: int = 0
reasoning_tokens: int = 0
audio_input_tokens: int = 0
audio_output_tokens: int = 0

def to_dict(self) -> Dict[str, int]:
"""Convert to dictionary format."""
return {
'prompt_tokens': self.prompt_tokens,
'completion_tokens': self.completion_tokens,
'total_tokens': self.total_tokens,
'cached_tokens': self.cached_tokens,
'reasoning_tokens': self.reasoning_tokens,
'audio_input_tokens': self.audio_input_tokens,
'audio_output_tokens': self.audio_output_tokens,
}


class LLM:
"""
Easy to use wrapper for language models. Supports multiple providers like OpenAI,
Expand Down Expand Up @@ -1566,10 +1597,24 @@ def get_response(
stream: bool = True,
stream_callback: Optional[Callable] = None,
emit_events: bool = False,
return_token_usage: bool = False,
**kwargs
) -> str:
) -> Union[str, tuple[str, TokenUsage]]:
Comment on lines +1600 to +1602
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mirror return_token_usage in get_response_async().

The sync API now exposes token usage, but the async counterpart still advertises -> str and has no matching flag/tuple contract. That makes cost observability depend on whether the caller used the sync or async path. As per coding guidelines, "All I/O operations must have both sync and async variants; never block the event loop with sync I/O in async context; use asyncio primitives for coordination, not threading".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/praisonai-agents/praisonaiagents/llm/llm.py` around lines 1600 - 1602,
The async method get_response_async currently lacks the return_token_usage
parameter and still declares a return type of str; update get_response_async to
mirror the sync variant by adding return_token_usage: bool = False to its
signature, change its declared return type to Union[str, tuple[str,
TokenUsage]], and ensure the implementation collects TokenUsage (same
structure/type used by the sync get_response) and returns (response_text,
token_usage) when return_token_usage is True, otherwise just response_text;
locate the implementation inside get_response_async and propagate the flag
through any helper calls so token accounting is computed in the async path.

"""Enhanced get_response with all OpenAI-like features"""
logging.debug(f"Getting response from {self.model}")

# Variable to store final response for token usage extraction
_final_llm_response = None
Comment on lines +1606 to +1607
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Most execution paths still drop the raw usage payload.

_final_llm_response is only populated in the non-streaming Chat Completions branches added here. The OpenAI Responses API flow and the successful streaming flow still synthesize final_response objects without any usage, so return_token_usage=True falls back to empty metrics for exactly the models RouterAgent now observes.

Please capture resp in the Responses API branches as well, and thread the terminal raw response or usage object out of the streaming helpers instead of rebuilding a usage-less dict.

Also applies to: 1903-1903, 2134-2134, 2326-2326

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/praisonai-agents/praisonaiagents/llm/llm.py` around lines 1606 - 1607,
The code only sets _final_llm_response in non-streaming Chat Completions
branches so Responses API and successful streaming flows lose usage data; modify
the Responses API branches to capture the raw resp (the OpenAI Responses API
return) into _final_llm_response, and update the streaming helper(s) that
currently synthesize final_response to return/propagate the terminal raw
response or at least its usage object back to the caller (instead of rebuilding
a usage-less dict); ensure callers that honor return_token_usage read usage from
_final_llm_response (or the value threaded out from the streaming helpers) so
return_token_usage=True yields correct metrics for models observed by
RouterAgent.


# Helper closure to return appropriate format based on return_token_usage
def _prepare_return_value(text: str) -> Union[str, tuple]:
if not return_token_usage:
return text
token_usage = self._extract_token_usage(_final_llm_response) if _final_llm_response else None
if token_usage is None:
token_usage = TokenUsage()
return text, token_usage

# Log all self values when in debug mode
self._log_llm_config(
'LLM instance',
Expand Down Expand Up @@ -1864,6 +1909,7 @@ def get_response(
reasoning_content = resp["choices"][0]["message"].get("provider_specific_fields", {}).get("reasoning_content")
response_text = resp["choices"][0]["message"]["content"]
final_response = resp
_final_llm_response = resp # Store for token usage extraction

# Emit StreamEvent for reasoning content if callback provided
if _emit and reasoning_content:
Expand Down Expand Up @@ -2094,6 +2140,7 @@ def get_response(
**kwargs
)
)
_final_llm_response = final_response # Store for token usage extraction
# Handle None content from Gemini
response_content = final_response["choices"][0]["message"].get("content")
response_text = response_content if response_content is not None else ""
Expand Down Expand Up @@ -2285,6 +2332,7 @@ def get_response(
**kwargs
)
)
_final_llm_response = final_response # Store for token usage extraction
# Handle None content from Gemini
response_content = final_response["choices"][0]["message"].get("content")
response_text = response_content if response_content is not None else ""
Expand Down Expand Up @@ -2698,7 +2746,7 @@ def get_response(
task_id=task_id
)
callback_executed = True
return final_response_text
return _prepare_return_value(final_response_text)

# No tool calls were made in this iteration, return the response
generation_time_val = time.time() - start_time
Expand Down Expand Up @@ -2787,7 +2835,7 @@ def get_response(
task_id=task_id
)
callback_executed = True
return response_text
return _prepare_return_value(response_text)

if not self_reflect:
if verbose and not interaction_displayed:
Expand Down Expand Up @@ -2816,8 +2864,8 @@ def get_response(

# Return reasoning content if reasoning_steps is True
if reasoning_steps and stored_reasoning_content:
return stored_reasoning_content
return response_text
return _prepare_return_value(stored_reasoning_content)
return _prepare_return_value(response_text)

# Handle self-reflection loop
while reflection_count < max_reflect:
Expand Down Expand Up @@ -2999,7 +3047,7 @@ def get_response(
agent_name=agent_name, agent_role=agent_role, agent_tools=agent_tools,
task_name=task_name, task_description=task_description, task_id=task_id)
interaction_displayed = True
return response_text
return _prepare_return_value(response_text)
continue
except Exception as e:
_get_display_functions()['display_error'](f"Error in LLM response: {str(e)}")
Expand All @@ -3010,12 +3058,12 @@ def get_response(
_get_display_functions()['display_interaction'](prompt, response_text, markdown=markdown,
generation_time=time.time() - start_time, console=self.console)
interaction_displayed = True
return response_text
return _prepare_return_value(response_text)

except Exception as error:
_get_display_functions()['display_error'](f"Error in get_response: {str(error)}")
raise

# Log completion time if in debug mode
if logging.getLogger().getEffectiveLevel() == logging.DEBUG:
total_time = time.time() - start_time
Expand Down Expand Up @@ -4192,6 +4240,49 @@ def _track_token_usage(self, response: Dict[str, Any], model: str) -> Optional[T
logging.warning(f"Failed to track token usage: {e}")
return None

def _extract_token_usage(self, response: Union[Dict[str, Any], Any]) -> Optional[TokenUsage]:
"""Extract token usage from LiteLLM response for public API."""
try:
usage = None

# Handle both dict and ModelResponse object formats
if isinstance(response, dict):
usage = response.get("usage", {})
else:
# ModelResponse object
usage = getattr(response, 'usage', None)

if not usage:
return None

# Extract token counts with support for both dict and object access
if isinstance(usage, dict):
return TokenUsage(
prompt_tokens=usage.get("prompt_tokens", 0),
completion_tokens=usage.get("completion_tokens", 0),
total_tokens=usage.get("total_tokens", 0),
cached_tokens=usage.get("cached_tokens", 0),
reasoning_tokens=usage.get("reasoning_tokens", 0),
audio_input_tokens=usage.get("audio_input_tokens", 0),
audio_output_tokens=usage.get("audio_output_tokens", 0),
)
else:
# Object-style access
return TokenUsage(
prompt_tokens=getattr(usage, 'prompt_tokens', 0) or 0,
completion_tokens=getattr(usage, 'completion_tokens', 0) or 0,
total_tokens=getattr(usage, 'total_tokens', 0) or 0,
cached_tokens=getattr(usage, 'cached_tokens', 0) or 0,
reasoning_tokens=getattr(usage, 'reasoning_tokens', 0) or 0,
audio_input_tokens=getattr(usage, 'audio_input_tokens', 0) or 0,
audio_output_tokens=getattr(usage, 'audio_output_tokens', 0) or 0,
)

except Exception as e:
if self.verbose:
logging.warning(f"Failed to extract token usage: {e}")
return None
Comment on lines +4243 to +4284
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "llm.py" -path "*/praisonai-agents/*" | head -5

Repository: MervinPraison/PraisonAI

Length of output: 116


🏁 Script executed:

wc -l ./src/praisonai-agents/praisonaiagents/llm/llm.py

Repository: MervinPraison/PraisonAI

Length of output: 121


🏁 Script executed:

sed -n '4245,4286p' ./src/praisonai-agents/praisonaiagents/llm/llm.py

Repository: MervinPraison/PraisonAI

Length of output: 2140


🏁 Script executed:

sed -n '1750,1760p' ./src/praisonai-agents/praisonaiagents/llm/llm.py

Repository: MervinPraison/PraisonAI

Length of output: 794


🏁 Script executed:

rg "RouterAgent" ./src/praisonai-agents/praisonaiagents/ -l

Repository: MervinPraison/PraisonAI

Length of output: 184


🏁 Script executed:

cat -n ./src/praisonai-agents/praisonaiagents/agent/router_agent.py | head -100

Repository: MervinPraison/PraisonAI

Length of output: 4645


🏁 Script executed:

rg "cost" ./src/praisonai-agents/praisonaiagents/agent/router_agent.py -A 3 -B 3

Repository: MervinPraison/PraisonAI

Length of output: 3576


🏁 Script executed:

rg "class TokenUsage" ./src/praisonai-agents/praisonaiagents/ -A 20

Repository: MervinPraison/PraisonAI

Length of output: 1811


🏁 Script executed:

rg "_extract_token_usage" ./src/praisonai-agents/praisonaiagents/ -B 5 -A 5

Repository: MervinPraison/PraisonAI

Length of output: 2215


🏁 Script executed:

rg "estimate_cost" ./src/praisonai-agents/praisonaiagents/llm/model_router.py -B 2 -A 10

Repository: MervinPraison/PraisonAI

Length of output: 541


🏁 Script executed:

sed -n '170,185p' ./src/praisonai-agents/praisonaiagents/agent/router_agent.py

Repository: MervinPraison/PraisonAI

Length of output: 573


🏁 Script executed:

sed -n '190,240p' ./src/praisonai-agents/praisonaiagents/agent/router_agent.py

Repository: MervinPraison/PraisonAI

Length of output: 2043


🏁 Script executed:

sed -n '240,280p' ./src/praisonai-agents/praisonaiagents/agent/router_agent.py

Repository: MervinPraison/PraisonAI

Length of output: 2072


🏁 Script executed:

rg "total_tokens" ./src/praisonai-agents/praisonaiagents/llm/model_router.py -B 2 -A 2

Repository: MervinPraison/PraisonAI

Length of output: 49


🏁 Script executed:

sed -n '1750,1770p' ./src/praisonai-agents/praisonaiagents/llm/llm.py

Repository: MervinPraison/PraisonAI

Length of output: 1319


🏁 Script executed:

rg "input_tokens|output_tokens" ./src/praisonai-agents/praisonaiagents/llm/llm.py | head -20

Repository: MervinPraison/PraisonAI

Length of output: 1069


Add fallback to Responses API token field names in _extract_token_usage().

The method at lines 4245-4286 only looks for prompt_tokens and completion_tokens, but the codebase already handles the Responses API format with input_tokens and output_tokens at line 1754. When responses use only the Responses API names, the method returns zero tokens, which causes RouterAgent's cost calculation to report 0.0 cost (since estimate_cost() multiplies by token count).

Add fallback logic to check input_tokens/output_tokens if the standard names are not present, and calculate total_tokens as their sum when not explicitly provided in the response.

Suggested patch
             if isinstance(usage, dict):
+                prompt_tokens = usage.get("prompt_tokens", usage.get("input_tokens", 0))
+                completion_tokens = usage.get("completion_tokens", usage.get("output_tokens", 0))
                 return TokenUsage(
-                    prompt_tokens=usage.get("prompt_tokens", 0),
-                    completion_tokens=usage.get("completion_tokens", 0),
-                    total_tokens=usage.get("total_tokens", 0),
+                    prompt_tokens=prompt_tokens,
+                    completion_tokens=completion_tokens,
+                    total_tokens=usage.get("total_tokens", prompt_tokens + completion_tokens),
                     cached_tokens=usage.get("cached_tokens", 0),
                     reasoning_tokens=usage.get("reasoning_tokens", 0),
                     audio_input_tokens=usage.get("audio_input_tokens", 0),
                     audio_output_tokens=usage.get("audio_output_tokens", 0),
                 )
             else:
+                prompt_tokens = getattr(usage, "prompt_tokens", None)
+                if prompt_tokens is None:
+                    prompt_tokens = getattr(usage, "input_tokens", 0) or 0
+                completion_tokens = getattr(usage, "completion_tokens", None)
+                if completion_tokens is None:
+                    completion_tokens = getattr(usage, "output_tokens", 0) or 0
                 return TokenUsage(
-                    prompt_tokens=getattr(usage, 'prompt_tokens', 0) or 0,
-                    completion_tokens=getattr(usage, 'completion_tokens', 0) or 0,
-                    total_tokens=getattr(usage, 'total_tokens', 0) or 0,
+                    prompt_tokens=prompt_tokens,
+                    completion_tokens=completion_tokens,
+                    total_tokens=getattr(usage, 'total_tokens', prompt_tokens + completion_tokens) or (prompt_tokens + completion_tokens),
                     cached_tokens=getattr(usage, 'cached_tokens', 0) or 0,
                     reasoning_tokens=getattr(usage, 'reasoning_tokens', 0) or 0,
                     audio_input_tokens=getattr(usage, 'audio_input_tokens', 0) or 0,
                     audio_output_tokens=getattr(usage, 'audio_output_tokens', 0) or 0,
                 )
🧰 Tools
🪛 Ruff (0.15.7)

[warning] 4283-4283: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/praisonai-agents/praisonaiagents/llm/llm.py` around lines 4245 - 4286,
The _extract_token_usage method fails to handle Responses API field names
(input_tokens/output_tokens), so update TokenUsage extraction in both dict and
object branches of _extract_token_usage to: if prompt_tokens/completion_tokens
are zero or missing, fall back to input_tokens and output_tokens respectively;
if total_tokens is missing or zero, compute it as the sum of prompt/completion
(or input/output) tokens; preserve other fields (cached_tokens,
reasoning_tokens, audio_*). Modify the dict branch (where usage.get(...) is
used) and the object branch (where getattr(usage, '...', 0) is used) to
implement these fallbacks so RouterAgent cost calculations (estimate_cost) see
correct token counts.


def set_current_agent(self, agent_name: Optional[str]):
"""Set the current agent name for token tracking."""
self.current_agent_name = agent_name
Expand Down