CLI Cost Tracking Gap: Single-Task Execution Lacks Structured Cost Output

# CLI Cost Tracking Gap: Single-Task Execution Lacks Structured Cost Output

## Problem Statement

When using the PraisonAI CLI in single-task mode (`praisonai "TASK" --model MODEL`), cost and token usage data is calculated internally but **not output in a machine-readable format**. This creates a significant gap for programmatic use cases like Terminal-Bench benchmarking where cost comparison between approaches is critical.

### Comparison: Direct Agent vs CLI Wrapper

| Metric | Direct Agent (Agent class) | CLI Wrapper (subprocess) |
|--------|---------------------------|-------------------------|
| Success Rate | 5/5 (100%) | 5/5 (100%) |
| Total Time | 262.47s | 105.25s (2.5x faster) |
| **Total Cost** | **$0.017157** | **N/A (not trackable)** |
| Avg per Task | 52.49s | 21.05s |

**Critical Gap**: The CLI approach is 2.5x faster but provides zero cost visibility, making it impossible to optimize for both speed AND cost-efficiency in production workloads.

---

## Root Cause Analysis

### Architecture Gap

**Direct Agent Path (Works):**
```
Agent.start() → LLM calls → TokenUsage dataclass → _total_cost accumulation → cost_summary() method available
```
- File: `praisonaiagents/agent/agent.py:1898-1914`
- `cost_summary()` returns: `{"tokens_in": int, "tokens_out": int, "cost": float, "llm_calls": int}`

**CLI Wrapper Path (Broken):**
```
praisonai "TASK" → subprocess.run() → stdout/stderr only → No structured cost data at process exit
```
- File: `praisonai/cli/main.py` - `handle_direct_prompt()` method prints result but not metrics

### Where Cost Data Lives in CLI

The CLI **does** track cost internally (evidence found):

1. **CostTracker class** (`praisonai/cli/features/cost_tracker.py:140-201`):
   - `SessionStats.to_dict()` returns complete cost data
   - Fields: `total_cost`, `total_input_tokens`, `total_output_tokens`, `avg_cost_per_request`

2. **Interactive TUI** (`praisonai/cli/main.py:6148-6168`):
   - `_handle_stats_command()` shows cost via `/stats` command
   - Calculates: `pricing.calculate_cost(input_tokens, output_tokens)`

3. **Metrics Feature** (`praisonai/cli/main.py:965`):
   - `--metrics` flag exists but **only for interactive TUI mode**
   - Missing: `--metrics-json` for single-task structured output

### The Missing Bridge

When running `praisonai "TASK" --model gpt-4o-mini`:
- ✅ CLI calculates cost internally in `session_state`
- ❌ At process exit, only the **text response** is printed
- ❌ No JSON blob with `{"cost_usd": X, "tokens_in": Y, "tokens_out": Z}` is output
- ❌ Wrapper agent cannot capture cost data via subprocess

---

## Evidence: Code Locations

### Core SDK (Works)
```python
# praisonaiagents/agent/agent.py:1898-1914
@property
def total_cost(self) -> float:
    """Cumulative USD cost of all LLM calls in this agent run."""
    return self._total_cost

@property
def cost_summary(self) -> dict:
    """Summary of cost and token usage."""
    return {
        "tokens_in": self._total_tokens_in,
        "tokens_out": self._total_tokens_out,
        "cost": self._total_cost,
        "llm_calls": self._llm_call_count,
    }
```

### CLI (Missing Output)
```python
# praisonai/cli/main.py (handle_direct_prompt method)
# Prints result but no cost metrics at exit
print(result)  # Line ~709
# Missing: print(json.dumps(session_stats.to_dict()))
```

### TokenUsage Dataclass (Core SDK)
```python
# praisonaiagents/llm/llm.py:96-121
@dataclass
class TokenUsage:
    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0
    cached_tokens: int = 0
    # ... methods
```

---

## Proposed Solutions (Ranked by Complexity)

### Option 1: CLI Metrics Output Flag (Recommended - Low Complexity)

Add `--metrics-json` flag to CLI that outputs structured cost data at process end:

```python
# In praisonai/cli/main.py at exit point
if args.metrics_json:
    print(json.dumps({
        "cost_usd": session_state['total_cost'],
        "tokens_in": session_state['total_input_tokens'],
        "tokens_out": session_state['total_output_tokens'],
        "model": session_state['current_model'],
        "request_count": session_state['request_count']
    }))
```

**Files to modify:**
- `praisonai/cli/main.py` - Add flag and output logic in `handle_direct_prompt()` and `main()`

**Benefits:**
- Minimal code change (~10 lines)
- Follows existing CLI patterns
- Benefits all CLI users, not just Terminal-Bench
- Machine-readable output enables programmatic cost tracking
- No performance impact (only executes when flag is set)

### Option 2: Wrapper Agent Cost Estimation (Medium Complexity)

Calculate cost in the wrapper agent after execution using litellm's cost calculator:

```python
from litellm import cost_calculator

# After subprocess completes, estimate cost based on model + output length
# Requires parsing output length to estimate tokens
```

**Downside:** Estimation only, not actual cost from provider

### Option 3: Environment Variable Bridge (Medium Complexity)

CLI writes cost data to temp file via env var path, wrapper reads it:

```python
# CLI side
if os.environ.get('PRAISONAI_COST_FILE'):
    with open(os.environ['PRAISONAI_COST_FILE'], 'w') as f:
        json.dump(cost_data, f)

# Wrapper side reads file after subprocess completes
```

**Downside:** More complex, requires file system coordination

### Option 4: Full Structured Logging (High Complexity)

Add comprehensive structured output mode to CLI with full execution metadata.

**Downside:** Overkill for this specific use case

---

## Recommendation

**Go with Option 1** - Add `--metrics-json` flag:

1. **Minimal change**: ~10 lines of code
2. **No performance impact**: Only executes when flag is explicitly set
3. **Follows patterns**: `--metrics` flag already exists for TUI mode
4. **Universal benefit**: All CLI users gain programmatic cost visibility
5. **Terminal-Bench unblocked**: Wrapper agent can capture and compare costs

---

## Implementation Plan

### Phase 1: Core CLI Change
1. Add `--metrics-json` argument to argument parser (`praisonai/cli/main.py`)
2. In `handle_direct_prompt()` method, capture cost data from session_state
3. At process exit, output JSON if flag is set
4. Test with single task: `praisonai "hello" --model gpt-4o-mini --metrics-json`

### Phase 2: Wrapper Agent Update
1. Update `praisonai_wrapper_agent.py` to pass `--metrics-json` flag
2. Parse JSON output from subprocess
3. Populate Harbor AgentContext with cost data

### Phase 3: Verification
1. Run comparison test again (5 tasks)
2. Verify both approaches show cost data
3. Confirm CLI cost matches Direct Agent cost for same task

---

## Acceptance Criteria

- [ ] `praisonai "TASK" --model MODEL --metrics-json` outputs valid JSON with cost data
- [ ] JSON includes: `cost_usd`, `tokens_in`, `tokens_out`, `model`, `request_count`
- [ ] Wrapper agent captures cost data and populates Harbor context
- [ ] Comparison test shows cost for both approaches
- [ ] No regression in existing CLI functionality
- [ ] Zero performance impact when flag is not used

---

## Related Files

### Core SDK (Cost Tracking Works)
- `praisonaiagents/agent/agent.py:1898-1914` - `total_cost`, `cost_summary` properties
- `praisonaiagents/llm/llm.py:96-121` - `TokenUsage` dataclass
- `praisonaiagents/agent/chat_mixin.py:680-699` - Cost accumulation during chat

### CLI (Missing Output)
- `praisonai/cli/main.py:965` - `--metrics` flag (TUI only)
- `praisonai/cli/main.py:6148-6168` - `_handle_stats_command()` (interactive only)
- `praisonai/cli/features/cost_tracker.py:140-201` - `SessionStats` class with cost data

### Wrapper Agent (Needs Cost)
- `examples/terminal_bench/praisonai_wrapper_agent.py` - Currently cannot capture cost
- `examples/terminal_bench/test_agent_comparison.py` - Shows cost gap in test results

---

## Priority

**High** - Blocks production benchmarking and cost optimization workflows. Currently impossible to compare cost-efficiency of CLI vs Direct Agent approaches.

## Labels

- `enhancement`
- `cli`
- `cost-tracking`
- `terminal-bench`
- `good-first-issue` (Option 1 is straightforward)

---

## Additional Context

This issue was discovered during Terminal-Bench 2.0 integration testing. The wrapper agent is 2.5x faster than direct Agent class but lacks cost visibility, making it impossible to optimize for both speed AND cost in production workloads.


Metric	Direct Agent (Agent class)	CLI Wrapper (subprocess)
Success Rate	5/5 (100%)	5/5 (100%)
Total Time	262.47s	105.25s (2.5x faster)
Total Cost	$0.017157	N/A (not trackable)
Avg per Task	52.49s	21.05s

Uh oh!

CLI Cost Tracking Gap: Single-Task Execution Lacks Structured Cost Output #1356

Description

CLI Cost Tracking Gap: Single-Task Execution Lacks Structured Cost Output

Problem Statement

Comparison: Direct Agent vs CLI Wrapper

Root Cause Analysis

Architecture Gap

Where Cost Data Lives in CLI

The Missing Bridge

Evidence: Code Locations

Core SDK (Works)

CLI (Missing Output)

TokenUsage Dataclass (Core SDK)

Proposed Solutions (Ranked by Complexity)

Option 1: CLI Metrics Output Flag (Recommended - Low Complexity)

Option 2: Wrapper Agent Cost Estimation (Medium Complexity)

Option 3: Environment Variable Bridge (Medium Complexity)

Option 4: Full Structured Logging (High Complexity)

Recommendation

Implementation Plan

Phase 1: Core CLI Change

Phase 2: Wrapper Agent Update

Phase 3: Verification

Acceptance Criteria

Related Files

Core SDK (Cost Tracking Works)

CLI (Missing Output)

Wrapper Agent (Needs Cost)

Priority

Labels

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions