Skip to content

feat(eval): enqueue call trace ids onto evaluation redis queue#831

Open
narsimhaReddyJuspay wants to merge 1 commit into
juspay:releasefrom
narsimhaReddyJuspay:pr-1-add-redis-quque-to-push-traces-which-needs-to-be-evaluated
Open

feat(eval): enqueue call trace ids onto evaluation redis queue#831
narsimhaReddyJuspay wants to merge 1 commit into
juspay:releasefrom
narsimhaReddyJuspay:pr-1-add-redis-quque-to-push-traces-which-needs-to-be-evaluated

Conversation

@narsimhaReddyJuspay

@narsimhaReddyJuspay narsimhaReddyJuspay commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What

PR 1 of the internal LLM-as-judge evaluation system. On call end, extract the OTEL/Langfuse trace_id and push the bare id onto a Redis list (evaluation:trace_queue) for the evaluation worker (later phase) to drain. Only the trace_id is stored — every other field (tags, call_sid, transcription, payload, …) is fetched from Langfuse by the worker.

Changes

  • get_trace_id(span) in tracing_setup.py — returns the 32-char hex OTEL trace_id (1:1 with Langfuse trace.id), guarded by ENABLE_BREEZE_BUDDY_TRACING; None when tracing is off.
  • enqueue_trace_for_evaluation(trace_id) in services/langfuse/tasks/evaluation/queue.pySETNX dedup marker (evaluation:enqueued:{trace_id}, 7d) + RPUSH the bare trace_id; best-effort (swallows Redis errors, never breaks call teardown).
  • Wired into end_conversation right after update_span_with_evaluation_data.
  • tests/conftest.py pre-imports the template package so the test process imports tracing_setup in the same order as the running app — sidesteps the latent template <-> handlers <-> tracing_setup import cycle without touching production imports (no TYPE_CHECKING).
  • Real-Redis unit + e2e tests.

Out of scope (later PRs)

Schemas/DB layer, the worker (drain queue → Langfuse fetch → LLM judge), actions/Slack, the REST API. The queue is written but not yet consumed — safe to merge standalone.

Testing

  • Unit (5): get_trace_id (none-span → None; real span → 32-hex), enqueue (empty-noop, push, dedup).
  • E2e (1): drives the real end_conversation handler with a real OTEL span + real Redis (bot/context/task stubbed) → asserts the trace_id lands on the queue.
  • Smoke: import app.main OK; server boots to "Application startup complete."
  • All green: black, isort, autoflake, pyrefly (0 errors on these files); 6/6 tests pass against local Redis (skip if Redis down).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Introduced automatic internal evaluation system that assesses conversations upon completion, enabling enhanced quality monitoring.
  • Tests

    • Added integration tests for the conversation evaluation enqueue mechanism and deduplication flow.

Copilot AI review requested due to automatic review settings June 15, 2026 11:28
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bed8fdd0-a5e2-4022-8e66-4891997b2fb0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

end_conversation now extracts the OTEL trace ID from the active root span via a new get_trace_id utility and enqueues it to a Redis list for LLM-as-judge evaluation. The evaluation queue uses Redis deduplication markers with a 7-day TTL. Integration tests cover the full path with a real Redis instance.

Changes

LLM-as-judge Evaluation Enqueue Pipeline

Layer / File(s) Summary
Redis evaluation queue and package exports
app/services/langfuse/tasks/evaluation/queue.py, app/services/langfuse/tasks/evaluation/__init__.py
Introduces enqueue_trace_for_evaluation(trace_id) with Redis SETNX-based deduplication (7-day TTL) and RPUSH onto evaluation:trace_queue. Package __init__.py re-exports the function and constants via __all__.
OTEL trace ID extraction utility
app/ai/voice/agents/breeze_buddy/observability/tracing_setup.py
Adds get_trace_id(span) returning the 32-char hex OTEL trace ID, or None when tracing is disabled, span is absent, or extraction fails (errors are logged and swallowed).
end_conversation handler integration
app/ai/voice/agents/breeze_buddy/handlers/internal/end_conversation.py
Imports get_trace_id and enqueue_trace_for_evaluation; after writing span evaluation data, extracts the trace ID from context.root_span and conditionally enqueues it.
Tests: fixtures, unit, and end-to-end
tests/conftest.py, tests/test_evaluation_queue.py
conftest.py fixes import ordering for test isolation. test_evaluation_queue.py adds a real-Redis async fixture, get_trace_id unit tests, enqueue/dedup Redis tests, and an end-to-end test through end_conversation asserting queue and dedup marker presence.

Sequence Diagram

sequenceDiagram
  participant end_conversation
  participant get_trace_id
  participant enqueue_trace_for_evaluation
  participant Redis

  end_conversation->>end_conversation: update_span_with_evaluation_data(context)
  end_conversation->>get_trace_id: context.root_span
  get_trace_id-->>end_conversation: evaluation_trace_id (32-hex or None)
  alt trace id present
    end_conversation->>enqueue_trace_for_evaluation: evaluation_trace_id
    enqueue_trace_for_evaluation->>Redis: SET evaluation:enqueued:{id} NX EX 7d
    enqueue_trace_for_evaluation->>Redis: RPUSH evaluation:trace_queue id
    enqueue_trace_for_evaluation-->>end_conversation: True / False
  end
  end_conversation->>end_conversation: end_conversation_callbacks
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hop, hop, a trace goes in the queue,
Redis holds it fresh with a seven-day hue.
Dedup keys glitter, no doubles allowed,
The judge-LLM waits under its cloud.
Each ended chat gets a golden tag,
Bunny seals it neat in the eval bag! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(eval): enqueue call trace ids onto evaluation redis queue' directly and specifically describes the main change: enqueueing trace IDs to a Redis queue for evaluation purposes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/test_evaluation_queue.py (1)

38-40: ⚡ Quick win

Add type hints to the new helper/fixture/test function signatures in this module.

Several new function signatures are untyped (e.g., fixtures and test functions), which violates the repo’s Python typing rule.

As per coding guidelines, **/*.py must “Add type hints on all function signatures.”

Also applies to: 49-55, 59-86, 93-204

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_evaluation_queue.py` around lines 38 - 40, Add type hints to all
function signatures that are missing them in tests/test_evaluation_queue.py. At
lines 38-40, the _new_trace_id function already has a return type hint (-> str)
but ensure all parameters and return types are typed. At lines 49-55, 59-86, and
93-204, add complete type hints to all function signatures including parameters
and return types for fixtures, test functions, and any helper functions that
currently lack type annotations. This includes adding parameter type hints and
return type hints (use None for functions that don't return a value) to comply
with the repository's Python typing requirements.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/services/langfuse/tasks/evaluation/queue.py`:
- Around line 46-51: The marker is set before the queue operation, creating an
atomicity issue where if the RPUSH fails, the marker remains set for 7 days and
blocks future retry attempts, causing traces to be lost from evaluation. To fix
this, move the marker set operation to occur after the successful RPUSH in the
section where the dedup marker is created with client.set() and the trace is
enqueued with client.rpush(). This ensures that the marker is only persisted
once the trace has been successfully added to the queue, allowing retries if the
enqueue operation fails.

In `@tests/test_evaluation_queue.py`:
- Line 198: The end_conversation function call on line 198 is passing a
SimpleNamespace object as the context parameter, but the function is typed to
accept a TemplateContext object, causing a type checker failure. Replace the
SimpleNamespace context with a properly constructed TemplateContext instance
that satisfies the function's type requirements.

---

Nitpick comments:
In `@tests/test_evaluation_queue.py`:
- Around line 38-40: Add type hints to all function signatures that are missing
them in tests/test_evaluation_queue.py. At lines 38-40, the _new_trace_id
function already has a return type hint (-> str) but ensure all parameters and
return types are typed. At lines 49-55, 59-86, and 93-204, add complete type
hints to all function signatures including parameters and return types for
fixtures, test functions, and any helper functions that currently lack type
annotations. This includes adding parameter type hints and return type hints
(use None for functions that don't return a value) to comply with the
repository's Python typing requirements.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8cb36779-0558-400e-be08-ff527683452b

📥 Commits

Reviewing files that changed from the base of the PR and between 81d67b9 and 5dff0c5.

📒 Files selected for processing (6)
  • app/ai/voice/agents/breeze_buddy/handlers/internal/end_conversation.py
  • app/ai/voice/agents/breeze_buddy/observability/tracing_setup.py
  • app/services/langfuse/tasks/evaluation/__init__.py
  • app/services/langfuse/tasks/evaluation/queue.py
  • tests/conftest.py
  • tests/test_evaluation_queue.py

Comment thread app/services/langfuse/tasks/evaluation/queue.py Outdated
Comment thread tests/test_evaluation_queue.py Outdated
@narsimhaReddyJuspay narsimhaReddyJuspay force-pushed the pr-1-add-redis-quque-to-push-traces-which-needs-to-be-evaluated branch 8 times, most recently from 911ae1d to 82a88d8 Compare June 16, 2026 18:35
Push the bare OTEL trace_id to a Redis list at call end (atomic SETNX+RPUSH via Lua) for the evaluation worker to drain later.

Co-Authored-By: Claude <noreply@anthropic.com>
@narsimhaReddyJuspay narsimhaReddyJuspay force-pushed the pr-1-add-redis-quque-to-push-traces-which-needs-to-be-evaluated branch from 82a88d8 to f58aa20 Compare June 16, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants