feat(eval): enqueue call trace ids onto evaluation redis queue by narsimhaReddyJuspay · Pull Request #831 · juspay/clairvoyance

narsimhaReddyJuspay · 2026-06-15T11:28:14Z

What

PR 1 of the internal LLM-as-judge evaluation system. On call end, extract the OTEL/Langfuse trace_id and push the bare id onto a Redis list (evaluation:trace_queue) for the evaluation worker (later phase) to drain. Only the trace_id is stored — every other field (tags, call_sid, transcription, payload, …) is fetched from Langfuse by the worker.

Changes

get_trace_id(span) in tracing_setup.py — returns the 32-char hex OTEL trace_id (1:1 with Langfuse trace.id), guarded by ENABLE_BREEZE_BUDDY_TRACING; None when tracing is off.
enqueue_trace_for_evaluation(trace_id) in services/langfuse/tasks/evaluation/queue.py — SETNX dedup marker (evaluation:enqueued:{trace_id}, 7d) + RPUSH the bare trace_id; best-effort (swallows Redis errors, never breaks call teardown).
Wired into end_conversation right after update_span_with_evaluation_data.
tests/conftest.py pre-imports the template package so the test process imports tracing_setup in the same order as the running app — sidesteps the latent template <-> handlers <-> tracing_setup import cycle without touching production imports (no TYPE_CHECKING).
Real-Redis unit + e2e tests.

Out of scope (later PRs)

Schemas/DB layer, the worker (drain queue → Langfuse fetch → LLM judge), actions/Slack, the REST API. The queue is written but not yet consumed — safe to merge standalone.

Testing

Unit (5): get_trace_id (none-span → None; real span → 32-hex), enqueue (empty-noop, push, dedup).
E2e (1): drives the real end_conversation handler with a real OTEL span + real Redis (bot/context/task stubbed) → asserts the trace_id lands on the queue.
Smoke: import app.main OK; server boots to "Application startup complete."
All green: black, isort, autoflake, pyrefly (0 errors on these files); 6/6 tests pass against local Redis (skip if Redis down).

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Introduced automatic internal evaluation system that assesses conversations upon completion, enabling enhanced quality monitoring.
Tests
- Added integration tests for the conversation evaluation enqueue mechanism and deduplication flow.

coderabbitai · 2026-06-15T11:28:28Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bed8fdd0-a5e2-4022-8e66-4891997b2fb0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

end_conversation now extracts the OTEL trace ID from the active root span via a new get_trace_id utility and enqueues it to a Redis list for LLM-as-judge evaluation. The evaluation queue uses Redis deduplication markers with a 7-day TTL. Integration tests cover the full path with a real Redis instance.

Changes

LLM-as-judge Evaluation Enqueue Pipeline

Layer / File(s)	Summary
Redis evaluation queue and package exports `app/services/langfuse/tasks/evaluation/queue.py`, `app/services/langfuse/tasks/evaluation/__init__.py`	Introduces `enqueue_trace_for_evaluation(trace_id)` with Redis `SETNX`-based deduplication (7-day TTL) and `RPUSH` onto `evaluation:trace_queue`. Package `__init__.py` re-exports the function and constants via `__all__`.
OTEL trace ID extraction utility `app/ai/voice/agents/breeze_buddy/observability/tracing_setup.py`	Adds `get_trace_id(span)` returning the 32-char hex OTEL trace ID, or `None` when tracing is disabled, span is absent, or extraction fails (errors are logged and swallowed).
end_conversation handler integration `app/ai/voice/agents/breeze_buddy/handlers/internal/end_conversation.py`	Imports `get_trace_id` and `enqueue_trace_for_evaluation`; after writing span evaluation data, extracts the trace ID from `context.root_span` and conditionally enqueues it.
Tests: fixtures, unit, and end-to-end `tests/conftest.py`, `tests/test_evaluation_queue.py`	`conftest.py` fixes import ordering for test isolation. `test_evaluation_queue.py` adds a real-Redis async fixture, `get_trace_id` unit tests, enqueue/dedup Redis tests, and an end-to-end test through `end_conversation` asserting queue and dedup marker presence.

Sequence Diagram

sequenceDiagram
  participant end_conversation
  participant get_trace_id
  participant enqueue_trace_for_evaluation
  participant Redis

  end_conversation->>end_conversation: update_span_with_evaluation_data(context)
  end_conversation->>get_trace_id: context.root_span
  get_trace_id-->>end_conversation: evaluation_trace_id (32-hex or None)
  alt trace id present
    end_conversation->>enqueue_trace_for_evaluation: evaluation_trace_id
    enqueue_trace_for_evaluation->>Redis: SET evaluation:enqueued:{id} NX EX 7d
    enqueue_trace_for_evaluation->>Redis: RPUSH evaluation:trace_queue id
    enqueue_trace_for_evaluation-->>end_conversation: True / False
  end
  end_conversation->>end_conversation: end_conversation_callbacks

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hop, hop, a trace goes in the queue,
Redis holds it fresh with a seven-day hue.
Dedup keys glitter, no doubles allowed,
The judge-LLM waits under its cloud.
Each ended chat gets a golden tag,
Bunny seals it neat in the eval bag! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(eval): enqueue call trace ids onto evaluation redis queue' directly and specifically describes the main change: enqueueing trace IDs to a Redis queue for evaluation purposes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/test_evaluation_queue.py (1)
38-40: ⚡ Quick win

Add type hints to the new helper/fixture/test function signatures in this module.

Several new function signatures are untyped (e.g., fixtures and test functions), which violates the repo’s Python typing rule.

As per coding guidelines, **/*.py must “Add type hints on all function signatures.”

Also applies to: 49-55, 59-86, 93-204
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_evaluation_queue.py` around lines 38 - 40, Add type hints to all
function signatures that are missing them in tests/test_evaluation_queue.py. At
lines 38-40, the _new_trace_id function already has a return type hint (-> str)
but ensure all parameters and return types are typed. At lines 49-55, 59-86, and
93-204, add complete type hints to all function signatures including parameters
and return types for fixtures, test functions, and any helper functions that
currently lack type annotations. This includes adding parameter type hints and
return type hints (use None for functions that don't return a value) to comply
with the repository's Python typing requirements.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/services/langfuse/tasks/evaluation/queue.py`:
- Around line 46-51: The marker is set before the queue operation, creating an
atomicity issue where if the RPUSH fails, the marker remains set for 7 days and
blocks future retry attempts, causing traces to be lost from evaluation. To fix
this, move the marker set operation to occur after the successful RPUSH in the
section where the dedup marker is created with client.set() and the trace is
enqueued with client.rpush(). This ensures that the marker is only persisted
once the trace has been successfully added to the queue, allowing retries if the
enqueue operation fails.

In `@tests/test_evaluation_queue.py`:
- Line 198: The end_conversation function call on line 198 is passing a
SimpleNamespace object as the context parameter, but the function is typed to
accept a TemplateContext object, causing a type checker failure. Replace the
SimpleNamespace context with a properly constructed TemplateContext instance
that satisfies the function's type requirements.

---

Nitpick comments:
In `@tests/test_evaluation_queue.py`:
- Around line 38-40: Add type hints to all function signatures that are missing
them in tests/test_evaluation_queue.py. At lines 38-40, the _new_trace_id
function already has a return type hint (-> str) but ensure all parameters and
return types are typed. At lines 49-55, 59-86, and 93-204, add complete type
hints to all function signatures including parameters and return types for
fixtures, test functions, and any helper functions that currently lack type
annotations. This includes adding parameter type hints and return type hints
(use None for functions that don't return a value) to comply with the
repository's Python typing requirements.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8cb36779-0558-400e-be08-ff527683452b

📥 Commits

Reviewing files that changed from the base of the PR and between 81d67b9 and 5dff0c5.

📒 Files selected for processing (6)

app/ai/voice/agents/breeze_buddy/handlers/internal/end_conversation.py
app/ai/voice/agents/breeze_buddy/observability/tracing_setup.py
app/services/langfuse/tasks/evaluation/__init__.py
app/services/langfuse/tasks/evaluation/queue.py
tests/conftest.py
tests/test_evaluation_queue.py

Push the bare OTEL trace_id to a Redis list at call end (atomic SETNX+RPUSH via Lua) for the evaluation worker to drain later. Co-Authored-By: Claude <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 15, 2026 11:28

Copilot AI reviewed Jun 15, 2026

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread app/services/langfuse/tasks/evaluation/queue.py Outdated

Comment thread tests/test_evaluation_queue.py Outdated

narsimhaReddyJuspay force-pushed the pr-1-add-redis-quque-to-push-traces-which-needs-to-be-evaluated branch 8 times, most recently from 911ae1d to 82a88d8 Compare June 16, 2026 18:35

feat(eval): enqueue call trace ids onto evaluation redis queue

f58aa20

Push the bare OTEL trace_id to a Redis list at call end (atomic SETNX+RPUSH via Lua) for the evaluation worker to drain later. Co-Authored-By: Claude <noreply@anthropic.com>

narsimhaReddyJuspay force-pushed the pr-1-add-redis-quque-to-push-traces-which-needs-to-be-evaluated branch from 82a88d8 to f58aa20 Compare June 16, 2026 23:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): enqueue call trace ids onto evaluation redis queue#831

feat(eval): enqueue call trace ids onto evaluation redis queue#831
narsimhaReddyJuspay wants to merge 1 commit into
juspay:releasefrom
narsimhaReddyJuspay:pr-1-add-redis-quque-to-push-traces-which-needs-to-be-evaluated

narsimhaReddyJuspay commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Review skipped

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

narsimhaReddyJuspay commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Out of scope (later PRs)

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

narsimhaReddyJuspay commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading