Skip to content

Conversation

@Lincyaw
Copy link

@Lincyaw Lincyaw commented Oct 31, 2025

Add support for Phoenix tracer

@Lincyaw
Copy link
Author

Lincyaw commented Oct 31, 2025

@microsoft-github-policy-service agree [company="personal"]

@Lincyaw
Copy link
Author

Lincyaw commented Oct 31, 2025

@microsoft-github-policy-service agree

pyproject.toml Outdated
apo = [
"poml",
]
apo = ["poml"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What formatting tool are you using to format pyproject.toml? We can add it to pre-commit.

Either way, let's keep the format changes out of this PR to reduce unnecessary conflicts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using Even better toml, I will fix it

pyproject.toml Outdated
"poml",
]
apo = ["poml"]
phoenix = ["arize-phoenix-client>=1.14.1", "arize-phoenix-otel>=0.11.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll need to add --extra phoenix to unit tests in github actions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ultmaster
Copy link
Contributor

/ci

@github-actions
Copy link

github-actions bot commented Oct 31, 2025

🚀 CI Watcher for correlation id-3471549478-mhehz7bn triggered by comment 3471549478
🏃‍♀️ Tracking 3 workflow run(s):

✅ All runs completed.

@Lincyaw
Copy link
Author

Lincyaw commented Nov 1, 2025

I find that this version can not be used in RL training now. For example, there are reward returned in my LitAgent, however, the reward is always None: Warning: Reward is None

2025-11-01 10:28:40,306 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38772 - "GET /get_next_span_sequence_id/ro-c73d72c5fa4e/at-8dd647b3 HTTP/1.1" 200 in 0.19 ms
2025-11-01 10:28:40,307 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38772 - "POST /add_span HTTP/1.1" 200 in 0.45 ms
2025-11-01 10:28:40,308 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38772 - "GET /get_next_span_sequence_id/ro-c73d72c5fa4e/at-8dd647b3 HTTP/1.1" 200 in 0.19 ms
2025-11-01 10:28:40,309 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38772 - "POST /add_span HTTP/1.1" 200 in 0.47 ms
2025-11-01 10:28:40,310 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38780 - "GET /get_next_span_sequence_id/ro-c73d72c5fa4e/at-8dd647b3 HTTP/1.1" 200 in 0.18 ms
2025-11-01 10:28:40,312 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38780 - "POST /add_span HTTP/1.1" 200 in 0.50 ms
reward span is _Span(name="agentlightning.reward", context=SpanContext(trace_id=0xe044bacfc66f1abd64e32a50fec889ad, span_id=0x66cb1b78abf3689c, trace_flags=0x01, trace_state=[], is_remote=False))
===============trace_spans: [<opentelemetry.sdk.trace.ReadableSpan object at 0x7f1d1bf7e030>, <opentelemetry.sdk.trace.ReadableSpan object at 0x7f1d1bf7d2b0>, <opentelemetry.sdk.trace.ReadableSpan object at 0x7f1d1c134620>, <OpenInferenceSpan at 0x7f1d1c200ac0 for _Span at 0x7f1d1c1346e0>]
getting span reward: _Span(name="agentlightning.reward", context=SpanContext(trace_id=0xe044bacfc66f1abd64e32a50fec889ad, span_id=0x66cb1b78abf3689c, trace_flags=0x01, trace_state=[], is_remote=False))
found reward: 0.0001
2025-11-01 10:28:40,312 [INFO] (Process-2616361 agentlightning.runner.agent)   [Worker 7 | Rollout ro-c73d72c5fa4e] Completed in 10.89s. Collected 4 span(s). Final reward: 0.0001
2025-11-01 10:28:40,314 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38780 - "POST /update_attempt HTTP/1.1" 200 in 1.28 ms
2025-11-01 10:28:40,315 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38780 - "GET /dequeue_rollout HTTP/1.1" 200 in 0.42 ms
2025-11-01 10:28:40,316 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38780 - "POST /update_attempt HTTP/1.1" 200 in 0.61 ms
2025-11-01 10:28:40,317 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38780 - "GET /get_resources_by_id/rs-765f4ad0ebc8 HTTP/1.1" 200 in 0.34 ms
using model: ./Qwen3-1.7B at http://10.0.6.208:51103/rollout/ro-d1515da05819/attempt/at-0491e10d
INFO:2025-11-01 10:28:40,328:> trace_id: ro-d1515da05819
INFO:2025-11-01 10:28:40,328:Loading builtin toolkit `query_parquet_files` with config ToolkitConfig(mode='builtin', name='query_parquet_files', activated_tools=None, config={}, config_llm=None, customized_filepath=None, customized_classname=None, mcp_transport='stdio', mcp_client_session_timeout_seconds=20)
INFO:2025-11-01 10:28:40,333:Loaded 3 tools: ['get_schema', 'list_tables_in_directory', 'query_parquet_files']
2025-11-01 10:28:40,334 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38772 - "GET /get_next_span_sequence_id/ro-d1515da05819/at-0491e10d HTTP/1.1" 200 in 0.19 ms
2025-11-01 10:28:40,335 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38772 - "POST /add_span HTTP/1.1" 200 in 0.49 ms
2025-11-01 10:28:40,340 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:48216 - "GET /get_next_span_sequence_id/ro-d1515da05819/at-0491e10d HTTP/1.1" 200 in 0.22 ms
(TaskRunner pid=2616541) INFO:     10.0.6.208:35248 - "POST /chat/completions HTTP/1.1" 200 OK
2025-11-01 10:28:41,149 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38796 - "POST /wait_for_rollouts HTTP/1.1" 200 in 1.19 ms
2025-11-01 10:28:41,151 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38796 - "GET /query_spans/ro-2b2a0b87c0e2 HTTP/1.1" 200 in 1.09 ms
2025-11-01 10:28:41,155 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38796 - "GET /query_spans/ro-aa81149ecc79 HTTP/1.1" 200 in 0.95 ms
2025-11-01 10:28:41,159 [INFO] (Process-2615891 agentlightning.store.client_server)   127.0.0.1:38796 - "GET /query_spans/ro-c73d72c5fa4e HTTP/1.1" 200 in 0.94 ms
(TaskRunner pid=2616541) Warning: Reward is None for rollout ro-2b2a0b87c0e2, will be auto-set to 0.0.
(TaskRunner pid=2616541) Warning: Length of triplets is 0 for rollout ro-2b2a0b87c0e2.
(TaskRunner pid=2616541) Warning: Reward is None for rollout ro-aa81149ecc79, will be auto-set to 0.0.
(TaskRunner pid=2616541) Warning: Length of triplets is 0 for rollout ro-aa81149ecc79.
(TaskRunner pid=2616541) Warning: Reward is None for rollout ro-c73d72c5fa4e, will be auto-set to 0.0.
(TaskRunner pid=2616541) Warning: Length of triplets is 0 for rollout ro-c73d72c5fa4e.

@ultmaster
Copy link
Contributor

Are you returning rewards from @rollout method or using emit_reward?

If you are using emit_reward, it actually creates an opentelemetry spans and you need to make sure that your tracer is receiving and handling this span.

If you are returning rewards, that's weird because the runner directly writes that span to store. We might need some bug fixes.

@Lincyaw
Copy link
Author

Lincyaw commented Nov 1, 2025

I return the float reward in the rollout function, the following is an example

    async def rollout_async(
        self,
        task: dict,
        resources: agl.NamedResources,
        rollout: agl.Rollout,
    ) -> agl.RolloutRawResult:
        sample = EvaluationSample(**task)
        start_time = time.time()

        assert self.agent_config is not None
        llm: agl.LLM = cast(agl.LLM, resources["main_llm"])

        print(
            "using model:",
            llm.model,
            "at",
            llm.get_base_url(rollout.rollout_id, rollout.attempt.attempt_id),
        )
        self.agent_config.model.model_provider.base_url = llm.get_base_url(
            rollout.rollout_id, rollout.attempt.attempt_id
        )
        self.agent_config.model.model_provider.model = llm.model
        self.agent = cast(SimpleAgent, get_agent(config=self.agent_config))
        result = await self.agent.run(sample.raw_question, trace_id=rollout.rollout_id)
        end_time = time.time()

        # Update sample with results
        sample.update(
            response=result.final_output,
            time_cost=end_time - start_time,
            trajectories=json.dumps(result.trajectories, ensure_ascii=False),
            stage="rollout",
        )

        judged_sample: EvaluationSample = await self.rcabench.judge_one(sample)

        print(judged_sample)
        return 1 if judged_sample.correct else 0.0001

So I need to use emit_reward ?

@Lincyaw
Copy link
Author

Lincyaw commented Nov 1, 2025

When I replace return 1 if judged_sample.correct else 0.0001 with agl.emit_reward(0.01), nothing happened:

(TaskRunner pid=2686400) Warning: Reward is None for rollout ro-3896968001ed, will be auto-set to 0.0.
(TaskRunner pid=2686400) Warning: Length of triplets is 0 for rollout ro-3896968001ed.
(TaskRunner pid=2686400) Completed 30/32 tasks...

@ultmaster
Copy link
Contributor

@Lincyaw
Copy link
Author

Lincyaw commented Nov 1, 2025

Yes, it generated. In previous logs, it has:

reward span is _Span(name="agentlightning.reward", context=SpanContext(trace_id=0xe044bacfc66f1abd64e32a50fec889ad, span_id=0x66cb1b78abf3689c, trace_flags=0x01, trace_state=[], is_remote=False))
===============trace_spans: [<opentelemetry.sdk.trace.ReadableSpan object at 0x7f1d1bf7e030>, <opentelemetry.sdk.trace.ReadableSpan object at 0x7f1d1bf7d2b0>, <opentelemetry.sdk.trace.ReadableSpan object at 0x7f1d1c134620>, <OpenInferenceSpan at 0x7f1d1c200ac0 for _Span at 0x7f1d1c1346e0>]
getting span reward: _Span(name="agentlightning.reward", context=SpanContext(trace_id=0xe044bacfc66f1abd64e32a50fec889ad, span_id=0x66cb1b78abf3689c, trace_flags=0x01, trace_state=[], is_remote=False))

@ultmaster
Copy link
Contributor

If you are seeing a span here, and the span did get sent into the store. And you are not seeing any rewards at VERL. there could be too reasons:

  1. Your runner and algorithm are not accessing the same instance of store.
  2. The adapter filters out the spans for some reason.

I think the second case is more possible, since the phoenix tracer might not generate the same formats of token ids as the agentops tracer.

Could you show me what your trace attributes look like?

@Lincyaw
Copy link
Author

Lincyaw commented Nov 1, 2025

Actually, when I use the default Agentops tracer, I also found that the token ids are missing. There are only rewards here.

(TaskRunner pid=2810103) Warning: Rollout ro-9bbacebcded8 contains empty response: [Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=0.0, metadata={'response_id': 'chatcmpl-992d420bce1d4c4aa23a24986be30e7f', 'agent_name': '*'})]
(TaskRunner pid=2810103) Completed 12/20 tasks...

And the final error is
e29e01a4b841f33ba8d2e0f8d02a72f3

@Lincyaw
Copy link
Author

Lincyaw commented Nov 1, 2025

Is this because the agent I implemented is not instrumented? I am using youtu-agent (https://github.com/TencentCloudADP/youtu-agent), which is a wrapper of openai-agents. I think by default, AgentOps can instrument it.

@ultmaster
Copy link
Contributor

not really. agentops instrumented the generally openai provider, which is general. Missing instrumentation of agent framework will only affect discriminating multiple agents.

There is however a bug fix for openai provider yesterday. Would you try to pull latest main?

@Lincyaw
Copy link
Author

Lincyaw commented Nov 1, 2025

Switched to the latest main, the issue remains

(TaskRunner pid=2885675) Warning: Rollout ro-34788af75c53 contains empty response: [Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-5e22b526f0324d0698df20396f5552b2', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-e8f1ccbad02e4e55ad2707ab8337ecbd', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-749e3b6c304f44429ac355e8d3d5a3da', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-321c9f16f3d046a8bc281f3e3742840e', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-7ae79bec5b884e9cb1c51f7675258e37', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=0.0, metadata={'response_id': 'chatcmpl-513243560c034928acf5e0d679a58068', 'agent_name': '*'})]

The full log of a single rollout is: (looks like quite normal, except for the last log)

2025-11-01 13:08:38,550 [INFO] (Process-2884936 agentlightning.store.client_server)   127.0.0.1:35350 - "POST /agl/v1/rollouts/ro-36214da8af95/attempts/at-59ed3351 HTTP/1.1" 200 in 0.71 ms
2025-11-01 13:08:38,550 [INFO] (Process-2884936 agentlightning.store.client_server)   127.0.0.1:35350 - "POST /agl/v1/rollouts/ro-36214da8af95/attempts/at-59ed3351 HTTP/1.1" 200 in 0.71 ms
🖇 AgentOps: Session Replay for ro-36214da8af95 trace: https://app.agentops.ai/sessions?trace_id=9c2a8ff54a02cc43f4af77e971ee2db6
🖇 AgentOps: Session Replay for ro-36214da8af95.session trace: https://app.agentops.ai/sessions?trace_id=9c2a8ff54a02cc43f4af77e971ee2db6
2025-11-01 13:09:04,650 [INFO] (Process-2885573 agentlightning.runner.agent)   [Worker 4 | Rollout ro-36214da8af95] Completed in 26.10s. Collected 22 span(s). Final reward: 0.1
2025-11-01 13:09:04,652 [INFO] (Process-2884936 agentlightning.store.client_server)   127.0.0.1:51356 - "POST /agl/v1/rollouts/ro-36214da8af95/attempts/at-59ed3351 HTTP/1.1" 200 in 0.73 ms
(TaskRunner pid=2885675) Warning: Rollout ro-36214da8af95 contains empty response: [Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-47a1b14f36e24e1b9a5d8575270e7df2', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-212d66a4ca034324aba8eca488576261', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=0.1, metadata={'response_id': 'chatcmpl-c02e043d081542e38c00037e2bfb693a', 'agent_name': '*'})]

Copilot AI review requested due to automatic review settings November 1, 2025 05:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Phoenix tracer integration to Agent Lightning, enabling observability and tracing through Arize Phoenix. The integration leverages OpenTelemetry and reuses the existing LightningSpanProcessor to capture and store spans.

  • Introduces a new PhoenixTracer class that wraps arize-phoenix-otel registration
  • Adds comprehensive test coverage for the tracer and span processor lifecycle
  • Includes new optional dependencies under phoenix extra

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
uv.lock Adds Phoenix-related dependencies (arize-phoenix-client, arize-phoenix-otel) and their transitive dependencies
pyproject.toml Declares new phoenix optional dependency group with minimum version constraints
agentlightning/tracer/phoenix.py Implements PhoenixTracer class with worker lifecycle management and span collection
tests/tracer/test_phoenix.py Comprehensive test suite covering processor lifecycle, concurrency, multiprocessing, and reward tracing
.github/workflows/tests.yml Updates CI to install phoenix extra for testing
.github/workflows/tests-full.yml Updates full test suite CI to install phoenix extra

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"poml",
]
phoenix = [
"arize-phoenix-client>=1.14.1",
Copy link

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum version constraint specifies >=1.14.1, but the lock file contains version 1.21.0 which appears to not exist yet. Consider specifying an upper bound constraint (e.g., >=1.14.1,<2.0) to prevent unexpected breaking changes while ensuring compatibility with currently available versions.

Suggested change
"arize-phoenix-client>=1.14.1",
"arize-phoenix-client>=1.14.1,<2.0",

Copilot uses AI. Check for mistakes.
except Exception as exc: # pragma: no cover - defensive path
try:
conn.send(("error", repr(exc)))
except Exception:
Copy link

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
except Exception:
# Suppress any exception when sending error to parent process,
# as the connection may already be closed or broken.

Copilot uses AI. Check for mistakes.
@ultmaster
Copy link
Contributor

Switched to the latest main, the issue remains

(TaskRunner pid=2885675) Warning: Rollout ro-34788af75c53 contains empty response: [Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-5e22b526f0324d0698df20396f5552b2', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-e8f1ccbad02e4e55ad2707ab8337ecbd', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-749e3b6c304f44429ac355e8d3d5a3da', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-321c9f16f3d046a8bc281f3e3742840e', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-7ae79bec5b884e9cb1c51f7675258e37', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=0.0, metadata={'response_id': 'chatcmpl-513243560c034928acf5e0d679a58068', 'agent_name': '*'})]

The full log of a single rollout is: (looks like quite normal, except for the last log)

2025-11-01 13:08:38,550 [INFO] (Process-2884936 agentlightning.store.client_server)   127.0.0.1:35350 - "POST /agl/v1/rollouts/ro-36214da8af95/attempts/at-59ed3351 HTTP/1.1" 200 in 0.71 ms
2025-11-01 13:08:38,550 [INFO] (Process-2884936 agentlightning.store.client_server)   127.0.0.1:35350 - "POST /agl/v1/rollouts/ro-36214da8af95/attempts/at-59ed3351 HTTP/1.1" 200 in 0.71 ms
🖇 AgentOps: Session Replay for ro-36214da8af95 trace: https://app.agentops.ai/sessions?trace_id=9c2a8ff54a02cc43f4af77e971ee2db6
🖇 AgentOps: Session Replay for ro-36214da8af95.session trace: https://app.agentops.ai/sessions?trace_id=9c2a8ff54a02cc43f4af77e971ee2db6
2025-11-01 13:09:04,650 [INFO] (Process-2885573 agentlightning.runner.agent)   [Worker 4 | Rollout ro-36214da8af95] Completed in 26.10s. Collected 22 span(s). Final reward: 0.1
2025-11-01 13:09:04,652 [INFO] (Process-2884936 agentlightning.store.client_server)   127.0.0.1:51356 - "POST /agl/v1/rollouts/ro-36214da8af95/attempts/at-59ed3351 HTTP/1.1" 200 in 0.73 ms
(TaskRunner pid=2885675) Warning: Rollout ro-36214da8af95 contains empty response: [Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-47a1b14f36e24e1b9a5d8575270e7df2', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-212d66a4ca034324aba8eca488576261', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=0.1, metadata={'response_id': 'chatcmpl-c02e043d081542e38c00037e2bfb693a', 'agent_name': '*'})]

Is this issue on AgentOps tracer? Please first print out the spans following the debug tutorial here: https://microsoft.github.io/agent-lightning/latest/tutorials/debug/ Make sure the token ids are included in the span in the right place.

If the token ids is missing, try to log return_value here and see why the token ids is missing from the spans.

@Lincyaw
Copy link
Author

Lincyaw commented Nov 2, 2025

I tried, but the token_ids are also missing.

The code snippets are shown below. I think it is simple. But both AgentOpsTracer and OtelTracer does not output the Triplelet.

from __future__ import annotations

import argparse
import asyncio
import json
import socket
import time
from collections.abc import Iterable
from typing import Any, cast


import agentlightning as agl
from utu.agents import get_agent
from utu.agents.simple_agent import SimpleAgent
from utu.config.loader import ConfigLoader
from utu.db.eval_datapoint import EvaluationSample
from utu.eval.data import DBDataManager
from utu.eval.processer.rcabench import RCABenchProcesser

agl.configure_logger()

logger = agl.configure_logger(name=__name__)


def _find_free_port() -> int:
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        sock.bind(("127.0.0.1", 0))
        return cast(int, sock.getsockname()[1])


class LitRCAgent(agl.LitAgent[EvaluationSample]):
    def __init__(
        self,
        eval_config_name: str = "rcabench_simple_agent",
    ) -> None:
        self.config = ConfigLoader.load_eval_config(name=eval_config_name)
        self.agent_config = self.config.agent

        assert self.agent_config is not None

        self.rcabench = RCABenchProcesser(config=self.config)

    async def rollout_async(
        self,
        task: dict,
        resources: agl.NamedResources,
        rollout: agl.Rollout,
    ) -> agl.RolloutRawResult:
        sample = EvaluationSample(**task)
        start_time = time.time()

        assert self.agent_config is not None
        llm: agl.LLM = cast(agl.LLM, resources["main_llm"])

        self.agent_config.model.model_provider.base_url = llm.get_base_url(
            rollout.rollout_id, rollout.attempt.attempt_id
        )

        self.agent_config.model.model_provider.model = llm.model
        self.agent = cast(SimpleAgent, get_agent(config=self.agent_config))
        result = await self.agent.run(sample.raw_question, trace_id=rollout.rollout_id)
        end_time = time.time()

        sample.update(
            response=result.final_output,
            time_cost=end_time - start_time,
            trajectories=json.dumps(result.trajectories, ensure_ascii=False),
            stage="rollout",
        )

        judged_sample: EvaluationSample = await self.rcabench.judge_one(sample)

        assert judged_sample.confidence is not None
        agl.emit_reward(judged_sample.confidence)
        # return judged_sample.confidence

    def rollout(
        self,
        task: Any,
        resources: agl.NamedResources,
        rollout: agl.Rollout,
    ) -> float | None:
        return asyncio.run(self.rollout_async(task, resources, rollout))  # type: ignore

    def load_dataset(self, num: int = 5) -> list[EvaluationSample]:
        dataset = DBDataManager(self.config)
        samples = dataset.load()[:num]
        data = [self.rcabench.preprocess_one(i) for i in samples]
        return data


class RunnerDebugHook(agl.Hook):
    async def on_trace_end(self, *, agent, runner, tracer, rollout) -> None:  # type: ignore[override]
        get_last_trace = getattr(tracer, "get_last_trace", None)
        if not callable(get_last_trace):
            logger.info("Tracer %s has no get_last_trace helper", type(tracer).__name__)
            return

        raw_spans = get_last_trace()
        if not raw_spans:
            logger.info(
                "No spans collected during rollout %s",
                getattr(rollout, "rollout_id", "<unknown>"),
            )
            return

        if hasattr(raw_spans, "spans"):
            span_iterable = getattr(raw_spans, "spans")
        elif isinstance(raw_spans, Iterable) and not isinstance(
            raw_spans, (str, bytes)
        ):
            span_iterable = raw_spans
        else:
            span_iterable = (raw_spans,)

        spans = list(span_iterable)
        if not spans:
            logger.info(
                "No spans collected during rollout %s",
                getattr(rollout, "rollout_id", "<unknown>"),
            )
            return

        logger.info(
            "Trace spans collected during rollout %s:",
            getattr(rollout, "rollout_id", "<unknown>"),
        )
        for span in spans:
            logger.info(
                "- %s (status=%s)",
                getattr(span, "name", "<unknown>"),
                getattr(span, "status", None),
            )


async def _debug_runner_once(
    *,
    eval_config_name: str = "rcabench_simple_agent_with_sql",
    llm_endpoint: str = "http://localhost:8000/v1",
    llm_model: str = "Qwen/Qwen3-1.7B",
    sample_index: int = 0,
    hooks: list[agl.Hook] | None = None,
) -> None:
    agent = LitRCAgent(eval_config_name=eval_config_name)
    samples = agent.load_dataset(num=sample_index + 1)
    if not samples:
        raise ValueError(
            "Dataset is empty; ensure the evaluation database is populated"
        )

    try:
        sample = samples[sample_index]
    except IndexError as exc:  # pragma: no cover - defensive guard
        raise ValueError(
            f"Sample index {sample_index} is out of range for dataset of size {len(samples)}"
        ) from exc

    tracer = agl.AgentOpsTracer()
    runner = agl.LitAgentRunner(tracer=tracer)
    store = agl.InMemoryLightningStore()

    llm_proxy = agl.LLMProxy(
        port=8081,
        model_list=[
            {
                "model_name": "Qwen/Qwen3-1.7B",
                "litellm_params": {
                    "model": "hosted_vllm/Qwen/Qwen3-1.7B",
                    "api_base": llm_endpoint,
                },
            }
        ],
        store=store,
    )
    llm_proxy.start()

    resources: agl.NamedResources = {"main_llm": llm_proxy.as_resource()}
    # resources: agl.NamedResources = {
    #     "main_llm": agl.LLM(
    #         endpoint="http://localhost:8001/v1",
    #         model=llm_model,
    #     )
    # }

    active_hooks = hooks or [RunnerDebugHook()]

    try:
        with runner.run_context(agent=agent, store=store, hooks=active_hooks):
            rollout = await runner.step(sample.model_dump(), resources=resources)

        rollouts = await store.query_rollouts()
        spans = await store.query_spans(rollout.rollout_id)
        reward = agl.emitter.find_final_reward(spans)
        logger.info("Rollouts captured: %s", rollouts)
        logger.info("Recorded %d spans for rollout %s", len(spans), rollout.rollout_id)
        logger.info(f"Final response: {spans[-1]}")
        logger.info("Extracted reward: %s", reward)

        adapter = agl.LlmProxyTraceToTriplet()
        triplets = adapter.adapt(spans)
        if not triplets:
            logger.info(
                "No triplets extracted from spans for rollout %s", rollout.rollout_id
            )
        else:
            for index, triplet in enumerate(triplets, start=1):
                prompt_token_ids = (
                    triplet.prompt.get("token_ids")
                    if isinstance(triplet.prompt, dict)
                    else triplet.prompt
                )
                response_payload = triplet.response
                if isinstance(response_payload, dict):
                    response_token_ids = response_payload.get("token_ids")
                else:
                    response_token_ids = response_payload

                logger.info(
                    "Triplet %d: prompt_token_ids=%s output_token_ids=%s reward=%s",
                    index,
                    prompt_token_ids,
                    response_token_ids,
                    triplet.reward,
                )
    finally:
        if llm_proxy.is_running():
            llm_proxy.stop()


def debug_runner_once(
    *,
    eval_config_name: str = "rcabench_simple_agent_with_sql",
    llm_endpoint: str = "http://localhost:8000/v1",
    llm_model: str = "Qwen/Qwen3-1.7B",
    sample_index: int = 0,
    hooks: list[agl.Hook] | None = None,
) -> None:
    """Public entry point that wraps the async runner debug helper."""

    asyncio.run(
        _debug_runner_once(
            eval_config_name=eval_config_name,
            llm_endpoint=llm_endpoint,
            llm_model=llm_model,
            sample_index=sample_index,
            hooks=hooks,
        )
    )


def debug_trainer_dev(
    *,
    eval_config_name: str = "rcabench_simple_agent_with_sql",
    llm_endpoint: str = "http://localhost:8000/v1",
    llm_model: str = "Qwen/Qwen3-1.7B",
    num_samples: int = 10,
    n_workers: int = 1,
) -> None:
    """Dry-run the trainer loop using Trainer.dev with configurable resources."""

    agent = LitRCAgent(eval_config_name=eval_config_name)
    debug_samples = agent.load_dataset(num=num_samples)
    logger.info("Loaded %d samples for trainer dry run", len(debug_samples))

    from utu.tracing import PhoenixTracer

    tracer = PhoenixTracer(auto_instrument=True)
    trainer = agl.Trainer(
        n_workers=n_workers,
        initial_resources={
            "main_llm": agl.LLM(
                endpoint=llm_endpoint,
                model=llm_model,
            )
        },
        tracer=tracer,
    )

    trainer.dev(agent, debug_samples)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Debug RCAgent")
    parser.add_argument(
        "--mode",
        choices=["runner", "trainer"],
        default="runner",
        help="Choose debug mode: 'runner' for debug_runner_once, 'trainer' for debug_trainer_dev",
    )
    args = parser.parse_args()

    if args.mode == "runner":
        debug_runner_once()
    elif args.mode == "trainer":
        debug_trainer_dev()

@ultmaster
Copy link
Contributor

What are the log output you are getting from running this script?

@Lincyaw
Copy link
Author

Lincyaw commented Nov 2, 2025

This is Otel Tracer

(rcagent-rl) root@iv-ye5ro8ne9s5i3z3p14l4 ~/RCAgent-RL (main)# uv run python ./training_agl/rcagent/rcagent.py --mode runner
2025-11-02 16:01:23,359[utu.eval.data.data_manager] - WARNING - data_manager.py:45 - exp_id rcabench_evaluation already exists in db
2025-11-02 16:01:23,362 [INFO] (Process-4057198 agentlightning.llm_proxy)   Adding a new middleware to the FastAPI app.
2025-11-02 16:01:23,365 [INFO] (Process-4057198 agentlightning.llm_proxy)   Starting LLMProxy server thread...
INFO:     Started server process [4057198]
INFO:     Waiting for application startup.

   ██╗     ██╗████████╗███████╗██╗     ██╗     ███╗   ███╗
   ██║     ██║╚══██╔══╝██╔════╝██║     ██║     ████╗ ████║
   ██║     ██║   ██║   █████╗  ██║     ██║     ██╔████╔██║
   ██║     ██║   ██║   ██╔══╝  ██║     ██║     ██║╚██╔╝██║
   ███████╗██║   ██║   ███████╗███████╗███████╗██║ ╚═╝ ██║
   ╚══════╝╚═╝   ╚═╝   ╚══════╝╚══════╝╚══════╝╚═╝     ╚═╝


#------------------------------------------------------------#
#                                                            #
#       'This feature doesn't meet my needs because...'       #
#        https://github.com/BerriAI/litellm/issues/new        #
#                                                            #
#------------------------------------------------------------#

 Thank you for using LiteLLM! - Krrish & Ishaan



Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new


LiteLLM: Proxy initialized with Config, Set models:
    Qwen/Qwen3-1.7B
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
2025-11-02 16:01:23,375 [INFO] (Process-4057198 agentlightning.tracer.otel)   [Worker 0] Setting up OpenTelemetry tracer...
Overriding of current TracerProvider is not allowed
2025-11-02 16:01:23,529 [WARNING] (Process-4057198 agentlightning.llm_proxy)   Loop and lock are not owned by the current process. Clearing them.
INFO:     10.0.6.208:59116 - "POST /chat/completions HTTP/1.1" 200 OK
2025-11-02 16:01:28,320 [WARNING] (Process-4057198 agentlightning.llm_proxy)   No headers found in 1 subtree spans. Cannot log to store.
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__)   No spans collected during rollout ro-e2292df7a4c9
2025-11-02 16:01:28,321 [INFO] (Process-4057198 agentlightning.runner.agent)   [Worker 0 | Rollout ro-e2292df7a4c9] Completed in 4.94s. Collected 0 span(s). Final reward: None
2025-11-02 16:01:28,321 [INFO] (Process-4057198 agentlightning.tracer.otel)   [Worker 0] Tearing down OpenTelemetry tracer...
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__)   Rollouts captured: [Rollout(rollout_id='ro-e2292df7a4c9', input={'exp_id': 'rcabench_evaluation', 'dataset': 'RCABench', 'dataset_index': 1, 'source': 'RCABench', 'level': 0, 'raw_question': 'You are a professional microservices system operations expert and fault diagnosis analyst.\n\n**Task Description**:\nBased on the fault description and related data provided below, conduct a comprehensive root cause analysis. You must strictly follow the structured analysis steps to scientifically determine the root cause.\n\n**Core Requirement: Comparative Analysis**\nAll your analysis and conclusions **MUST** be built upon a **comparison** between **normal period data** (normal_*.parquet) and **abnormal period data** (abnormal_*.parquet).\n\n**Fault Problem**: Current user is experiencing the SLO violation.\nPlease provide a detailed Root Cause Analysis report based on the following incident data and logs.\nThe report should include:\n1. A summary of the incident.\n2. The identified root cause(s).\n3. Any contributing factors.\n4. Recommended actions to prevent future occurrences.\n\n\n**Available Data**:\nYou can access the directory: `/root/RCAgent-RL/training_agl/rcagent/data/data42`. All observation data is stored here.\n* `conclusion.parquet`: Contains high-level anomaly data (e.g., duration, success rate), **pre-compared** between abnormal and normal states.\n* `abnormal_traces.parquet` / `normal_traces.parquet`: Contain detailed Span information from abnormal and normal periods. **This is your primary basis for comparative analysis and conclusions.**\n* `abnormal_logs.parquet` / `normal_logs.parquet`: Contain log messages from abnormal and normal periods. **This is your other primary basis for comparative analysis and conclusions.**\n\n---\n\n**Required Analysis Steps (Show Your Work):**\n\nPlease use the following headings and structure to organize your **entire** analysis process.\n\n### 1. Problem Definition (Define)\n* **Fault Phenomenon Summary**: Briefly summarize the fault symptom described in the problem.\n* **Role of `conclusion.parquet` (Symptom Data)**: This file provides the **high-level symptom** or **alert** (e.g., "Span X is slow" or "Service Y success rate dropped"). It tells you **"what"** is wrong, but **CANNOT** tell you **"why"**. **You must not draw any root cause conclusions from this file alone.**\n* **Initial Symptom Observation (based on conclusion.parquet)**: Based on the **alert data** in `conclusion.parquet`, which services or operations are **flagged as symptomatic** (e.g., high latency, low success rate)?\n\n### 2. Hypothesis Formulation (Structure)\nBased on the **symptoms** observed in Step 1, list your primary diagnostic hypotheses. These hypotheses must be testable using the **`traces` and `logs` data**.\n* **Hypothesis 1**: (e.g., "The high latency in \'payment-service\' is a symptom, caused by a slow response from its downstream dependency, \'auth-service\'.")\n* **Hypothesis 2**: (e.g., "The \'database-proxy\' service is generating \'connection error\' logs, not present in the normal period, which is causing upstream failures.")\n* **Hypothesis 3**: ...\n\n### 3. Evidence Analysis & Hypothesis Validation (Analyze)\nUsing the `abnormal_*` and `normal_*` files, systematically validate each hypothesis through **comparative analysis**. **This is the core of your analysis.**\n\n* **Service Dependency Analysis (Based on Traces)**:\n    * **Data Used**: Analyze `abnormal_traces.parquet` (using `span_id`, `parent_span_id`, `service_name`) to map the upstream/downstream call chain for the **symptomatic Spans** from Step 1.\n    * **Findings**: (e.g., "The \'checkout-span\' (frontend) calls \'payment-service\', which in turn calls \'auth-service\' and \'database-proxy\'.")\n    * **Conclusion**: (e.g., "The problem in \'payment-service\' could be influenced by \'auth-service\' or \'database-proxy\'.")\n\n* **Validating Hypothesis 1: (e.g., \'payment-service\' is impacted by \'auth-service\')**\n    * **Data Used**: **Compare** `abnormal_traces.parquet` and `normal_traces.parquet`. **Filter** for Spans where `service_name == \'payment-service\'` and **inspect** their child Spans where `service_name == \'auth-service\'`.\n    * **Findings**: (e.g., "In `normal_traces.parquet`, the \'auth-service\' Span takes 30ms. However, in `abnormal_traces.parquet`, the \'auth-service\' Span takes 2900ms, while the \'payment-service\' self-time (parent duration - child span duration) remains stable at 50ms.")\n    * **Conclusion**: (e.g., "Hypothesis 1 is **Confirmed**. The high latency of \'payment-service\' is **not** its own fault but is **propagated** from its slow dependency, \'auth-service\'.")\n\n* **Validating Hypothesis 2: (e.g., \'database-proxy\' emitting new errors)**\n    * **Data Used**: **Compare** logs where `service_name == \'database-proxy\'` in `abnormal_logs.parquet` and `normal_logs.parquet`. **Correlate** these logs with the timestamps of slow \'database-proxy\' Spans in `abnormal_traces.parquet`.\n    * **Findings**: (e.g., "No \'Connection timed out\' errors are found in `normal_logs.parquet`. In `abnormal_logs.parquet`, these errors appear frequently, and **their timestamps perfectly match the timestamps of the latency spikes for \'database-proxy\' Spans**.")\n    * **Conclusion**: (e.g., "Hypothesis 2 is **Confirmed**. These new log errors are strongly correlated with the latency phenomenon seen in the trace data.")\n\n* *(Continue for all other hypotheses...)*\n\n### 4. Synthesis & Root Cause Identification (Synthesize)\nBased on the hypotheses **confirmed through detailed trace and log comparison** in Step 3, synthesize your findings to form a final conclusion.\n\n* **Fault Propagation Path**: (Based on Step 3 dependency analysis and validation) Describe the complete chain of events. (e.g., "1. The \'database-proxy\' began emitting \'Connection timed out\' errors **(log comparison evidence)** not seen in the normal period. 2. This caused its caller, \'auth-service\', to slow down, with latency jumping **(trace comparison evidence)** from 30ms to 2900ms. 3. The latency from \'auth-service\' **propagated** to its caller, \'payment-service\'. 4. This finally resulted in the high latency alert for the \'checkout\' operation reported in `conclusion.parquet`.")\n* **Root Cause**: State the single, underlying root cause. (e.g., "The root cause was the exhaustion of the \'database-proxy\' connection pool **(evidenced by the comparative log errors)**, which caused \'auth-service\' to block **(evidenced by the comparative high latency of its child span)**, and ultimately manifested as a cascading failure in \'payment-service\'.")\n* **Core Evidence Chain (Based on Comparison)**: Briefly link the key pieces of evidence. (e.g., `database-proxy` log **comparison** (normal: 0 errors vs. abnormal: N errors) -> **correlates with** `auth-service` trace latency **comparison** (normal: 30ms vs. abnormal: 2900ms) -> **propagates to** `payment-service` trace -> **triggers** `conclusion.parquet` alert)\n\n---\n\n**Final Conclusion (JSON):**\nAfter you have completed the detailed analysis in all the steps above, provide the final JSON conclusion as required by your system instructions.\n', 'correct_answer': 'ts-inside-payment-service', 'file_name': '', 'stage': 'init'}, start_time=1762070483.3769877, end_time=1762070488.321383, mode=None, resources_id='rs-d3d5da38eb1f', status='succeeded', config=RolloutConfig(timeout_seconds=None, unresponsive_seconds=None, max_attempts=1, retry_condition=[]), metadata={})]
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__)   Recorded 7 spans for rollout ro-e2292df7a4c9
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__)   Final response: rollout_id='ro-e2292df7a4c9' attempt_id='at-b66f7abe' sequence_id=1 trace_id='3f76b67dc0da06210ff6342c27683e8e' span_id='e23f898afefc8d66' parent_id=None name='Received Proxy Server Request' status=TraceStatus(status_code='UNSET', description=None) attributes={} events=[] links=[] start_time=1762070483.5280588 end_time=1762070488.28927 context=SpanContext(trace_id='3f76b67dc0da06210ff6342c27683e8e', span_id='e23f898afefc8d66', is_remote=False, trace_state={}) parent=None resource=OtelResource(attributes={'telemetry.sdk.language': 'python', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.38.0', 'service.name': 'litellm', 'deployment.environment': 'production', 'model_id': 'litellm'}, schema_url='')
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__)   Extracted reward: None
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__)   No triplets extracted from spans for rollout ro-e2292df7a4c9
2025-11-02 16:01:28,321 [INFO] (Process-4057198 agentlightning.llm_proxy)   Stopping LLMProxy server thread...
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [4057198]
2025-11-02 16:01:28,522 [INFO] (Process-4057198 agentlightning.llm_proxy)   LLMProxy server thread stopped.

This is AgentOps Tracer

(rcagent-rl) root@iv-ye5ro8ne9s5i3z3p14l4 ~/RCAgent-RL (main)# uv run python ./training_agl/rcagent/rcagent.py --mode runner
2025-11-02 16:03:12,775[utu.eval.data.data_manager] - WARNING - data_manager.py:45 - exp_id rcabench_evaluation already exists in db
2025-11-02 16:03:12,778 [INFO] (Process-4058921 agentlightning.llm_proxy)   Adding a new middleware to the FastAPI app.
2025-11-02 16:03:12,782 [INFO] (Process-4058921 agentlightning.llm_proxy)   Starting LLMProxy server thread...
INFO:     Started server process [4058921]
INFO:     Waiting for application startup.

   ██╗     ██╗████████╗███████╗██╗     ██╗     ███╗   ███╗
   ██║     ██║╚══██╔══╝██╔════╝██║     ██║     ████╗ ████║
   ██║     ██║   ██║   █████╗  ██║     ██║     ██╔████╔██║
   ██║     ██║   ██║   ██╔══╝  ██║     ██║     ██║╚██╔╝██║
   ███████╗██║   ██║   ███████╗███████╗███████╗██║ ╚═╝ ██║
   ╚══════╝╚═╝   ╚═╝   ╚══════╝╚══════╝╚══════╝╚═╝     ╚═╝


#------------------------------------------------------------#
#                                                            #
#           'It would help me if you could add...'            #
#        https://github.com/BerriAI/litellm/issues/new        #
#                                                            #
#------------------------------------------------------------#

 Thank you for using LiteLLM! - Krrish & Ishaan



Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new


LiteLLM: Proxy initialized with Config, Set models:
    Qwen/Qwen3-1.7B
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
2025-11-02 16:03:12,792 [INFO] (Process-4058921 agentlightning.tracer.agentops)   [Worker 0] Setting up tracer...
2025-11-02 16:03:12,797 [INFO] (Process-4058921 agentlightning.instrumentation.agentops)   Patched newer version of agentops using handle_chat_attributes
2025-11-02 16:03:12,797 [INFO] (Process-4058921 agentlightning.tracer.agentops)   [Worker 0] Instrumentation applied.
🖇 AgentOps: [OPENAI INSTRUMENTOR] Error setting up OpenAI streaming wrappers: No module named 'openai.resources.beta.chat'
🖇 AgentOps: [OPENAI INSTRUMENTOR] Error setting up OpenAI streaming wrappers: No module named 'openai.resources.beta.chat'
2025-11-02 16:03:12,968 [INFO] (Process-4058921 agentlightning.tracer.agentops)   [Worker 0] AgentOps client initialized.
🖇 AgentOps: Session Replay for ro-c0654c496b24 trace: https://app.agentops.ai/sessions?trace_id=2f7d978199fd50de27b52cf3d4d53ff1
2025-11-02 16:03:13,004 [WARNING] (Process-4058921 agentlightning.llm_proxy)   Loop and lock are not owned by the current process. Clearing them.
INFO:     10.0.6.208:53580 - "POST /chat/completions HTTP/1.1" 200 OK
2025-11-02 16:03:16,227 [WARNING] (Process-4058921 agentlightning.llm_proxy)   No headers found in 1 subtree spans. Cannot log to store.
2025-11-02 16:03:16,228 [WARNING] (Process-4058921 agentlightning.llm_proxy)   No headers found in 1 subtree spans. Cannot log to store.
INFO:     10.0.6.208:53580 - "POST /chat/completions HTTP/1.1" 200 OK
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   Trace spans collected during rollout ro-c0654c496b24:
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - simple_agent (status=<opentelemetry.trace.status.Status object at 0x7f9b1201a510>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - openai.chat.completion (status=<opentelemetry.trace.status.Status object at 0x7f9c2fdcb740>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - openai.chat.completion (status=<opentelemetry.trace.status.Status object at 0x7f9b103ba360>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - openai.chat.completion (status=<opentelemetry.trace.status.Status object at 0x7f9c2fd7cb30>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - openai.chat.completion (status=<opentelemetry.trace.status.Status object at 0x7f9b116a3590>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - generation (status=<opentelemetry.trace.status.Status object at 0x7f9c303be960>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - list_tables_in_directory (status=<opentelemetry.trace.status.Status object at 0x7f9b12141700>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - openai.chat.completion (status=<opentelemetry.trace.status.Status object at 0x7f9b1025fd40>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - openai.chat.completion (status=<opentelemetry.trace.status.Status object at 0x7f9b116a30e0>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - openai.chat.completion (status=<opentelemetry.trace.status.Status object at 0x7f9b101c7f50>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - openai.chat.completion (status=<opentelemetry.trace.status.Status object at 0x7f9b103b9670>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - generation (status=<opentelemetry.trace.status.Status object at 0x7f9b105d2b10>)
2025-11-02 16:03:19,893 [INFO] (Process-4058921 __main__)   - RCA-Agent (status=<opentelemetry.trace.status.Status object at 0x7f9b1201a870>)
2025-11-02 16:03:19,894 [WARNING] (Process-4058921 agentlightning.llm_proxy)   No headers found in 2 subtree spans. Cannot log to store.
🖇 AgentOps: Session Replay for ro-c0654c496b24.session trace: https://app.agentops.ai/sessions?trace_id=2f7d978199fd50de27b52cf3d4d53ff1
2025-11-02 16:03:19,894 [INFO] (Process-4058921 agentlightning.runner.agent)   [Worker 0 | Rollout ro-c0654c496b24] Completed in 6.92s. Collected 13 span(s). Final reward: None
2025-11-02 16:03:19,894 [INFO] (Process-4058921 agentlightning.instrumentation.agentops)   Unpatched newer version of agentops using handle_chat_attributes
2025-11-02 16:03:19,894 [INFO] (Process-4058921 agentlightning.tracer.agentops)   [Worker 0] Instrumentation removed.
2025-11-02 16:03:19,894 [INFO] (Process-4058921 __main__)   Rollouts captured: [Rollout(rollout_id='ro-c0654c496b24', input={'exp_id': 'rcabench_evaluation', 'dataset': 'RCABench', 'dataset_index': 1, 'source': 'RCABench', 'level': 0, 'raw_question': 'You are a professional microservices system operations expert and fault diagnosis analyst.\n\n**Task Description**:\nBased on the fault description and related data provided below, conduct a comprehensive root cause analysis. You must strictly follow the structured analysis steps to scientifically determine the root cause.\n\n**Core Requirement: Comparative Analysis**\nAll your analysis and conclusions **MUST** be built upon a **comparison** between **normal period data** (normal_*.parquet) and **abnormal period data** (abnormal_*.parquet).\n\n**Fault Problem**: Current user is experiencing the SLO violation.\nPlease provide a detailed Root Cause Analysis report based on the following incident data and logs.\nThe report should include:\n1. A summary of the incident.\n2. The identified root cause(s).\n3. Any contributing factors.\n4. Recommended actions to prevent future occurrences.\n\n\n**Available Data**:\nYou can access the directory: `/root/RCAgent-RL/training_agl/rcagent/data/data42`. All observation data is stored here.\n* `conclusion.parquet`: Contains high-level anomaly data (e.g., duration, success rate), **pre-compared** between abnormal and normal states.\n* `abnormal_traces.parquet` / `normal_traces.parquet`: Contain detailed Span information from abnormal and normal periods. **This is your primary basis for comparative analysis and conclusions.**\n* `abnormal_logs.parquet` / `normal_logs.parquet`: Contain log messages from abnormal and normal periods. **This is your other primary basis for comparative analysis and conclusions.**\n\n---\n\n**Required Analysis Steps (Show Your Work):**\n\nPlease use the following headings and structure to organize your **entire** analysis process.\n\n### 1. Problem Definition (Define)\n* **Fault Phenomenon Summary**: Briefly summarize the fault symptom described in the problem.\n* **Role of `conclusion.parquet` (Symptom Data)**: This file provides the **high-level symptom** or **alert** (e.g., "Span X is slow" or "Service Y success rate dropped"). It tells you **"what"** is wrong, but **CANNOT** tell you **"why"**. **You must not draw any root cause conclusions from this file alone.**\n* **Initial Symptom Observation (based on conclusion.parquet)**: Based on the **alert data** in `conclusion.parquet`, which services or operations are **flagged as symptomatic** (e.g., high latency, low success rate)?\n\n### 2. Hypothesis Formulation (Structure)\nBased on the **symptoms** observed in Step 1, list your primary diagnostic hypotheses. These hypotheses must be testable using the **`traces` and `logs` data**.\n* **Hypothesis 1**: (e.g., "The high latency in \'payment-service\' is a symptom, caused by a slow response from its downstream dependency, \'auth-service\'.")\n* **Hypothesis 2**: (e.g., "The \'database-proxy\' service is generating \'connection error\' logs, not present in the normal period, which is causing upstream failures.")\n* **Hypothesis 3**: ...\n\n### 3. Evidence Analysis & Hypothesis Validation (Analyze)\nUsing the `abnormal_*` and `normal_*` files, systematically validate each hypothesis through **comparative analysis**. **This is the core of your analysis.**\n\n* **Service Dependency Analysis (Based on Traces)**:\n    * **Data Used**: Analyze `abnormal_traces.parquet` (using `span_id`, `parent_span_id`, `service_name`) to map the upstream/downstream call chain for the **symptomatic Spans** from Step 1.\n    * **Findings**: (e.g., "The \'checkout-span\' (frontend) calls \'payment-service\', which in turn calls \'auth-service\' and \'database-proxy\'.")\n    * **Conclusion**: (e.g., "The problem in \'payment-service\' could be influenced by \'auth-service\' or \'database-proxy\'.")\n\n* **Validating Hypothesis 1: (e.g., \'payment-service\' is impacted by \'auth-service\')**\n    * **Data Used**: **Compare** `abnormal_traces.parquet` and `normal_traces.parquet`. **Filter** for Spans where `service_name == \'payment-service\'` and **inspect** their child Spans where `service_name == \'auth-service\'`.\n    * **Findings**: (e.g., "In `normal_traces.parquet`, the \'auth-service\' Span takes 30ms. However, in `abnormal_traces.parquet`, the \'auth-service\' Span takes 2900ms, while the \'payment-service\' self-time (parent duration - child span duration) remains stable at 50ms.")\n    * **Conclusion**: (e.g., "Hypothesis 1 is **Confirmed**. The high latency of \'payment-service\' is **not** its own fault but is **propagated** from its slow dependency, \'auth-service\'.")\n\n* **Validating Hypothesis 2: (e.g., \'database-proxy\' emitting new errors)**\n    * **Data Used**: **Compare** logs where `service_name == \'database-proxy\'` in `abnormal_logs.parquet` and `normal_logs.parquet`. **Correlate** these logs with the timestamps of slow \'database-proxy\' Spans in `abnormal_traces.parquet`.\n    * **Findings**: (e.g., "No \'Connection timed out\' errors are found in `normal_logs.parquet`. In `abnormal_logs.parquet`, these errors appear frequently, and **their timestamps perfectly match the timestamps of the latency spikes for \'database-proxy\' Spans**.")\n    * **Conclusion**: (e.g., "Hypothesis 2 is **Confirmed**. These new log errors are strongly correlated with the latency phenomenon seen in the trace data.")\n\n* *(Continue for all other hypotheses...)*\n\n### 4. Synthesis & Root Cause Identification (Synthesize)\nBased on the hypotheses **confirmed through detailed trace and log comparison** in Step 3, synthesize your findings to form a final conclusion.\n\n* **Fault Propagation Path**: (Based on Step 3 dependency analysis and validation) Describe the complete chain of events. (e.g., "1. The \'database-proxy\' began emitting \'Connection timed out\' errors **(log comparison evidence)** not seen in the normal period. 2. This caused its caller, \'auth-service\', to slow down, with latency jumping **(trace comparison evidence)** from 30ms to 2900ms. 3. The latency from \'auth-service\' **propagated** to its caller, \'payment-service\'. 4. This finally resulted in the high latency alert for the \'checkout\' operation reported in `conclusion.parquet`.")\n* **Root Cause**: State the single, underlying root cause. (e.g., "The root cause was the exhaustion of the \'database-proxy\' connection pool **(evidenced by the comparative log errors)**, which caused \'auth-service\' to block **(evidenced by the comparative high latency of its child span)**, and ultimately manifested as a cascading failure in \'payment-service\'.")\n* **Core Evidence Chain (Based on Comparison)**: Briefly link the key pieces of evidence. (e.g., `database-proxy` log **comparison** (normal: 0 errors vs. abnormal: N errors) -> **correlates with** `auth-service` trace latency **comparison** (normal: 30ms vs. abnormal: 2900ms) -> **propagates to** `payment-service` trace -> **triggers** `conclusion.parquet` alert)\n\n---\n\n**Final Conclusion (JSON):**\nAfter you have completed the detailed analysis in all the steps above, provide the final JSON conclusion as required by your system instructions.\n', 'correct_answer': 'ts-inside-payment-service', 'file_name': '', 'stage': 'init'}, start_time=1762070592.9693727, end_time=1762070599.8946114, mode=None, resources_id='rs-563fc20c7887', status='succeeded', config=RolloutConfig(timeout_seconds=None, unresponsive_seconds=None, max_attempts=1, retry_condition=[]), metadata={})]
2025-11-02 16:03:19,895 [INFO] (Process-4058921 __main__)   Recorded 27 spans for rollout ro-c0654c496b24
2025-11-02 16:03:19,895 [INFO] (Process-4058921 __main__)   Final response: rollout_id='ro-c0654c496b24' attempt_id='at-e822e8af' sequence_id=15 trace_id='2f7d978199fd50de27b52cf3d4d53ff1' span_id='39997c411ac3b997' parent_id='0679f35d1ee4d844' name='RCA-Agent' status=TraceStatus(status_code='OK', description=None) attributes={'trace.id': 'ro-c0654c496b24', 'span.id': 'span_8289d93ac10944bda7d498ef', 'instrumentation.name': 'agentops', 'instrumentation.version': '0.4.21', 'agentops.span.kind': 'agent', 'agent.name': 'RCA-Agent', 'agent.tools': '["get_schema", "list_tables_in_directory", "query_parquet_files"]'} events=[] links=[] start_time=1762070592.9881177 end_time=1762070599.8871913 context=SpanContext(trace_id='2f7d978199fd50de27b52cf3d4d53ff1', span_id='39997c411ac3b997', is_remote=False, trace_state={}) parent=SpanContext(trace_id='2f7d978199fd50de27b52cf3d4d53ff1', span_id='0679f35d1ee4d844', is_remote=False, trace_state={}) resource=OtelResource(attributes={'service.name': 'agentops', 'agentops.project.id': 'temporary', 'imported_libraries': ('argparse', 'asyncio', 'json', 'socket', 'agentlightning')}, schema_url='')
2025-11-02 16:03:19,895 [INFO] (Process-4058921 __main__)   Extracted reward: None
2025-11-02 16:03:19,895 [INFO] (Process-4058921 __main__)   No triplets extracted from spans for rollout ro-c0654c496b24
2025-11-02 16:03:19,895 [INFO] (Process-4058921 agentlightning.llm_proxy)   Stopping LLMProxy server thread...
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [4058921]
2025-11-02 16:03:20,082 [INFO] (Process-4058921 agentlightning.llm_proxy)   LLMProxy server thread stopped.
(rcagent-rl) root@iv-ye5ro8ne9s5i3z3p14l4 ~/RCAgent-RL (main)# 

@ultmaster
Copy link
Contributor

ultmaster commented Nov 2, 2025

I think the problem is with the LlmProxy. LlmProxy registers a global tracer provider that will interfere with other tracer providers registered in runner. You must run LlmProxy in another process if you are keen to run them together.

See #255 #206 and comments in llm_proxy.py. In praticular:

    !!! danger

        Do not run LLM proxy in the same process as the main runner. It's easy to cause conflicts in the tracer provider
        with tracers like [`AgentOpsTracer`][agentlightning.AgentOpsTracer].

We might want to make LlmProxy.start() initiates a new process by default because at least three people has encountered this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants