-
Notifications
You must be signed in to change notification settings - Fork 478
feat: add phoenix tracer #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@microsoft-github-policy-service agree [company="personal"] |
|
@microsoft-github-policy-service agree |
pyproject.toml
Outdated
| apo = [ | ||
| "poml", | ||
| ] | ||
| apo = ["poml"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What formatting tool are you using to format pyproject.toml? We can add it to pre-commit.
Either way, let's keep the format changes out of this PR to reduce unnecessary conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using Even better toml, I will fix it
pyproject.toml
Outdated
| "poml", | ||
| ] | ||
| apo = ["poml"] | ||
| phoenix = ["arize-phoenix-client>=1.14.1", "arize-phoenix-otel>=0.11.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll need to add --extra phoenix to unit tests in github actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
/ci |
|
🚀 CI Watcher for correlation id-3471549478-mhehz7bn triggered by comment 3471549478
✅ All runs completed. |
|
I find that this version can not be used in RL training now. For example, there are reward returned in my LitAgent, however, the reward is always None: |
|
Are you returning rewards from If you are using If you are returning rewards, that's weird because the runner directly writes that span to store. We might need some bug fixes. |
|
I return the float reward in the rollout function, the following is an example async def rollout_async(
self,
task: dict,
resources: agl.NamedResources,
rollout: agl.Rollout,
) -> agl.RolloutRawResult:
sample = EvaluationSample(**task)
start_time = time.time()
assert self.agent_config is not None
llm: agl.LLM = cast(agl.LLM, resources["main_llm"])
print(
"using model:",
llm.model,
"at",
llm.get_base_url(rollout.rollout_id, rollout.attempt.attempt_id),
)
self.agent_config.model.model_provider.base_url = llm.get_base_url(
rollout.rollout_id, rollout.attempt.attempt_id
)
self.agent_config.model.model_provider.model = llm.model
self.agent = cast(SimpleAgent, get_agent(config=self.agent_config))
result = await self.agent.run(sample.raw_question, trace_id=rollout.rollout_id)
end_time = time.time()
# Update sample with results
sample.update(
response=result.final_output,
time_cost=end_time - start_time,
trajectories=json.dumps(result.trajectories, ensure_ascii=False),
stage="rollout",
)
judged_sample: EvaluationSample = await self.rcabench.judge_one(sample)
print(judged_sample)
return 1 if judged_sample.correct else 0.0001So I need to use |
|
When I replace |
|
put a print here: Do you see a span generated here? |
|
Yes, it generated. In previous logs, it has: |
|
If you are seeing a span here, and the span did get sent into the store. And you are not seeing any rewards at VERL. there could be too reasons:
I think the second case is more possible, since the phoenix tracer might not generate the same formats of token ids as the agentops tracer. Could you show me what your trace attributes look like? |
|
Actually, when I use the default Agentops tracer, I also found that the token ids are missing. There are only rewards here. |
|
Is this because the agent I implemented is not instrumented? I am using youtu-agent (https://github.com/TencentCloudADP/youtu-agent), which is a wrapper of openai-agents. I think by default, AgentOps can instrument it. |
|
not really. agentops instrumented the generally openai provider, which is general. Missing instrumentation of agent framework will only affect discriminating multiple agents. There is however a bug fix for openai provider yesterday. Would you try to pull latest main? |
|
Switched to the latest main, the issue remains The full log of a single rollout is: (looks like quite normal, except for the last log) 2025-11-01 13:08:38,550 [INFO] (Process-2884936 agentlightning.store.client_server) 127.0.0.1:35350 - "POST /agl/v1/rollouts/ro-36214da8af95/attempts/at-59ed3351 HTTP/1.1" 200 in 0.71 ms
2025-11-01 13:08:38,550 [INFO] (Process-2884936 agentlightning.store.client_server) 127.0.0.1:35350 - "POST /agl/v1/rollouts/ro-36214da8af95/attempts/at-59ed3351 HTTP/1.1" 200 in 0.71 ms
🖇 AgentOps: Session Replay for ro-36214da8af95 trace: https://app.agentops.ai/sessions?trace_id=9c2a8ff54a02cc43f4af77e971ee2db6
🖇 AgentOps: Session Replay for ro-36214da8af95.session trace: https://app.agentops.ai/sessions?trace_id=9c2a8ff54a02cc43f4af77e971ee2db6
2025-11-01 13:09:04,650 [INFO] (Process-2885573 agentlightning.runner.agent) [Worker 4 | Rollout ro-36214da8af95] Completed in 26.10s. Collected 22 span(s). Final reward: 0.1
2025-11-01 13:09:04,652 [INFO] (Process-2884936 agentlightning.store.client_server) 127.0.0.1:51356 - "POST /agl/v1/rollouts/ro-36214da8af95/attempts/at-59ed3351 HTTP/1.1" 200 in 0.73 ms
(TaskRunner pid=2885675) Warning: Rollout ro-36214da8af95 contains empty response: [Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-47a1b14f36e24e1b9a5d8575270e7df2', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': 'chatcmpl-212d66a4ca034324aba8eca488576261', 'agent_name': '*'}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=0.1, metadata={'response_id': 'chatcmpl-c02e043d081542e38c00037e2bfb693a', 'agent_name': '*'})]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Phoenix tracer integration to Agent Lightning, enabling observability and tracing through Arize Phoenix. The integration leverages OpenTelemetry and reuses the existing LightningSpanProcessor to capture and store spans.
- Introduces a new
PhoenixTracerclass that wrapsarize-phoenix-otelregistration - Adds comprehensive test coverage for the tracer and span processor lifecycle
- Includes new optional dependencies under
phoenixextra
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Adds Phoenix-related dependencies (arize-phoenix-client, arize-phoenix-otel) and their transitive dependencies |
| pyproject.toml | Declares new phoenix optional dependency group with minimum version constraints |
| agentlightning/tracer/phoenix.py | Implements PhoenixTracer class with worker lifecycle management and span collection |
| tests/tracer/test_phoenix.py | Comprehensive test suite covering processor lifecycle, concurrency, multiprocessing, and reward tracing |
| .github/workflows/tests.yml | Updates CI to install phoenix extra for testing |
| .github/workflows/tests-full.yml | Updates full test suite CI to install phoenix extra |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "poml", | ||
| ] | ||
| phoenix = [ | ||
| "arize-phoenix-client>=1.14.1", |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum version constraint specifies >=1.14.1, but the lock file contains version 1.21.0 which appears to not exist yet. Consider specifying an upper bound constraint (e.g., >=1.14.1,<2.0) to prevent unexpected breaking changes while ensuring compatibility with currently available versions.
| "arize-phoenix-client>=1.14.1", | |
| "arize-phoenix-client>=1.14.1,<2.0", |
| except Exception as exc: # pragma: no cover - defensive path | ||
| try: | ||
| conn.send(("error", repr(exc))) | ||
| except Exception: |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| except Exception: | |
| # Suppress any exception when sending error to parent process, | |
| # as the connection may already be closed or broken. |
Is this issue on AgentOps tracer? Please first print out the spans following the debug tutorial here: https://microsoft.github.io/agent-lightning/latest/tutorials/debug/ Make sure the token ids are included in the span in the right place. If the token ids is missing, try to log |
|
I tried, but the token_ids are also missing. The code snippets are shown below. I think it is simple. But both AgentOpsTracer and OtelTracer does not output the Triplelet. from __future__ import annotations
import argparse
import asyncio
import json
import socket
import time
from collections.abc import Iterable
from typing import Any, cast
import agentlightning as agl
from utu.agents import get_agent
from utu.agents.simple_agent import SimpleAgent
from utu.config.loader import ConfigLoader
from utu.db.eval_datapoint import EvaluationSample
from utu.eval.data import DBDataManager
from utu.eval.processer.rcabench import RCABenchProcesser
agl.configure_logger()
logger = agl.configure_logger(name=__name__)
def _find_free_port() -> int:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(("127.0.0.1", 0))
return cast(int, sock.getsockname()[1])
class LitRCAgent(agl.LitAgent[EvaluationSample]):
def __init__(
self,
eval_config_name: str = "rcabench_simple_agent",
) -> None:
self.config = ConfigLoader.load_eval_config(name=eval_config_name)
self.agent_config = self.config.agent
assert self.agent_config is not None
self.rcabench = RCABenchProcesser(config=self.config)
async def rollout_async(
self,
task: dict,
resources: agl.NamedResources,
rollout: agl.Rollout,
) -> agl.RolloutRawResult:
sample = EvaluationSample(**task)
start_time = time.time()
assert self.agent_config is not None
llm: agl.LLM = cast(agl.LLM, resources["main_llm"])
self.agent_config.model.model_provider.base_url = llm.get_base_url(
rollout.rollout_id, rollout.attempt.attempt_id
)
self.agent_config.model.model_provider.model = llm.model
self.agent = cast(SimpleAgent, get_agent(config=self.agent_config))
result = await self.agent.run(sample.raw_question, trace_id=rollout.rollout_id)
end_time = time.time()
sample.update(
response=result.final_output,
time_cost=end_time - start_time,
trajectories=json.dumps(result.trajectories, ensure_ascii=False),
stage="rollout",
)
judged_sample: EvaluationSample = await self.rcabench.judge_one(sample)
assert judged_sample.confidence is not None
agl.emit_reward(judged_sample.confidence)
# return judged_sample.confidence
def rollout(
self,
task: Any,
resources: agl.NamedResources,
rollout: agl.Rollout,
) -> float | None:
return asyncio.run(self.rollout_async(task, resources, rollout)) # type: ignore
def load_dataset(self, num: int = 5) -> list[EvaluationSample]:
dataset = DBDataManager(self.config)
samples = dataset.load()[:num]
data = [self.rcabench.preprocess_one(i) for i in samples]
return data
class RunnerDebugHook(agl.Hook):
async def on_trace_end(self, *, agent, runner, tracer, rollout) -> None: # type: ignore[override]
get_last_trace = getattr(tracer, "get_last_trace", None)
if not callable(get_last_trace):
logger.info("Tracer %s has no get_last_trace helper", type(tracer).__name__)
return
raw_spans = get_last_trace()
if not raw_spans:
logger.info(
"No spans collected during rollout %s",
getattr(rollout, "rollout_id", "<unknown>"),
)
return
if hasattr(raw_spans, "spans"):
span_iterable = getattr(raw_spans, "spans")
elif isinstance(raw_spans, Iterable) and not isinstance(
raw_spans, (str, bytes)
):
span_iterable = raw_spans
else:
span_iterable = (raw_spans,)
spans = list(span_iterable)
if not spans:
logger.info(
"No spans collected during rollout %s",
getattr(rollout, "rollout_id", "<unknown>"),
)
return
logger.info(
"Trace spans collected during rollout %s:",
getattr(rollout, "rollout_id", "<unknown>"),
)
for span in spans:
logger.info(
"- %s (status=%s)",
getattr(span, "name", "<unknown>"),
getattr(span, "status", None),
)
async def _debug_runner_once(
*,
eval_config_name: str = "rcabench_simple_agent_with_sql",
llm_endpoint: str = "http://localhost:8000/v1",
llm_model: str = "Qwen/Qwen3-1.7B",
sample_index: int = 0,
hooks: list[agl.Hook] | None = None,
) -> None:
agent = LitRCAgent(eval_config_name=eval_config_name)
samples = agent.load_dataset(num=sample_index + 1)
if not samples:
raise ValueError(
"Dataset is empty; ensure the evaluation database is populated"
)
try:
sample = samples[sample_index]
except IndexError as exc: # pragma: no cover - defensive guard
raise ValueError(
f"Sample index {sample_index} is out of range for dataset of size {len(samples)}"
) from exc
tracer = agl.AgentOpsTracer()
runner = agl.LitAgentRunner(tracer=tracer)
store = agl.InMemoryLightningStore()
llm_proxy = agl.LLMProxy(
port=8081,
model_list=[
{
"model_name": "Qwen/Qwen3-1.7B",
"litellm_params": {
"model": "hosted_vllm/Qwen/Qwen3-1.7B",
"api_base": llm_endpoint,
},
}
],
store=store,
)
llm_proxy.start()
resources: agl.NamedResources = {"main_llm": llm_proxy.as_resource()}
# resources: agl.NamedResources = {
# "main_llm": agl.LLM(
# endpoint="http://localhost:8001/v1",
# model=llm_model,
# )
# }
active_hooks = hooks or [RunnerDebugHook()]
try:
with runner.run_context(agent=agent, store=store, hooks=active_hooks):
rollout = await runner.step(sample.model_dump(), resources=resources)
rollouts = await store.query_rollouts()
spans = await store.query_spans(rollout.rollout_id)
reward = agl.emitter.find_final_reward(spans)
logger.info("Rollouts captured: %s", rollouts)
logger.info("Recorded %d spans for rollout %s", len(spans), rollout.rollout_id)
logger.info(f"Final response: {spans[-1]}")
logger.info("Extracted reward: %s", reward)
adapter = agl.LlmProxyTraceToTriplet()
triplets = adapter.adapt(spans)
if not triplets:
logger.info(
"No triplets extracted from spans for rollout %s", rollout.rollout_id
)
else:
for index, triplet in enumerate(triplets, start=1):
prompt_token_ids = (
triplet.prompt.get("token_ids")
if isinstance(triplet.prompt, dict)
else triplet.prompt
)
response_payload = triplet.response
if isinstance(response_payload, dict):
response_token_ids = response_payload.get("token_ids")
else:
response_token_ids = response_payload
logger.info(
"Triplet %d: prompt_token_ids=%s output_token_ids=%s reward=%s",
index,
prompt_token_ids,
response_token_ids,
triplet.reward,
)
finally:
if llm_proxy.is_running():
llm_proxy.stop()
def debug_runner_once(
*,
eval_config_name: str = "rcabench_simple_agent_with_sql",
llm_endpoint: str = "http://localhost:8000/v1",
llm_model: str = "Qwen/Qwen3-1.7B",
sample_index: int = 0,
hooks: list[agl.Hook] | None = None,
) -> None:
"""Public entry point that wraps the async runner debug helper."""
asyncio.run(
_debug_runner_once(
eval_config_name=eval_config_name,
llm_endpoint=llm_endpoint,
llm_model=llm_model,
sample_index=sample_index,
hooks=hooks,
)
)
def debug_trainer_dev(
*,
eval_config_name: str = "rcabench_simple_agent_with_sql",
llm_endpoint: str = "http://localhost:8000/v1",
llm_model: str = "Qwen/Qwen3-1.7B",
num_samples: int = 10,
n_workers: int = 1,
) -> None:
"""Dry-run the trainer loop using Trainer.dev with configurable resources."""
agent = LitRCAgent(eval_config_name=eval_config_name)
debug_samples = agent.load_dataset(num=num_samples)
logger.info("Loaded %d samples for trainer dry run", len(debug_samples))
from utu.tracing import PhoenixTracer
tracer = PhoenixTracer(auto_instrument=True)
trainer = agl.Trainer(
n_workers=n_workers,
initial_resources={
"main_llm": agl.LLM(
endpoint=llm_endpoint,
model=llm_model,
)
},
tracer=tracer,
)
trainer.dev(agent, debug_samples)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Debug RCAgent")
parser.add_argument(
"--mode",
choices=["runner", "trainer"],
default="runner",
help="Choose debug mode: 'runner' for debug_runner_once, 'trainer' for debug_trainer_dev",
)
args = parser.parse_args()
if args.mode == "runner":
debug_runner_once()
elif args.mode == "trainer":
debug_trainer_dev() |
|
What are the log output you are getting from running this script? |
|
This is Otel Tracer (rcagent-rl) root@iv-ye5ro8ne9s5i3z3p14l4 ~/RCAgent-RL (main)# uv run python ./training_agl/rcagent/rcagent.py --mode runner
2025-11-02 16:01:23,359[utu.eval.data.data_manager] - WARNING - data_manager.py:45 - exp_id rcabench_evaluation already exists in db
2025-11-02 16:01:23,362 [INFO] (Process-4057198 agentlightning.llm_proxy) Adding a new middleware to the FastAPI app.
2025-11-02 16:01:23,365 [INFO] (Process-4057198 agentlightning.llm_proxy) Starting LLMProxy server thread...
INFO: Started server process [4057198]
INFO: Waiting for application startup.
██╗ ██╗████████╗███████╗██╗ ██╗ ███╗ ███╗
██║ ██║╚══██╔══╝██╔════╝██║ ██║ ████╗ ████║
██║ ██║ ██║ █████╗ ██║ ██║ ██╔████╔██║
██║ ██║ ██║ ██╔══╝ ██║ ██║ ██║╚██╔╝██║
███████╗██║ ██║ ███████╗███████╗███████╗██║ ╚═╝ ██║
╚══════╝╚═╝ ╚═╝ ╚══════╝╚══════╝╚══════╝╚═╝ ╚═╝
#------------------------------------------------------------#
# #
# 'This feature doesn't meet my needs because...' #
# https://github.com/BerriAI/litellm/issues/new #
# #
#------------------------------------------------------------#
Thank you for using LiteLLM! - Krrish & Ishaan
Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM: Proxy initialized with Config, Set models:
Qwen/Qwen3-1.7B
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
2025-11-02 16:01:23,375 [INFO] (Process-4057198 agentlightning.tracer.otel) [Worker 0] Setting up OpenTelemetry tracer...
Overriding of current TracerProvider is not allowed
2025-11-02 16:01:23,529 [WARNING] (Process-4057198 agentlightning.llm_proxy) Loop and lock are not owned by the current process. Clearing them.
INFO: 10.0.6.208:59116 - "POST /chat/completions HTTP/1.1" 200 OK
2025-11-02 16:01:28,320 [WARNING] (Process-4057198 agentlightning.llm_proxy) No headers found in 1 subtree spans. Cannot log to store.
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__) No spans collected during rollout ro-e2292df7a4c9
2025-11-02 16:01:28,321 [INFO] (Process-4057198 agentlightning.runner.agent) [Worker 0 | Rollout ro-e2292df7a4c9] Completed in 4.94s. Collected 0 span(s). Final reward: None
2025-11-02 16:01:28,321 [INFO] (Process-4057198 agentlightning.tracer.otel) [Worker 0] Tearing down OpenTelemetry tracer...
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__) Rollouts captured: [Rollout(rollout_id='ro-e2292df7a4c9', input={'exp_id': 'rcabench_evaluation', 'dataset': 'RCABench', 'dataset_index': 1, 'source': 'RCABench', 'level': 0, 'raw_question': 'You are a professional microservices system operations expert and fault diagnosis analyst.\n\n**Task Description**:\nBased on the fault description and related data provided below, conduct a comprehensive root cause analysis. You must strictly follow the structured analysis steps to scientifically determine the root cause.\n\n**Core Requirement: Comparative Analysis**\nAll your analysis and conclusions **MUST** be built upon a **comparison** between **normal period data** (normal_*.parquet) and **abnormal period data** (abnormal_*.parquet).\n\n**Fault Problem**: Current user is experiencing the SLO violation.\nPlease provide a detailed Root Cause Analysis report based on the following incident data and logs.\nThe report should include:\n1. A summary of the incident.\n2. The identified root cause(s).\n3. Any contributing factors.\n4. Recommended actions to prevent future occurrences.\n\n\n**Available Data**:\nYou can access the directory: `/root/RCAgent-RL/training_agl/rcagent/data/data42`. All observation data is stored here.\n* `conclusion.parquet`: Contains high-level anomaly data (e.g., duration, success rate), **pre-compared** between abnormal and normal states.\n* `abnormal_traces.parquet` / `normal_traces.parquet`: Contain detailed Span information from abnormal and normal periods. **This is your primary basis for comparative analysis and conclusions.**\n* `abnormal_logs.parquet` / `normal_logs.parquet`: Contain log messages from abnormal and normal periods. **This is your other primary basis for comparative analysis and conclusions.**\n\n---\n\n**Required Analysis Steps (Show Your Work):**\n\nPlease use the following headings and structure to organize your **entire** analysis process.\n\n### 1. Problem Definition (Define)\n* **Fault Phenomenon Summary**: Briefly summarize the fault symptom described in the problem.\n* **Role of `conclusion.parquet` (Symptom Data)**: This file provides the **high-level symptom** or **alert** (e.g., "Span X is slow" or "Service Y success rate dropped"). It tells you **"what"** is wrong, but **CANNOT** tell you **"why"**. **You must not draw any root cause conclusions from this file alone.**\n* **Initial Symptom Observation (based on conclusion.parquet)**: Based on the **alert data** in `conclusion.parquet`, which services or operations are **flagged as symptomatic** (e.g., high latency, low success rate)?\n\n### 2. Hypothesis Formulation (Structure)\nBased on the **symptoms** observed in Step 1, list your primary diagnostic hypotheses. These hypotheses must be testable using the **`traces` and `logs` data**.\n* **Hypothesis 1**: (e.g., "The high latency in \'payment-service\' is a symptom, caused by a slow response from its downstream dependency, \'auth-service\'.")\n* **Hypothesis 2**: (e.g., "The \'database-proxy\' service is generating \'connection error\' logs, not present in the normal period, which is causing upstream failures.")\n* **Hypothesis 3**: ...\n\n### 3. Evidence Analysis & Hypothesis Validation (Analyze)\nUsing the `abnormal_*` and `normal_*` files, systematically validate each hypothesis through **comparative analysis**. **This is the core of your analysis.**\n\n* **Service Dependency Analysis (Based on Traces)**:\n * **Data Used**: Analyze `abnormal_traces.parquet` (using `span_id`, `parent_span_id`, `service_name`) to map the upstream/downstream call chain for the **symptomatic Spans** from Step 1.\n * **Findings**: (e.g., "The \'checkout-span\' (frontend) calls \'payment-service\', which in turn calls \'auth-service\' and \'database-proxy\'.")\n * **Conclusion**: (e.g., "The problem in \'payment-service\' could be influenced by \'auth-service\' or \'database-proxy\'.")\n\n* **Validating Hypothesis 1: (e.g., \'payment-service\' is impacted by \'auth-service\')**\n * **Data Used**: **Compare** `abnormal_traces.parquet` and `normal_traces.parquet`. **Filter** for Spans where `service_name == \'payment-service\'` and **inspect** their child Spans where `service_name == \'auth-service\'`.\n * **Findings**: (e.g., "In `normal_traces.parquet`, the \'auth-service\' Span takes 30ms. However, in `abnormal_traces.parquet`, the \'auth-service\' Span takes 2900ms, while the \'payment-service\' self-time (parent duration - child span duration) remains stable at 50ms.")\n * **Conclusion**: (e.g., "Hypothesis 1 is **Confirmed**. The high latency of \'payment-service\' is **not** its own fault but is **propagated** from its slow dependency, \'auth-service\'.")\n\n* **Validating Hypothesis 2: (e.g., \'database-proxy\' emitting new errors)**\n * **Data Used**: **Compare** logs where `service_name == \'database-proxy\'` in `abnormal_logs.parquet` and `normal_logs.parquet`. **Correlate** these logs with the timestamps of slow \'database-proxy\' Spans in `abnormal_traces.parquet`.\n * **Findings**: (e.g., "No \'Connection timed out\' errors are found in `normal_logs.parquet`. In `abnormal_logs.parquet`, these errors appear frequently, and **their timestamps perfectly match the timestamps of the latency spikes for \'database-proxy\' Spans**.")\n * **Conclusion**: (e.g., "Hypothesis 2 is **Confirmed**. These new log errors are strongly correlated with the latency phenomenon seen in the trace data.")\n\n* *(Continue for all other hypotheses...)*\n\n### 4. Synthesis & Root Cause Identification (Synthesize)\nBased on the hypotheses **confirmed through detailed trace and log comparison** in Step 3, synthesize your findings to form a final conclusion.\n\n* **Fault Propagation Path**: (Based on Step 3 dependency analysis and validation) Describe the complete chain of events. (e.g., "1. The \'database-proxy\' began emitting \'Connection timed out\' errors **(log comparison evidence)** not seen in the normal period. 2. This caused its caller, \'auth-service\', to slow down, with latency jumping **(trace comparison evidence)** from 30ms to 2900ms. 3. The latency from \'auth-service\' **propagated** to its caller, \'payment-service\'. 4. This finally resulted in the high latency alert for the \'checkout\' operation reported in `conclusion.parquet`.")\n* **Root Cause**: State the single, underlying root cause. (e.g., "The root cause was the exhaustion of the \'database-proxy\' connection pool **(evidenced by the comparative log errors)**, which caused \'auth-service\' to block **(evidenced by the comparative high latency of its child span)**, and ultimately manifested as a cascading failure in \'payment-service\'.")\n* **Core Evidence Chain (Based on Comparison)**: Briefly link the key pieces of evidence. (e.g., `database-proxy` log **comparison** (normal: 0 errors vs. abnormal: N errors) -> **correlates with** `auth-service` trace latency **comparison** (normal: 30ms vs. abnormal: 2900ms) -> **propagates to** `payment-service` trace -> **triggers** `conclusion.parquet` alert)\n\n---\n\n**Final Conclusion (JSON):**\nAfter you have completed the detailed analysis in all the steps above, provide the final JSON conclusion as required by your system instructions.\n', 'correct_answer': 'ts-inside-payment-service', 'file_name': '', 'stage': 'init'}, start_time=1762070483.3769877, end_time=1762070488.321383, mode=None, resources_id='rs-d3d5da38eb1f', status='succeeded', config=RolloutConfig(timeout_seconds=None, unresponsive_seconds=None, max_attempts=1, retry_condition=[]), metadata={})]
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__) Recorded 7 spans for rollout ro-e2292df7a4c9
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__) Final response: rollout_id='ro-e2292df7a4c9' attempt_id='at-b66f7abe' sequence_id=1 trace_id='3f76b67dc0da06210ff6342c27683e8e' span_id='e23f898afefc8d66' parent_id=None name='Received Proxy Server Request' status=TraceStatus(status_code='UNSET', description=None) attributes={} events=[] links=[] start_time=1762070483.5280588 end_time=1762070488.28927 context=SpanContext(trace_id='3f76b67dc0da06210ff6342c27683e8e', span_id='e23f898afefc8d66', is_remote=False, trace_state={}) parent=None resource=OtelResource(attributes={'telemetry.sdk.language': 'python', 'telemetry.sdk.name': 'opentelemetry', 'telemetry.sdk.version': '1.38.0', 'service.name': 'litellm', 'deployment.environment': 'production', 'model_id': 'litellm'}, schema_url='')
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__) Extracted reward: None
2025-11-02 16:01:28,321 [INFO] (Process-4057198 __main__) No triplets extracted from spans for rollout ro-e2292df7a4c9
2025-11-02 16:01:28,321 [INFO] (Process-4057198 agentlightning.llm_proxy) Stopping LLMProxy server thread...
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [4057198]
2025-11-02 16:01:28,522 [INFO] (Process-4057198 agentlightning.llm_proxy) LLMProxy server thread stopped.This is AgentOps Tracer |
|
I think the problem is with the LlmProxy. LlmProxy registers a global tracer provider that will interfere with other tracer providers registered in runner. You must run LlmProxy in another process if you are keen to run them together. See #255 #206 and comments in We might want to make |

Add support for Phoenix tracer