Skip to content

Commit f8d9391

Browse files
add integration and e2e test suites
1 parent 37c6374 commit f8d9391

12 files changed

Lines changed: 1367 additions & 31 deletions

File tree

DEVELOPMENT.md

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,60 @@ make build-ui # build React app only → ui/dist/
3737

3838
Both `build` and `build-bundle` produce `dist/agentevals-*.whl` with the same package name and version. The difference is that `build-bundle` embeds `ui/dist/` as `agentevals/_static/` inside the wheel. The hatchling `artifacts` config ensures the gitignored `_static/` directory is included.
3939

40-
### Testing and cleanup
40+
### Testing
41+
42+
```bash
43+
make test # run all tests (unit + integration, excludes e2e)
44+
make test-unit # unit tests only (fast, no server startup)
45+
make test-integration # integration tests — OTLP pipeline, session grouping, timing (no API keys)
46+
make test-e2e # E2E tests — real agents as subprocesses (requires OPENAI_API_KEY)
47+
```
48+
49+
### Cleanup
4150

4251
```bash
43-
make test # run pytest
4452
make clean # remove dist/, build/, ui/dist/, src/agentevals/_static/
4553
```
4654

55+
## Testing
56+
57+
### Test tiers
58+
59+
Tests are organized into three tiers with different trade-offs:
60+
61+
| Tier | Location | Transport | API keys | What it verifies |
62+
|------|----------|-----------|----------|------------------|
63+
| **Unit** | `tests/` (excl. integration) | `TestClient` / mocks | None | Business logic, route handlers, converters |
64+
| **Integration** | `tests/integration/` | ASGI in-process | None | OTLP session grouping, timing, concurrent batches, eval pipeline |
65+
| **E2E** | `tests/integration/test_live_agents.py` | Real uvicorn servers | `OPENAI_API_KEY` | Full pipeline — real agent → OTLP export → session creation → invocation extraction → API visibility |
66+
67+
Integration tests use `httpx.ASGITransport` to hit the OTLP and streaming API routes in-process (no ports, no real HTTP). Timers are configured fast (0.1s grace, 0.5s idle) for quick deterministic tests.
68+
69+
E2E tests start real uvicorn servers on ephemeral ports in a background thread, then run example agent scripts as subprocesses that emit real OTLP traces with `BatchSpanProcessor`/`BatchLogRecordProcessor` flush timing.
70+
71+
### Running E2E tests
72+
73+
E2E tests require `OPENAI_API_KEY` (used by LangChain and Strands agents). They are skipped automatically when the key is not set.
74+
75+
```bash
76+
# Source your .env and run
77+
set -a && source .env && set +a && make test-e2e
78+
```
79+
80+
### Adding tests for new examples
81+
82+
When adding a new example agent to `examples/`, add corresponding E2E tests to ensure the full OTLP pipeline works:
83+
84+
1. Add a test class in `tests/integration/test_live_agents.py` following the existing pattern (`TestLangchainZeroCode`, `TestStrandsZeroCode`)
85+
2. Each agent should have at minimum three tests:
86+
- **Session creation** — agent runs successfully, session is created with spans (and logs if applicable)
87+
- **Invocation extraction** — invocations are extracted with user/agent content
88+
- **API visibility** — session appears in `GET /api/streaming/sessions`
89+
3. Use `_run_agent()` to run the example as a subprocess with the test OTLP endpoint
90+
4. Use `wait_for_session_complete_sync()` to poll until the session finalizes
91+
5. Mark the test class with the appropriate skip condition (e.g., `_skip_no_openai`)
92+
6. Use unique `session_name` values per test to avoid collisions within the session-scoped server fixture
93+
4794
## Runtime behavior
4895

4996
The serve command auto-detects the active mode:

Makefile

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
VERSION := $(shell grep '^version' pyproject.toml | cut -d'"' -f2)
22
WHEEL := dist/agentevals-$(VERSION)-py3-none-any.whl
33

4-
.PHONY: build build-bundle build-ui release clean dev-backend dev-frontend dev-bundle test
4+
.PHONY: build build-bundle build-ui release clean dev-backend dev-frontend dev-bundle test test-unit test-integration test-e2e
55

66
build:
77
uv build
@@ -47,6 +47,15 @@ dev-bundle: build-ui
4747
test:
4848
uv run pytest
4949

50+
test-unit:
51+
uv run pytest tests/ --ignore=tests/integration
52+
53+
test-integration:
54+
uv run pytest tests/integration/ -m "integration and not e2e" -v
55+
56+
test-e2e:
57+
uv run pytest tests/integration/ -m "e2e" -v
58+
5059
clean:
5160
rm -rf dist/ build/ src/agentevals/_static/ ui/dist/
5261
find . -name '*.egg-info' -type d -exec rm -rf {} + 2>/dev/null || true

pyproject.toml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,15 @@ members = []
4747
[tool.pytest.ini_options]
4848
testpaths = ["tests"]
4949
pythonpath = ["src"]
50+
markers = [
51+
"integration: OTLP pipeline tests with ASGI apps (no API keys)",
52+
"e2e: End-to-end tests requiring API keys and real agents",
53+
]
54+
asyncio_mode = "auto"
5055

5156
[dependency-groups]
5257
dev = [
5358
"pytest>=9.0.2",
59+
"pytest-asyncio>=0.24.0",
60+
"httpx>=0.27.0",
5461
]

src/agentevals/api/app.py

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import json
55
import logging
66
import os
7+
from contextlib import asynccontextmanager
78
from pathlib import Path
89

910
from fastapi import FastAPI
@@ -23,10 +24,31 @@
2324
except ImportError:
2425
pass
2526

27+
@asynccontextmanager
28+
async def lifespan(app: FastAPI):
29+
log_level_str = os.getenv("AGENTEVALS_LOG_LEVEL", "INFO").upper()
30+
log_level = getattr(logging, log_level_str, logging.INFO)
31+
logging.basicConfig(
32+
level=log_level,
33+
format="%(levelname)s:%(name)s:%(message)s",
34+
force=True,
35+
)
36+
ae_logger = logging.getLogger("agentevals")
37+
ae_logger.setLevel(log_level)
38+
log_buffer.setFormatter(logging.Formatter("%(levelname)s:%(name)s:%(message)s"))
39+
ae_logger.addHandler(log_buffer)
40+
if _trace_manager:
41+
_trace_manager.start_cleanup_task()
42+
yield
43+
if _trace_manager:
44+
await _trace_manager.shutdown()
45+
46+
2647
app = FastAPI(
2748
title="agentevals API",
2849
version=__version__,
2950
description="REST API for evaluating agent traces using ADK's scoring framework",
51+
lifespan=lifespan,
3052
)
3153

3254
app.add_middleware(
@@ -105,26 +127,3 @@ async def spa_fallback(path: str):
105127
if file_path.is_file():
106128
return FileResponse(file_path)
107129
return FileResponse(_static_dir / "index.html")
108-
109-
110-
@app.on_event("startup")
111-
async def on_startup():
112-
log_level_str = os.getenv("AGENTEVALS_LOG_LEVEL", "INFO").upper()
113-
log_level = getattr(logging, log_level_str, logging.INFO)
114-
logging.basicConfig(
115-
level=log_level,
116-
format="%(levelname)s:%(name)s:%(message)s",
117-
force=True,
118-
)
119-
ae_logger = logging.getLogger("agentevals")
120-
ae_logger.setLevel(log_level)
121-
log_buffer.setFormatter(logging.Formatter("%(levelname)s:%(name)s:%(message)s"))
122-
ae_logger.addHandler(log_buffer)
123-
if _trace_manager:
124-
_trace_manager.start_cleanup_task()
125-
126-
127-
@app.on_event("shutdown")
128-
async def on_shutdown():
129-
if _trace_manager:
130-
await _trace_manager.shutdown()

src/agentevals/streaming/ws_server.py

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,14 +36,27 @@ class StreamingTraceManager:
3636
Args:
3737
session_ttl_hours: How long to keep completed sessions in memory (default: 2 hours)
3838
max_sessions: Maximum number of sessions to keep (default: 100)
39+
completion_grace_seconds: Delay after root span before completing session (default: 3.0)
40+
idle_timeout_seconds: Complete session after this many seconds of inactivity (default: 30.0)
41+
reextraction_delay_seconds: Debounce delay for late-log re-extraction (default: 2.0)
3942
"""
4043

41-
def __init__(self, session_ttl_hours: int = 2, max_sessions: int = 100):
44+
def __init__(
45+
self,
46+
session_ttl_hours: int = 2,
47+
max_sessions: int = 100,
48+
completion_grace_seconds: float = 3.0,
49+
idle_timeout_seconds: float = 30.0,
50+
reextraction_delay_seconds: float = 2.0,
51+
):
4252
self.sessions: dict[str, TraceSession] = {}
4353
self.incremental_extractors: dict[str, IncrementalInvocationExtractor] = {}
4454
self.sse_queues: list[asyncio.Queue] = []
4555
self.session_ttl = timedelta(hours=session_ttl_hours)
4656
self.max_sessions = max_sessions
57+
self.completion_grace_seconds = completion_grace_seconds
58+
self.idle_timeout_seconds = idle_timeout_seconds
59+
self.reextraction_delay_seconds = reextraction_delay_seconds
4760
self._cleanup_task: asyncio.Task | None = None
4861
self._completion_timers: dict[str, asyncio.Task] = {}
4962
self._idle_timers: dict[str, asyncio.Task] = {}
@@ -72,6 +85,12 @@ async def shutdown(self) -> None:
7285
"""Gracefully shut down: close SSE clients and cancel background tasks."""
7386
for queue in self.sse_queues:
7487
queue.put_nowait(None)
88+
for task in self._completion_timers.values():
89+
task.cancel()
90+
self._completion_timers.clear()
91+
for task in self._idle_timers.values():
92+
task.cancel()
93+
self._idle_timers.clear()
7594
if self._cleanup_task:
7695
self._cleanup_task.cancel()
7796
try:
@@ -274,7 +293,7 @@ def schedule_session_completion(self, session_id: str) -> None:
274293
self._completion_timers[session_id].cancel()
275294

276295
self._completion_timers[session_id] = asyncio.create_task(
277-
self._delayed_complete(session_id, 3.0)
296+
self._delayed_complete(session_id, self.completion_grace_seconds)
278297
)
279298

280299
def reset_idle_timer(self, session_id: str) -> None:
@@ -289,7 +308,7 @@ def reset_idle_timer(self, session_id: str) -> None:
289308
self._idle_timers[session_id].cancel()
290309

291310
self._idle_timers[session_id] = asyncio.create_task(
292-
self._delayed_complete(session_id, 30.0)
311+
self._delayed_complete(session_id, self.idle_timeout_seconds)
293312
)
294313

295314
def schedule_log_reextraction(self, session_id: str) -> None:
@@ -304,7 +323,7 @@ def schedule_log_reextraction(self, session_id: str) -> None:
304323
self._completion_timers[key].cancel()
305324

306325
self._completion_timers[key] = asyncio.create_task(
307-
self._delayed_reextract(session_id, 2.0)
326+
self._delayed_reextract(session_id, self.reextraction_delay_seconds)
308327
)
309328

310329
async def _delayed_complete(self, session_id: str, delay: float) -> None:

tests/integration/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)