You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: fold AI evals and prompt management into AI observability sidebar (#17507)
## Changes
The **AI evals** (`/docs/ai-evals`) and **prompt management** (`/docs/prompt-management`) overview pages were weird orphans — inside the AI Observability sidebar they only appeared as `↗` links that threw you into a separate sidebar context. This folds them into a single, coherent AI Observability sidebar and tidies up the section groupings.
- Fold **Evaluations** and **Prompt management** in as proper nested groups under **Concepts** (URLs unchanged — no file moves, so no redirects needed for the sub-pages).
- Move **Calculating LLM costs** from Concepts → **Guides**.
- Move **Trace reviews** into the Evaluations group (its own content describes it as the AI Evals manual-review workflow).
- Move **Link session replay**, **Link error tracking**, and **Collect user feedback** out of the "PostHog AI" section (they aren't PostHog AI features) into **Guides**.
- Remove the two `↗` orphan links and the now-redundant standalone Evaluations / Prompt Management products from the docs menu.
- Repoint the overview pages' QuickLinks to the consolidated location.
- Fix a pre-existing broken redirect destination (`/docs/llm-analytics/trace-reviews` now points at the real page).
**Why:** Charles flagged that the two pages felt like orphans and that the sidebar grouping was unintuitive (evals/prompt management filed under Guides, costs filed under Concepts). This makes the AI docs navigation make sense as one product.
## Checklist
- [x] I've read the [docs](https://posthog.com/handbook/docs-and-wizard/docs-style-guide) and/or [content](https://posthog.com/handbook/content/posthog-style-guide) style guides.
- [x] Words are spelled using American English
- [x] Use relative URLs for internal links
- [x] I've checked the pages added or changed in the Vercel preview build
- [x] If I moved a page, I added a redirect in `vercel.json`
---
*Created with [PostHog Code](https://posthog.com/code?ref=pr) from a [Slack thread](https://posthog.slack.com/archives/C087XQ7K9K7/p1781192656765859)*
Copy file name to clipboardExpand all lines: contents/blog/ai-observability-for-mvps.mdx
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -120,7 +120,7 @@ You don't need 15 evaluation criteria across five different dimensions of qualit
120
120
- You don't know what "good" looks like for your product yet, so any rubric you write could be wrong
121
121
- You can do better than evals at this scale by *reading your outputs*. Spend 30 minutes a week looking at random traces. You'll catch more than any rubric will.
122
122
123
-
[Add automated evals](/docs/ai-evals/evaluations) once you have enough generations that you can't read them all (usually a few hundred per day).
123
+
[Add automated evals](/docs/ai-evals) once you have enough generations that you can't read them all (usually a few hundred per day).
124
124
125
125
When you're ready, our [beginner's guide to testing AI agents](/blog/testing-ai-agents) walks through what a minimal eval suite actually looks like – a small dataset from real user queries and recent bugs, one or two cheap code-based evaluators, one LLM-as-a-judge for a subjective criterion, and a regular trace review ritual.
126
126
@@ -138,7 +138,7 @@ Until you have meaningful traffic: just ship the better-feeling prompt and move
138
138
139
139
If only one person on the team writes prompts, and you ship prompt changes through your normal code deploys, you don't need a separate prompt management system yet. A `prompts.py` file in your repo is fine.
140
140
141
-
[Prompt management](/docs/prompt-management/prompts) becomes valuable once (a) multiple people are editing prompts, (b) you want non-engineers to iterate on prompts without a deploy, or (c) you want to version and roll back prompts independently of code. None of those usually apply on day one.
141
+
[Prompt management](/docs/prompt-management) becomes valuable once (a) multiple people are editing prompts, (b) you want non-engineers to iterate on prompts without a deploy, or (c) you want to version and roll back prompts independently of code. None of those usually apply on day one.
142
142
143
143
### Custom-built observability infra
144
144
@@ -180,7 +180,7 @@ As your product matures, the bar for observability rises. Here's the rough order
180
180
181
181
**Connect LLM data to product analytics.** "Users who hit our AI feature retain at 2x" is the kind of insight that justifies the whole investment. To do this, your LLM observability needs to share a user model with your [product analytics](/product-analytics) – which is one reason [bundled tools](/blog/best-analytics-stack-vibe-coded-apps) tend to win at this stage.
182
182
183
-
**Consider prompt management (if it fits).** Once more than one person edits prompts, or you want non-engineers to iterate without a deploy, [prompt management](/docs/prompt-management/prompts) starts earning its keep. Many teams never need this – a `prompts.py` file in your repo is fine if you're shipping prompt changes through code anyway.
183
+
**Consider prompt management (if it fits).** Once more than one person edits prompts, or you want non-engineers to iterate without a deploy, [prompt management](/docs/prompt-management) starts earning its keep. Many teams never need this – a `prompts.py` file in your repo is fine if you're shipping prompt changes through code anyway.
184
184
185
185
</details>
186
186
@@ -209,9 +209,9 @@ A quick pitch, since this is our blog after all:
209
209
210
210
-**[Tracing](/docs/ai-observability/traces)** with a small SDK wrapper around OpenAI, Anthropic, and other providers
211
211
-**Cost, latency, and token tracking** by model, user, and feature
212
-
-**[Evaluations](/docs/ai-evals/evaluations)** with LLM-as-a-judge and human review
212
+
-**[Evaluations](/docs/ai-evals)** with LLM-as-a-judge and human review
213
213
-**[Error tracking](/error-tracking)** alongside traces, so AI-specific failures (tool calls, parsing errors, loop termination) show up next to the calls that caused them
214
-
-**[Prompt management](/docs/prompt-management/prompts)** with versioning, fetch-at-runtime, and rollback
214
+
-**[Prompt management](/docs/prompt-management)** with versioning, fetch-at-runtime, and rollback
215
215
-**[Connected product analytics](/product-analytics)** – the same user IDs flow through, so you can correlate AI usage with retention, conversion, and churn
216
216
-**[Session replay](/docs/ai-observability/link-session-replay)** alongside traces, so you can see exactly what the user was doing when the model misbehaved
217
217
-**An [MCP server](/docs/model-context-protocol)** that lets you query traces, costs, and evals directly from Claude Code or Cursor while you're iterating
Copy file name to clipboardExpand all lines: contents/blog/best-statsig-alternatives.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,7 +59,7 @@ Eligible early-stage companies can apply to [PostHog for Startups](/startups) fo
59
59
60
60
**PostHog**: Built for engineering teams shipping AI products. [AI observability](/ai-observability) captures traces, spans, prompts, token costs, and latency across providers (Anthropic, OpenAI, Google, LangChain, and more).
61
61
62
-
Run [evals with LLM-as-a-Judge](/blog/stop-ai-slop) to catch quality regressions, [manage prompts directly in PostHog](/docs/prompt-management/prompts) with versioning and fetch them at runtime (no redeploys), and use the prompt playground to compare models side-by-side.
62
+
Run [evals with LLM-as-a-Judge](/blog/stop-ai-slop) to catch quality regressions, [manage prompts directly in PostHog](/docs/prompt-management) with versioning and fetch them at runtime (no redeploys), and use the prompt playground to compare models side-by-side.
63
63
64
64
Pair it with feature flags to roll out new models gradually, experiments to A/B test prompt variants, and session replay to see exactly what users saw when the model misbehaved.
Copy file name to clipboardExpand all lines: contents/blog/what-is-ai-observability.mdx
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -108,7 +108,7 @@ The combination of LLM errors + general application errors is how you debug agen
108
108
109
109
#### 4. Evaluations (evals)
110
110
111
-
[LLM evals](/docs/ai-evals/evaluations) score the quality of outputs, not just whether they succeeded. They come in two flavors:
111
+
[LLM evals](/docs/ai-evals) score the quality of outputs, not just whether they succeeded. They come in two flavors:
112
112
113
113
-**Deterministic evals** (code-based): cheap, fast, and reliable. Things like "did the agent call the right tool," "did the output contain a forbidden keyword," "was the response under N tokens," or Levenshtein distance against an expected output.
114
114
-**LLM-as-a-judge evals**: a separate model scores the output against a rubric. Useful for subjective criteria (tone, helpfulness, hallucination) that code can't capture. More expensive, less reliable, sensitive to changes in the judge model.
Versioning, A/B testing, and runtime control of your prompts without re-deploying code. [Prompt management](/docs/prompt-management/prompts) is useful once multiple people are editing prompts or you want non-engineers to iterate. Often overrated at the MVP stage – a `prompts.py` file in your repo is fine if one person owns prompts and you ship changes through normal deploys.
126
+
Versioning, A/B testing, and runtime control of your prompts without re-deploying code. [Prompt management](/docs/prompt-management) is useful once multiple people are editing prompts or you want non-engineers to iterate. Often overrated at the MVP stage – a `prompts.py` file in your repo is fine if one person owns prompts and you ship changes through normal deploys.
127
127
128
128
### How tools instrument your code
129
129
@@ -167,8 +167,8 @@ The free tier covers 100K LLM events/mo. EU hosting available. SDKs for OpenAI,
167
167
**Key features:**
168
168
169
169
- Tracing, cost tracking, latency, token usage, model breakdown
170
-
-[Evals](/docs/ai-evals/evaluations) with deterministic checks and LLM-as-a-judge
171
-
-[Prompt management](/docs/prompt-management/prompts) with versioning and fetch-at-runtime
170
+
-[Evals](/docs/ai-evals) with deterministic checks and LLM-as-a-judge
171
+
-[Prompt management](/docs/prompt-management) with versioning and fetch-at-runtime
172
172
-[Prompt experiments](/docs/prompt-management/prompt-experiments) (beta) for A/B testing prompts with built-in cost, latency, and eval pass rate metrics
173
173
-[MCP server](/docs/model-context-protocol) for querying observability data from Claude Code or Cursor
174
174
- Connected to [product analytics](/docs/product-analytics), [session replay](/docs/session-replay), and [error tracking](/docs/error-tracking) so you can debug AI failures with full user context
Copy file name to clipboardExpand all lines: contents/docs/ai-evals/datasets.mdx
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,4 +4,6 @@ title: Datasets
4
4
5
5
Datasets let you curate sets of input/output pairs you can replay against prompt or model changes to catch regressions before they reach production.
6
6
7
-
> Coming soon.
7
+
## Documentation coming soon
8
+
9
+
We're still writing the full guide for datasets. In the meantime, see the [AI Evals overview](/docs/ai-evals) to learn how datasets fit alongside evaluations and trace reviews.
Copy file name to clipboardExpand all lines: contents/docs/ai-evals/taggers.mdx
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,4 +4,6 @@ title: Taggers
4
4
5
5
Taggers automatically classify and label generations so you can filter, group, and analyze your LLM traffic by the categories that matter to your product.
6
6
7
-
> Full documentation coming soon.
7
+
## Documentation coming soon
8
+
9
+
We're still writing the full guide for taggers. In the meantime, see the [AI Evals overview](/docs/ai-evals) to learn how taggers fit alongside evaluations and trace reviews.
Copy file name to clipboardExpand all lines: contents/docs/prompt-management/_snippets/prompt-experiments-body.mdx
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ Use this when you have a candidate change to a prompt (a wording tweak, a new in
15
15
16
16
You need at least two versions of a prompt before you can create an experiment from it.
17
17
18
-
1. In **Prompt management** → **Prompts**, open the prompt you want to test (or [create a new one](/docs/prompt-management/prompts#creating-prompts))
18
+
1. In **Prompt management** → **Prompts**, open the prompt you want to test (or [create a new one](/docs/prompt-management#creating-prompts))
19
19
2. If it only has one version, edit the body and save. Every save creates a new immutable version.
20
20
3. Repeat for as many versions as you want to compare (up to 10)
21
21
@@ -159,7 +159,7 @@ Results populate within seconds of the first events landing. Each tile shows the
159
159
160
160
- **Cost** — mean LLM cost per user (`$ai_total_cost_usd` on `$ai_generation`). Goal: decrease.
161
161
- **Latency** — mean LLM latency per user (`$ai_latency`). Goal: decrease.
162
-
- **Eval pass rate** — share of `$ai_evaluation` events that returned a pass, scoped to this prompt. Populates only if you have [LLM evaluations](/docs/ai-evals/evaluations) configured.
162
+
- **Eval pass rate** — share of `$ai_evaluation` events that returned a pass, scoped to this prompt. Populates only if you have [LLM evaluations](/docs/ai-evals) configured.
0 commit comments