Skip to content

Commit 83c7455

Browse files
docs: fold AI evals and prompt management into AI observability sidebar (#17507)
## Changes The **AI evals** (`/docs/ai-evals`) and **prompt management** (`/docs/prompt-management`) overview pages were weird orphans — inside the AI Observability sidebar they only appeared as `↗` links that threw you into a separate sidebar context. This folds them into a single, coherent AI Observability sidebar and tidies up the section groupings. - Fold **Evaluations** and **Prompt management** in as proper nested groups under **Concepts** (URLs unchanged — no file moves, so no redirects needed for the sub-pages). - Move **Calculating LLM costs** from Concepts → **Guides**. - Move **Trace reviews** into the Evaluations group (its own content describes it as the AI Evals manual-review workflow). - Move **Link session replay**, **Link error tracking**, and **Collect user feedback** out of the "PostHog AI" section (they aren't PostHog AI features) into **Guides**. - Remove the two `↗` orphan links and the now-redundant standalone Evaluations / Prompt Management products from the docs menu. - Repoint the overview pages' QuickLinks to the consolidated location. - Fix a pre-existing broken redirect destination (`/docs/llm-analytics/trace-reviews` now points at the real page). **Why:** Charles flagged that the two pages felt like orphans and that the sidebar grouping was unintuitive (evals/prompt management filed under Guides, costs filed under Concepts). This makes the AI docs navigation make sense as one product. ## Checklist - [x] I've read the [docs](https://posthog.com/handbook/docs-and-wizard/docs-style-guide) and/or [content](https://posthog.com/handbook/content/posthog-style-guide) style guides. - [x] Words are spelled using American English - [x] Use relative URLs for internal links - [x] I've checked the pages added or changed in the Vercel preview build - [x] If I moved a page, I added a redirect in `vercel.json` --- *Created with [PostHog Code](https://posthog.com/code?ref=pr) from a [Slack thread](https://posthog.slack.com/archives/C087XQ7K9K7/p1781192656765859)*
1 parent da5af6e commit 83c7455

14 files changed

Lines changed: 145 additions & 238 deletions

contents/blog/ai-observability-for-mvps.mdx

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ You don't need 15 evaluation criteria across five different dimensions of qualit
120120
- You don't know what "good" looks like for your product yet, so any rubric you write could be wrong
121121
- You can do better than evals at this scale by *reading your outputs*. Spend 30 minutes a week looking at random traces. You'll catch more than any rubric will.
122122

123-
[Add automated evals](/docs/ai-evals/evaluations) once you have enough generations that you can't read them all (usually a few hundred per day).
123+
[Add automated evals](/docs/ai-evals) once you have enough generations that you can't read them all (usually a few hundred per day).
124124

125125
When you're ready, our [beginner's guide to testing AI agents](/blog/testing-ai-agents) walks through what a minimal eval suite actually looks like – a small dataset from real user queries and recent bugs, one or two cheap code-based evaluators, one LLM-as-a-judge for a subjective criterion, and a regular trace review ritual.
126126

@@ -138,7 +138,7 @@ Until you have meaningful traffic: just ship the better-feeling prompt and move
138138

139139
If only one person on the team writes prompts, and you ship prompt changes through your normal code deploys, you don't need a separate prompt management system yet. A `prompts.py` file in your repo is fine.
140140

141-
[Prompt management](/docs/prompt-management/prompts) becomes valuable once (a) multiple people are editing prompts, (b) you want non-engineers to iterate on prompts without a deploy, or (c) you want to version and roll back prompts independently of code. None of those usually apply on day one.
141+
[Prompt management](/docs/prompt-management) becomes valuable once (a) multiple people are editing prompts, (b) you want non-engineers to iterate on prompts without a deploy, or (c) you want to version and roll back prompts independently of code. None of those usually apply on day one.
142142

143143
### Custom-built observability infra
144144

@@ -180,7 +180,7 @@ As your product matures, the bar for observability rises. Here's the rough order
180180

181181
**Connect LLM data to product analytics.** "Users who hit our AI feature retain at 2x" is the kind of insight that justifies the whole investment. To do this, your LLM observability needs to share a user model with your [product analytics](/product-analytics) – which is one reason [bundled tools](/blog/best-analytics-stack-vibe-coded-apps) tend to win at this stage.
182182

183-
**Consider prompt management (if it fits).** Once more than one person edits prompts, or you want non-engineers to iterate without a deploy, [prompt management](/docs/prompt-management/prompts) starts earning its keep. Many teams never need this – a `prompts.py` file in your repo is fine if you're shipping prompt changes through code anyway.
183+
**Consider prompt management (if it fits).** Once more than one person edits prompts, or you want non-engineers to iterate without a deploy, [prompt management](/docs/prompt-management) starts earning its keep. Many teams never need this – a `prompts.py` file in your repo is fine if you're shipping prompt changes through code anyway.
184184

185185
</details>
186186

@@ -209,9 +209,9 @@ A quick pitch, since this is our blog after all:
209209

210210
- **[Tracing](/docs/ai-observability/traces)** with a small SDK wrapper around OpenAI, Anthropic, and other providers
211211
- **Cost, latency, and token tracking** by model, user, and feature
212-
- **[Evaluations](/docs/ai-evals/evaluations)** with LLM-as-a-judge and human review
212+
- **[Evaluations](/docs/ai-evals)** with LLM-as-a-judge and human review
213213
- **[Error tracking](/error-tracking)** alongside traces, so AI-specific failures (tool calls, parsing errors, loop termination) show up next to the calls that caused them
214-
- **[Prompt management](/docs/prompt-management/prompts)** with versioning, fetch-at-runtime, and rollback
214+
- **[Prompt management](/docs/prompt-management)** with versioning, fetch-at-runtime, and rollback
215215
- **[Connected product analytics](/product-analytics)** – the same user IDs flow through, so you can correlate AI usage with retention, conversion, and churn
216216
- **[Session replay](/docs/ai-observability/link-session-replay)** alongside traces, so you can see exactly what the user was doing when the model misbehaved
217217
- **An [MCP server](/docs/model-context-protocol)** that lets you query traces, costs, and evals directly from Claude Code or Cursor while you're iterating

contents/blog/best-statsig-alternatives.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ Eligible early-stage companies can apply to [PostHog for Startups](/startups) fo
5959

6060
**PostHog**: Built for engineering teams shipping AI products. [AI observability](/ai-observability) captures traces, spans, prompts, token costs, and latency across providers (Anthropic, OpenAI, Google, LangChain, and more).
6161

62-
Run [evals with LLM-as-a-Judge](/blog/stop-ai-slop) to catch quality regressions, [manage prompts directly in PostHog](/docs/prompt-management/prompts) with versioning and fetch them at runtime (no redeploys), and use the prompt playground to compare models side-by-side.
62+
Run [evals with LLM-as-a-Judge](/blog/stop-ai-slop) to catch quality regressions, [manage prompts directly in PostHog](/docs/prompt-management) with versioning and fetch them at runtime (no redeploys), and use the prompt playground to compare models side-by-side.
6363

6464
Pair it with feature flags to roll out new models gradually, experiments to A/B test prompt variants, and session replay to see exactly what users saw when the model misbehaved.
6565

contents/blog/what-is-ai-observability.mdx

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ The combination of LLM errors + general application errors is how you debug agen
108108

109109
#### 4. Evaluations (evals)
110110

111-
[LLM evals](/docs/ai-evals/evaluations) score the quality of outputs, not just whether they succeeded. They come in two flavors:
111+
[LLM evals](/docs/ai-evals) score the quality of outputs, not just whether they succeeded. They come in two flavors:
112112

113113
- **Deterministic evals** (code-based): cheap, fast, and reliable. Things like "did the agent call the right tool," "did the output contain a forbidden keyword," "was the response under N tokens," or Levenshtein distance against an expected output.
114114
- **LLM-as-a-judge evals**: a separate model scores the output against a rubric. Useful for subjective criteria (tone, helpfulness, hallucination) that code can't capture. More expensive, less reliable, sensitive to changes in the judge model.
@@ -123,7 +123,7 @@ Beyond explicit feedback, implicit signals matter too: retries, edits, abandonme
123123

124124
#### 6. Prompt management
125125

126-
Versioning, A/B testing, and runtime control of your prompts without re-deploying code. [Prompt management](/docs/prompt-management/prompts) is useful once multiple people are editing prompts or you want non-engineers to iterate. Often overrated at the MVP stage – a `prompts.py` file in your repo is fine if one person owns prompts and you ship changes through normal deploys.
126+
Versioning, A/B testing, and runtime control of your prompts without re-deploying code. [Prompt management](/docs/prompt-management) is useful once multiple people are editing prompts or you want non-engineers to iterate. Often overrated at the MVP stage – a `prompts.py` file in your repo is fine if one person owns prompts and you ship changes through normal deploys.
127127

128128
### How tools instrument your code
129129

@@ -167,8 +167,8 @@ The free tier covers 100K LLM events/mo. EU hosting available. SDKs for OpenAI,
167167
**Key features:**
168168

169169
- Tracing, cost tracking, latency, token usage, model breakdown
170-
- [Evals](/docs/ai-evals/evaluations) with deterministic checks and LLM-as-a-judge
171-
- [Prompt management](/docs/prompt-management/prompts) with versioning and fetch-at-runtime
170+
- [Evals](/docs/ai-evals) with deterministic checks and LLM-as-a-judge
171+
- [Prompt management](/docs/prompt-management) with versioning and fetch-at-runtime
172172
- [Prompt experiments](/docs/prompt-management/prompt-experiments) (beta) for A/B testing prompts with built-in cost, latency, and eval pass rate metrics
173173
- [MCP server](/docs/model-context-protocol) for querying observability data from Claude Code or Cursor
174174
- Connected to [product analytics](/docs/product-analytics), [session replay](/docs/session-replay), and [error tracking](/docs/error-tracking) so you can debug AI failures with full user context

contents/docs/ai-evals/datasets.mdx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,6 @@ title: Datasets
44

55
Datasets let you curate sets of input/output pairs you can replay against prompt or model changes to catch regressions before they reach production.
66

7-
> Coming soon.
7+
## Documentation coming soon
8+
9+
We're still writing the full guide for datasets. In the meantime, see the [AI Evals overview](/docs/ai-evals) to learn how datasets fit alongside evaluations and trace reviews.
File renamed without changes.

contents/docs/ai-evals/taggers.mdx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,6 @@ title: Taggers
44

55
Taggers automatically classify and label generations so you can filter, group, and analyze your LLM traffic by the categories that matter to your product.
66

7-
> Full documentation coming soon.
7+
## Documentation coming soon
8+
9+
We're still writing the full guide for taggers. In the meantime, see the [AI Evals overview](/docs/ai-evals) to learn how taggers fit alongside evaluations and trace reviews.

contents/docs/experiments/llm-prompt-experiments.mdx

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,28 @@ availability:
66
free: none
77
selfServe: full
88
enterprise: full
9+
tableOfContents: [
10+
{
11+
url: 'step-1-create-your-prompt-versions',
12+
value: 'Step 1: Create your prompt versions',
13+
depth: 1,
14+
},
15+
{
16+
url: 'step-2-create-the-experiment',
17+
value: 'Step 2: Create the experiment',
18+
depth: 1,
19+
},
20+
{
21+
url: 'step-3-wire-up-your-code',
22+
value: 'Step 3: Wire up your code',
23+
depth: 1,
24+
},
25+
{
26+
url: 'step-4-launch-and-read-results',
27+
value: 'Step 4: Launch and read results',
28+
depth: 1,
29+
},
30+
]
931
---
1032

1133
import PromptExperimentsBody from '../prompt-management/_snippets/prompt-experiments-body.mdx'

contents/docs/prompt-management/_snippets/prompt-experiments-body.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Use this when you have a candidate change to a prompt (a wording tweak, a new in
1515

1616
You need at least two versions of a prompt before you can create an experiment from it.
1717

18-
1. In **Prompt management****Prompts**, open the prompt you want to test (or [create a new one](/docs/prompt-management/prompts#creating-prompts))
18+
1. In **Prompt management****Prompts**, open the prompt you want to test (or [create a new one](/docs/prompt-management#creating-prompts))
1919
2. If it only has one version, edit the body and save. Every save creates a new immutable version.
2020
3. Repeat for as many versions as you want to compare (up to 10)
2121

@@ -159,7 +159,7 @@ Results populate within seconds of the first events landing. Each tile shows the
159159
160160
- **Cost** — mean LLM cost per user (`$ai_total_cost_usd` on `$ai_generation`). Goal: decrease.
161161
- **Latency** — mean LLM latency per user (`$ai_latency`). Goal: decrease.
162-
- **Eval pass rate** — share of `$ai_evaluation` events that returned a pass, scoped to this prompt. Populates only if you have [LLM evaluations](/docs/ai-evals/evaluations) configured.
162+
- **Eval pass rate** — share of `$ai_evaluation` events that returned a pass, scoped to this prompt. Populates only if you have [LLM evaluations](/docs/ai-evals) configured.
163163
164164
export const PromptExperimentResultsLight = "https://res.cloudinary.com/dmukukwp6/image/upload/q_auto,f_auto/Screenshot_2026_05_21_at_17_34_40_5e448cfbdf.png"
165165
export const PromptExperimentResultsDark = "https://res.cloudinary.com/dmukukwp6/image/upload/q_auto,f_auto/Screenshot_2026_05_21_at_17_34_54_68fa203949.png"
File renamed without changes.

contents/docs/prompt-management/prompt-experiments.mdx

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,28 @@ availability:
66
free: none
77
selfServe: full
88
enterprise: full
9+
tableOfContents: [
10+
{
11+
url: 'step-1-create-your-prompt-versions',
12+
value: 'Step 1: Create your prompt versions',
13+
depth: 1,
14+
},
15+
{
16+
url: 'step-2-create-the-experiment',
17+
value: 'Step 2: Create the experiment',
18+
depth: 1,
19+
},
20+
{
21+
url: 'step-3-wire-up-your-code',
22+
value: 'Step 3: Wire up your code',
23+
depth: 1,
24+
},
25+
{
26+
url: 'step-4-launch-and-read-results',
27+
value: 'Step 4: Launch and read results',
28+
depth: 1,
29+
},
30+
]
931
---
1032

1133
import PromptExperimentsBody from './_snippets/prompt-experiments-body.mdx'

0 commit comments

Comments
 (0)