PostHog
diff --git a/‎contents/blog/ai-observability-for-mvps.mdx‎
Lines changed: 5 additions & 5 deletions b/‎contents/blog/ai-observability-for-mvps.mdx‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎contents/blog/best-statsig-alternatives.mdx‎
Lines changed: 1 addition & 1 deletion b/‎contents/blog/best-statsig-alternatives.mdx‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎contents/blog/what-is-ai-observability.mdx‎
Lines changed: 4 additions & 4 deletions b/‎contents/blog/what-is-ai-observability.mdx‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎contents/docs/ai-evals/datasets.mdx‎
Lines changed: 3 additions & 1 deletion b/‎contents/docs/ai-evals/datasets.mdx‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎contents/docs/ai-evals/evaluations.mdx‎ ‎contents/docs/ai-evals/index.mdx‎contents/docs/ai-evals/evaluations.mdx renamed to contents/docs/ai-evals/index.mdx b/‎contents/docs/ai-evals/evaluations.mdx‎ ‎contents/docs/ai-evals/index.mdx‎contents/docs/ai-evals/evaluations.mdx renamed to contents/docs/ai-evals/index.mdx
diff --git a/‎contents/docs/ai-evals/taggers.mdx‎
Lines changed: 3 additions & 1 deletion b/‎contents/docs/ai-evals/taggers.mdx‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎contents/docs/experiments/llm-prompt-experiments.mdx‎
Lines changed: 22 additions & 0 deletions b/‎contents/docs/experiments/llm-prompt-experiments.mdx‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎contents/docs/prompt-management/_snippets/prompt-experiments-body.mdx‎
Lines changed: 2 additions & 2 deletions b/‎contents/docs/prompt-management/_snippets/prompt-experiments-body.mdx‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎…tents/docs/prompt-management/prompts.mdx‎ ‎contents/docs/prompt-management/index.mdx‎contents/docs/prompt-management/prompts.mdx renamed to contents/docs/prompt-management/index.mdx b/‎…tents/docs/prompt-management/prompts.mdx‎ ‎contents/docs/prompt-management/index.mdx‎contents/docs/prompt-management/prompts.mdx renamed to contents/docs/prompt-management/index.mdx
diff --git a/‎contents/docs/prompt-management/prompt-experiments.mdx‎
Lines changed: 22 additions & 0 deletions b/‎contents/docs/prompt-management/prompt-experiments.mdx‎
Lines changed: 22 additions & 0 deletions
@@ -120,7 +120,7 @@ You don't need 15 evaluation criteria across five different dimensions of qualit
 - You don't know what "good" looks like for your product yet, so any rubric you write could be wrong
 - You can do better than evals at this scale by *reading your outputs*. Spend 30 minutes a week looking at random traces. You'll catch more than any rubric will.
 
-[Add automated evals](/docs/ai-evals/evaluations) once you have enough generations that you can't read them all (usually a few hundred per day). 
+[Add automated evals](/docs/ai-evals) once you have enough generations that you can't read them all (usually a few hundred per day). 
 
 When you're ready, our [beginner's guide to testing AI agents](/blog/testing-ai-agents) walks through what a minimal eval suite actually looks like – a small dataset from real user queries and recent bugs, one or two cheap code-based evaluators, one LLM-as-a-judge for a subjective criterion, and a regular trace review ritual.
 
@@ -138,7 +138,7 @@ Until you have meaningful traffic: just ship the better-feeling prompt and move
 
 If only one person on the team writes prompts, and you ship prompt changes through your normal code deploys, you don't need a separate prompt management system yet. A `prompts.py` file in your repo is fine.
 
-[Prompt management](/docs/prompt-management/prompts) becomes valuable once (a) multiple people are editing prompts, (b) you want non-engineers to iterate on prompts without a deploy, or (c) you want to version and roll back prompts independently of code. None of those usually apply on day one.
+[Prompt management](/docs/prompt-management) becomes valuable once (a) multiple people are editing prompts, (b) you want non-engineers to iterate on prompts without a deploy, or (c) you want to version and roll back prompts independently of code. None of those usually apply on day one.
 
 ### Custom-built observability infra
 
@@ -180,7 +180,7 @@ As your product matures, the bar for observability rises. Here's the rough order
 
 **Connect LLM data to product analytics.** "Users who hit our AI feature retain at 2x" is the kind of insight that justifies the whole investment. To do this, your LLM observability needs to share a user model with your [product analytics](/product-analytics) – which is one reason [bundled tools](/blog/best-analytics-stack-vibe-coded-apps) tend to win at this stage.
 
-**Consider prompt management (if it fits).** Once more than one person edits prompts, or you want non-engineers to iterate without a deploy, [prompt management](/docs/prompt-management/prompts) starts earning its keep. Many teams never need this – a `prompts.py` file in your repo is fine if you're shipping prompt changes through code anyway.
+**Consider prompt management (if it fits).** Once more than one person edits prompts, or you want non-engineers to iterate without a deploy, [prompt management](/docs/prompt-management) starts earning its keep. Many teams never need this – a `prompts.py` file in your repo is fine if you're shipping prompt changes through code anyway.
 
 </details>
 
@@ -209,9 +209,9 @@ A quick pitch, since this is our blog after all:
 
 - **[Tracing](/docs/ai-observability/traces)** with a small SDK wrapper around OpenAI, Anthropic, and other providers
 - **Cost, latency, and token tracking** by model, user, and feature
-- **[Evaluations](/docs/ai-evals/evaluations)** with LLM-as-a-judge and human review
+- **[Evaluations](/docs/ai-evals)** with LLM-as-a-judge and human review
 - **[Error tracking](/error-tracking)** alongside traces, so AI-specific failures (tool calls, parsing errors, loop termination) show up next to the calls that caused them
-- **[Prompt management](/docs/prompt-management/prompts)** with versioning, fetch-at-runtime, and rollback
+- **[Prompt management](/docs/prompt-management)** with versioning, fetch-at-runtime, and rollback
 - **[Connected product analytics](/product-analytics)** – the same user IDs flow through, so you can correlate AI usage with retention, conversion, and churn
 - **[Session replay](/docs/ai-observability/link-session-replay)** alongside traces, so you can see exactly what the user was doing when the model misbehaved
 - **An [MCP server](/docs/model-context-protocol)** that lets you query traces, costs, and evals directly from Claude Code or Cursor while you're iterating
 
@@ -59,7 +59,7 @@ Eligible early-stage companies can apply to [PostHog for Startups](/startups) fo
 
 **PostHog**: Built for engineering teams shipping AI products. [AI observability](/ai-observability) captures traces, spans, prompts, token costs, and latency across providers (Anthropic, OpenAI, Google, LangChain, and more). 
 
-Run [evals with LLM-as-a-Judge](/blog/stop-ai-slop) to catch quality regressions, [manage prompts directly in PostHog](/docs/prompt-management/prompts) with versioning and fetch them at runtime (no redeploys), and use the prompt playground to compare models side-by-side.
+Run [evals with LLM-as-a-Judge](/blog/stop-ai-slop) to catch quality regressions, [manage prompts directly in PostHog](/docs/prompt-management) with versioning and fetch them at runtime (no redeploys), and use the prompt playground to compare models side-by-side.
 
 Pair it with feature flags to roll out new models gradually, experiments to A/B test prompt variants, and session replay to see exactly what users saw when the model misbehaved.
 
 
@@ -108,7 +108,7 @@ The combination of LLM errors + general application errors is how you debug agen
 
 #### 4. Evaluations (evals)
 
-[LLM evals](/docs/ai-evals/evaluations) score the quality of outputs, not just whether they succeeded. They come in two flavors:
+[LLM evals](/docs/ai-evals) score the quality of outputs, not just whether they succeeded. They come in two flavors:
 
 - **Deterministic evals** (code-based): cheap, fast, and reliable. Things like "did the agent call the right tool," "did the output contain a forbidden keyword," "was the response under N tokens," or Levenshtein distance against an expected output.
 - **LLM-as-a-judge evals**: a separate model scores the output against a rubric. Useful for subjective criteria (tone, helpfulness, hallucination) that code can't capture. More expensive, less reliable, sensitive to changes in the judge model.
@@ -123,7 +123,7 @@ Beyond explicit feedback, implicit signals matter too: retries, edits, abandonme
 
 #### 6. Prompt management
 
-Versioning, A/B testing, and runtime control of your prompts without re-deploying code. [Prompt management](/docs/prompt-management/prompts) is useful once multiple people are editing prompts or you want non-engineers to iterate. Often overrated at the MVP stage – a `prompts.py` file in your repo is fine if one person owns prompts and you ship changes through normal deploys.
+Versioning, A/B testing, and runtime control of your prompts without re-deploying code. [Prompt management](/docs/prompt-management) is useful once multiple people are editing prompts or you want non-engineers to iterate. Often overrated at the MVP stage – a `prompts.py` file in your repo is fine if one person owns prompts and you ship changes through normal deploys.
 
 ### How tools instrument your code
 
@@ -167,8 +167,8 @@ The free tier covers 100K LLM events/mo. EU hosting available. SDKs for OpenAI,
 **Key features:**
 
 - Tracing, cost tracking, latency, token usage, model breakdown
-- [Evals](/docs/ai-evals/evaluations) with deterministic checks and LLM-as-a-judge
-- [Prompt management](/docs/prompt-management/prompts) with versioning and fetch-at-runtime
+- [Evals](/docs/ai-evals) with deterministic checks and LLM-as-a-judge
+- [Prompt management](/docs/prompt-management) with versioning and fetch-at-runtime
 - [Prompt experiments](/docs/prompt-management/prompt-experiments) (beta) for A/B testing prompts with built-in cost, latency, and eval pass rate metrics
 - [MCP server](/docs/model-context-protocol) for querying observability data from Claude Code or Cursor
 - Connected to [product analytics](/docs/product-analytics), [session replay](/docs/session-replay), and [error tracking](/docs/error-tracking) so you can debug AI failures with full user context
 
@@ -4,4 +4,6 @@ title: Datasets
 
 Datasets let you curate sets of input/output pairs you can replay against prompt or model changes to catch regressions before they reach production.
 
-> Coming soon.
+## Documentation coming soon
+
+We're still writing the full guide for datasets. In the meantime, see the [AI Evals overview](/docs/ai-evals) to learn how datasets fit alongside evaluations and trace reviews.
@@ -4,4 +4,6 @@ title: Taggers
 
 Taggers automatically classify and label generations so you can filter, group, and analyze your LLM traffic by the categories that matter to your product.
 
-> Full documentation coming soon.
+## Documentation coming soon
+
+We're still writing the full guide for taggers. In the meantime, see the [AI Evals overview](/docs/ai-evals) to learn how taggers fit alongside evaluations and trace reviews.
@@ -6,6 +6,28 @@ availability:
   free: none
   selfServe: full
   enterprise: full
+tableOfContents: [
+    {
+        url: 'step-1-create-your-prompt-versions',
+        value: 'Step 1: Create your prompt versions',
+        depth: 1,
+    },
+    {
+        url: 'step-2-create-the-experiment',
+        value: 'Step 2: Create the experiment',
+        depth: 1,
+    },
+    {
+        url: 'step-3-wire-up-your-code',
+        value: 'Step 3: Wire up your code',
+        depth: 1,
+    },
+    {
+        url: 'step-4-launch-and-read-results',
+        value: 'Step 4: Launch and read results',
+        depth: 1,
+    },
+]
 ---
 
 import PromptExperimentsBody from '../prompt-management/_snippets/prompt-experiments-body.mdx'
 
@@ -15,7 +15,7 @@ Use this when you have a candidate change to a prompt (a wording tweak, a new in
 
 You need at least two versions of a prompt before you can create an experiment from it.
 
-1. In **Prompt management** → **Prompts**, open the prompt you want to test (or [create a new one](/docs/prompt-management/prompts#creating-prompts))
+1. In **Prompt management** → **Prompts**, open the prompt you want to test (or [create a new one](/docs/prompt-management#creating-prompts))
 2. If it only has one version, edit the body and save. Every save creates a new immutable version.
 3. Repeat for as many versions as you want to compare (up to 10)
 
@@ -159,7 +159,7 @@ Results populate within seconds of the first events landing. Each tile shows the
 
 - **Cost** — mean LLM cost per user (`$ai_total_cost_usd` on `$ai_generation`). Goal: decrease.
 - **Latency** — mean LLM latency per user (`$ai_latency`). Goal: decrease.
-- **Eval pass rate** — share of `$ai_evaluation` events that returned a pass, scoped to this prompt. Populates only if you have [LLM evaluations](/docs/ai-evals/evaluations) configured.
+- **Eval pass rate** — share of `$ai_evaluation` events that returned a pass, scoped to this prompt. Populates only if you have [LLM evaluations](/docs/ai-evals) configured.
 
 export const PromptExperimentResultsLight = "https://res.cloudinary.com/dmukukwp6/image/upload/q_auto,f_auto/Screenshot_2026_05_21_at_17_34_40_5e448cfbdf.png"
 export const PromptExperimentResultsDark = "https://res.cloudinary.com/dmukukwp6/image/upload/q_auto,f_auto/Screenshot_2026_05_21_at_17_34_54_68fa203949.png"
 
@@ -6,6 +6,28 @@ availability:
   free: none
   selfServe: full
   enterprise: full
+tableOfContents: [
+    {
+        url: 'step-1-create-your-prompt-versions',
+        value: 'Step 1: Create your prompt versions',
+        depth: 1,
+    },
+    {
+        url: 'step-2-create-the-experiment',
+        value: 'Step 2: Create the experiment',
+        depth: 1,
+    },
+    {
+        url: 'step-3-wire-up-your-code',
+        value: 'Step 3: Wire up your code',
+        depth: 1,
+    },
+    {
+        url: 'step-4-launch-and-read-results',
+        value: 'Step 4: Launch and read results',
+        depth: 1,
+    },
+]
 ---
 
 import PromptExperimentsBody from './_snippets/prompt-experiments-body.mdx'