docs: add experiment CI/CD integration guide by wochinge · Pull Request #2888 · langfuse/langfuse-docs

wochinge · 2026-05-04T09:32:11Z

Summary

Added a dedicated CI/CD integration page for Langfuse experiments, covering GitHub Actions setup, action inputs/outputs, secrets, RunnerContext, regression failures, and non-GitHub CI patterns.
Documented the upcoming experiment(context: RunnerContext) contract so GitHub Action examples use context.runExperiment / context.run_experiment with action-injected dataset, SDK client, and metadata defaults.
Moved the existing Pytest/Vitest CI examples out of the SDK page into the new CI/CD page, and replaced the old SDK section with a concise cross-link.
Added explicit docs anchors for evaluator and CI/CD deep links, and registered the new page in the experiments docs navigation.

Linear

LFE-9366

Major Decisions

Made the new CI/CD page the canonical place for experiment gating docs to avoid duplicating long CI examples across the SDK and CI/CD pages.
Kept GitHub Action docs focused on the action-specific RunnerContext flow, while keeping direct SDK-based examples for other CI/CD systems.

Disclaimer: Experimental PR review

Greptile Summary

This PR adds a dedicated CI/CD integration guide for Langfuse experiments, covering the langfuse/experiment-action GitHub Actions workflow, RunnerContext contract, regression gating via RegressionError, and equivalent Pytest/Vitest patterns for other CI systems. The existing CI examples are moved from the SDK page to the new page with clean cross-links.

The TypeScript Vitest examples use eval as an arrow-function parameter name (lines 534 and 557), which is a reserved identifier in TypeScript strict mode and will cause a compilation error for anyone copying the snippet.
The Python Pytest examples also shadow the built-in eval via loop variables, and next() without a default can raise StopIteration silently instead of a clear test failure.

Confidence Score: 3/5

Not safe to merge as-is — the TypeScript code examples contain a strict-mode syntax error that will fail compilation for any reader who copies them.

Two P1 findings (using eval as a TypeScript parameter name — a hard syntax error in strict mode) pull the score below the P1 ceiling of 4. The remaining findings are P2 style/robustness issues.

content/docs/evaluation/experiments/experiments-in-ci-cd.mdx — Vitest TypeScript example (lines 534, 557) and Pytest Python example (lines 418–420, 443–445).

Important Files Changed

Filename	Overview
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx	New CI/CD integration guide — two TypeScript code examples use `eval` as a parameter name, which is a syntax error in strict mode; Python examples shadow the built-in `eval`; `next()` without a default can surface a confusing StopIteration; echo quoting in the action output YAML snippet is fragile.
content/docs/evaluation/experiments/experiments-via-sdk.mdx	Removed the duplicated CI examples and replaced with cross-links to the new CI/CD page; added anchor IDs to the Evaluators and Testing in CI Environments headings — clean change with no issues.
content/docs/evaluation/experiments/meta.json	Registers the new `experiments-in-ci-cd` page in the experiments navigation — trivial and correct.

Sequence Diagram

sequenceDiagram
    participant PR as Pull Request
    participant GHA as GitHub Actions
    participant Action as langfuse/experiment-action
    participant Script as experiment(context)
    participant LF as Langfuse API

    PR->>GHA: push / pull_request event
    GHA->>Action: run with inputs (dataset_name, keys, metadata)
    Action->>LF: fetch dataset items (dataset_name + dataset_version)
    Action->>Script: call experiment(RunnerContext)
    Script->>LF: run_experiment / runExperiment (task + evaluators)
    LF-->>Script: ExperimentResult (run_evaluations)
    Script-->>Action: return result (or raise RegressionError)
    Action->>GHA: set outputs (result_json, failed)
    Action->>PR: post/update PR comment (pass/regression/error + scores)

Prompt To Fix All With AI

Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:534
**`eval` is a reserved identifier in TypeScript strict mode**

Using `eval` as an arrow-function parameter (`(eval) => ...`) is a syntax error in TypeScript strict mode (and ES5+ strict mode). TypeScript will reject the snippet with *"Parameter name 'eval' is not allowed in strict mode"*, so anyone copying this example will get an immediate compilation failure. The same issue appears again on line 557.

```suggestion
      (e) => e.name === "avg_accuracy",
```

### Issue 2 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:557
**Same `eval` reserved-identifier error in the second test case**

Same strict-mode syntax error as on line 534 — `eval` cannot be used as a parameter name.

```suggestion
      (e) => e.name === "avg_accuracy",
```

### Issue 3 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:418-420
**`next()` without a default raises `StopIteration` on missing evaluation**

If no evaluation named `"avg_accuracy"` is present (e.g. the run evaluator raised an exception), `next(...)` with no default will raise `StopIteration`, which pytest surfaces as a confusing internal error rather than a clear assertion failure. Adding a default of `None` and checking it explicitly makes the failure obvious. The same pattern on line 443 has the same issue.

### Issue 4 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:418-420
**`eval` shadows Python built-in in loop variable**

`eval` is a Python built-in function; using it as a loop variable name (`for eval in result.run_evaluations`) will shadow it and trigger linting warnings (e.g. `W0622` in pylint, `A001` in flake8-builtins). A minor rename like `evaluation` or `ev` avoids the issue. This same pattern also appears on line 444.

### Issue 5 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:342
**Single-quoted `echo` breaks if JSON contains a single quote**

`echo '${{ steps.experiment.outputs.result_json }}'` wraps the GitHub expression in single quotes. If any string value inside the JSON output contains a single-quote character, the shell command will fail or produce malformed JSON. A safer idiom is to assign to an env var first:

```yaml
- name: Store experiment result
  env:
    RESULT_JSON: ${{ steps.experiment.outputs.result_json }}
  run: echo "$RESULT_JSON" > experiment-result.json
```

_{Reviews (1): Last reviewed commit: "docs: add experiment CI/CD integration g..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

vercel · 2026-05-04T09:32:17Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
langfuse-docs	Ready	Preview, Comment	May 6, 2026 10:12am

github-actions · 2026-05-04T09:32:20Z

@claude review

claude

Author addressed the substantive feedback from my prior review (echo quoting, TS eval parameter, Python eval loop variable) in commit e29c238 — remaining items are docs-only nits, approving.

Extended reasoning...

Overview

Docs-only PR adding a new CI/CD integration page (experiments-in-ci-cd.mdx) for Langfuse experiments, with the existing Pytest/Vitest CI examples relocated from experiments-via-sdk.mdx and replaced with cross-links there. meta.json registers the new page in the experiments nav. No application code, schemas, or auth surfaces are touched.

Security risks

None. The page is rendered as MDX docs content. The content does include CI/CD recipes — script-injection via single-quoted echo '${{ ... }}' was the one real footgun a reader might copy verbatim, and that has now been fixed in commit e29c238 to use the env: + printf pattern recommended by GitHub, so the remaining sample is safe.

Level of scrutiny

Light. This is documentation only; the worst-case impact of a defect is a confused reader rather than a broken production system. Bug-hunting found a few documentation-quality issues (TS strict-mode eval parameter, Python eval shadowing the built-in, fragile shell quoting) and the author addressed all three in e29c238. The remaining nits — next() without an explicit default in the moved Pytest snippets, and the actions/checkout@v6 / setup-python@v6 / setup-node@v6 pins — are either pre-existing patterns carried over from the source page or potentially valid given the time elapsed since my knowledge cutoff (today is 2026-05-04, cutoff January 2026).

Other factors

The PR has clear intent, is well-organized (canonicalizes CI gating docs in one place rather than duplicating across pages), and the author has been responsive to bot feedback within the same review cycle. No CODEOWNERS-protected files, no test coverage to worry about (it is the test guidance), and the cross-links between the two pages are coherent. Given the docs-only scope and the addressed feedback, this is comfortable to shadow-approve.

wochinge · 2026-05-04T13:23:26Z

@greptileai

…entation

claude

Docs-only PR; all my prior substantive concerns (eval rename, shell quoting via env var, if: always() on the storage step, LangTabs label consistency) are reflected in the current diff.

Extended reasoning...

Overview

This PR adds a new content/docs/evaluation/experiments/experiments-in-ci-cd.mdx page documenting CI/CD integration for Langfuse experiments (GitHub Action workflow, RunnerContext contract, RegressionError, secrets, action inputs/outputs), moves the existing Pytest/Vitest examples out of experiments-via-sdk.mdx (replaced with a cross-link), registers the new page in meta.json, adds a changelog entry, and adds an author entry + headshot.

Security risks

None material. Modifications are confined to MDX docs, a JSON authors file, and a JPG headshot — no application code, no auth, crypto, or permissions code is touched. The most security-relevant aspect is the example shell snippet for storing result_json, which now correctly uses the env: indirection pattern recommended by GitHub's own security hardening guide rather than interpolating ${{ ... }} directly into a shell command.

Level of scrutiny

Low-to-medium. This is a documentation-only change (size:XL by line count, but all prose + examples). The page has been through two prior bot review rounds and an explicit author re-tag, with each substantive concern I raised in earlier passes addressed in the current diff. The remaining open thread (@v6 action pins) was resolved by the author and the bug hunting system did not re-flag it in this run — this is the kind of judgment call that does not benefit from another bot reiteration.

Other factors

The bug hunting system in this run reported no bugs. All my prior inline comments are marked resolved on the GitHub side, and the corresponding code in the current diff matches the suggested fixes (eval/evaluation rename in three Python comprehensions, env-var pattern on the storage step, if: always() on the storage step, uniform ["Python SDK", "JS/TS SDK"] LangTabs labels in all three locations on the new page). The remaining @v6 pins are the author's intentional choice given their current-date context, and re-flagging them would be noise per the broken-record rule.

wochinge · 2026-05-05T10:31:37Z

 </Tab>
 </LangTabs>

-#### Testing in CI Environments


moved this to the new page

claude

LGTM — docs-only PR with prior feedback addressed; remaining nit on brand-name capitalization is non-blocking.

Extended reasoning...

Overview

This PR adds a new CI/CD integration guide (content/docs/evaluation/experiments/experiments-in-ci-cd.mdx), a corresponding changelog entry (content/changelog/2026-05-05-experiment-ci-cd-gates.mdx), refactors experiments-via-sdk.mdx to cross-link to the new page rather than duplicate Pytest/Vitest examples, registers the page in meta.json, and adds a new author entry plus avatar. All file changes are documentation, navigation config, an author profile, and image assets.

Security risks

None — pure docs/content. No code, no auth, no infra. The added image (avatar/OG image) is a static asset, and the changelog/MDX are standard Nextra content. The included YAML samples are illustrative only.

Level of scrutiny

Low scrutiny appropriate. This is a docs PR with no executable impact, no CODEOWNER concerns flagged in the diff, and no shared production state. Multiple rounds of substantive review have already happened in this PR's timeline (action versions, shell-quoting, eval shadowing, if: always(), LangTabs labels) and the author has addressed all of them — the current diff already reflects actions/checkout@v6 etc. (the team's call to keep), if: always() on the storage step, printf '%s' "$RESULT_JSON" env-var pattern, evaluation in place of eval, and next(..., None) with isinstance guards.

Other factors

The only remaining finding from this run is a stylistic nit — the new changelog uses lowercase langfuse as a brand name in five prose locations, while every other entry in content/changelog/ capitalizes it. That inline comment is already attached for the author to consider; it is not a release blocker. Given the PR is well-scoped, the substantive feedback has been resolved across prior review cycles, and only a brand-capitalization nit remains, this is safe to approve without further human review.

claude · 2026-05-05T10:37:03Z

+description: Run langfuse experiments in GitHub Actions to catch quality regressions before releasing changes to production.
+author: Tobias Wochinger
+ogImage: /images/changelog/2026-05-05-experiment-ci-cd.png
+canonical: /docs/evaluation/experiments/experiments-in-ci-cd
+---
+
+import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";
+
+<ChangelogHeader />
+
+You can now run langfuse experiments in GitHub Actions and catch quality regressions before they ship. The new [langfuse/experiment-action](https://github.com/langfuse/experiment-action) tests your application against a langfuse dataset, reports the result directly on the pull request, and tracks the experiment run in langfuse.


🟡 The new changelog at content/changelog/2026-05-05-experiment-ci-cd-gates.mdx writes the product name as lowercase langfuse in five places on lines 4 and 14 (frontmatter description: plus three references in the body), e.g. "Run langfuse experiments...", "a langfuse dataset", "the experiment run in langfuse". Every other entry under content/changelog/ capitalizes "Langfuse" in user-facing prose, so this entry is the lone outlier. Suggest capitalizing the five brand-prose occurrences while keeping the GitHub repo path langfuse/experiment-action correctly lowercase.

Extended reasoning...

What the bug is. The new changelog entry uses lowercase langfuse as the product name in user-facing prose. Specifically:

Line 4 (frontmatter): description: Run langfuse experiments in GitHub Actions to catch quality regressions before releasing changes to production.

Line 14 (body): You can now run langfuse experiments in GitHub Actions... — and within the same paragraph, ...tests your application against a langfuse dataset... and ...tracks the experiment run in langfuse.

That is five occurrences of langfuse where the brand name should be Langfuse.

Why this is inconsistent. Grepping content/changelog/*.mdx shows every other changelog entry capitalizes "Langfuse" in description: fields and prose bodies (e.g. "Categorize and manage your traces with tags in Langfuse.", "Langfuse now offers compatibility...", "Deploy Langfuse behind a custom base path...", "Langfuse says hello to Anthropic Claude..."). The new file is the only entry under content/changelog/ with lowercase langfuse as a brand reference in prose.

What is and isn't a bug. The GitHub repo path langfuse/experiment-action (line 14, inside the markdown link target) is correctly lowercase — that is the literal owner/repo slug on GitHub and should not change. Only the five brand-name prose occurrences should be capitalized.

Step-by-step proof.

Open content/changelog/2026-05-05-experiment-ci-cd-gates.mdx line 4 — description: Run langfuse experiments.... Compare to the description field of any neighbouring changelog file (e.g. 2025-11-20-*, 2025-01-21-audit-logs.mdx, 2024-04-22-posthog-integration.mdx); all use capital Langfuse.

Open line 14 — the prose paragraph contains three more lowercase langfuse references plus the legitimate [langfuse/experiment-action] repo link.

The published changelog page (and Open Graph preview) will render the lowercase brand name verbatim, where every other entry on the same index page reads "Langfuse".

Impact. Pure brand/style consistency — no functional break. The page still renders, the link still resolves, and search will still find the entry. But the inconsistency is visible to anyone scanning the changelog index because the description appears next to other entries that consistently capitalize the product name.

Fix. Replace langfuse → Langfuse in the five prose occurrences on lines 4 and 14, while leaving the langfuse/experiment-action repo path unchanged. Five-character diff.

Severity. nit — purely stylistic/branding, fixable in a single targeted edit while the file is being added.

lowercase is the new style that we want to use, no?

Doesn't appear to be — every other 2026 changelog still capitalizes "Langfuse" in prose, including 2026-04-27-langfuse-cloud-japan.mdx (just a week before this entry, e.g. "Langfuse Cloud is now live...", "What is Langfuse Cloud Japan?") and 2026-04-13-experiments-rebuild.mdx ("Experiments are now a primary concept in Langfuse..."). If lowercase langfuse is genuinely the new convention, this is the first changelog to adopt it and the rest of content/changelog/ will read inconsistently on the index page until the others are migrated. If it's not the new convention, capitalizing the five prose occurrences on lines 4 and 14 brings this entry in line with the existing style. Either way, worth a quick check with whoever owns the style guide before merging.

Lotte-Verheyden

Content-wise the page is strong, but I would try to refrain from adding another docs page for it here (it should not live on the same level as experiments via SDKs). What about

making this a guide page
having a section on the experiments via SDK page that then links to the guide page for details on how to implement it. The section on the experiments via SDK page can be what's now the intro of your CI/CD page
Wdyt?

wochinge · 2026-05-06T10:08:44Z

@Lotte-Verheyden Done - main difference to your suggestion is that I kept a snippet of GH action usage in the main docs. This is in line how e.g. braintrust does it

This reverts commit 39ec823.

docs: add experiment CI/CD integration guide

dff8d39

dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. docs labels May 4, 2026

vercel Bot deployed to Preview May 4, 2026 09:34 View deployment

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

Comment thread content/docs/evaluation/experiments/experiments-in-ci-cd.mdx Outdated

docs: clarify experiment action SDK versions

a2ecfaa

vercel Bot deployed to Preview May 4, 2026 09:38 View deployment

claude Bot reviewed May 4, 2026

View reviewed changes

Comment thread content/guides/experiments-ci-cd.mdx

Comment thread content/guides/experiments-ci-cd.mdx

Comment thread content/guides/experiments-ci-cd.mdx

claude Bot reviewed May 4, 2026

View reviewed changes

Comment thread content/guides/experiments-ci-cd.mdx

docs: address experiment CI review feedback

e29c238

vercel Bot deployed to Preview May 4, 2026 10:08 View deployment

claude Bot reviewed May 4, 2026

View reviewed changes

docs: handle missing CI run evaluator scores

9812eca

vercel Bot deployed to Preview May 4, 2026 11:01 View deployment

claude Bot reviewed May 4, 2026

View reviewed changes

Comment thread content/guides/experiments-ci-cd.mdx

Comment thread content/docs/evaluation/experiments/experiments-in-ci-cd.mdx Outdated

docs: refine experiment CI guide wording

70ab372

vercel Bot deployed to Preview May 4, 2026 12:22 View deployment

docs: update experiment action contract details

ee19772

vercel Bot deployed to Preview May 4, 2026 13:15 View deployment

docs: improve experiment CI action examples

f29c4f2

vercel Bot deployed to Preview May 4, 2026 13:28 View deployment

wochinge requested a review from Lotte-Verheyden May 4, 2026 13:41

wochinge added 2 commits May 5, 2026 11:10

docs: add experiment CI changelog

e6403f1

Merge remote-tracking branch 'origin/main' into tobias/lfe-9366-docum…

e75c4dd

…entation

vercel Bot deployed to Preview May 5, 2026 09:13 View deployment

claude Bot reviewed May 5, 2026

View reviewed changes

docs: refine experiment CI changelog

47d9052

vercel Bot deployed to Preview May 5, 2026 10:30 View deployment

wochinge commented May 5, 2026

View reviewed changes

claude Bot reviewed May 5, 2026

View reviewed changes

wochinge requested a review from annabellscha May 5, 2026 14:06

wochinge added 2 commits May 5, 2026 17:12

Merge branch 'main' into tobias/lfe-9366-documentation

c659b7d

docs: remove duplicate changelog author

1dff964

vercel Bot deployed to Preview May 5, 2026 15:15 View deployment

claude Bot reviewed May 5, 2026

View reviewed changes

Comment thread data/authors.json Outdated

vercel Bot deployed to Preview May 5, 2026 15:51 View deployment

Lotte-Verheyden requested changes May 6, 2026

View reviewed changes

docs: move experiment ci guide

f46f939

vercel Bot deployed to Preview May 6, 2026 09:58 View deployment

claude Bot reviewed May 6, 2026

View reviewed changes

Comment thread content/docs/evaluation/experiments/experiments-via-sdk.mdx

wochinge requested a review from Lotte-Verheyden May 6, 2026 10:08

wochinge added 3 commits May 6, 2026 12:10

docs: fix ci testing section title

e7e03c7

docs: link experiment action in ci callout

39ec823

Revert "docs: link experiment action in ci callout"

cb5b231

This reverts commit 39ec823.

Conversation

wochinge commented May 4, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Linear

Major Decisions

Disclaimer: Experimental PR review

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

vercel Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

Uh oh!

wochinge commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

wochinge May 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

wochinge May 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Lotte-Verheyden left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wochinge commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wochinge commented May 4, 2026 •

edited by greptile-apps Bot

Loading

vercel Bot commented May 4, 2026 •

edited

Loading

wochinge commented May 4, 2026 •

edited

Loading