Skip to content

docs: add experiment CI/CD integration guide#2888

Open
wochinge wants to merge 16 commits intomainfrom
tobias/lfe-9366-documentation
Open

docs: add experiment CI/CD integration guide#2888
wochinge wants to merge 16 commits intomainfrom
tobias/lfe-9366-documentation

Conversation

@wochinge
Copy link
Copy Markdown
Contributor

@wochinge wochinge commented May 4, 2026

Summary

  • Added a dedicated CI/CD integration page for Langfuse experiments, covering GitHub Actions setup, action inputs/outputs, secrets, RunnerContext, regression failures, and non-GitHub CI patterns.
  • Documented the upcoming experiment(context: RunnerContext) contract so GitHub Action examples use context.runExperiment / context.run_experiment with action-injected dataset, SDK client, and metadata defaults.
  • Moved the existing Pytest/Vitest CI examples out of the SDK page into the new CI/CD page, and replaced the old SDK section with a concise cross-link.
  • Added explicit docs anchors for evaluator and CI/CD deep links, and registered the new page in the experiments docs navigation.

Linear

Major Decisions

  • Made the new CI/CD page the canonical place for experiment gating docs to avoid duplicating long CI examples across the SDK and CI/CD pages.
  • Kept GitHub Action docs focused on the action-specific RunnerContext flow, while keeping direct SDK-based examples for other CI/CD systems.

Disclaimer: Experimental PR review

Greptile Summary

This PR adds a dedicated CI/CD integration guide for Langfuse experiments, covering the langfuse/experiment-action GitHub Actions workflow, RunnerContext contract, regression gating via RegressionError, and equivalent Pytest/Vitest patterns for other CI systems. The existing CI examples are moved from the SDK page to the new page with clean cross-links.

  • The TypeScript Vitest examples use eval as an arrow-function parameter name (lines 534 and 557), which is a reserved identifier in TypeScript strict mode and will cause a compilation error for anyone copying the snippet.
  • The Python Pytest examples also shadow the built-in eval via loop variables, and next() without a default can raise StopIteration silently instead of a clear test failure.

Confidence Score: 3/5

Not safe to merge as-is — the TypeScript code examples contain a strict-mode syntax error that will fail compilation for any reader who copies them.

Two P1 findings (using eval as a TypeScript parameter name — a hard syntax error in strict mode) pull the score below the P1 ceiling of 4. The remaining findings are P2 style/robustness issues.

content/docs/evaluation/experiments/experiments-in-ci-cd.mdx — Vitest TypeScript example (lines 534, 557) and Pytest Python example (lines 418–420, 443–445).

Important Files Changed

Filename Overview
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx New CI/CD integration guide — two TypeScript code examples use eval as a parameter name, which is a syntax error in strict mode; Python examples shadow the built-in eval; next() without a default can surface a confusing StopIteration; echo quoting in the action output YAML snippet is fragile.
content/docs/evaluation/experiments/experiments-via-sdk.mdx Removed the duplicated CI examples and replaced with cross-links to the new CI/CD page; added anchor IDs to the Evaluators and Testing in CI Environments headings — clean change with no issues.
content/docs/evaluation/experiments/meta.json Registers the new experiments-in-ci-cd page in the experiments navigation — trivial and correct.

Sequence Diagram

sequenceDiagram
    participant PR as Pull Request
    participant GHA as GitHub Actions
    participant Action as langfuse/experiment-action
    participant Script as experiment(context)
    participant LF as Langfuse API

    PR->>GHA: push / pull_request event
    GHA->>Action: run with inputs (dataset_name, keys, metadata)
    Action->>LF: fetch dataset items (dataset_name + dataset_version)
    Action->>Script: call experiment(RunnerContext)
    Script->>LF: run_experiment / runExperiment (task + evaluators)
    LF-->>Script: ExperimentResult (run_evaluations)
    Script-->>Action: return result (or raise RegressionError)
    Action->>GHA: set outputs (result_json, failed)
    Action->>PR: post/update PR comment (pass/regression/error + scores)
Loading
Prompt To Fix All With AI
Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:534
**`eval` is a reserved identifier in TypeScript strict mode**

Using `eval` as an arrow-function parameter (`(eval) => ...`) is a syntax error in TypeScript strict mode (and ES5+ strict mode). TypeScript will reject the snippet with *"Parameter name 'eval' is not allowed in strict mode"*, so anyone copying this example will get an immediate compilation failure. The same issue appears again on line 557.

```suggestion
      (e) => e.name === "avg_accuracy",
```

### Issue 2 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:557
**Same `eval` reserved-identifier error in the second test case**

Same strict-mode syntax error as on line 534 — `eval` cannot be used as a parameter name.

```suggestion
      (e) => e.name === "avg_accuracy",
```

### Issue 3 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:418-420
**`next()` without a default raises `StopIteration` on missing evaluation**

If no evaluation named `"avg_accuracy"` is present (e.g. the run evaluator raised an exception), `next(...)` with no default will raise `StopIteration`, which pytest surfaces as a confusing internal error rather than a clear assertion failure. Adding a default of `None` and checking it explicitly makes the failure obvious. The same pattern on line 443 has the same issue.

### Issue 4 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:418-420
**`eval` shadows Python built-in in loop variable**

`eval` is a Python built-in function; using it as a loop variable name (`for eval in result.run_evaluations`) will shadow it and trigger linting warnings (e.g. `W0622` in pylint, `A001` in flake8-builtins). A minor rename like `evaluation` or `ev` avoids the issue. This same pattern also appears on line 444.

### Issue 5 of 5
content/docs/evaluation/experiments/experiments-in-ci-cd.mdx:342
**Single-quoted `echo` breaks if JSON contains a single quote**

`echo '${{ steps.experiment.outputs.result_json }}'` wraps the GitHub expression in single quotes. If any string value inside the JSON output contains a single-quote character, the shell command will fail or produce malformed JSON. A safer idiom is to assign to an env var first:

```yaml
- name: Store experiment result
  env:
    RESULT_JSON: ${{ steps.experiment.outputs.result_json }}
  run: echo "$RESULT_JSON" > experiment-result.json
```

Reviews (1): Last reviewed commit: "docs: add experiment CI/CD integration g..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
langfuse-docs Ready Ready Preview, Comment May 6, 2026 10:12am

Request Review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

@claude review

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. docs labels May 4, 2026
Comment thread content/docs/evaluation/experiments/experiments-in-ci-cd.mdx Outdated
Comment thread content/guides/experiments-ci-cd.mdx
Comment thread content/guides/experiments-ci-cd.mdx
Comment thread content/guides/experiments-ci-cd.mdx
Comment thread content/guides/experiments-ci-cd.mdx
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Author addressed the substantive feedback from my prior review (echo quoting, TS eval parameter, Python eval loop variable) in commit e29c238 — remaining items are docs-only nits, approving.

Extended reasoning...

Overview

Docs-only PR adding a new CI/CD integration page (experiments-in-ci-cd.mdx) for Langfuse experiments, with the existing Pytest/Vitest CI examples relocated from experiments-via-sdk.mdx and replaced with cross-links there. meta.json registers the new page in the experiments nav. No application code, schemas, or auth surfaces are touched.

Security risks

None. The page is rendered as MDX docs content. The content does include CI/CD recipes — script-injection via single-quoted echo '${{ ... }}' was the one real footgun a reader might copy verbatim, and that has now been fixed in commit e29c238 to use the env: + printf pattern recommended by GitHub, so the remaining sample is safe.

Level of scrutiny

Light. This is documentation only; the worst-case impact of a defect is a confused reader rather than a broken production system. Bug-hunting found a few documentation-quality issues (TS strict-mode eval parameter, Python eval shadowing the built-in, fragile shell quoting) and the author addressed all three in e29c238. The remaining nits — next() without an explicit default in the moved Pytest snippets, and the actions/checkout@v6 / setup-python@v6 / setup-node@v6 pins — are either pre-existing patterns carried over from the source page or potentially valid given the time elapsed since my knowledge cutoff (today is 2026-05-04, cutoff January 2026).

Other factors

The PR has clear intent, is well-organized (canonicalizes CI gating docs in one place rather than duplicating across pages), and the author has been responsive to bot feedback within the same review cycle. No CODEOWNERS-protected files, no test coverage to worry about (it is the test guidance), and the cross-links between the two pages are coherent. Given the docs-only scope and the addressed feedback, this is comfortable to shadow-approve.

Comment thread content/guides/experiments-ci-cd.mdx
Comment thread content/docs/evaluation/experiments/experiments-in-ci-cd.mdx Outdated
@wochinge
Copy link
Copy Markdown
Contributor Author

wochinge commented May 4, 2026

@greptileai

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs-only PR; all my prior substantive concerns (eval rename, shell quoting via env var, if: always() on the storage step, LangTabs label consistency) are reflected in the current diff.

Extended reasoning...

Overview

This PR adds a new content/docs/evaluation/experiments/experiments-in-ci-cd.mdx page documenting CI/CD integration for Langfuse experiments (GitHub Action workflow, RunnerContext contract, RegressionError, secrets, action inputs/outputs), moves the existing Pytest/Vitest examples out of experiments-via-sdk.mdx (replaced with a cross-link), registers the new page in meta.json, adds a changelog entry, and adds an author entry + headshot.

Security risks

None material. Modifications are confined to MDX docs, a JSON authors file, and a JPG headshot — no application code, no auth, crypto, or permissions code is touched. The most security-relevant aspect is the example shell snippet for storing result_json, which now correctly uses the env: indirection pattern recommended by GitHub's own security hardening guide rather than interpolating ${{ ... }} directly into a shell command.

Level of scrutiny

Low-to-medium. This is a documentation-only change (size:XL by line count, but all prose + examples). The page has been through two prior bot review rounds and an explicit author re-tag, with each substantive concern I raised in earlier passes addressed in the current diff. The remaining open thread (@v6 action pins) was resolved by the author and the bug hunting system did not re-flag it in this run — this is the kind of judgment call that does not benefit from another bot reiteration.

Other factors

The bug hunting system in this run reported no bugs. All my prior inline comments are marked resolved on the GitHub side, and the corresponding code in the current diff matches the suggested fixes (eval/evaluation rename in three Python comprehensions, env-var pattern on the storage step, if: always() on the storage step, uniform ["Python SDK", "JS/TS SDK"] LangTabs labels in all three locations on the new page). The remaining @v6 pins are the author's intentional choice given their current-date context, and re-flagging them would be noise per the broken-record rule.

</Tab>
</LangTabs>

#### Testing in CI Environments
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this to the new page

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — docs-only PR with prior feedback addressed; remaining nit on brand-name capitalization is non-blocking.

Extended reasoning...

Overview

This PR adds a new CI/CD integration guide (content/docs/evaluation/experiments/experiments-in-ci-cd.mdx), a corresponding changelog entry (content/changelog/2026-05-05-experiment-ci-cd-gates.mdx), refactors experiments-via-sdk.mdx to cross-link to the new page rather than duplicate Pytest/Vitest examples, registers the page in meta.json, and adds a new author entry plus avatar. All file changes are documentation, navigation config, an author profile, and image assets.

Security risks

None — pure docs/content. No code, no auth, no infra. The added image (avatar/OG image) is a static asset, and the changelog/MDX are standard Nextra content. The included YAML samples are illustrative only.

Level of scrutiny

Low scrutiny appropriate. This is a docs PR with no executable impact, no CODEOWNER concerns flagged in the diff, and no shared production state. Multiple rounds of substantive review have already happened in this PR's timeline (action versions, shell-quoting, eval shadowing, if: always(), LangTabs labels) and the author has addressed all of them — the current diff already reflects actions/checkout@v6 etc. (the team's call to keep), if: always() on the storage step, printf '%s' "$RESULT_JSON" env-var pattern, evaluation in place of eval, and next(..., None) with isinstance guards.

Other factors

The only remaining finding from this run is a stylistic nit — the new changelog uses lowercase langfuse as a brand name in five prose locations, while every other entry in content/changelog/ capitalizes it. That inline comment is already attached for the author to consider; it is not a release blocker. Given the PR is well-scoped, the substantive feedback has been resolved across prior review cycles, and only a brand-capitalization nit remains, this is safe to approve without further human review.

Comment on lines +4 to +14
description: Run langfuse experiments in GitHub Actions to catch quality regressions before releasing changes to production.
author: Tobias Wochinger
ogImage: /images/changelog/2026-05-05-experiment-ci-cd.png
canonical: /docs/evaluation/experiments/experiments-in-ci-cd
---

import { ChangelogHeader } from "@/components/changelog/ChangelogHeader";

<ChangelogHeader />

You can now run langfuse experiments in GitHub Actions and catch quality regressions before they ship. The new [langfuse/experiment-action](https://github.com/langfuse/experiment-action) tests your application against a langfuse dataset, reports the result directly on the pull request, and tracks the experiment run in langfuse.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new changelog at content/changelog/2026-05-05-experiment-ci-cd-gates.mdx writes the product name as lowercase langfuse in five places on lines 4 and 14 (frontmatter description: plus three references in the body), e.g. "Run langfuse experiments...", "a langfuse dataset", "the experiment run in langfuse". Every other entry under content/changelog/ capitalizes "Langfuse" in user-facing prose, so this entry is the lone outlier. Suggest capitalizing the five brand-prose occurrences while keeping the GitHub repo path langfuse/experiment-action correctly lowercase.

Extended reasoning...

What the bug is. The new changelog entry uses lowercase langfuse as the product name in user-facing prose. Specifically:

  • Line 4 (frontmatter): description: Run langfuse experiments in GitHub Actions to catch quality regressions before releasing changes to production.
  • Line 14 (body): You can now run langfuse experiments in GitHub Actions... — and within the same paragraph, ...tests your application against a langfuse dataset... and ...tracks the experiment run in langfuse.

That is five occurrences of langfuse where the brand name should be Langfuse.

Why this is inconsistent. Grepping content/changelog/*.mdx shows every other changelog entry capitalizes "Langfuse" in description: fields and prose bodies (e.g. "Categorize and manage your traces with tags in Langfuse.", "Langfuse now offers compatibility...", "Deploy Langfuse behind a custom base path...", "Langfuse says hello to Anthropic Claude..."). The new file is the only entry under content/changelog/ with lowercase langfuse as a brand reference in prose.

What is and isn't a bug. The GitHub repo path langfuse/experiment-action (line 14, inside the markdown link target) is correctly lowercase — that is the literal owner/repo slug on GitHub and should not change. Only the five brand-name prose occurrences should be capitalized.

Step-by-step proof.

  1. Open content/changelog/2026-05-05-experiment-ci-cd-gates.mdx line 4 — description: Run langfuse experiments.... Compare to the description field of any neighbouring changelog file (e.g. 2025-11-20-*, 2025-01-21-audit-logs.mdx, 2024-04-22-posthog-integration.mdx); all use capital Langfuse.
  2. Open line 14 — the prose paragraph contains three more lowercase langfuse references plus the legitimate [langfuse/experiment-action] repo link.
  3. The published changelog page (and Open Graph preview) will render the lowercase brand name verbatim, where every other entry on the same index page reads "Langfuse".

Impact. Pure brand/style consistency — no functional break. The page still renders, the link still resolves, and search will still find the entry. But the inconsistency is visible to anyone scanning the changelog index because the description appears next to other entries that consistently capitalize the product name.

Fix. Replace langfuseLangfuse in the five prose occurrences on lines 4 and 14, while leaving the langfuse/experiment-action repo path unchanged. Five-character diff.

Severity. nit — purely stylistic/branding, fixable in a single targeted edit while the file is being added.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase is the new style that we want to use, no?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't appear to be — every other 2026 changelog still capitalizes "Langfuse" in prose, including 2026-04-27-langfuse-cloud-japan.mdx (just a week before this entry, e.g. "Langfuse Cloud is now live...", "What is Langfuse Cloud Japan?") and 2026-04-13-experiments-rebuild.mdx ("Experiments are now a primary concept in Langfuse..."). If lowercase langfuse is genuinely the new convention, this is the first changelog to adopt it and the rest of content/changelog/ will read inconsistently on the index page until the others are migrated. If it's not the new convention, capitalizing the five prose occurrences on lines 4 and 14 brings this entry in line with the existing style. Either way, worth a quick check with whoever owns the style guide before merging.

@wochinge wochinge requested a review from annabellscha May 5, 2026 14:06
Comment thread data/authors.json Outdated
Copy link
Copy Markdown
Member

@Lotte-Verheyden Lotte-Verheyden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content-wise the page is strong, but I would try to refrain from adding another docs page for it here (it should not live on the same level as experiments via SDKs). What about

  • making this a guide page
  • having a section on the experiments via SDK page that then links to the guide page for details on how to implement it. The section on the experiments via SDK page can be what's now the intro of your CI/CD page
    Wdyt?

Comment thread content/docs/evaluation/experiments/experiments-via-sdk.mdx
@wochinge
Copy link
Copy Markdown
Contributor Author

wochinge commented May 6, 2026

@Lotte-Verheyden Done - main difference to your suggestion is that I kept a snippet of GH action usage in the main docs. This is in line how e.g. braintrust does it

@wochinge wochinge requested a review from Lotte-Verheyden May 6, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants