docs: add experiment CI/CD integration guide#2888
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
@claude review |
There was a problem hiding this comment.
Author addressed the substantive feedback from my prior review (echo quoting, TS eval parameter, Python eval loop variable) in commit e29c238 — remaining items are docs-only nits, approving.
Extended reasoning...
Overview
Docs-only PR adding a new CI/CD integration page (experiments-in-ci-cd.mdx) for Langfuse experiments, with the existing Pytest/Vitest CI examples relocated from experiments-via-sdk.mdx and replaced with cross-links there. meta.json registers the new page in the experiments nav. No application code, schemas, or auth surfaces are touched.
Security risks
None. The page is rendered as MDX docs content. The content does include CI/CD recipes — script-injection via single-quoted echo '${{ ... }}' was the one real footgun a reader might copy verbatim, and that has now been fixed in commit e29c238 to use the env: + printf pattern recommended by GitHub, so the remaining sample is safe.
Level of scrutiny
Light. This is documentation only; the worst-case impact of a defect is a confused reader rather than a broken production system. Bug-hunting found a few documentation-quality issues (TS strict-mode eval parameter, Python eval shadowing the built-in, fragile shell quoting) and the author addressed all three in e29c238. The remaining nits — next() without an explicit default in the moved Pytest snippets, and the actions/checkout@v6 / setup-python@v6 / setup-node@v6 pins — are either pre-existing patterns carried over from the source page or potentially valid given the time elapsed since my knowledge cutoff (today is 2026-05-04, cutoff January 2026).
Other factors
The PR has clear intent, is well-organized (canonicalizes CI gating docs in one place rather than duplicating across pages), and the author has been responsive to bot feedback within the same review cycle. No CODEOWNERS-protected files, no test coverage to worry about (it is the test guidance), and the cross-links between the two pages are coherent. Given the docs-only scope and the addressed feedback, this is comfortable to shadow-approve.
There was a problem hiding this comment.
Docs-only PR; all my prior substantive concerns (eval rename, shell quoting via env var, if: always() on the storage step, LangTabs label consistency) are reflected in the current diff.
Extended reasoning...
Overview
This PR adds a new content/docs/evaluation/experiments/experiments-in-ci-cd.mdx page documenting CI/CD integration for Langfuse experiments (GitHub Action workflow, RunnerContext contract, RegressionError, secrets, action inputs/outputs), moves the existing Pytest/Vitest examples out of experiments-via-sdk.mdx (replaced with a cross-link), registers the new page in meta.json, adds a changelog entry, and adds an author entry + headshot.
Security risks
None material. Modifications are confined to MDX docs, a JSON authors file, and a JPG headshot — no application code, no auth, crypto, or permissions code is touched. The most security-relevant aspect is the example shell snippet for storing result_json, which now correctly uses the env: indirection pattern recommended by GitHub's own security hardening guide rather than interpolating ${{ ... }} directly into a shell command.
Level of scrutiny
Low-to-medium. This is a documentation-only change (size:XL by line count, but all prose + examples). The page has been through two prior bot review rounds and an explicit author re-tag, with each substantive concern I raised in earlier passes addressed in the current diff. The remaining open thread (@v6 action pins) was resolved by the author and the bug hunting system did not re-flag it in this run — this is the kind of judgment call that does not benefit from another bot reiteration.
Other factors
The bug hunting system in this run reported no bugs. All my prior inline comments are marked resolved on the GitHub side, and the corresponding code in the current diff matches the suggested fixes (eval/evaluation rename in three Python comprehensions, env-var pattern on the storage step, if: always() on the storage step, uniform ["Python SDK", "JS/TS SDK"] LangTabs labels in all three locations on the new page). The remaining @v6 pins are the author's intentional choice given their current-date context, and re-flagging them would be noise per the broken-record rule.
| </Tab> | ||
| </LangTabs> | ||
|
|
||
| #### Testing in CI Environments |
There was a problem hiding this comment.
moved this to the new page
There was a problem hiding this comment.
LGTM — docs-only PR with prior feedback addressed; remaining nit on brand-name capitalization is non-blocking.
Extended reasoning...
Overview
This PR adds a new CI/CD integration guide (content/docs/evaluation/experiments/experiments-in-ci-cd.mdx), a corresponding changelog entry (content/changelog/2026-05-05-experiment-ci-cd-gates.mdx), refactors experiments-via-sdk.mdx to cross-link to the new page rather than duplicate Pytest/Vitest examples, registers the page in meta.json, and adds a new author entry plus avatar. All file changes are documentation, navigation config, an author profile, and image assets.
Security risks
None — pure docs/content. No code, no auth, no infra. The added image (avatar/OG image) is a static asset, and the changelog/MDX are standard Nextra content. The included YAML samples are illustrative only.
Level of scrutiny
Low scrutiny appropriate. This is a docs PR with no executable impact, no CODEOWNER concerns flagged in the diff, and no shared production state. Multiple rounds of substantive review have already happened in this PR's timeline (action versions, shell-quoting, eval shadowing, if: always(), LangTabs labels) and the author has addressed all of them — the current diff already reflects actions/checkout@v6 etc. (the team's call to keep), if: always() on the storage step, printf '%s' "$RESULT_JSON" env-var pattern, evaluation in place of eval, and next(..., None) with isinstance guards.
Other factors
The only remaining finding from this run is a stylistic nit — the new changelog uses lowercase langfuse as a brand name in five prose locations, while every other entry in content/changelog/ capitalizes it. That inline comment is already attached for the author to consider; it is not a release blocker. Given the PR is well-scoped, the substantive feedback has been resolved across prior review cycles, and only a brand-capitalization nit remains, this is safe to approve without further human review.
| description: Run langfuse experiments in GitHub Actions to catch quality regressions before releasing changes to production. | ||
| author: Tobias Wochinger | ||
| ogImage: /images/changelog/2026-05-05-experiment-ci-cd.png | ||
| canonical: /docs/evaluation/experiments/experiments-in-ci-cd | ||
| --- | ||
|
|
||
| import { ChangelogHeader } from "@/components/changelog/ChangelogHeader"; | ||
|
|
||
| <ChangelogHeader /> | ||
|
|
||
| You can now run langfuse experiments in GitHub Actions and catch quality regressions before they ship. The new [langfuse/experiment-action](https://github.com/langfuse/experiment-action) tests your application against a langfuse dataset, reports the result directly on the pull request, and tracks the experiment run in langfuse. |
There was a problem hiding this comment.
🟡 The new changelog at content/changelog/2026-05-05-experiment-ci-cd-gates.mdx writes the product name as lowercase langfuse in five places on lines 4 and 14 (frontmatter description: plus three references in the body), e.g. "Run langfuse experiments...", "a langfuse dataset", "the experiment run in langfuse". Every other entry under content/changelog/ capitalizes "Langfuse" in user-facing prose, so this entry is the lone outlier. Suggest capitalizing the five brand-prose occurrences while keeping the GitHub repo path langfuse/experiment-action correctly lowercase.
Extended reasoning...
What the bug is. The new changelog entry uses lowercase langfuse as the product name in user-facing prose. Specifically:
- Line 4 (frontmatter):
description: Run langfuse experiments in GitHub Actions to catch quality regressions before releasing changes to production. - Line 14 (body):
You can now run langfuse experiments in GitHub Actions...— and within the same paragraph,...tests your application against a langfuse dataset...and...tracks the experiment run in langfuse.
That is five occurrences of langfuse where the brand name should be Langfuse.
Why this is inconsistent. Grepping content/changelog/*.mdx shows every other changelog entry capitalizes "Langfuse" in description: fields and prose bodies (e.g. "Categorize and manage your traces with tags in Langfuse.", "Langfuse now offers compatibility...", "Deploy Langfuse behind a custom base path...", "Langfuse says hello to Anthropic Claude..."). The new file is the only entry under content/changelog/ with lowercase langfuse as a brand reference in prose.
What is and isn't a bug. The GitHub repo path langfuse/experiment-action (line 14, inside the markdown link target) is correctly lowercase — that is the literal owner/repo slug on GitHub and should not change. Only the five brand-name prose occurrences should be capitalized.
Step-by-step proof.
- Open
content/changelog/2026-05-05-experiment-ci-cd-gates.mdxline 4 —description: Run langfuse experiments.... Compare to the description field of any neighbouring changelog file (e.g.2025-11-20-*,2025-01-21-audit-logs.mdx,2024-04-22-posthog-integration.mdx); all use capitalLangfuse. - Open line 14 — the prose paragraph contains three more lowercase
langfusereferences plus the legitimate[langfuse/experiment-action]repo link. - The published changelog page (and Open Graph preview) will render the lowercase brand name verbatim, where every other entry on the same index page reads "Langfuse".
Impact. Pure brand/style consistency — no functional break. The page still renders, the link still resolves, and search will still find the entry. But the inconsistency is visible to anyone scanning the changelog index because the description appears next to other entries that consistently capitalize the product name.
Fix. Replace langfuse → Langfuse in the five prose occurrences on lines 4 and 14, while leaving the langfuse/experiment-action repo path unchanged. Five-character diff.
Severity. nit — purely stylistic/branding, fixable in a single targeted edit while the file is being added.
There was a problem hiding this comment.
lowercase is the new style that we want to use, no?
There was a problem hiding this comment.
Doesn't appear to be — every other 2026 changelog still capitalizes "Langfuse" in prose, including 2026-04-27-langfuse-cloud-japan.mdx (just a week before this entry, e.g. "Langfuse Cloud is now live...", "What is Langfuse Cloud Japan?") and 2026-04-13-experiments-rebuild.mdx ("Experiments are now a primary concept in Langfuse..."). If lowercase langfuse is genuinely the new convention, this is the first changelog to adopt it and the rest of content/changelog/ will read inconsistently on the index page until the others are migrated. If it's not the new convention, capitalizing the five prose occurrences on lines 4 and 14 brings this entry in line with the existing style. Either way, worth a quick check with whoever owns the style guide before merging.
Lotte-Verheyden
left a comment
There was a problem hiding this comment.
Content-wise the page is strong, but I would try to refrain from adding another docs page for it here (it should not live on the same level as experiments via SDKs). What about
- making this a guide page
- having a section on the experiments via SDK page that then links to the guide page for details on how to implement it. The section on the experiments via SDK page can be what's now the intro of your CI/CD page
Wdyt?
|
@Lotte-Verheyden Done - main difference to your suggestion is that I kept a snippet of GH action usage in the main docs. This is in line how e.g. braintrust does it |
Summary
RunnerContext, regression failures, and non-GitHub CI patterns.experiment(context: RunnerContext)contract so GitHub Action examples usecontext.runExperiment/context.run_experimentwith action-injected dataset, SDK client, and metadata defaults.Linear
Major Decisions
RunnerContextflow, while keeping direct SDK-based examples for other CI/CD systems.Disclaimer: Experimental PR review
Greptile Summary
This PR adds a dedicated CI/CD integration guide for Langfuse experiments, covering the
langfuse/experiment-actionGitHub Actions workflow,RunnerContextcontract, regression gating viaRegressionError, and equivalent Pytest/Vitest patterns for other CI systems. The existing CI examples are moved from the SDK page to the new page with clean cross-links.evalas an arrow-function parameter name (lines 534 and 557), which is a reserved identifier in TypeScript strict mode and will cause a compilation error for anyone copying the snippet.evalvia loop variables, andnext()without a default can raiseStopIterationsilently instead of a clear test failure.Confidence Score: 3/5
Not safe to merge as-is — the TypeScript code examples contain a strict-mode syntax error that will fail compilation for any reader who copies them.
Two P1 findings (using
evalas a TypeScript parameter name — a hard syntax error in strict mode) pull the score below the P1 ceiling of 4. The remaining findings are P2 style/robustness issues.content/docs/evaluation/experiments/experiments-in-ci-cd.mdx — Vitest TypeScript example (lines 534, 557) and Pytest Python example (lines 418–420, 443–445).
Important Files Changed
evalas a parameter name, which is a syntax error in strict mode; Python examples shadow the built-ineval;next()without a default can surface a confusing StopIteration; echo quoting in the action output YAML snippet is fragile.experiments-in-ci-cdpage in the experiments navigation — trivial and correct.Sequence Diagram
sequenceDiagram participant PR as Pull Request participant GHA as GitHub Actions participant Action as langfuse/experiment-action participant Script as experiment(context) participant LF as Langfuse API PR->>GHA: push / pull_request event GHA->>Action: run with inputs (dataset_name, keys, metadata) Action->>LF: fetch dataset items (dataset_name + dataset_version) Action->>Script: call experiment(RunnerContext) Script->>LF: run_experiment / runExperiment (task + evaluators) LF-->>Script: ExperimentResult (run_evaluations) Script-->>Action: return result (or raise RegressionError) Action->>GHA: set outputs (result_json, failed) Action->>PR: post/update PR comment (pass/regression/error + scores)Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "docs: add experiment CI/CD integration g..." | Re-trigger Greptile