gemini-cli-evalkit is an eval-focused fork of Gemini CLI for building and
validating a behavioral eval workflow around /generate-eval and /evals. The
fork is centered on one product loop: detect a bad agent behavior, turn it into
a local eval-rule combo, inspect coverage and registry state, and optionally
prepare that work for upstream contribution.
This repository is not a general-purpose rebrand of Gemini CLI. It is a working implementation and research sandbox for a validated eval ecosystem: misbehavior detection, eval generation, local rule lifecycle management, marketplace-backed discovery, and coverage analysis.
/generate-evalfor creating a local eval + rule from failure context or a manual correction/evalsas a tabbed TUI for browsing, managing, and generating eval-rule combos- Flat
/evalssubcommands for list, browse, coverage, show, enable, disable, install, uninstall, and delete flows - Misbehavior detection that captures user corrections and seeds eval generation
- A local-first eval-rule lifecycle under
.gemini/ - Registry-backed marketplace discovery through
eval-registry.json - Static coverage analysis for tool and behavioral gaps
Start the CLI, reproduce a behavior issue, then run:
/generate-eval agent should have asked for clarification before running the commandIf misbehavior detection has already captured context from the conversation,
/generate-eval can reuse that pending context automatically. Otherwise, it
builds a draft from the recent conversation history or from the manual
description you provide.
Open the eval UI:
/evalsThe TUI provides:
Allto browse official, community, and generated evalsInstalledto manage locally active eval-rule combosCoverageto inspect current tool and behavioral gapsGenerateto create a new eval-rule combo directly from the UIGEMINI.mdto inspect active generated rule blocks
Generated and installed eval-rule combos are stored under .gemini/, where the
eval files, rule files, metadata, and managed GEMINI.md rule fragments live
together. This keeps the workflow local-first and usable even when marketplace
or upstream flows are unavailable.
/generate-eval <describe what the agent did wrong>Behavior:
- uses pending misbehavior context when available
- otherwise derives context from recent conversation history
- generates an eval draft and a paired rule fragment
- saves accepted drafts into
.gemini/evals/and.gemini/eval-rules/ - updates
.gemini/GEMINI.mdwith managed rule blocks
Default behavior opens the eval marketplace TUI.
Supported subcommands:
/evals list/evals browse/evals coverage/evals show <name>/evals install <name>/evals uninstall <name>/evals enable <name>/evals disable <name>/evals delete <name>
The eval workflow is local-first and persists under .gemini/:
.gemini/
GEMINI.md
evals/
<generated>.eval.ts
eval-rules/
index.json
<generated-or-installed>.rule.md
Key files:
.gemini/evals/: generated local evals.gemini/eval-rules/index.json: installed/generated metadata and analytics.gemini/GEMINI.md: active managed rule fragments
The CLI layer handles user actions and presentation through /generate-eval,
/evals, and the EvalsMarketplaceView TUI. The core eval domain in
packages/core/src/evals/ owns the lifecycle logic: misbehavior detection,
draft generation, registry fetching and caching, coverage analysis, and local
state management.
Data flow is local-first. Installs and generated artifacts are written to
.gemini/, while remote registry access is optional and used only for
marketplace discovery. This lets the core eval workflow continue to function
even when network-backed features are unavailable.
packages/core/src/evalsCore eval logic: generation, registry, coverage, rule management, typespackages/cli/src/ui/commandsCLI entrypoints for/evalsand/generate-evalpackages/cli/src/ui/components/viewsTUI implementation, includingEvalsMarketplaceVieweval-registry.jsonMarketplace registry data used by the eval browserdocs/cli/evals.mdContributor-facing guide for the/evalsand/generate-evalworkflowdocs/reference/commands.mdSlash command reference, including/evalsand/generate-evalevals/README.mdBehavioral eval suite documentation inherited from the upstream project
Install dependencies:
npm ciStart the CLI in development mode:
npm run startUseful commands:
npm run build
npm run lint
npm run typecheckRun the eval suites:
npm run test:always_passing_evals
npm run test:all_evalsRun integration tests:
npm run test:integration:sandbox:noneThe product problem here is not just "write more eval files." It is how to make behavioral eval work discoverable, local-first, and contribution-ready:
- detect failures from real user interactions
- convert those failures into reproducible evals
- pair evals with local behavior rules
- manage them through a dedicated
/evalssurface - understand what coverage is still missing
This fork exists to build and validate that workflow end to end.