Skip to content

Commit ab24b48

Browse files
patrykkopycinskiclaudekibanamachinespong
authored
feat(evals): create @kbn/evals-extensions foundation package (elastic#258775)
## Summary Creates the foundation package `@kbn/evals-extensions` for advanced evaluation capabilities. This package will house features ported from cursor-plugin-evals and serve as the home for Phases 3-5 of the evals roadmap. ## Architecture **One-way dependency:** - ✅ kbn-evals-extensions depends on kbn-evals - ❌ kbn-evals has NO dependency on kbn-evals-extensions Evaluation suites opt-in by importing from extensions directly. ## What's Included ✅ Package structure and build configuration ✅ Comprehensive documentation ✅ 5 passing unit tests ✅ CODEOWNERS entry ✅ No functional changes ## Validation ✅ Bootstrap, type check, tests, eslint, check_changes.ts all passed ✅ No circular dependencies ## Roadmap This enables PRs elastic#2-10 for cost tracking, dataset management, safety evaluators, UI components, DX enhancements, analytics, A/B testing, human-in-the-loop, and IDE integration. ## Related - Part of elastic#257821 - Enables elastic#257823, elastic#257824, elastic#257825, elastic#257826 - Addresses elastic#255820 Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Garrett Spong <garrett.spong@elastic.co>
1 parent ebf83e0 commit ab24b48

24 files changed

Lines changed: 685 additions & 1 deletion

File tree

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -997,6 +997,7 @@ x-pack/platform/packages/shared/kbn-entities-schema @elastic/core-analysis
997997
x-pack/platform/packages/shared/kbn-es-snapshot-loader @elastic/obs-ai-team
998998
x-pack/platform/packages/shared/kbn-evals @elastic/obs-ai-team @elastic/security-generative-ai
999999
x-pack/platform/packages/shared/kbn-evals-common @elastic/obs-ai-team @elastic/security-generative-ai
1000+
x-pack/platform/packages/shared/kbn-evals-extensions @elastic/obs-ai-team @elastic/security-generative-ai
10001001
x-pack/platform/packages/shared/kbn-evals-phoenix-executor @elastic/obs-ai-team
10011002
x-pack/platform/packages/shared/kbn-evals-suite-streams @elastic/obs-onboarding-team @elastic/obs-sig-events-team
10021003
x-pack/platform/packages/shared/kbn-event-stacktrace @elastic/obs-presentation-team @elastic/obs-exploration-team

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1690,6 +1690,7 @@
16901690
"@kbn/eslint-plugin-telemetry": "link:packages/kbn-eslint-plugin-telemetry",
16911691
"@kbn/esql-resource-browser-storybook-config": "link:src/platform/packages/shared/kbn-esql-resource-browser/.storybook",
16921692
"@kbn/evals": "link:x-pack/platform/packages/shared/kbn-evals",
1693+
"@kbn/evals-extensions": "link:x-pack/platform/packages/shared/kbn-evals-extensions",
16931694
"@kbn/evals-phoenix-executor": "link:x-pack/platform/packages/shared/kbn-evals-phoenix-executor",
16941695
"@kbn/evals-suite-agent-builder": "link:x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder",
16951696
"@kbn/evals-suite-endpoint": "link:x-pack/solutions/security/packages/kbn-evals-suite-endpoint",

tsconfig.base.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1144,6 +1144,8 @@
11441144
"@kbn/evals/*": ["x-pack/platform/packages/shared/kbn-evals/*"],
11451145
"@kbn/evals-common": ["x-pack/platform/packages/shared/kbn-evals-common"],
11461146
"@kbn/evals-common/*": ["x-pack/platform/packages/shared/kbn-evals-common/*"],
1147+
"@kbn/evals-extensions": ["x-pack/platform/packages/shared/kbn-evals-extensions"],
1148+
"@kbn/evals-extensions/*": ["x-pack/platform/packages/shared/kbn-evals-extensions/*"],
11471149
"@kbn/evals-phoenix-executor": ["x-pack/platform/packages/shared/kbn-evals-phoenix-executor"],
11481150
"@kbn/evals-phoenix-executor/*": ["x-pack/platform/packages/shared/kbn-evals-phoenix-executor/*"],
11491151
"@kbn/evals-plugin": ["x-pack/platform/plugins/shared/evals"],
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Build output
2+
target/
3+
*.js
4+
!jest.config.js
5+
*.d.ts
6+
tsconfig.tsbuildinfo
7+
8+
# Dependencies
9+
node_modules/
10+
11+
# IDE
12+
.vscode/
13+
.idea/
14+
15+
# OS
16+
.DS_Store
17+
Thumbs.db
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# @kbn/evals-extensions
2+
3+
Advanced evaluation capabilities for `@kbn/evals` - **standalone extensions package**.
4+
5+
## Purpose
6+
7+
This package extends `@kbn/evals` with advanced features ported from [cursor-plugin-evals](https://github.com/patrykkopycinski/cursor-plugin-evals) and serves as the home for Phases 3-5 of the evals roadmap.
8+
9+
## Architecture: Independent Package Design
10+
11+
**Critical principle:** This package is designed to be **completely independent** from `@kbn/evals`.
12+
13+
```
14+
┌─────────────────────────────────────────────────────┐
15+
│ Evaluation Suites │
16+
│ (agent-builder, obs-ai-assistant, security) │
17+
└──────────────────┬──────────────────────────────────┘
18+
19+
┌──────────┴──────────┐
20+
│ │
21+
▼ ▼
22+
┌──────────────────┐ ┌─────────────────────────────┐
23+
│ @kbn/evals │ │ @kbn/evals-extensions │
24+
│ (core) │ │ (advanced features) │
25+
│ │ │ │
26+
│ ✅ Evaluators │ │ ✅ Safety evaluators │
27+
│ ✅ Scout/PW │ │ ✅ Cost tracking │
28+
│ ✅ ES export │ │ ✅ Dataset management │
29+
│ ✅ Stats │ │ ✅ UI components │
30+
│ ✅ CLI basics │ │ ✅ Watch mode │
31+
│ │ │ ✅ A/B testing │
32+
│ ❌ NO imports │ │ ✅ Human-in-the-loop │
33+
│ from ext ─────┼───┼──X │
34+
│ │ │ │
35+
└──────────────────┘ └──────────┬──────────────────┘
36+
37+
│ depends on
38+
39+
┌──────────────────┐
40+
│ @kbn/evals │
41+
│ (types, utils) │
42+
└──────────────────┘
43+
```
44+
45+
**Dependency Rules:**
46+
-`kbn-evals-extensions` CAN import from `kbn-evals`
47+
-`kbn-evals` MUST NOT import from `kbn-evals-extensions`
48+
- ✅ Evaluation suites can use both packages independently
49+
50+
## Features
51+
52+
### Current Status: Foundation (PR #1)
53+
- ✅ Package structure established
54+
- ✅ Build configuration
55+
- ✅ Test infrastructure
56+
- ❌ No functional features yet (placeholder exports only)
57+
58+
### Roadmap
59+
60+
#### **PR #2: Cost Tracking & Metadata** (Weeks 2-3)
61+
- Token-based cost calculation
62+
- Hyperparameter tracking (temperature, top_p, etc.)
63+
- Environment snapshots (Kibana/ES versions, plugins)
64+
- Run tagging and annotations
65+
66+
#### **PR #3: Dataset Management** (Weeks 4-6)
67+
- Dataset versioning (semantic versioning)
68+
- Schema validation (Zod-based)
69+
- Deduplication (similarity-based)
70+
- Merging and splitting utilities
71+
- Filtering and statistics
72+
73+
#### **PR #4: Safety Evaluators** (Weeks 7-10)
74+
- Toxicity detection
75+
- PII detection
76+
- Bias detection
77+
- Hallucination detection
78+
- Refusal testing
79+
- Content moderation
80+
81+
#### **PR #5: UI Components** (Weeks 11-16)
82+
- Run comparison viewer (side-by-side diff)
83+
- Example explorer (worst-case analysis)
84+
- Score distribution charts
85+
- Integration with evals Kibana plugin
86+
87+
#### **PR #6: DX Enhancements** (Weeks 17-21)
88+
- Watch mode (auto-rerun on changes)
89+
- Parallel execution (multi-suite concurrency)
90+
- Result caching (skip unchanged examples)
91+
- Incremental evaluation (delta-only runs)
92+
- Interactive mode (step-through debugging)
93+
- Dry-run mode (validation without execution)
94+
95+
#### **PR #7: Advanced Analytics** (Weeks 22-24)
96+
- Confidence intervals (bootstrapping)
97+
- Outlier detection (Z-score, IQR, Isolation Forest)
98+
- Failure clustering (K-means, hierarchical)
99+
- Error taxonomy
100+
- Ensemble evaluation
101+
- Calibration analysis
102+
103+
#### **PR #8: A/B Testing & Active Learning** (Weeks 25-29)
104+
- A/B testing framework with statistical tests
105+
- Bandit algorithms (epsilon-greedy, UCB, Thompson sampling)
106+
- Active learning (uncertainty and diversity sampling)
107+
108+
#### **PR #9: Human-in-the-Loop** (Weeks 30-35)
109+
- Review queue UI
110+
- Annotation interface
111+
- Assignment workflow
112+
- Inter-rater reliability
113+
- Conflict resolution
114+
115+
#### **PR #10: IDE Integration** (Weeks 36-39)
116+
- VS Code extension
117+
- Cursor skills for eval authoring
118+
- AI-assisted dataset creation
119+
120+
## Usage
121+
122+
### Opting In to Extensions
123+
124+
Evaluation suites import extensions explicitly:
125+
126+
```typescript
127+
// Example: agent-builder evaluation suite
128+
import { evaluate } from '@kbn/evals';
129+
import {
130+
createToxicityEvaluator,
131+
createPiiDetector,
132+
createBiasEvaluator,
133+
costTracker,
134+
watchMode
135+
} from '@kbn/evals-extensions';
136+
137+
evaluate('security test', async ({ executorClient }) => {
138+
// Mix core and extension evaluators
139+
await executorClient.runExperiment(
140+
{ dataset, task },
141+
[
142+
...createCorrectnessEvaluators(), // core kbn/evals
143+
createToxicityEvaluator(), // extension
144+
createPiiDetector(), // extension
145+
]
146+
);
147+
148+
// Use extension features
149+
await costTracker.logRunCost(executorClient.getRunId());
150+
});
151+
```
152+
153+
### Feature Flags
154+
155+
Extensions use environment variables for opt-in behavior:
156+
157+
```bash
158+
# Enable watch mode
159+
KBN_EVALS_EXT_WATCH_MODE=true node scripts/evals run --suite <id>
160+
161+
# Enable parallel execution
162+
KBN_EVALS_EXT_PARALLEL=true node scripts/evals run --suite <id>
163+
164+
# Enable result caching
165+
KBN_EVALS_EXT_CACHE=true node scripts/evals run --suite <id>
166+
```
167+
168+
## Why a Separate Package?
169+
170+
1. **Clear boundaries** - Extensions don't pollute core framework
171+
2. **Independent evolution** - Iterate without affecting core
172+
3. **Optional adoption** - Suites choose which features to use
173+
4. **Parallel development** - Teams work without conflicts
174+
5. **Easier testing** - Integration tests isolated
175+
6. **Future migration** - Can promote mature features to core later
176+
177+
## Vision Alignment
178+
179+
All features follow principles from "Future of @kbn/evals":
180+
- **Trace-first**: Leverage OTel traces when applicable
181+
- **Elastic-native**: No external dependencies
182+
- **Shared layer**: Provide composable primitives
183+
- **Code-defined**: Datasets versioned in code
184+
185+
## Development
186+
187+
### Running Tests
188+
189+
```bash
190+
yarn test:jest --testPathPattern=kbn-evals-extensions
191+
```
192+
193+
### Type Checking
194+
195+
```bash
196+
yarn test:type_check --project x-pack/platform/packages/shared/kbn-evals-extensions/tsconfig.json
197+
```
198+
199+
### Linting
200+
201+
```bash
202+
node scripts/eslint --fix x-pack/platform/packages/shared/kbn-evals-extensions
203+
```
204+
205+
## Contributing
206+
207+
See individual feature directories for contribution guidelines. All PRs should:
208+
- Follow Kibana code standards
209+
- Include unit tests
210+
- Update this README with new exports
211+
- Maintain independence from `@kbn/evals` core
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
/*
2+
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3+
* or more contributor license agreements. Licensed under the Elastic License
4+
* 2.0; you may not use this file except in compliance with the Elastic License
5+
* 2.0.
6+
*/
7+
8+
/**
9+
* Basic package health checks for @kbn/evals-extensions
10+
*/
11+
12+
import { EVALS_EXTENSIONS_VERSION } from '..';
13+
14+
describe('@kbn/evals-extensions', () => {
15+
describe('package structure', () => {
16+
it('should export EVALS_EXTENSIONS_VERSION', () => {
17+
expect(EVALS_EXTENSIONS_VERSION).toBe('1.0.0');
18+
});
19+
20+
it('should be importable without errors', async () => {
21+
const mod = await import('..');
22+
expect(mod).toBeDefined();
23+
});
24+
});
25+
26+
describe('dependency isolation', () => {
27+
it('should not create circular dependencies with @kbn/evals', async () => {
28+
// This test ensures we maintain one-way dependency:
29+
// kbn-evals-extensions → depends on → kbn-evals
30+
// kbn-evals → MUST NOT depend on → kbn-evals-extensions
31+
32+
// Both packages should be importable
33+
const evalsExtensions = await import('..');
34+
const kbnEvals = await import('@kbn/evals');
35+
36+
expect(evalsExtensions).toBeDefined();
37+
expect(kbnEvals).toBeDefined();
38+
39+
// kbn-evals-extensions can use kbn-evals types (verified by compilation)
40+
// kbn-evals should have no knowledge of kbn-evals-extensions
41+
// This is enforced by TypeScript references in tsconfig.json
42+
});
43+
});
44+
45+
describe('exports', () => {
46+
it('should re-export core types from @kbn/evals', async () => {
47+
// Type exports are verified at compile time
48+
// Runtime check just ensures module loads
49+
const exports = await import('..');
50+
expect(exports).toBeDefined();
51+
});
52+
});
53+
});
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
/*
2+
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3+
* or more contributor license agreements. Licensed under the Elastic License
4+
* 2.0; you may not use this file except in compliance with the Elastic License
5+
* 2.0.
6+
*/
7+
8+
/**
9+
* @kbn/evals-extensions - Advanced evaluation capabilities
10+
*
11+
* This package provides standalone extensions for @kbn/evals.
12+
* It does NOT modify the core @kbn/evals package.
13+
*
14+
* ## Architecture
15+
*
16+
* Dependency flow:
17+
* - ✅ kbn-evals-extensions → imports from → kbn-evals
18+
* - ❌ kbn-evals → MUST NOT import from → kbn-evals-extensions
19+
*
20+
* Evaluation suites can opt-in to extensions by importing directly:
21+
*
22+
* @example
23+
* ```typescript
24+
* import { evaluate } from '@kbn/evals';
25+
* import { createToxicityEvaluator, costTracker } from '@kbn/evals-extensions';
26+
*
27+
* evaluate('test', async ({ executorClient }) => {
28+
* await executorClient.runExperiment(
29+
* { dataset, task },
30+
* [createToxicityEvaluator()] // Extension evaluator
31+
* );
32+
* await costTracker.logRunCost(runId); // Extension feature
33+
* });
34+
* ```
35+
*
36+
* ## Roadmap
37+
*
38+
* Features are being added incrementally:
39+
* - **PR #1**: Foundation (current) - Package setup, no functional changes
40+
* - **PR #2**: Cost tracking & metadata
41+
* - **PR #3**: Dataset management utilities
42+
* - **PR #4**: Safety evaluators (toxicity, PII, bias, etc.)
43+
* - **PR #5**: UI components (run comparison, example explorer)
44+
* - **PR #6**: DX enhancements (watch mode, caching, parallel execution)
45+
* - **PR #7**: Advanced analytics (confidence intervals, outlier detection)
46+
* - **PR #8**: A/B testing & active learning
47+
* - **PR #9**: Human-in-the-loop workflows
48+
* - **PR #10**: IDE integration (VS Code extension, Cursor skills)
49+
*
50+
* @packageDocumentation
51+
*/
52+
53+
// Re-export core types from kbn-evals for convenience
54+
// This allows users to import from one place, but doesn't create reverse dependency
55+
export type { Evaluator, Example, EvaluationDataset, TaskOutput } from '@kbn/evals';
56+
57+
export type { EvaluationScoreDocument } from '@kbn/evals';
58+
59+
/**
60+
* Extension-specific types (to be populated in future PRs)
61+
*/
62+
export interface ExtensionConfig {
63+
/**
64+
* Configuration for extension features
65+
* Will be expanded as features are added
66+
*/
67+
placeholder?: string;
68+
}
69+
70+
/**
71+
* Feature exports (to be populated in future PRs)
72+
*
73+
* Examples of what will be exported:
74+
* - export { createToxicityEvaluator } from './src/evaluators/safety/toxicity';
75+
* - export { costTracker } from './src/tracking/cost_calculator';
76+
* - export { watchMode } from './src/execution/watch_mode';
77+
* - export { createABTest } from './src/experimentation/ab_testing/framework';
78+
* - export { reviewQueue } from './src/human_review/workflow/review_workflow';
79+
*/
80+
81+
// Placeholder export to ensure package builds
82+
export const EVALS_EXTENSIONS_VERSION = '1.0.0';

0 commit comments

Comments
 (0)