Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 147 additions & 16 deletions typescript/README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,161 @@
# DeepEval for TypeScript

> **Status:** Initial version shipping **June 5th**.
DeepEval for TypeScript brings the full DeepEval workflow into the JavaScript and TypeScript ecosystem, including local LLM evaluation, 40+ metrics, synthetic data generation, prompt optimization, and full Confident AI platform integration.

DeepEval for TypeScript brings the DeepEval workflow into the JavaScript and TypeScript ecosystem, starting with the Confident AI platform features teams already use to manage datasets, prompts, and evaluation reporting.
## Feature Parity (June 2026)

This package is designed for TypeScript teams that want first-class access to DeepEval workflows that integrate with Confident AI from the same language they use to build their applications.
The TypeScript SDK now provides near-complete parity with the Python package:

## What TypeScript Supports
### Local Evaluation Models
- 11+ model providers: OpenAI, Azure, Anthropic, Gemini, Bedrock, DeepSeek, Grok, Kimi, Local, Ollama, AISDK
- `ModelFactory` for auto-detecting providers from model name prefixes
- `DeepEvalBaseEmbeddingModel` with OpenAI embedding support
- All models backed by the `DeepEvalBaseLLM` abstract class with `generate<T>(prompt, schema?)`

The initial TypeScript SDK focuses on the Confident AI API surface, including:
### 40+ Metrics (Complete Parity)
- **RAG**: Faithfulness, Hallucination, AnswerRelevancy, ContextualPrecision/Recall/Relevancy
- **Safety**: Bias, Toxicity, PII Leakage, NonAdvice, Misuse, RoleViolation
- **Agent**: TaskCompletion, ToolUse, ToolCorrectness, PlanAdherence, PlanQuality, StepEfficiency, GoalAccuracy, ArgumentCorrectness
- **Quality**: Summarization, PromptAlignment, JsonCorrectness, ExactMatch, PatternMatch
- **Conversational**: TurnRelevancy, TurnFaithfulness, ConversationCompleteness, KnowledgeRetention, RoleAdherence, TopicAdherence, ConversationalGEval
- **Arena**: ArenaGEval with multi-contestant comparison
- **MCP**: MCPUseMetric, MCPTaskCompletion, MultiTurnMCPUse
- **Multimodal**: ImageCoherence, ImageHelpfulness, ImageReference, TextToImage, ImageEditing
- **General**: GEval with custom criteria + rubrics

- Pushing and pulling datasets
- Running and reporting evaluations through Confident AI
- Reading/writing prompts and prompt versions
- Other Confident AI platform interactions
All metrics share template definitions with Python (Jinja2 → Nunjucks, using the same `templates.json`).

Local execution features, such as LLM-as-a-judge metrics, NLP models, and fully local evaluation, currently remain in the Python package while we expand TypeScript support.
### Unit-Test Workflow
- `evaluate()` — run metrics over test cases with progress bars, reporting, and caching
- `assertTest()` — call from Jest/Vitest tests; throws detailed `AssertionError` on failure
- `deepeval test run` CLI command — runs Jest test files and posts results to Confident AI
- `compare()` — arena-style comparison of contestant outputs

## Roadmap
### Synthetic Data Generation
- `Synthesizer` class — generate goldens from documents, contexts, scratch, or existing goldens
- Evolution types: Reasoning, MultiContext, Concretizing, Constrained, Comparative, Hypothetical, InBreadth
- Configurable filtration and evolution distribution
- Supports both single-turn and conversational goldens

Our next milestone is to reach **80% feature parity** across the Confident AI integration surface by the **end of July**. This includes:
### Prompt Optimization
- `PromptOptimizer` — evolutionary prompt improvement using evaluation metrics
- Configurable iterations, minibatch size, Pareto set, and patience-based early stopping
- Automatic feedback generation and prompt rewriting

- **Shared prompt templates** — one source of truth for prompt templates, consumed by both Python and TypeScript so the implementations stay aligned.
- **TypeScript-native APIs** — equivalents for the relevant Python functions and classes, shaped to feel natural in TypeScript while staying familiar to DeepEval users.
- **Dedicated TypeScript docs** — TypeScript examples and guides alongside the existing Python documentation.
### Confident AI Integration
- Full API client with multi-region support (US/EU/AU), retry logic
- Dataset CRUD, test run posting, experiment management
- Prompt management with versioning, branching, and labels
- Tracing with OpenTelemetry
- Governance assessment

## Quick Start

### Installation

```bash
npm install deepeval
```

### Set up your model

```typescript
import { OpenAIModel } from "deepeval";

const model = new OpenAIModel({ model: "gpt-4o" });
```

Or use the factory to auto-detect providers:

```typescript
import { ModelFactory } from "deepeval/models";

const model = ModelFactory.createLLM({ model: "gpt-4o" });
const local = ModelFactory.createLLM({
model: "my-model",
provider: "local",
baseURL: "http://localhost:8000/v1",
});
```

### Run a metric

```typescript
import { FaithfulnessMetric, LLMTestCase } from "deepeval";

const metric = new FaithfulnessMetric({ model: "gpt-4o" });
const testCase = new LLMTestCase({
input: "What is the capital of France?",
actualOutput: "Paris is the capital of France.",
retrievalContext: ["France is a country in Europe."],
});

await metric.measure(testCase);
console.log(metric.score, metric.reason);
```

### Write eval tests (Jest/Vitest)

```typescript
import { assertTest, LLMTestCase, ExactMatchMetric } from "deepeval";

test("response should exactly match expected", async () => {
await assertTest({
testCase: new LLMTestCase({
input: "What is 2+2?",
actualOutput: "4",
expectedOutput: "4",
}),
metrics: [new ExactMatchMetric({ threshold: 1 })],
});
});
```

Run with: `npx deepeval test run`

### Generate synthetic data

```typescript
import { Synthesizer, OpenAIModel } from "deepeval";

const synth = new Synthesizer(new OpenAIModel());
const goldens = await synth.generateGoldensFromContexts([
["Paris is the capital of France."],
]);
```

### Optimize a prompt

```typescript
import { PromptOptimizer } from "deepeval";

const optimizer = new PromptOptimizer({
modelCallback: async (prompt, golden) => {
const rendered = prompt.interpolate({ input: golden.input }) as string;
const { output } = await model.generate(rendered);
return output;
},
metrics: [new FaithfulnessMetric()],
});

const report = await optimizer.optimize(prompt, goldens);
console.log("Best score:", report.logs[0]?.before, "→", report.logs[0]?.after);
```

## Submodule Imports

```typescript
import { ... } from "deepeval/metrics"; // All metric classes
import { ... } from "deepeval/models"; // Model classes + factory
import { ... } from "deepeval/evaluate"; // evaluate, assertTest
import { ... } from "deepeval/dataset"; // Dataset management
import { ... } from "deepeval/prompt"; // Prompt management
import { ... } from "deepeval/synthesizer"; // Synthetic data generation
import { ... } from "deepeval/optimizer"; // Prompt optimization
import { ... } from "deepeval/tracing"; // OpenTelemetry tracing
import { ... } from "deepeval/confident"; // Confident AI client
```

## Python vs TypeScript

Python remains DeepEval's most complete implementation and the first place new local evaluation capabilities will land. TypeScript complements that foundation by making DeepEval workflows that integrate with Confident AI available to JavaScript and TypeScript teams, with a clear path toward broader feature coverage.
The TypeScript SDK aims for full API parity with the Python package while feeling natural in TypeScript (strong typing, interfaces, generics, discriminated unions). Shared resources like metric templates are compiled from a single source of truth.
3 changes: 2 additions & 1 deletion typescript/eslint.config.mts
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ import globals from "globals";
import tseslint from "typescript-eslint";

export default defineConfig([
{ ignores: ["dist/**"] },
{
ignores: ["dist/**", "test/**"],
ignores: ["test/**"],
files: ["**/*.{js,mjs,cjs,ts,mts,cts}"],
languageOptions: {
globals: globals.node,
Expand Down
25 changes: 10 additions & 15 deletions typescript/examples/evaluate/example-evaluate.ts
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
import { LLMTestCase, ToolCall } from "../../src/test-case";
import { evaluate } from "../../src/confident/evaluate";
import { LLMTestCase, ToolCall, evaluate, ExactMatchMetric } from "../../src";

async function main() {
const testCase1 = new LLMTestCase({
input: "What is the capital of Germany?",
actualOutput: "Berlin is the capital of Germany.",
actualOutput: "Berlin",
expectedOutput: "Berlin",
context: ["Geography", "Europe"],
retrievalContext: ["Germany is a country in Central Europe."],
});
const testCase2 = new LLMTestCase({
input: "What is the formula for water?",
actualOutput: "The chemical formula for water is H2O.",
actualOutput: "H2O",
expectedOutput: "H2O",
context: ["Chemistry", "Molecules"],
retrievalContext: [
Expand All @@ -27,25 +26,21 @@ async function main() {
});
const testCase3 = new LLMTestCase({
input: "What is the chemical formula for water?",
actualOutput: "The chemical formula for water is H2O.",
actualOutput: "H2O",
expectedOutput: "H2O",
context: ["Chemistry"],
retrievalContext: ["Water is composed of hydrogen and oxygen"],
additionalMetadata: { source: "chemistry textbook" },
comments: "Example with tool calls",
toolsCalled: [toolCall],
});
const testCases = [testCase1, testCase2, testCase3];

try {
const metricCollection = "New Collection";
await evaluate({
metricCollection,
llmTestCases: testCases,
});
} catch (error: any) {
console.error("Error evaluating test cases:", error);
}
const metric = new ExactMatchMetric({ threshold: 1 });
const result = await evaluate([testCase1, testCase2, testCase3], [metric], {
displayConfig: { showIndicator: true, printResults: true },
});

console.log(`Passed: ${result.testResults.filter((r) => r.success).length}/${result.testResults.length}`);
}

main().catch((error) => {
Expand Down
3 changes: 3 additions & 0 deletions typescript/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

48 changes: 48 additions & 0 deletions typescript/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,36 @@
"require": "./dist/openai/index.js",
"types": "./dist/openai/index.d.ts"
},
"./models": {
"import": "./dist/models/index.js",
"require": "./dist/models/index.js",
"types": "./dist/models/index.d.ts"
},
"./metrics": {
"import": "./dist/metrics/index.js",
"require": "./dist/metrics/index.js",
"types": "./dist/metrics/index.d.ts"
},
"./evaluate": {
"import": "./dist/evaluate/index.js",
"require": "./dist/evaluate/index.js",
"types": "./dist/evaluate/index.d.ts"
},
"./prompt": {
"import": "./dist/prompt/index.js",
"require": "./dist/prompt/index.js",
"types": "./dist/prompt/index.d.ts"
},
"./synthesizer": {
"import": "./dist/synthesizer/index.js",
"require": "./dist/synthesizer/index.js",
"types": "./dist/synthesizer/index.d.ts"
},
"./optimizer": {
"import": "./dist/optimizer/index.js",
"require": "./dist/optimizer/index.js",
"types": "./dist/optimizer/index.d.ts"
},
"./integrations/ai-sdk": {
"import": "./dist/integrations/ai-sdk/index.js",
"require": "./dist/integrations/ai-sdk/index.js",
Expand Down Expand Up @@ -84,6 +114,24 @@
"openai": [
"dist/openai/index.d.ts"
],
"models": [
"dist/models/index.d.ts"
],
"metrics": [
"dist/metrics/index.d.ts"
],
"evaluate": [
"dist/evaluate/index.d.ts"
],
"prompt": [
"dist/prompt/index.d.ts"
],
"synthesizer": [
"dist/synthesizer/index.d.ts"
],
"optimizer": [
"dist/optimizer/index.d.ts"
],
"integrations/ai-sdk": [
"dist/integrations/ai-sdk/index.d.ts"
],
Expand Down
Loading
Loading