Skip to content

Commit 6bd11f0

Browse files
committed
.
1 parent ec8cc15 commit 6bd11f0

2 files changed

Lines changed: 193 additions & 0 deletions

File tree

docs/content/blog/meta.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
"deepeval-got-a-new-look",
99

1010
"---[Users]Community---",
11+
"typescript-in-deepeval-monorepo",
1112
"medical-chatbot-deepeval-guide",
1213
"rag-contract-assistant-deepeval-guide",
1314
"use-case-cognee-ai-memory",
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
---
2+
title: "We're releasing TypeScript in DeepEval's Python monorepo"
3+
description: DeepEval is going TypeScript. Here's why we put it in the same repo as Python, and how we keep the two implementations from drifting apart.
4+
date: 2026-06-08
5+
authors: [penguine]
6+
category: community
7+
---
8+
9+
:::info
10+
DeepEval for Typescript is releasing on July 1st. Keep track of it on [our GitHub](https://github.com/confident-ai/deepeval) for more.
11+
:::
12+
13+
DeepEval started Python-only. For most of its life, TypeScript existed in our world as exactly one thing: a thin client for shipping eval results up to Confident AI. It couldn't _run_ a metric — it had no `GEval`, no `AnswerRelevancyMetric`, no judge logic.
14+
15+
In fact, that's why we didn't even bother to promote it. If you're a user of Confident AI, you'll find it in Confident's docs. Otherwise, it didn't matter to you.
16+
17+
But with the new [DeepEval as an evaluation harness](/blog/introducing-deepeval-4) direction we're headed into - I thought it was absolutely necessary to support Typescript.
18+
19+
Don't get me wrong - we're still not at the "typescript-native, feature parity with python" stage yet. But now we're building a real TypeScript SDK, and the first decision wasn't what the API should look like — it was where the code should live. One repo alongside Python, or a separate `deepeval-ts`?
20+
21+
## Actually, we did start with a separate Typescript repo
22+
23+
For those that are users of Confident AI, you'll know that I'd by lying if I said we were always crystal clear on day one that a monorepo with typescript in it was the decision we always had from day 1.
24+
25+
Internally, we did have a `deepeval-ts` private repo, and we did even release `deepeval-ts` `npm` package. But it was all to act as an SDK for Confident AI.
26+
27+
But overtime, we decided it wasn't good enough and frankly - pointless as its own Typescript repo when everyone was asking for it to be open-sourced. So really it came down to 2 decision:
28+
29+
- Do we open-source the Typescript repo, or
30+
- Do we include it within the existing DeepEval repo
31+
32+
We chose the latter.
33+
34+
## How we weighed our options
35+
36+
So here was our objective for DeepEval in the typescript system: To allow for typescript users to use DeepEval for non-experimental features. This meant local evals, tracing, synthetic data generation, and simulation.
37+
38+
Hence, the goal was never parity. The goal is to narrow the gap between the reference and the follower as much as we possibly can. That distinction matters, because a second language that's allowed to drift is worse than no second language at all — it produces different scores from the "same" metric, quietly invalidates cross-language comparisons, and erodes trust in both. If we can't keep the gap small, we might as well not ship TypeScript at all.
39+
40+
That single belief — _narrow the gap or don't bother_ — is what drove the repo decision. Here are the three structures we looked at, judged on the only axis we cared about: how hard does the structure fight drift?
41+
42+
So we came up with 4 options, each with its pros and cons:
43+
44+
- **Two repos, nothing shared** — maximum freedom for each ecosystem to move on its own.
45+
- **Two repos, generated from one source** — zero drift, for free.
46+
- **One repo, shared contract** — drift caught at the PR boundary.
47+
- **One repo, AI-synced** — docs carry over by construction, code changes ported across languages with AI.
48+
49+
## Camp 1: Two repos, nothing shared (the LLM-framework default)
50+
51+
This is what most of our space does. LangChain keeps [`langchain`](https://github.com/langchain-ai/langchain) (Python) and [`langchainjs`](https://github.com/langchain-ai/langchainjs) as fully separate repositories; [LlamaIndex](https://github.com/run-llama/llama_index) does the same. Each gets its own release cadence, issue tracker, and contributors. It's a reasonable default — the Python and JS ecosystems disagree about package managers, test runners, and versioning, and two repos let each move freely.
52+
53+
But nothing holds the two surfaces together. A new feature is two independent PRs in two places, and the second one is the path of least resistance to skip. The result is the well-known reality that LangChain's JS surface trails its Python surface on features and integrations. Drift isn't an accident here — it's the default state, because no one is ever forced to look at both at once:
54+
55+
<Tabs items={["Python", "TypeScript"]}>
56+
<Tab value="Python">
57+
58+
```python title="repo: deepeval"
59+
AnswerRelevancyMetric(
60+
threshold=0.7,
61+
include_reason=True,
62+
)
63+
```
64+
65+
</Tab>
66+
<Tab value="TypeScript">
67+
68+
```typescript title="repo: deepeval-ts"
69+
new AnswerRelevancyMetric({
70+
threshold: 0.5, // ← drifted default
71+
// includeReason: not implemented yet
72+
});
73+
```
74+
75+
</Tab>
76+
</Tabs>
77+
78+
We decided this wasn't where we want to take DeepEval.
79+
80+
## Camp 2: Two repos, generated from one source (Stripe)
81+
82+
There's a smarter version of the split that kills drift entirely. Stripe ships [`stripe-python`](https://github.com/stripe/stripe-python), [`stripe-node`](https://github.com/stripe/stripe-node), [`stripe-go`](https://github.com/stripe/stripe-go), and a dozen others as separate repos — but every one is generated from a single [`stripe/openapi`](https://github.com/stripe/openapi) spec repo. The repos are split; the source of truth is not. Parity is mechanical, because no human hand-writes the per-language surface:
83+
84+
```yaml
85+
# stripe/openapi — the single source of truth
86+
PaymentIntent:
87+
properties:
88+
amount: { type: integer }
89+
currency: { type: string }
90+
```
91+
92+
```python
93+
# stripe-python — File generated from our OpenAPI spec
94+
class PaymentIntent:
95+
amount: int
96+
currency: str
97+
```
98+
99+
```typescript
100+
// stripe-node — File generated from our OpenAPI spec
101+
interface PaymentIntent {
102+
amount: number;
103+
currency: string;
104+
}
105+
```
106+
107+
The gap here is zero, and it stays zero for free. We'd love that. But it doesn't transfer, for a structural reason: Stripe's SDKs are API clients — thin wrappers over HTTP endpoints, fully describable by a schema.
108+
109+
DeepEval is a framework, and although this would have worked if DeepEval were a mere wrapper for Confident AI's APIs — it isn't, that's the whole point of making TypeScript OS.
110+
111+
So pure codegen is out: there's no spec to generate a metric from. Keep that failure in mind, though — it comes back in a different form once you add AI to the picture.
112+
113+
## Camp 3: One repo, shared contract (Apache Arrow)
114+
115+
So we can't generate our way to a small gap, and we don't want the split that lets the gap widen. That leaves the structure that actively fights drift by hand: a single repo.
116+
117+
[Apache Arrow](https://github.com/apache/arrow) is the model. It keeps C++, Python, JavaScript, and more in one repository — clean per-language directories around a shared format spec, with per-language CI, plus integration tests that check the languages _against each other_. The shared contract is what makes "did these two implementations stay in sync" a single, atomic question instead of a cross-repo coordination problem.
118+
119+
The contract is a set of shared, language-neutral fixtures — golden cases that both implementations must agree on:
120+
121+
```json
122+
// shared/fixtures/answer_relevancy/basic.json
123+
{
124+
"input": "What is the capital of France?",
125+
"actual_output": "Paris is the capital of France.",
126+
"expected_score_min": 0.8
127+
}
128+
```
129+
130+
Both test suites consume the _same_ file in the _same_ CI run:
131+
132+
```python
133+
# python/tests/test_answer_relevancy.py
134+
case = load_fixture("answer_relevancy/basic.json")
135+
metric.measure(LLMTestCase(input=case["input"], actual_output=case["actual_output"]))
136+
assert metric.score >= case["expected_score_min"]
137+
```
138+
139+
```typescript
140+
// typescript/tests/answerRelevancy.test.ts
141+
const c = loadFixture("answer_relevancy/basic.json");
142+
await metric.measure(
143+
new LLMTestCase({ input: c.input, actualOutput: c.actualOutput })
144+
);
145+
expect(metric.score).toBeGreaterThanOrEqual(c.expectedScoreMin);
146+
```
147+
148+
The TypeScript surface stays idiomatic — `camelCase`, an options object instead of keyword arguments, `new`, `await` — while being held to the same behavioral contract as Python, with the gap checked at the PR boundary rather than discovered later by a confused user.
149+
150+
This is genuinely strong: a drift regression fails the build immediately. But the contract isn't free — every behavior needs a hand-written, language-neutral fixture, and someone has to keep that corpus in lockstep with both implementations forever. For a metric surface that changes often, maintaining the fixtures can become the bottleneck. We wanted Arrow's one-repo backbone without committing to that much hand-maintained machinery up front.
151+
152+
## Camp 4: One repo, AI-synced (what we actually do)
153+
154+
This is where we landed, and it's an option that didn't really exist a couple of years ago. It's Arrow's structure — one repo, clean per-language directories — minus the hand-maintained fixture contract. Two things hold the languages together instead.
155+
156+
The first is docs. Prose is written once and rendered per-language: a shared [term-map](https://github.com/confident-ai/deepeval/blob/main/docs/lib/lang/terms.ts) pairs the Python and TypeScript spelling of every inline identifier — `test_case``testCase`, `actual_output``actualOutput` — so the documentation never silently describes one language while showing the other. An unknown term fails the build instead of being dropped silently.
157+
158+
The second is code, and this is the part that wasn't possible before. Camp 2 failed because a framework's logic isn't describable by a spec — there was nothing to generate from. But you no longer need a rigid spec: an LLM can read the Python implementation of a metric — its judge prompts, scoring math, thresholds — and port that exact change into the TypeScript implementation.
159+
160+
```python
161+
# python/metrics/answer_relevancy.py — the reference change
162+
- threshold: float = 0.5
163+
+ threshold: float = 0.7 # bumped default after eval study
164+
```
165+
166+
```typescript
167+
// typescript/metrics/answerRelevancy.ts — ported by AI from the Python diff
168+
- threshold: 0.5,
169+
+ threshold: 0.7, // bumped default after eval study
170+
```
171+
172+
Python stays the reference where behavior is decided; AI is what carries the diff across the gap. So a change to `AnswerRelevancyMetric` is still one PR in one repo, but the TypeScript side isn't transliterated from scratch by hand, nor gated behind a fixture corpus we have to grow forever — it's ported from the Python reference with AI and reviewed by a human who knows both. It doesn't make TypeScript first-class — Python still decides behavior — but it's the lightest structure that keeps the follower honest, and it only works now because the tooling to translate logic, not just schemas, finally exists.
173+
174+
## The costs we're signing up for
175+
176+
One repo isn't free; the reasons everyone else splits are real and we inherit them:
177+
178+
- **Mixed toolchains in one CI** — pip _and_ npm, pytest _and_ vitest, two lint stacks, two release pipelines. Arrow's CI is heavy for exactly this reason.
179+
- **Release-coupling pressure** — npm and PyPI users upgrade on different schedules, so one repo must _not_ mean one version number. We have to deliberately decouple release tags per package.
180+
- **Contributor friction** — a TypeScript contributor clones a repo full of Python they don't care about, and vice versa.
181+
182+
We're accepting these because the alternative — a quietly drifting TypeScript SDK — costs more than a heavier build.
183+
184+
## What's next from here
185+
186+
So, when can we actually see Typescript in DeepEval? In fact, as of today it's already out here: https://github.com/confident-ai/deepeval/tree/main/typescript.
187+
188+
But so far its still a client wrapper around Confident AI. The actual local evals, simulation, etc. will be released on **July 1st.**
189+
190+
[Star and watch the DeepEval repo](https://github.com/confident-ai/deepeval) if you're interested in how this will look like, and for Python users - don't worry, nothing you won't notice a single change in your day to day experience.
191+
192+
In conclusion: Python leads, TypeScript follows close behind, and one repo is what keeps "close behind" true. We're not pretending TypeScript is first-class. We're making sure that the day it isn't first-class is never the day it silently stops agreeing with Python.

0 commit comments

Comments
 (0)