|
| 1 | +--- |
| 2 | +title: "We're releasing TypeScript in DeepEval's Python monorepo" |
| 3 | +description: DeepEval is going TypeScript. Here's why we put it in the same repo as Python, and how we keep the two implementations from drifting apart. |
| 4 | +date: 2026-06-08 |
| 5 | +authors: [penguine] |
| 6 | +category: community |
| 7 | +--- |
| 8 | + |
| 9 | +:::info |
| 10 | +DeepEval for Typescript is releasing on July 1st. Keep track of it on [our GitHub](https://github.com/confident-ai/deepeval) for more. |
| 11 | +::: |
| 12 | + |
| 13 | +DeepEval started Python-only. For most of its life, TypeScript existed in our world as exactly one thing: a thin client for shipping eval results up to Confident AI. It couldn't _run_ a metric — it had no `GEval`, no `AnswerRelevancyMetric`, no judge logic. |
| 14 | + |
| 15 | +In fact, that's why we didn't even bother to promote it. If you're a user of Confident AI, you'll find it in Confident's docs. Otherwise, it didn't matter to you. |
| 16 | + |
| 17 | +But with the new [DeepEval as an evaluation harness](/blog/introducing-deepeval-4) direction we're headed into - I thought it was absolutely necessary to support Typescript. |
| 18 | + |
| 19 | +Don't get me wrong - we're still not at the "typescript-native, feature parity with python" stage yet. But now we're building a real TypeScript SDK, and the first decision wasn't what the API should look like — it was where the code should live. One repo alongside Python, or a separate `deepeval-ts`? |
| 20 | + |
| 21 | +## Actually, we did start with a separate Typescript repo |
| 22 | + |
| 23 | +For those that are users of Confident AI, you'll know that I'd by lying if I said we were always crystal clear on day one that a monorepo with typescript in it was the decision we always had from day 1. |
| 24 | + |
| 25 | +Internally, we did have a `deepeval-ts` private repo, and we did even release `deepeval-ts` `npm` package. But it was all to act as an SDK for Confident AI. |
| 26 | + |
| 27 | +But overtime, we decided it wasn't good enough and frankly - pointless as its own Typescript repo when everyone was asking for it to be open-sourced. So really it came down to 2 decision: |
| 28 | + |
| 29 | +- Do we open-source the Typescript repo, or |
| 30 | +- Do we include it within the existing DeepEval repo |
| 31 | + |
| 32 | +We chose the latter. |
| 33 | + |
| 34 | +## How we weighed our options |
| 35 | + |
| 36 | +So here was our objective for DeepEval in the typescript system: To allow for typescript users to use DeepEval for non-experimental features. This meant local evals, tracing, synthetic data generation, and simulation. |
| 37 | + |
| 38 | +Hence, the goal was never parity. The goal is to narrow the gap between the reference and the follower as much as we possibly can. That distinction matters, because a second language that's allowed to drift is worse than no second language at all — it produces different scores from the "same" metric, quietly invalidates cross-language comparisons, and erodes trust in both. If we can't keep the gap small, we might as well not ship TypeScript at all. |
| 39 | + |
| 40 | +That single belief — _narrow the gap or don't bother_ — is what drove the repo decision. Here are the three structures we looked at, judged on the only axis we cared about: how hard does the structure fight drift? |
| 41 | + |
| 42 | +So we came up with 4 options, each with its pros and cons: |
| 43 | + |
| 44 | +- **Two repos, nothing shared** — maximum freedom for each ecosystem to move on its own. |
| 45 | +- **Two repos, generated from one source** — zero drift, for free. |
| 46 | +- **One repo, shared contract** — drift caught at the PR boundary. |
| 47 | +- **One repo, AI-synced** — docs carry over by construction, code changes ported across languages with AI. |
| 48 | + |
| 49 | +## Camp 1: Two repos, nothing shared (the LLM-framework default) |
| 50 | + |
| 51 | +This is what most of our space does. LangChain keeps [`langchain`](https://github.com/langchain-ai/langchain) (Python) and [`langchainjs`](https://github.com/langchain-ai/langchainjs) as fully separate repositories; [LlamaIndex](https://github.com/run-llama/llama_index) does the same. Each gets its own release cadence, issue tracker, and contributors. It's a reasonable default — the Python and JS ecosystems disagree about package managers, test runners, and versioning, and two repos let each move freely. |
| 52 | + |
| 53 | +But nothing holds the two surfaces together. A new feature is two independent PRs in two places, and the second one is the path of least resistance to skip. The result is the well-known reality that LangChain's JS surface trails its Python surface on features and integrations. Drift isn't an accident here — it's the default state, because no one is ever forced to look at both at once: |
| 54 | + |
| 55 | +<Tabs items={["Python", "TypeScript"]}> |
| 56 | +<Tab value="Python"> |
| 57 | + |
| 58 | +```python title="repo: deepeval" |
| 59 | +AnswerRelevancyMetric( |
| 60 | + threshold=0.7, |
| 61 | + include_reason=True, |
| 62 | +) |
| 63 | +``` |
| 64 | + |
| 65 | +</Tab> |
| 66 | +<Tab value="TypeScript"> |
| 67 | + |
| 68 | +```typescript title="repo: deepeval-ts" |
| 69 | +new AnswerRelevancyMetric({ |
| 70 | + threshold: 0.5, // ← drifted default |
| 71 | + // includeReason: not implemented yet |
| 72 | +}); |
| 73 | +``` |
| 74 | + |
| 75 | +</Tab> |
| 76 | +</Tabs> |
| 77 | + |
| 78 | +We decided this wasn't where we want to take DeepEval. |
| 79 | + |
| 80 | +## Camp 2: Two repos, generated from one source (Stripe) |
| 81 | + |
| 82 | +There's a smarter version of the split that kills drift entirely. Stripe ships [`stripe-python`](https://github.com/stripe/stripe-python), [`stripe-node`](https://github.com/stripe/stripe-node), [`stripe-go`](https://github.com/stripe/stripe-go), and a dozen others as separate repos — but every one is generated from a single [`stripe/openapi`](https://github.com/stripe/openapi) spec repo. The repos are split; the source of truth is not. Parity is mechanical, because no human hand-writes the per-language surface: |
| 83 | + |
| 84 | +```yaml |
| 85 | +# stripe/openapi — the single source of truth |
| 86 | +PaymentIntent: |
| 87 | + properties: |
| 88 | + amount: { type: integer } |
| 89 | + currency: { type: string } |
| 90 | +``` |
| 91 | +
|
| 92 | +```python |
| 93 | +# stripe-python — File generated from our OpenAPI spec |
| 94 | +class PaymentIntent: |
| 95 | + amount: int |
| 96 | + currency: str |
| 97 | +``` |
| 98 | +
|
| 99 | +```typescript |
| 100 | +// stripe-node — File generated from our OpenAPI spec |
| 101 | +interface PaymentIntent { |
| 102 | + amount: number; |
| 103 | + currency: string; |
| 104 | +} |
| 105 | +``` |
| 106 | + |
| 107 | +The gap here is zero, and it stays zero for free. We'd love that. But it doesn't transfer, for a structural reason: Stripe's SDKs are API clients — thin wrappers over HTTP endpoints, fully describable by a schema. |
| 108 | + |
| 109 | +DeepEval is a framework, and although this would have worked if DeepEval were a mere wrapper for Confident AI's APIs — it isn't, that's the whole point of making TypeScript OS. |
| 110 | + |
| 111 | +So pure codegen is out: there's no spec to generate a metric from. Keep that failure in mind, though — it comes back in a different form once you add AI to the picture. |
| 112 | + |
| 113 | +## Camp 3: One repo, shared contract (Apache Arrow) |
| 114 | + |
| 115 | +So we can't generate our way to a small gap, and we don't want the split that lets the gap widen. That leaves the structure that actively fights drift by hand: a single repo. |
| 116 | + |
| 117 | +[Apache Arrow](https://github.com/apache/arrow) is the model. It keeps C++, Python, JavaScript, and more in one repository — clean per-language directories around a shared format spec, with per-language CI, plus integration tests that check the languages _against each other_. The shared contract is what makes "did these two implementations stay in sync" a single, atomic question instead of a cross-repo coordination problem. |
| 118 | + |
| 119 | +The contract is a set of shared, language-neutral fixtures — golden cases that both implementations must agree on: |
| 120 | + |
| 121 | +```json |
| 122 | +// shared/fixtures/answer_relevancy/basic.json |
| 123 | +{ |
| 124 | + "input": "What is the capital of France?", |
| 125 | + "actual_output": "Paris is the capital of France.", |
| 126 | + "expected_score_min": 0.8 |
| 127 | +} |
| 128 | +``` |
| 129 | + |
| 130 | +Both test suites consume the _same_ file in the _same_ CI run: |
| 131 | + |
| 132 | +```python |
| 133 | +# python/tests/test_answer_relevancy.py |
| 134 | +case = load_fixture("answer_relevancy/basic.json") |
| 135 | +metric.measure(LLMTestCase(input=case["input"], actual_output=case["actual_output"])) |
| 136 | +assert metric.score >= case["expected_score_min"] |
| 137 | +``` |
| 138 | + |
| 139 | +```typescript |
| 140 | +// typescript/tests/answerRelevancy.test.ts |
| 141 | +const c = loadFixture("answer_relevancy/basic.json"); |
| 142 | +await metric.measure( |
| 143 | + new LLMTestCase({ input: c.input, actualOutput: c.actualOutput }) |
| 144 | +); |
| 145 | +expect(metric.score).toBeGreaterThanOrEqual(c.expectedScoreMin); |
| 146 | +``` |
| 147 | + |
| 148 | +The TypeScript surface stays idiomatic — `camelCase`, an options object instead of keyword arguments, `new`, `await` — while being held to the same behavioral contract as Python, with the gap checked at the PR boundary rather than discovered later by a confused user. |
| 149 | + |
| 150 | +This is genuinely strong: a drift regression fails the build immediately. But the contract isn't free — every behavior needs a hand-written, language-neutral fixture, and someone has to keep that corpus in lockstep with both implementations forever. For a metric surface that changes often, maintaining the fixtures can become the bottleneck. We wanted Arrow's one-repo backbone without committing to that much hand-maintained machinery up front. |
| 151 | + |
| 152 | +## Camp 4: One repo, AI-synced (what we actually do) |
| 153 | + |
| 154 | +This is where we landed, and it's an option that didn't really exist a couple of years ago. It's Arrow's structure — one repo, clean per-language directories — minus the hand-maintained fixture contract. Two things hold the languages together instead. |
| 155 | + |
| 156 | +The first is docs. Prose is written once and rendered per-language: a shared [term-map](https://github.com/confident-ai/deepeval/blob/main/docs/lib/lang/terms.ts) pairs the Python and TypeScript spelling of every inline identifier — `test_case` ↔ `testCase`, `actual_output` ↔ `actualOutput` — so the documentation never silently describes one language while showing the other. An unknown term fails the build instead of being dropped silently. |
| 157 | + |
| 158 | +The second is code, and this is the part that wasn't possible before. Camp 2 failed because a framework's logic isn't describable by a spec — there was nothing to generate from. But you no longer need a rigid spec: an LLM can read the Python implementation of a metric — its judge prompts, scoring math, thresholds — and port that exact change into the TypeScript implementation. |
| 159 | + |
| 160 | +```python |
| 161 | +# python/metrics/answer_relevancy.py — the reference change |
| 162 | +- threshold: float = 0.5 |
| 163 | ++ threshold: float = 0.7 # bumped default after eval study |
| 164 | +``` |
| 165 | + |
| 166 | +```typescript |
| 167 | +// typescript/metrics/answerRelevancy.ts — ported by AI from the Python diff |
| 168 | +- threshold: 0.5, |
| 169 | ++ threshold: 0.7, // bumped default after eval study |
| 170 | +``` |
| 171 | + |
| 172 | +Python stays the reference where behavior is decided; AI is what carries the diff across the gap. So a change to `AnswerRelevancyMetric` is still one PR in one repo, but the TypeScript side isn't transliterated from scratch by hand, nor gated behind a fixture corpus we have to grow forever — it's ported from the Python reference with AI and reviewed by a human who knows both. It doesn't make TypeScript first-class — Python still decides behavior — but it's the lightest structure that keeps the follower honest, and it only works now because the tooling to translate logic, not just schemas, finally exists. |
| 173 | + |
| 174 | +## The costs we're signing up for |
| 175 | + |
| 176 | +One repo isn't free; the reasons everyone else splits are real and we inherit them: |
| 177 | + |
| 178 | +- **Mixed toolchains in one CI** — pip _and_ npm, pytest _and_ vitest, two lint stacks, two release pipelines. Arrow's CI is heavy for exactly this reason. |
| 179 | +- **Release-coupling pressure** — npm and PyPI users upgrade on different schedules, so one repo must _not_ mean one version number. We have to deliberately decouple release tags per package. |
| 180 | +- **Contributor friction** — a TypeScript contributor clones a repo full of Python they don't care about, and vice versa. |
| 181 | + |
| 182 | +We're accepting these because the alternative — a quietly drifting TypeScript SDK — costs more than a heavier build. |
| 183 | + |
| 184 | +## What's next from here |
| 185 | + |
| 186 | +So, when can we actually see Typescript in DeepEval? In fact, as of today it's already out here: https://github.com/confident-ai/deepeval/tree/main/typescript. |
| 187 | + |
| 188 | +But so far its still a client wrapper around Confident AI. The actual local evals, simulation, etc. will be released on **July 1st.** |
| 189 | + |
| 190 | +[Star and watch the DeepEval repo](https://github.com/confident-ai/deepeval) if you're interested in how this will look like, and for Python users - don't worry, nothing you won't notice a single change in your day to day experience. |
| 191 | + |
| 192 | +In conclusion: Python leads, TypeScript follows close behind, and one repo is what keeps "close behind" true. We're not pretending TypeScript is first-class. We're making sure that the day it isn't first-class is never the day it silently stops agreeing with Python. |
0 commit comments