vitest-evals

This project is a prototype of extending vitest to support basic Evals functionality. Evals are a type of testing that is most commonly deployed to evaluate the results of calls to language models. This allows you to utilize them with a pattern of testing you're familiar with, working well with your existing continuous integration toolchain.

This is heavily inspired by Evalite, but opts for a vitest-native approach to maximize the compatibility of the existing ecosystem. This means you can use it with your existing toolchain, including reporting such as code coverage and xunit.

Use

npm install -D vitest-evals

You've likely already got a mechanism for passing the user input into your model, for example:

async function answerQuestion(prompt: string) {
  const model = openai("gpt-4o");
  const { text } = await generateText({
    model,
    prompt,
  });
  return text;
}

You'll use this as the task within your evals, and then you simply need to define a set of scenarios and a way to validate if the LLM is responding as you desire:

import { describeEval } from "vitest-evals";
import { Factuality } from "autoevals";

describeEval("my evals", {
  data: async () => {
    // The scenarios you wish to evaluate
    return [
      {
        input: "What is the capital of France?",
        expected: "Paris",
      }
    ];
  },

  task: answerQuestion,

  // Scorers determine if the response was acceptable - in this case we're using
  // a secondary LLM prompt to judge the response of the first.
  scorers: [Factuality],

  // The threshold required for the average score for this eval to pass. This will be
  // based on the scorers you've provided, and in the case of Factuality, we might be
  // ok with a 60% score (see the implementation for why).
  threshold: 0.6,

  // The timeout for each test. Defaults to 10s. You may need to increase this if your model
  // provider has high latency or you're using a large number of scorers.
  // timeout: 60000,

  // A check to determine if these tests should run. This is helpful to control tests so they only
  // in certain situations, for example if a model providers API key is defined.
  // skipIf: () => !process.env.OPENAI_API_KEY
})

Existing Test Suites

// import `vitest-evals` to expose `expect().toEval()`
// This can also be done via `setupFiles` pattern in `vitest`.
import "vitest-evals";
import { Factuality } from "autoevals";

describe("my test suite", () => {
  it("kind of works", () => {
    expect("What is the capital of France?").toEval(
      "Paris",
      answerQuestion,
      Factuality,
      0.8
    );
  });
});

Scoring

Scorers are compatible with the autoevals interface, but are also simple to implement on your own:

export const Contains = async (opts: {
  input: string,
  expected: string,
  output: string,
}) => {
  return {
    score: output.indexOf(expected) !== -1 ? 1.0 : 0.0,
  };
};

For something more realistic, here's a reimplementation of the Factuality scorer from autoevals, with some flexibility on the model, enabling you to evaluate against multiple models:

import { generateObject, type LanguageModel } from "ai";
import { z } from "zod";

/**
 * A Factuality checker utilizing the `ai` SDK based on the implementation in `autoevals`.
 *
 * @param model - The language model to utilize (via `ai`).
 *
 * @example
 * ```javascript
 * import { openai } from "@ai-sdk/openai";
 *
 * scorers: [Factuality(openai("gpt-4o"))]
 * ```
 */
export function Factuality(model: LanguageModel) {
  return async Factuality(opts: {
    input: string;
    output: string;
    expected?: string;
  }) => {
    const { object } = await generateObject({
      model,
      /**
       * Prompt implementation from `autoevals`:
       *
       * {@link https://github.com/braintrustdata/autoevals/blob/5aa20a0a9eb8fc9e07e9e5722ebf71c68d082f32/templates/factuality.yaml}
       */
      prompt: `
        You are comparing a submitted answer to an expert answer on a given question. Here is the data:
        [BEGIN DATA]
        ************
        [Question]: ${opts.input}
        ************
        [Expert]: ${opts.expected}
        ************
        [Submission]: ${opts.output}
        ************
        [END DATA]

        Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
        The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
        (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
        (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
        (C) The submitted answer contains all the same details as the expert answer.
        (D) There is a disagreement between the submitted answer and the expert answer.
        (E) The answers differ, but these differences don't matter from the perspective of factuality.
      `,
      schema: z.object({
        answer: z.enum(["A", "B", "C", "D", "E"]).describe("Your selection."),
        rationale: z
          .string()
          .describe("Why you chose this answer. Be very detailed."),
      }),
    });

    const scores = {
      A: 0.4,
      B: 0.6,
      C: 1,
      D: 0,
      E: 1,
    };

    return {
      score: scores[object.answer],
      metadata: {
        rationale: object.rationale,
      },
    };
  };
}

Compatibility with `autoevals`

We maintain compatibility with the autoevals package from Braintrust. To use it you'll typically need to use te partial helper provided on the scorers. For example, with the ClosedQA scorer:

import { describeEval } from "vitest-evals";
import { ClosedQA } from "autoevals";

describeEval("my evals", {
  data: async () => {
    // The scenarios you wish to evaluate
    return [
      {
        input: "What is the capital of France?",
        expected: "Paris",
      }
    ];
  },
  task: answerQuestion,
  scorers: [ClosedQA.partial({
    criteria: "Does the submission indicate that the question is out of scope?",
  })],
  threshold: 0.6,
})

Separating Evals

An alternative to skipIf for controlling if evals run is creating an separate vitest configuration for them. This gives a lot of advantages, particularly allowing you to maintain two completely separate test suites. A good pattern you can enable with this is a filename-based-test selector:

// vitest.evals.config.ts
/// <reference types="vitest" />
import { defineConfig } from "vitest/config";
import defaultConfig from "./vitest.config";

export default defineConfig({
  ...defaultConfig,
  test: {
    ...defaultConfig.test,
    // run `eval` files rather than typical `test` files
    include: ["src/**/*.eval.{js,mjs,cjs,ts,mts,cts,jsx,tsx}"],
  },
});

In the above, we're telling it to only match only *.eval.* files (vs the typical *.test.* or *.spec.*). We're also inheriting from our default vitest.config.ts. This gives us a clean way to run only tests, or run only evals:

vitest --config=vitest.evals.config.ts

Its recommended to add this to your package.json, such as under an eval helper:

// package.json
{
  // ...
  "scripts": {
    // ...
    "eval": "vitest --config=vitest.evals.config.ts",
  }
}

You can then run your evals using npm run eval.

Development

Nothing fancy here.

pnpm install

pnpm test

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
.vscode		.vscode
scripts		scripts
src		src
.craft.yml		.craft.yml
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

vitest-evals

Use

Existing Test Suites

Scoring

Compatibility with `autoevals`

Separating Evals

Development

About

Uh oh!

Releases 3

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Uh oh!

License

getsentry/vitest-evals

Folders and files

Latest commit

History

Repository files navigation

vitest-evals

Use

Existing Test Suites

Scoring

Compatibility with autoevals

Separating Evals

Development

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Compatibility with `autoevals`

Packages