Skip to content

ApelegHQ/ts-benchmark

Repository files navigation

@apeleghq/benchmark

NPM Downloads License

A statistically rigorous benchmarking library with paired t-tests, baseline correction, and confidence intervals.


Features

  • No runtime dependencies (except picocolor with some built-in reporters)
  • Runtime and framework agnostic (ESM and CJS dual-package).
  • Automatically injects a no-op baseline function to subtract loop and call overhead from every measurement on a per-trial basis.
  • Runs all functions within the same trial to enable paired statistical tests — shared noise sources (GC pauses, thermal throttling, etc.) cancel out in per-trial differences $d_i = a_i - b_i$.
  • Computes descriptive statistics per function: mean, median, standard deviation, SEM, min, max, and percentiles ($p_5$, $p_{25}$, $p_{75}$, $p_{95}$).
  • Reports 95% confidence intervals for all means.
  • Performs all $\binom{k}{2}$ pairwise paired t-tests with two-tailed p-values and significance flags.
  • Randomises execution order within each trial to reduce systematic bias.
  • Per-benchmark and suite-level setup/teardown hooks with shared typed context.
  • Configurable warmup iterations, iterations per trial, and number of trials.
  • Includes three built-in reporters: simple, advanced, and xUnit.
  • Fully JSON-serialisable output for custom reporting or CI integration.

Screenshot: Simple report


Screenshot: Advanced report


Installation

npm install @apeleghq/benchmark

or

yarn add @apeleghq/benchmark

Usage

Basic suite

import { runSuite } from '@apeleghq/benchmark';
import simpleReport from '@apeleghq/benchmark/reporters/simple';

type Ctx = {
  array: unknown[];
};

const result = await runSuite<Ctx>({
  name: 'Array copy comparison',
  setup() {
    this.array = [1, 2, 3];
  },
  functions: [
    {
      name: 'Array.from',
      fn() {
        Array.from(this.array);
      },
    },
    {
      name: 'Spread operator',
      fn() {
        [...this.array];
      },
    },
  ],
});

simpleReport(result);

Advanced reporter

import advancedReport from '@apeleghq/benchmark/reporters/advanced';

advancedReport(result);

xUnit reporter (for CI integration)

import xunitReport from '@apeleghq/benchmark/reporters/xunit';

console.log(xunitReport(result));

Configuration options

All fields in ISuiteConfig beyond name and functions are optional:

const result = await runSuite({
  name: 'My suite',
  warmupIterations: 10,    // default: 10
  iterationsPerTrial: 1000, // default: 1000
  trials: 30,               // default: 30
  setup() { /* suite-level context setup */ },
  teardown() { /* suite-level context teardown */ },
  functions: [
    {
      name: 'my-fn',
      fn() { /* ... */ },
      setup() { /* function-level setup, runs before warmup+measurement */ },
      teardown() { /* function-level teardown, runs after measurement */ },
    },
  ],
});

Using the statistics module directly

The stats subpackage exposes the underlying statistical primitives:

import { mean, median, stdDev, tDistPValue } from '@apeleghq/benchmark/stats';

How it works

Statistical methodology

Baseline correction

A no-op function (internally named @@null) is automatically added to every suite. Its per-trial measurement captures pure loop and call overhead. This baseline value is subtracted from every other function's measurement on a per-trial basis before any statistics are computed:

$$\hat{x}{i,\text{fn}} = x{i,\text{fn}} - x_{i,\text{null}}$$

All reported aggregate statistics (mean, median, standard deviation, confidence intervals) are computed from these baseline-corrected samples.

Paired t-test

Because all functions are measured within the same trial, comparisons use a paired (dependent) t-test rather than an independent-samples test. Per-trial differences are:

$$d_i = a_i - b_i$$

The t-statistic is:

$$t = \frac{\bar{d}}{s_d / \sqrt{n}}$$

where $\bar{d}$ is the mean of the differences and $s_d$ is their sample standard deviation. This is more powerful than an unpaired test because shared noise sources cancel out.

Confidence intervals

The 95% confidence interval for the true mean difference is:

$$\bar{d} \pm t_{\alpha/2,, n-1} \cdot \frac{s_d}{\sqrt{n}}$$

where $t_{\alpha/2,, n-1}$ is the critical value from the Student's t-distribution with $n-1$ degrees of freedom.


Output format

runSuite returns a fully JSON-serialisable ISuiteReport object:

Field Description
name Suite name
config Effective configuration (iterations, trials, warmup)
trials Every trial with raw timings and execution order
functions Per-function aggregate statistics (corrected and raw samples)
comparisons All pairwise paired t-test results
baselineName Name of the injected null baseline (@@null)

Each entry in comparisons includes:

Field Description
meanDifference $\bar{d}$ — mean of paired differences (ms)
relativeDifference $\bar{d} / \bar{b}$ — relative difference as a ratio
tStatistic Computed t-statistic
pValue Two-tailed p-value
significant true if pValue < 0.05
confidenceInterval 95% CI for the true mean difference (ms)

Exports

Export path Description
@apeleghq/benchmark runSuite — main entry point
@apeleghq/benchmark/stats Statistical primitives (mean, stdDev, t-test, etc.)
@apeleghq/benchmark/reporters/simple Simple console reporter
@apeleghq/benchmark/reporters/advanced Advanced console reporter with full statistics
@apeleghq/benchmark/reporters/xunit xUnit XML reporter for CI systems

Contributing

Contributions welcome. Please open issues or pull requests on the repository. Consider adding unit tests for edge cases and additional reporters if extending the library.


License

This project is licensed under the Apache 2.0 License with the LLVM exception. You are free to use this package in compliance with the terms of the license. For more information, see the LICENSE file.

About

A statistically rigorous benchmarking library with paired t-tests, baseline correction, and confidence intervals

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors