Skip to content

[benchmark] Add custom ScalarMetric and DiscreteMetric primitives#1520

Open
Janpot wants to merge 19 commits into
masterfrom
custom-benchmark-metrics
Open

[benchmark] Add custom ScalarMetric and DiscreteMetric primitives#1520
Janpot wants to merge 19 commits into
masterfrom
custom-benchmark-metrics

Conversation

@Janpot

@Janpot Janpot commented Jun 5, 2026

Copy link
Copy Markdown
Member

Adds user-defined custom metrics to @mui/internal-benchmark, recordable from a plain it() loop, and migrates the harness's built-in paint timings onto the same primitive. The dashboard and PR comment gain per-metric alarm flagging.

API

import { it } from 'vitest';
import { ScalarMetric, DiscreteMetric } from '@mui/internal-benchmark';

const duration = new ScalarMetric({
  name: 'work_duration',
  format: { style: 'unit', unit: 'millisecond' }, // Intl.NumberFormatOptions
  alarm: { direction: 'lowerIsBetter', warn: 0.1, error: 0.25 }, // omit -> informational
});
const clicks = new DiscreteMetric({ name: 'button_clicks' });

it('measures', () => {
  for (let i = 0; i < 100; i += 1) {
    duration.time();
    runWork();
    duration.timeEnd(); // records elapsed ms (console.time-style)
    clicks.record(countClicks());
  }
});
  • ScalarMetric — continuous values; mean ± σ with IQR outlier removal; time()/timeEnd() timing helper. Alarm bands are relative fractions.
  • DiscreteMetric — counts; exact-integer comparison, integer formatting. Alarm bands are absolute count deltas.
  • record(value, { id }) opens a labeled name#id sub-series; base + sub-series can mix on one metric. Metrics attach to the running test via TestRunner.getCurrentTest(), so one instance works across tests/iterations, inside or outside React.
  • alarm (optional) flags regressions vs the baseline. Its presence replaces a mode flag; absence = informational. warn/error are two severity bands (amber/red); direction (default lowerIsBetter) picks which way is a regression. Reusing a metric name with conflicting config across benchmarks is rejected.

Scoping renders & warmup

  • pauseReactRecording() / resumeReactRecording() on the interaction context scope which renders and bench:paint are measured (e.g. only an interaction's re-render). Strict pair — wrong-state calls throw. The reactRecordingPaused option starts paused to exclude the mount.
  • Custom metrics recorded inside benchmark() now exclude warmup iterations, like renders and paint do.

@code-infra-dashboard

code-infra-dashboard Bot commented Jun 5, 2026

Copy link
Copy Markdown

Deploy preview

https://deploy-preview-1520--mui-internal.netlify.app/

Bundle size

Total Size Change: 0B(0.00%) - Total Gzip Change: 0B(0.00%)
Files: 63 total (0 added, 0 removed, 0 changed)

Show details for 63 more bundles

@mui/internal-docs-infra/abstractCreateDemoparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/abstractCreateDemoClientparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/abstractCreateStreamparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/abstractCreateTypesparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/ChunkProviderparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/cliparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/CodeControllerContextparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/CodeExternalsContextparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/CodeHighlighterparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/CodeHighlighter/errorsparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/CodeHighlighter/typesparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/CodeProviderparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/CoordinatedLazyparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/CoordinatedLazy/typesparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/createDemoDataparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/createDemoData/typesparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/createSitemapparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/createSitemap/typesparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useCodeparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useCodeWindowparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useCoordinatedparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useCopierparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useDemoparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useErrorsparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useLocalStorageStateparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/usePreferenceparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useScrollAnchorparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useSearchparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useSearch/typesparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useStreamparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useStream/typesparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useTypeparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useTypesparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/useUrlHashStateparsed: 0B(0.00%) gzip: 0B(0.00%)
@mui/internal-docs-infra/withDocsInfraparsed: 0B(0.00%) gzip: 0B(0.00%)
addLineGuttersparsed: 0B(0.00%) gzip: 0B(0.00%)
CodeHighlighterChunkparsed: 0B(0.00%) gzip: 0B(0.00%)
CodeHighlighterClientparsed: 0B(0.00%) gzip: 0B(0.00%)
CodeInitialSourceLoaderparsed: 0B(0.00%) gzip: 0B(0.00%)
CodeSourceLoaderparsed: 0B(0.00%) gzip: 0B(0.00%)
createFrameparsed: 0B(0.00%) gzip: 0B(0.00%)
createParseSourceWorkerClientparsed: 0B(0.00%) gzip: 0B(0.00%)
EditingEngineparsed: 0B(0.00%) gzip: 0B(0.00%)
embedTransformsparsed: 0B(0.00%) gzip: 0B(0.00%)
enhanceCodeEmphasisparsed: 0B(0.00%) gzip: 0B(0.00%)
findExpandingRangesparsed: 0B(0.00%) gzip: 0B(0.00%)
getHastTextContentparsed: 0B(0.00%) gzip: 0B(0.00%)
grammarLoadersparsed: 0B(0.00%) gzip: 0B(0.00%)
grammarsparsed: 0B(0.00%) gzip: 0B(0.00%)
isFrameSpanparsed: 0B(0.00%) gzip: 0B(0.00%)
loadIsomorphicCodeVariantparsed: 0B(0.00%) gzip: 0B(0.00%)
parseSourceparsed: 0B(0.00%) gzip: 0B(0.00%)
source.cssparsed: 0B(0.00%) gzip: 0B(0.00%)
source.jsparsed: 0B(0.00%) gzip: 0B(0.00%)
source.jsonparsed: 0B(0.00%) gzip: 0B(0.00%)
source.mdxparsed: 0B(0.00%) gzip: 0B(0.00%)
source.shellparsed: 0B(0.00%) gzip: 0B(0.00%)
source.tsparsed: 0B(0.00%) gzip: 0B(0.00%)
source.tsxparsed: 0B(0.00%) gzip: 0B(0.00%)
source.yamlparsed: 0B(0.00%) gzip: 0B(0.00%)
text.html.basicparsed: 0B(0.00%) gzip: 0B(0.00%)
text.mdparsed: 0B(0.00%) gzip: 0B(0.00%)
TransformEngineparsed: 0B(0.00%) gzip: 0B(0.00%)

Details of bundle changes

Performance

Total duration: 19.63 ms +2.67 ms(+15.8%) | Renders: 5 (🔺+1) | Paint: 0.00 ms ▼-71.71 ms(-100.0%)

Test Duration Renders
Counter click (interaction only) 2.10 ms 1
custom scalar + discrete metrics 0.00 ms 0
sub-series via labels 0.00 ms 0
Widget imperative update (metric only) 0.00 ms 0

3 tests within noise — details


Check out the code infra dashboard for more information about this PR.

…PR comment

Record Element-Timing values into a single 'paint' ScalarMetric with a sub-series
per identifier (paint#default, …) and a default alarm; drop IterationData.metrics,
BenchmarkMetric, and the reporter's aggregateMetrics. On the dashboard, drop the
paintDefault totals special-case, add an 'N metrics' accordion indicator, and surface
error-level metric alarms in the PR comment (warnings stay dashboard-only).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class custom metric primitives to @mui/internal-benchmark (scalar + discrete), aggregates them efficiently in the browser, and extends the reporter + dashboard to format and flag metric regressions based on per-metric definitions (with optional alarm thresholds). Paint timings are migrated to the same custom-metric pipeline as a paint scalar metric with sub-series (paint#...).

Changes:

  • Introduces Metric base + ScalarMetric/DiscreteMetric primitives and shared IQR aggregation (aggregateSamples), with per-test collection flushed into task.meta.benchmarkMetrics.
  • Updates benchmark harness + reporter to emit custom metric stats into existing metrics while hoisting per-metric config into optional top-level metricDefinitions.
  • Extends dashboard comparison/reporting to use metricDefinitions for formatting and alarm-based severities (error/warning), and surfaces error-level metric alarms in PR markdown.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/performance/tests/metric.bench.tsx Adds benchmark test demonstrating scalar/discrete custom metrics and sub-series labels.
packages/benchmark/src/types.ts Removes iteration-level metric samples and adds custom metric + definition types.
packages/benchmark/src/taskMetaAugmentation.ts Extends Vitest TaskMeta with benchmarkMetrics for custom metric stats.
packages/benchmark/src/stats.ts Adds aggregateSamples() shared aggregation helper (IQR filtering + mean/stdDev).
packages/benchmark/src/stats.test.ts Adds unit tests for aggregateSamples().
packages/benchmark/src/ScalarMetric.ts Adds ScalarMetric with time()/timeEnd() helper and scalar kind.
packages/benchmark/src/DiscreteMetric.ts Adds DiscreteMetric with integer formatting defaults and discrete kind.
packages/benchmark/src/Metric.ts Implements per-test metric accumulation, uniqueness checks, and flush to task.meta.
packages/benchmark/src/Metric.test.ts Adds unit tests for name collisions and ScalarMetric.timeEnd() error cases.
packages/benchmark/src/index.tsx Exports new metric APIs and migrates paint timing to paint scalar metric sub-series.
packages/benchmark/src/reporter.ts Uses aggregateSamples(), merges benchmarkMetrics into report, and emits metricDefinitions.
packages/benchmark/src/reporter.test.ts Updates iteration fixture and adds tests for custom metric merge + definition hoisting.
packages/benchmark/src/ciReport.ts Extends Zod upload schema to include optional metricDefinitions.
packages/benchmark/README.md Documents custom metrics API and updates paint metric naming to paint#....
apps/code-infra-dashboard/src/views/BenchmarkDetails.tsx Passes metricDefinitions into the comparison view.
apps/code-infra-dashboard/src/utils/formatters.ts Adds cached Intl.NumberFormat helpers for metric formatting/diffs.
apps/code-infra-dashboard/src/lib/ciReports/benchmarkReport.ts Passes definitions into compareBenchmarkReports() for markdown generation.
apps/code-infra-dashboard/src/lib/benchmark/types.ts Adds dashboard-side MetricDefinition types and wires into upload shape.
apps/code-infra-dashboard/src/lib/benchmark/compareBenchmarkReports.ts Adds alarm-aware metric diffing and introduces warning severity.
apps/code-infra-dashboard/src/lib/benchmark/compareBenchmarkReports.test.ts Adds test coverage for custom metric comparisons and severities.
apps/code-infra-dashboard/src/lib/benchmark/buildMarkdownReport.ts Adds “Metric alarms” section (error-level only) and warning severity support.
apps/code-infra-dashboard/src/lib/benchmark/buildMarkdownReport.test.ts Adds tests for metric alarm surfacing rules.
apps/code-infra-dashboard/src/components/DailyBenchmarkChart.tsx Updates paint series key used for charting and propagates metric definitions.
apps/code-infra-dashboard/src/components/BenchmarkComparisonReportView.tsx Formats metric values/diffs using per-metric formats and shows warning severity.

Comment thread apps/code-infra-dashboard/src/components/DailyBenchmarkChart.tsx
Comment thread apps/code-infra-dashboard/src/utils/formatters.ts
Comment thread apps/code-infra-dashboard/src/lib/benchmark/compareBenchmarkReports.ts Outdated
Comment thread packages/benchmark/src/reporter.ts Outdated
… diff formatting

Address PR review: centralize report migrations in a migrateBenchmarkReport()
applied at the fetch boundary (renames legacy paint:* keys to bench:paint), make
scalar diff hints respect the metric's format instead of hard-coding ms, and
enforce the sign in formatMetricDiff.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 8 comments.

Comment thread packages/benchmark/README.md Outdated
Comment thread apps/code-infra-dashboard/src/lib/benchmark/types.ts Outdated
Comment thread apps/code-infra-dashboard/src/views/BenchmarkDetails.tsx Outdated
Comment thread apps/code-infra-dashboard/src/lib/ciReports/benchmarkReport.ts Outdated
Comment thread packages/benchmark/src/Metric.ts
Comment thread packages/benchmark/src/reporter.ts
Comment thread packages/benchmark/src/types.ts Outdated
Comment thread apps/code-infra-dashboard/src/components/DailyBenchmarkChart.tsx Outdated
…heck, reject '#' names, merge base+head definitions

- Reporter resets accumulated state in onTestRunStart so watch re-runs start clean
  (an edited metric config no longer conflicts with its previous-run definition).
- Conflicting-definition check uses an order-insensitive comparison.
- Metric names containing '#' (the sub-series separator) are rejected at construction.
- Dashboard + PR markdown merge base and head metricDefinitions (head wins) so a
  base-only/removed metric keeps its formatting and alarm metadata.
- Clarify docs: error defaults to the global noise band only when both bands are omitted.
…reBenchmarkReports

compareBenchmarkReports now takes both sides' definitions and reconciles them internally
(head wins), so callers pass each side's raw definitions and can't forget to merge.
mergeMetricDefinitions is now private; the View gained a baseDefinitions prop.
…input

compareBenchmarkReports now takes report-objects that carry their own
metricDefinitions (BenchmarkComparisonInput) instead of separate report +
definitions args. Each metric uses the definition from the side it appears on,
so the mergeMetricDefinitions helper is gone. Test fixtures build the input
bundle directly.
Two orthogonal, per-test mechanisms in the benchmark harness:

- Custom metrics recorded inside benchmark() now honor warmup exclusion via
  an internal per-test gate (metricsGate), matching how renders and bench:paint
  are already excluded. Standalone it() loops are unaffected.
- A React recording switch lets interactions scope which renders/paint are
  measured: pauseReactRecording()/resumeReactRecording() (strict pair, throw on
  wrong state) plus a reactRecordingPaused option to start paused and exclude
  the mount. Paint is attributed by renderTime so async observation can't
  misattribute a paused-frame paint.
@zannager zannager added the scope: code-infra Involves the code-infra product (https://www.notion.so/mui-org/5562c14178aa42af97bc1fa5114000cd). label Jun 8, 2026
The blanket "renders > 0" assertion false-failed benchmarks that pause
recording and measure imperatively (no React renders) or via custom metrics
only. Replace it with a per-window check: every active recording window must
capture at least one render, but windows where recording was never running are
not checked — so a fully-paused, metric-only benchmark passes.

createReactRecordingControls tracks per-window render presence (markRendered /
finalizeWindow / hadEmptyActiveWindow); the harness aggregates across iterations
and fails only on an active window that measured nothing.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 3 comments.

Comment thread packages/benchmark/src/Metric.ts
Comment thread packages/benchmark/src/ScalarMetric.ts
Comment thread apps/code-infra-dashboard/src/utils/formatters.ts
Janpot and others added 3 commits June 9, 2026 09:59
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Jan Potoms <2109932+Janpot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Jan Potoms <2109932+Janpot@users.noreply.github.com>
The Metric.ts autofix dropped the `const test = getCurrentTest()` declaration
(leaving every use of `test` undefined) and a stray brace — restore it, keeping
the improved "running Vitest test" wording. The ScalarMetric.ts autofix escaped
quotes inside a template literal (no-useless-escape); unescape them. Add tests
for the new time()-already-running guard.
@Janpot Janpot added scope: code-infra Involves the code-infra product (https://www.notion.so/mui-org/5562c14178aa42af97bc1fa5114000cd). and removed scope: code-infra Involves the code-infra product (https://www.notion.so/mui-org/5562c14178aa42af97bc1fa5114000cd). labels Jun 9, 2026
Comment thread packages/benchmark/README.md Outdated
Comment thread packages/benchmark/README.md Outdated
Comment on lines +170 to +171
- `warn` — softer band; a regression past it is flagged as a warning.
- `error` — harder band; a regression past it is flagged as an error. Defaults to the dashboard's global noise band only when both `warn` and `error` are omitted; with only `warn` set there is no error band (warning-only).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose error makes the test suite fail, but warn doesn't?

If warn is 0.1 and error is 0.2, I get an warning for like 15% regression, but only get an error for a regression greater than 20%, right?

@Janpot Janpot Jun 12, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither makes the test suite fail. These alarms are measured against the baseline value. This is calculated when the PR comment is generated.

If warn is 0.1 and error is 0.2, I get an warning for like 15% regression, but only get an error for a regression greater than 20%, right?

yes, only errors fail the PR conment

Comment thread packages/benchmark/src/stats.ts Outdated
Comment thread packages/benchmark/src/ScalarMetric.ts Outdated
);
}
this.pending.delete(key);
this.record(performance.now() - start, label !== undefined ? { id: label } : undefined);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a micro-optimization, but I wonder if performance.now() should be at the start to reduce the overhead of the metric gathering.

Otherwise, the map access and key deletion is measured as part of the metric.

Comment thread packages/benchmark/src/ScalarMetric.ts Outdated
Comment thread packages/benchmark/src/reporter.ts Outdated
// matches (e.g. the harness `bench:paint`), but conflicting config would silently apply
// last-write-wins to every entry — reject it instead.
const existing = definitions[metricName];
if (existing && stableStringify(existing) !== stableStringify(definition)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're using stringify for a deep equals, might as well implement deep equals directly.

Comment thread packages/benchmark/src/reactRecording.ts Outdated
Comment thread packages/benchmark/src/ciReport.ts Outdated
…rted

- stats: use Array.prototype.toSorted instead of [...].sort()
- ScalarMetric.timeEnd / reactRecording pause-resume: capture performance.now()
  before bookkeeping so the surrounding work isn't part of the measurement
- ScalarMetric: drop the superclass-method reference from the class JSDoc
- reporter: replace stableStringify-for-equality with a direct deepEqual
- ciReport: alarm warn/error thresholds must be >= 0
- README: record() or time() inside benchmark(); clarify alarms are evaluated
  for the PR comment, not the local run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope: code-infra Involves the code-infra product (https://www.notion.so/mui-org/5562c14178aa42af97bc1fa5114000cd).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants