[benchmark] Add custom ScalarMetric and DiscreteMetric primitives#1520
[benchmark] Add custom ScalarMetric and DiscreteMetric primitives#1520Janpot wants to merge 19 commits into
Conversation
Deploy previewhttps://deploy-preview-1520--mui-internal.netlify.app/ Bundle sizeTotal Size Change: 0B(0.00%) - Total Gzip Change: 0B(0.00%) Show details for 63 more bundles@mui/internal-docs-infra/abstractCreateDemo parsed: 0B(0.00%) gzip: 0B(0.00%) PerformanceTotal duration: 19.63 ms +2.67 ms(+15.8%) | Renders: 5 (🔺+1) | Paint: 0.00 ms ▼-71.71 ms(-100.0%)
3 tests within noise — details Check out the code infra dashboard for more information about this PR. |
…PR comment Record Element-Timing values into a single 'paint' ScalarMetric with a sub-series per identifier (paint#default, …) and a default alarm; drop IterationData.metrics, BenchmarkMetric, and the reporter's aggregateMetrics. On the dashboard, drop the paintDefault totals special-case, add an 'N metrics' accordion indicator, and surface error-level metric alarms in the PR comment (warnings stay dashboard-only).
There was a problem hiding this comment.
Pull request overview
Adds first-class custom metric primitives to @mui/internal-benchmark (scalar + discrete), aggregates them efficiently in the browser, and extends the reporter + dashboard to format and flag metric regressions based on per-metric definitions (with optional alarm thresholds). Paint timings are migrated to the same custom-metric pipeline as a paint scalar metric with sub-series (paint#...).
Changes:
- Introduces
Metricbase +ScalarMetric/DiscreteMetricprimitives and shared IQR aggregation (aggregateSamples), with per-test collection flushed intotask.meta.benchmarkMetrics. - Updates benchmark harness + reporter to emit custom metric stats into existing
metricswhile hoisting per-metric config into optional top-levelmetricDefinitions. - Extends dashboard comparison/reporting to use
metricDefinitionsfor formatting and alarm-based severities (error/warning), and surfaces error-level metric alarms in PR markdown.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/performance/tests/metric.bench.tsx | Adds benchmark test demonstrating scalar/discrete custom metrics and sub-series labels. |
| packages/benchmark/src/types.ts | Removes iteration-level metric samples and adds custom metric + definition types. |
| packages/benchmark/src/taskMetaAugmentation.ts | Extends Vitest TaskMeta with benchmarkMetrics for custom metric stats. |
| packages/benchmark/src/stats.ts | Adds aggregateSamples() shared aggregation helper (IQR filtering + mean/stdDev). |
| packages/benchmark/src/stats.test.ts | Adds unit tests for aggregateSamples(). |
| packages/benchmark/src/ScalarMetric.ts | Adds ScalarMetric with time()/timeEnd() helper and scalar kind. |
| packages/benchmark/src/DiscreteMetric.ts | Adds DiscreteMetric with integer formatting defaults and discrete kind. |
| packages/benchmark/src/Metric.ts | Implements per-test metric accumulation, uniqueness checks, and flush to task.meta. |
| packages/benchmark/src/Metric.test.ts | Adds unit tests for name collisions and ScalarMetric.timeEnd() error cases. |
| packages/benchmark/src/index.tsx | Exports new metric APIs and migrates paint timing to paint scalar metric sub-series. |
| packages/benchmark/src/reporter.ts | Uses aggregateSamples(), merges benchmarkMetrics into report, and emits metricDefinitions. |
| packages/benchmark/src/reporter.test.ts | Updates iteration fixture and adds tests for custom metric merge + definition hoisting. |
| packages/benchmark/src/ciReport.ts | Extends Zod upload schema to include optional metricDefinitions. |
| packages/benchmark/README.md | Documents custom metrics API and updates paint metric naming to paint#.... |
| apps/code-infra-dashboard/src/views/BenchmarkDetails.tsx | Passes metricDefinitions into the comparison view. |
| apps/code-infra-dashboard/src/utils/formatters.ts | Adds cached Intl.NumberFormat helpers for metric formatting/diffs. |
| apps/code-infra-dashboard/src/lib/ciReports/benchmarkReport.ts | Passes definitions into compareBenchmarkReports() for markdown generation. |
| apps/code-infra-dashboard/src/lib/benchmark/types.ts | Adds dashboard-side MetricDefinition types and wires into upload shape. |
| apps/code-infra-dashboard/src/lib/benchmark/compareBenchmarkReports.ts | Adds alarm-aware metric diffing and introduces warning severity. |
| apps/code-infra-dashboard/src/lib/benchmark/compareBenchmarkReports.test.ts | Adds test coverage for custom metric comparisons and severities. |
| apps/code-infra-dashboard/src/lib/benchmark/buildMarkdownReport.ts | Adds “Metric alarms” section (error-level only) and warning severity support. |
| apps/code-infra-dashboard/src/lib/benchmark/buildMarkdownReport.test.ts | Adds tests for metric alarm surfacing rules. |
| apps/code-infra-dashboard/src/components/DailyBenchmarkChart.tsx | Updates paint series key used for charting and propagates metric definitions. |
| apps/code-infra-dashboard/src/components/BenchmarkComparisonReportView.tsx | Formats metric values/diffs using per-metric formats and shows warning severity. |
… diff formatting Address PR review: centralize report migrations in a migrateBenchmarkReport() applied at the fetch boundary (renames legacy paint:* keys to bench:paint), make scalar diff hints respect the metric's format instead of hard-coding ms, and enforce the sign in formatMetricDiff.
…heck, reject '#' names, merge base+head definitions - Reporter resets accumulated state in onTestRunStart so watch re-runs start clean (an edited metric config no longer conflicts with its previous-run definition). - Conflicting-definition check uses an order-insensitive comparison. - Metric names containing '#' (the sub-series separator) are rejected at construction. - Dashboard + PR markdown merge base and head metricDefinitions (head wins) so a base-only/removed metric keeps its formatting and alarm metadata. - Clarify docs: error defaults to the global noise band only when both bands are omitted.
…reBenchmarkReports compareBenchmarkReports now takes both sides' definitions and reconciles them internally (head wins), so callers pass each side's raw definitions and can't forget to merge. mergeMetricDefinitions is now private; the View gained a baseDefinitions prop.
…input compareBenchmarkReports now takes report-objects that carry their own metricDefinitions (BenchmarkComparisonInput) instead of separate report + definitions args. Each metric uses the definition from the side it appears on, so the mergeMetricDefinitions helper is gone. Test fixtures build the input bundle directly.
Two orthogonal, per-test mechanisms in the benchmark harness: - Custom metrics recorded inside benchmark() now honor warmup exclusion via an internal per-test gate (metricsGate), matching how renders and bench:paint are already excluded. Standalone it() loops are unaffected. - A React recording switch lets interactions scope which renders/paint are measured: pauseReactRecording()/resumeReactRecording() (strict pair, throw on wrong state) plus a reactRecordingPaused option to start paused and exclude the mount. Paint is attributed by renderTime so async observation can't misattribute a paused-frame paint.
The blanket "renders > 0" assertion false-failed benchmarks that pause recording and measure imperatively (no React renders) or via custom metrics only. Replace it with a per-window check: every active recording window must capture at least one render, but windows where recording was never running are not checked — so a fully-paused, metric-only benchmark passes. createReactRecordingControls tracks per-window render presence (markRendered / finalizeWindow / hadEmptyActiveWindow); the harness aggregates across iterations and fails only on an active window that measured nothing.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jan Potoms <2109932+Janpot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jan Potoms <2109932+Janpot@users.noreply.github.com>
The Metric.ts autofix dropped the `const test = getCurrentTest()` declaration (leaving every use of `test` undefined) and a stray brace — restore it, keeping the improved "running Vitest test" wording. The ScalarMetric.ts autofix escaped quotes inside a template literal (no-useless-escape); unescape them. Add tests for the new time()-already-running guard.
| - `warn` — softer band; a regression past it is flagged as a warning. | ||
| - `error` — harder band; a regression past it is flagged as an error. Defaults to the dashboard's global noise band only when both `warn` and `error` are omitted; with only `warn` set there is no error band (warning-only). |
There was a problem hiding this comment.
I suppose error makes the test suite fail, but warn doesn't?
If warn is 0.1 and error is 0.2, I get an warning for like 15% regression, but only get an error for a regression greater than 20%, right?
There was a problem hiding this comment.
Neither makes the test suite fail. These alarms are measured against the baseline value. This is calculated when the PR comment is generated.
If warn is 0.1 and error is 0.2, I get an warning for like 15% regression, but only get an error for a regression greater than 20%, right?
yes, only errors fail the PR conment
| ); | ||
| } | ||
| this.pending.delete(key); | ||
| this.record(performance.now() - start, label !== undefined ? { id: label } : undefined); |
There was a problem hiding this comment.
Probably a micro-optimization, but I wonder if performance.now() should be at the start to reduce the overhead of the metric gathering.
Otherwise, the map access and key deletion is measured as part of the metric.
| // matches (e.g. the harness `bench:paint`), but conflicting config would silently apply | ||
| // last-write-wins to every entry — reject it instead. | ||
| const existing = definitions[metricName]; | ||
| if (existing && stableStringify(existing) !== stableStringify(definition)) { |
There was a problem hiding this comment.
If we're using stringify for a deep equals, might as well implement deep equals directly.
…rted - stats: use Array.prototype.toSorted instead of [...].sort() - ScalarMetric.timeEnd / reactRecording pause-resume: capture performance.now() before bookkeeping so the surrounding work isn't part of the measurement - ScalarMetric: drop the superclass-method reference from the class JSDoc - reporter: replace stableStringify-for-equality with a direct deepEqual - ciReport: alarm warn/error thresholds must be >= 0 - README: record() or time() inside benchmark(); clarify alarms are evaluated for the PR comment, not the local run
Adds user-defined custom metrics to
@mui/internal-benchmark, recordable from a plainit()loop, and migrates the harness's built-in paint timings onto the same primitive. The dashboard and PR comment gain per-metric alarm flagging.API
ScalarMetric— continuous values; mean ± σ with IQR outlier removal;time()/timeEnd()timing helper. Alarm bands are relative fractions.DiscreteMetric— counts; exact-integer comparison, integer formatting. Alarm bands are absolute count deltas.record(value, { id })opens a labeledname#idsub-series; base + sub-series can mix on one metric. Metrics attach to the running test viaTestRunner.getCurrentTest(), so one instance works across tests/iterations, inside or outside React.alarm(optional) flags regressions vs the baseline. Its presence replaces a mode flag; absence = informational.warn/errorare two severity bands (amber/red);direction(defaultlowerIsBetter) picks which way is a regression. Reusing a metric name with conflicting config across benchmarks is rejected.Scoping renders & warmup
pauseReactRecording()/resumeReactRecording()on the interaction context scope which renders andbench:paintare measured (e.g. only an interaction's re-render). Strict pair — wrong-state calls throw. ThereactRecordingPausedoption starts paused to exclude the mount.benchmark()now exclude warmup iterations, like renders and paint do.