[Feature] Modernize benchmark infrastructure for multi-backend comparison

## Context

The `benchmarks/` directory was originally built to compare Keras 3 against `tf.keras`, which made sense at the time. But now that JAX, PyTorch, and TF backends are all properly supported, the more useful question is how does the same Keras code actually perform across backends?

The [development roadmap (#19519)](https://github.com/keras-team/keras/issues/19519) lists "Official performance benchmarks" as a goal, so I wanted to gauge what that could look like and offer to help.

The main gap is that the benchmarks only compare Keras 3 vs `tf.keras`. `LayerBenchmark` hardcodes both a `keras.layers.X` and a `tf.keras.layers.X` instance and runs them side by side. There is no way to benchmark the same layer across JAX, Torch, and TF backends, which is probably what most users actually want to know.

A few other things I noticed while looking through it:

- Results just go to `print()` with no JSON or CSV, so there is no easy way to compare runs or track trends over time
- Benchmarks are not run in `actions.yml` or `nightly.yml`, so performance regressions only get caught when someone files a bug
- Only throughput is measured with no memory tracking, which actually matters a lot for RNN layers and loss functions (I ran into this myself while working on #22169)
- The warmup is just skipping batch 0, which is not enough to account for JIT compilation noise on JAX or PyTorch

## What I'd propose

Rather than one big PR, I'd tackle this incrementally:

1. **Multi-backend `LayerBenchmark`** - refactor the base class to accept a `--backend jax,torch,tensorflow` flag and produce a comparison table
2. **Structured output** - add a `--output_format json` flag so results include metadata (backend, hardware, shapes, timestamps)
3. **Memory profiling** - track peak memory alongside throughput using `torch.cuda.max_memory_allocated`, JAX device memory stats, etc.
4. **Better measurement** - configurable warmup, multiple runs, mean/std reporting
5. **CI integration** - a nightly workflow that runs benchmarks and stores results as artifacts; stretch goal would be a perf diff comment on PRs

## Where I'd start

The first PR would just be Phase 1 refactor `LayerBenchmark` to support multi-backend runs, update `conv_benchmark.py` as a reference implementation, and keep the existing Keras-vs-tf.keras path working so nothing breaks.

I've been contributing for a while (#22257, #22115, #22169, #22013) and I'm reasonably familiar with how the backends are structured, so this feels like a natural next thing to work on. Happy to adjust based on what's actually useful here. Also if anyone would like to collaborate on this I would be more than happy to!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Modernize benchmark infrastructure for multi-backend comparison #22307

Context

What I'd propose

Where I'd start

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Modernize benchmark infrastructure for multi-backend comparison #22307

Description

Context

What I'd propose

Where I'd start

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions