Skip to content

fix(indexer): use a static divisor when possible#217

Merged
yoshuawuyts merged 1 commit into
yoshuawuyts:mainfrom
Ddystopia:const_indexer
Jan 18, 2026
Merged

fix(indexer): use a static divisor when possible#217
yoshuawuyts merged 1 commit into
yoshuawuyts:mainfrom
Ddystopia:const_indexer

Conversation

@Ddystopia

@Ddystopia Ddystopia commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

futures_concurrency is a no_std-friendly crate, and on many architectures, such as cortex-m4, integer division by runtime number is a very expensive operation. It is so expensive that any preemption will discard the command and then it will be restarted from scratch. Using generic constant allows compiler to optimize it and use bit operations instead.

@Ddystopia Ddystopia force-pushed the const_indexer branch 2 times, most recently from 9a7d640 to ad4d97f Compare January 8, 2026 20:35
@yoshuawuyts

Copy link
Copy Markdown
Owner

Thanks for filing this! I have some trouble evaluating this PR, because it introduces a fair bit of complexity in what otherwise is rather simple. But on the other hand: it sounds like this may yield significant performance benefits for embedded systems, and that seems worth it.

So I think I'd like to see two things before accepting this:

  1. A benchmark comparison showing the before/after numbers with this change. I believe it should be possible to run the test suite without the std feature to get numbers out.
  2. More docs explaining why this code exists. The PR description here does a good job imo, but the code checked in should also mention the rationale. I know myself, and about a year from now I'll be wondering what this was and why it exists.

I hope those are reasonable asks! Thanks again for filing this!

@Ddystopia

Ddystopia commented Jan 10, 2026

Copy link
Copy Markdown
Contributor Author

Okay, I made benchmarks. First of all I will note that of course no one is running select/race in a sync loop. But throughout the application there are a lot of calls to poll, especially if you use futures-concurrency and have big trees - your distant siblings could cause you to be polled. Costs of select won't show up in a flamegraph as they are miniscule but spread out through the whole application. But the costs are still there.

I used criterion for x86_64 and crude manual measurements for cortex-m4 ra6m3 chip.

A conclusion is, for this specific benchmark, on x86_64 it gave 2x boost in speed and 4-20x boost in throughput (how much it stalls other unrelated operations for example), for cortex-m4 on Os the improvement in speed is 5%, for O3 is 40% faster for 8x1 and 40% slower for 4x2 and 2x2x2. Not really sure why it is like this, maybe when the chain is longer and Indexer works more then it is better to use div and in other cases, I don't know, smart compiler? Well, anyway, O3 is usually never used on microprocessors as there just isn't as much FLASH nor icache for that much code (the whole codebase I mean).

Div = use % by runtime value
Con = use % by generic constant

X86_64

Con Os

poll-race/1x8/          time:   [1.4576 µs 1.4598 µs 1.4621 µs]
                        thrpt:  [683.97 Melem/s 685.01 Melem/s 686.04 Melem/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
poll-race/2x4/          time:   [11.578 µs 11.609 µs 11.641 µs]
                        thrpt:  [85.905 Melem/s 86.139 Melem/s 86.369 Melem/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
poll-race/2x2x2/        time:   [18.204 µs 18.222 µs 18.242 µs]
                        thrpt:  [54.819 Melem/s 54.878 Melem/s 54.934 Melem/s]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Div Os (performance regressed in respect to Con Os)


poll-race/1x8/          time:   [43.034 µs 43.044 µs 43.051 µs]
                        thrpt:  [23.228 Melem/s 23.232 Melem/s 23.237 Melem/s]
                 change:
                        time:   [+2846.7% +2852.3% +2857.7%] (p = 0.00 < 0.05)
                        thrpt:  [-96.619% -96.613% -96.606%]
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low severe
  5 (5.00%) high mild
  7 (7.00%) high severe
poll-race/2x4/          time:   [81.415 µs 81.427 µs 81.440 µs]
                        thrpt:  [12.279 Melem/s 12.281 Melem/s 12.283 Melem/s]
                 change:
                        time:   [+596.36% +599.09% +601.50%] (p = 0.00 < 0.05)
                        thrpt:  [-85.745% -85.696% -85.640%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
poll-race/2x2x2/        time:   [100.73 µs 100.75 µs 100.78 µs]
                        thrpt:  [9.9223 Melem/s 9.9251 Melem/s 9.9275 Melem/s]
                 change:
                        time:   [+452.61% +453.41% +454.23%] (p = 0.00 < 0.05)
                        thrpt:  [-81.957% -81.930% -81.904%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Div O3

poll-race/1x8/          time:   [32.265 µs 32.377 µs 32.478 µs]
                        thrpt:  [30.790 Melem/s 30.886 Melem/s 30.994 Melem/s]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
poll-race/2x4/          time:   [78.291 µs 78.409 µs 78.534 µs]
                        thrpt:  [12.733 Melem/s 12.754 Melem/s 12.773 Melem/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
poll-race/2x2x2/        time:   [97.911 µs 97.984 µs 98.060 µs]
                        thrpt:  [10.198 Melem/s 10.206 Melem/s 10.213 Melem/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Con O3 (performance improved in respect to Div O3)

poll-race/1x8/          time:   [1.5476 µs 1.5574 µs 1.5666 µs]
                        thrpt:  [638.32 Melem/s 642.10 Melem/s 646.15 Melem/s]
                 change:
                        time:   [-95.233% -95.198% -95.165%] (p = 0.00 < 0.05)
                        thrpt:  [+1968.3% +1982.7% +1997.9%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
poll-race/2x4/          time:   [8.3998 µs 8.4158 µs 8.4327 µs]
                        thrpt:  [118.59 Melem/s 118.82 Melem/s 119.05 Melem/s]
                 change:
                        time:   [-89.271% -89.250% -89.227%] (p = 0.00 < 0.05)
                        thrpt:  [+828.26% +830.19% +832.08%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  9 (9.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe
poll-race/2x2x2/        time:   [4.5731 µs 4.5778 µs 4.5868 µs]
                        thrpt:  [218.02 Melem/s 218.44 Melem/s 218.67 Melem/s]
                 change:
                        time:   [-95.349% -95.340% -95.331%] (p = 0.00 < 0.05)
                        thrpt:  [+2041.9% +2045.9% +2050.0%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

Cortex-m4, ra6m3

Div, Os, 1000 samples, 8 x 1, preemption: 
INFO  [project::app] Std: 36us, Min: 12416us, Avg: 12437us, Max: 12500us

Div, Os, 1000 samples, 4 x 2, preemption: 
INFO  [project::app] Std: 41us, Min: 14916us, Avg: 14956us, Max: 15000us

Div, Os, 1000 samples, 2 x 2 x 2, preemption: 
INFO  [project::app] Std: 39us, Min: 16083us, Avg: 16137us, Max: 16166us

Div, O3, 1000 samples, 8 x 1, preemption: 
INFO  [project::app] Std: 21us, Min: 11583us, Avg: 11659us, Max: 11666us

Div, O3, 1000 samples, 4 x 2, preemption: 
INFO  [project::app] Std: 41us, Min: 8333us, Avg: 8375us, Max: 8416us

Div, O3, 1000 samples, 2 x 2 x 2, preemption: 
INFO  [project::app] Std: 39us, Min: 9000us, Avg: 9055us, Max: 9083us

Con, Os, 1000 samples, 8 x 1, preemption: 
INFO  [project::app] Std: 40us, Min: 11666us, Avg: 11698us, Max: 11750us

Con, Os, 1000 samples, 4 x 2, preemption: 
INFO  [project::app] Std: 41us, Min: 13166us, Avg: 13213us, Max: 13250us

Con, Os, 1000 samples, 2 x 2 x 2, preemption: 
INFO  [project::app] Std: 31us, Min: 14083us, Avg: 14151us, Max: 14166us

Con, O3, 1000 samples, 8 x 1, preemption: 
INFO  [project::app] Std: 40us, Min: 6416us, Avg: 6447us, Max: 6500us

Con, O3, 1000 samples, 4 x 2, preemption: 
INFO  [project::app] Std: 38us, Min: 12083us, Avg: 12139us, Max: 12166us

Con, O3, 1000 samples, 2 x 2 x 2, preemption: 
INFO  [project::app] Std: 31us, Min: 11500us, Avg: 11568us, Max: 11583us

Code for benchmarks on cortex-m4 (in the presence of other higher priority tasks) listed bellow. I was changing the Cargo.toml to switch between Os/O3 and Div/Con.

    #[task(priority = 1)]
    async fn bench_futures_concurrency(_ctx: bench_futures_concurrency::Context<'_>) {
        use core::pin::Pin;
        use futures_concurrency::prelude::*;

        #[inline(never)]
        fn test(f: fn(usize)) {
            struct Wrapper<F>(F, usize, fn(usize));
            impl<F: Future> Future for Wrapper<F> {
                type Output = F::Output;
                fn poll(
                    self: Pin<&mut Self>,
                    cx: &mut core::task::Context<'_>,
                ) -> core::task::Poll<F::Output> {
                    (self.2)(self.1);
                    unsafe { Pin::new_unchecked(&mut self.get_unchecked_mut().0) }.poll(cx)
                }
            }

            let fut = async {
                let f0 = Wrapper(Mono::delay(Duration::hours(1)), 0, f);
                let f1 = Wrapper(Mono::delay(Duration::hours(1)), 1, f);
                let f2 = Wrapper(Mono::delay(Duration::hours(1)), 2, f);
                let f3 = Wrapper(Mono::delay(Duration::hours(1)), 3, f);
                let f4 = Wrapper(Mono::delay(Duration::hours(1)), 4, f);
                let f5 = Wrapper(Mono::delay(Duration::hours(1)), 5, f);
                let f6 = Wrapper(Mono::delay(Duration::hours(1)), 6, f);
                let f7 = Wrapper(Mono::delay(Duration::hours(1)), 7, f);

                // (f0, f1, f2, f3, f4, f5, f6, f7).race().await
                // ((f0, f1).race(), (f2, f3).race(), (f4, f5).race(), (f6, f7).race(),) .race() .await
                (
                    ((f0, f1).race(), (f2, f3).race()).race(),
                    ((f4, f5).race(), (f6, f7).race()).race(),
                )
                    .race()
                    .await
            };

            let mut fut = core::pin::pin!(fut);
            let waker = core::task::Waker::noop();
            let mut context = core::task::Context::from_waker(waker);

            for _ in 0..1000 {
                _ = core::hint::black_box(fut.as_mut().poll(&mut context));
            }
        }

        info!("Starting benchmark");
        let mut points = [0; 1000];
        let mut min = u32::MAX;
        let mut max = 0;
        let mut avg = 0;
        for i in 0..points.len() {
            let start = Mono::now();

            test(|i| _ = core::hint::black_box(i));

            let duration = Mono::now()
                .checked_duration_since(start)
                .unwrap()
                .to_micros();
            min = min.min(duration);
            max = max.max(duration);
            avg += duration;
            points[i] = duration;
        }
        avg /= points.len() as u32;
        let std = points
            .iter()
            .map(|p| avg.abs_diff(*p))
            .map(|p| p * p)
            .sum::<u32>()
            .isqrt()
            / (points.len() as u32).isqrt();
        info!("Std: {std}us, Min: {min}us, Avg: {avg}us, Max: {max}us");
    }

For x86-64 it is basically the same but I used pending as I was lazy to setup tokio to get a timer.


P.S.: I didn't expect that x86_64 (x2 improvement) is more sensible to this issue than cortex-m4 (4% on Os), but it seems good nontheless.

@Ddystopia

Copy link
Copy Markdown
Contributor Author

If you're fine with benchmarks I'll write the more docs and force push.

@yoshuawuyts

Copy link
Copy Markdown
Owner

Thank you for running these benchmarks; these seem promising and support your thesis. Happy to merge this with the added docs and merge conflicts resolved.

`futures_concurrency` is a `no_std`-friendly crate, and on many cpus,
such as cortex-m4, integer division by runtime number is a very
expensive operation. It is so expensive that any preemption will discard
the command and then it will be restarted from scratch. Using generic
constant allows compiler to optimize it and use bit operations instead.
@Ddystopia

Copy link
Copy Markdown
Contributor Author

@yoshuawuyts PR is ready

@Ddystopia

Copy link
Copy Markdown
Contributor Author

@yoshuawuyts hi, didn't you accidentally forgot about that one?

@yoshuawuyts yoshuawuyts left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you so much!

@yoshuawuyts yoshuawuyts merged commit 7336a08 into yoshuawuyts:main Jan 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants