fix(indexer): use a static divisor when possible#217
Conversation
9a7d640 to
ad4d97f
Compare
|
Thanks for filing this! I have some trouble evaluating this PR, because it introduces a fair bit of complexity in what otherwise is rather simple. But on the other hand: it sounds like this may yield significant performance benefits for embedded systems, and that seems worth it. So I think I'd like to see two things before accepting this:
I hope those are reasonable asks! Thanks again for filing this! |
|
Okay, I made benchmarks. First of all I will note that of course no one is running I used criterion for x86_64 and crude manual measurements for cortex-m4 ra6m3 chip. A conclusion is, for this specific benchmark, on x86_64 it gave 2x boost in speed and 4-20x boost in throughput (how much it stalls other unrelated operations for example), for cortex-m4 on Os the improvement in speed is 5%, for O3 is 40% faster for 8x1 and 40% slower for 4x2 and 2x2x2. Not really sure why it is like this, maybe when the chain is longer and Indexer works more then it is better to use Div = use % by runtime value X86_64Con Os Div Os (performance regressed in respect to Con Os) Div O3 Con O3 (performance improved in respect to Div O3) Cortex-m4, ra6m3Code for benchmarks on cortex-m4 (in the presence of other higher priority tasks) listed bellow. I was changing the Cargo.toml to switch between Os/O3 and Div/Con. #[task(priority = 1)]
async fn bench_futures_concurrency(_ctx: bench_futures_concurrency::Context<'_>) {
use core::pin::Pin;
use futures_concurrency::prelude::*;
#[inline(never)]
fn test(f: fn(usize)) {
struct Wrapper<F>(F, usize, fn(usize));
impl<F: Future> Future for Wrapper<F> {
type Output = F::Output;
fn poll(
self: Pin<&mut Self>,
cx: &mut core::task::Context<'_>,
) -> core::task::Poll<F::Output> {
(self.2)(self.1);
unsafe { Pin::new_unchecked(&mut self.get_unchecked_mut().0) }.poll(cx)
}
}
let fut = async {
let f0 = Wrapper(Mono::delay(Duration::hours(1)), 0, f);
let f1 = Wrapper(Mono::delay(Duration::hours(1)), 1, f);
let f2 = Wrapper(Mono::delay(Duration::hours(1)), 2, f);
let f3 = Wrapper(Mono::delay(Duration::hours(1)), 3, f);
let f4 = Wrapper(Mono::delay(Duration::hours(1)), 4, f);
let f5 = Wrapper(Mono::delay(Duration::hours(1)), 5, f);
let f6 = Wrapper(Mono::delay(Duration::hours(1)), 6, f);
let f7 = Wrapper(Mono::delay(Duration::hours(1)), 7, f);
// (f0, f1, f2, f3, f4, f5, f6, f7).race().await
// ((f0, f1).race(), (f2, f3).race(), (f4, f5).race(), (f6, f7).race(),) .race() .await
(
((f0, f1).race(), (f2, f3).race()).race(),
((f4, f5).race(), (f6, f7).race()).race(),
)
.race()
.await
};
let mut fut = core::pin::pin!(fut);
let waker = core::task::Waker::noop();
let mut context = core::task::Context::from_waker(waker);
for _ in 0..1000 {
_ = core::hint::black_box(fut.as_mut().poll(&mut context));
}
}
info!("Starting benchmark");
let mut points = [0; 1000];
let mut min = u32::MAX;
let mut max = 0;
let mut avg = 0;
for i in 0..points.len() {
let start = Mono::now();
test(|i| _ = core::hint::black_box(i));
let duration = Mono::now()
.checked_duration_since(start)
.unwrap()
.to_micros();
min = min.min(duration);
max = max.max(duration);
avg += duration;
points[i] = duration;
}
avg /= points.len() as u32;
let std = points
.iter()
.map(|p| avg.abs_diff(*p))
.map(|p| p * p)
.sum::<u32>()
.isqrt()
/ (points.len() as u32).isqrt();
info!("Std: {std}us, Min: {min}us, Avg: {avg}us, Max: {max}us");
}For x86-64 it is basically the same but I used P.S.: I didn't expect that x86_64 (x2 improvement) is more sensible to this issue than cortex-m4 (4% on Os), but it seems good nontheless. |
|
If you're fine with benchmarks I'll write the more docs and force push. |
|
Thank you for running these benchmarks; these seem promising and support your thesis. Happy to merge this with the added docs and merge conflicts resolved. |
ad4d97f to
5f10932
Compare
`futures_concurrency` is a `no_std`-friendly crate, and on many cpus, such as cortex-m4, integer division by runtime number is a very expensive operation. It is so expensive that any preemption will discard the command and then it will be restarted from scratch. Using generic constant allows compiler to optimize it and use bit operations instead.
5f10932 to
a78d5e9
Compare
|
@yoshuawuyts PR is ready |
|
@yoshuawuyts hi, didn't you accidentally forgot about that one? |
yoshuawuyts
left a comment
There was a problem hiding this comment.
This is great, thank you so much!
futures_concurrencyis ano_std-friendly crate, and on many architectures, such as cortex-m4, integer division by runtime number is a very expensive operation. It is so expensive that any preemption will discard the command and then it will be restarted from scratch. Using generic constant allows compiler to optimize it and use bit operations instead.