Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delegated the implementations of Lock and Semaphore to the async backend class #761

Merged
merged 13 commits into from
Sep 5, 2024

Conversation

agronholm
Copy link
Owner

@agronholm agronholm commented Jul 18, 2024

Changes

This improves the performance of anyio.Lock on asyncio by ~50% by delegating the implementation of the class to the async back-end.

Checklist

If this is a user-facing code change, like a bugfix or a new feature, please ensure that
you've fulfilled the following conditions (where applicable):

  • You've added tests (in tests/) added which would fail without your patch
  • You've updated the documentation (in docs/, in case of behavior changes or new
    features)
  • You've added a new changelog entry (in docs/versionhistory.rst).

If this is a trivial change, like a typo fix or a code reformatting, then you can ignore
these instructions.

Updating the changelog

If there are no entries after the last release, use **UNRELEASED** as the version.
If, say, your patch fixes issue #123, the entry should look like this:

* Fix big bad boo-boo in task groups (#123 <https://github.com/agronholm/anyio/issues/123>_; PR by @yourgithubaccount)

If there's no issue linked, just link to your pull request instead by updating the
changelog after you've created the PR.

@agronholm
Copy link
Owner Author

This is what I used to benchmark Lock:

import asyncio
import time

import trio

import anyio


async def acquire_release(lock):
    async with lock:
        await anyio.sleep(0)


async def main(lock_class: type):
    lock = lock_class()
    async with anyio.create_task_group() as tg:
        for _ in range(10000):
            tg.start_soon(acquire_release, lock)


def benchmark(lock_class: type, name: str, backend: str) -> None:
    start = time.monotonic()
    anyio.run(main, lock_class, backend=backend)
    print(f"{name}: {time.monotonic() - start}")


benchmark(anyio.Lock, "anyio+asyncio", backend="asyncio")
benchmark(asyncio.Lock, "asyncio", backend="asyncio")
benchmark(anyio.Lock, "anyio+trio", backend="trio")
benchmark(trio.Lock, "trio", backend="trio")

@MarkusSintonen
Copy link

This is very nice! Could you also benchmark anyio.Semaphore? It might also suffer from some performance issues as it has a similar implementation as the one removed in this PR.

@agronholm
Copy link
Owner Author

I was going to tackle Semaphore next, but wanted to hear some thoughts on the Lock change first.

@agronholm
Copy link
Owner Author

anyio+asyncio: 0.4290109691210091
asyncio: 0.28042099880985916

As expected, there's a significant performance gap between anyio.Semaphore and asyncio.Semaphore.

@agronholm
Copy link
Owner Author

After my revised implementation of Semaphore:

anyio+asyncio: 0.2880397120025009
asyncio: 0.24730895110405982
anyio+trio: 0.4473942369222641
trio: 0.475036040879786

@agronholm agronholm changed the title Delegated the implementation of Lock to the async backend class Delegated the implementations of Lock and Semaphore to the async backend class Jul 20, 2024
@MarkusSintonen
Copy link

MarkusSintonen commented Jul 23, 2024

I run the above benchmark with following main to test the low contention scenario:

async def main(lock_class: type):
    lock = lock_class()
    for _ in range(10000):
        await acquire_release(lock)

Then unfortunately AnyIO still has quite a great overhead with asyncio (as we found also here encode/httpcore#922):

anyio+asyncio: 0.5959584189999987
asyncio: 0.26639485199984847
anyio+trio: 0.6655837559997053
trio: 0.6824864679997518

Caused by this line https://github.com/agronholm/anyio/pull/761/files#diff-202c3fc4551e5a287df488a9bd96a25dcc1405ec135d5c7aceb6ef4c02755341R1694

Explanation on why above is required here encode/httpcore#922 (comment)

@agronholm
Copy link
Owner Author

I can see why this would be a problem...but removing that call violates the Trio design principles where each coroutine function in the API should always contain a checkpoint. I'm not sure I can do much about this before v5.0.

@agronholm
Copy link
Owner Author

Perhaps it could be made opt-out via a parameter? That wouldn't break backwards compat, and would be in line with Trio's principles, in the sense that by default it does the right thing.

@MarkusSintonen
Copy link

Perhaps it could be made opt-out via a parameter? That wouldn't break backwards compat, and would be in line with Trio's principles, in the sense that by default it does the right thing.

Yeah that sounds pretty reasonable to me!

@agronholm
Copy link
Owner Author

agronholm commented Sep 1, 2024

I misremembered the implementation. I'm already foregoing the extra scheduling cycle, but I am checking for cancellation, which the asyncio also doesn't do. So there isn't really anything that can be done on that front.

EDIT: I must be tired; disregard the above. I did have a cancel_shielded_checkpoint() call there after all.

I run the above benchmark with following main to test the low contention scenario:

async def main(lock_class: type):
    lock = lock_class()
    for _ in range(10000):
        await acquire_release(lock)

Then unfortunately AnyIO still has quite a great overhead with asyncio (as we found also here encode/httpcore#922):

anyio+asyncio: 0.5959584189999987
asyncio: 0.26639485199984847
anyio+trio: 0.6655837559997053
trio: 0.6824864679997518

Caused by this line https://github.com/agronholm/anyio/pull/761/files#diff-202c3fc4551e5a287df488a9bd96a25dcc1405ec135d5c7aceb6ef4c02755341R1694

Explanation on why above is required here encode/httpcore#922 (comment)

Are you absolutely sure that you ran the benchmark against this PR? On what Python version? Because I'm seeing very different results. Here's an updated benchmark that adds a warm-up round:

import asyncio
import time

import trio

import anyio


async def acquire_release(lock):
    async with lock:
        await anyio.sleep(0)


async def main(lock_class: type):
    lock = lock_class()
    async with anyio.create_task_group() as tg:
        for _ in range(10000):
            tg.start_soon(acquire_release, lock)


def benchmark(lock_class: type, name: str, backend: str, warmup: bool) -> None:
    start = time.monotonic()
    anyio.run(main, lock_class, backend=backend)
    if not warmup:
        print(f"{name}: {time.monotonic() - start}")


for warmup in (True, False):
    benchmark(anyio.Lock, "anyio+asyncio", backend="asyncio", warmup=warmup)
    benchmark(asyncio.Lock, "asyncio", backend="asyncio", warmup=warmup)
    benchmark(anyio.Lock, "anyio+trio", backend="trio", warmup=warmup)
    benchmark(trio.Lock, "trio", backend="trio", warmup=warmup)

I'm seeing anyio.Lock consistently outperform asyncio.Lock:

anyio+asyncio: 0.2851747270033229
asyncio: 0.29133382299914956
anyio+trio: 0.45034723199205473
trio: 0.44515540200518444
$ python lockbench.py 
anyio+asyncio: 0.30581956400419585
asyncio: 0.31490542700339574
anyio+trio: 0.46516700099164154
trio: 0.4660604179953225
$ python lockbench.py 
anyio+asyncio: 0.3131622729997616
asyncio: 0.3260173320013564
anyio+trio: 0.457542422998813
trio: 0.4611267040017992

Would you mind re-running the benchmark at your end?

@agronholm
Copy link
Owner Author

Oh, by the way, maybe you'd like to join the AnyIO room on Gitter to discuss this in (almost) real time?

@MarkusSintonen
Copy link

Would you mind re-running the benchmark at your end?

Sure, here are the rerun results with the commit 787a454

High contention:

anyio+asyncio: 0.6917983900000308
asyncio: 0.65059882200012
anyio+trio: 0.9897013449999577
trio: 1.071194450000121

Low contention:

anyio+asyncio: 0.5575289480000265
asyncio: 0.24671458100010568
anyio+trio: 0.6501416300000074
trio: 0.6244343480000225

I used the same benchmark as above here (that has the warmup) for the high contention case. The low contention case I again used this in the place of "main":

async def main(lock_class: type):
    lock = lock_class()
    for _ in range(10000):
        await acquire_release(lock)

So in the high-contention case the results are similar as before but in the low-contention case the asyncio is still 2x faster.

In Python 3.12.3

@MarkusSintonen
Copy link

I'm seeing anyio.Lock consistently outperform asyncio.Lock:

For me its about similar. But the low-contention case is much worse. Did you try the low-contention where the lock is not contested?

@agronholm
Copy link
Owner Author

That I did not. I will try it your way and see if I can make more sense of the results.

@MarkusSintonen
Copy link

I added the low contention case as part of full benchmark if its useful:

import asyncio
import time
from typing import Awaitable, Callable

import trio

import anyio


async def acquire_release(lock):
    async with lock:
        await anyio.sleep(0)


async def high_contention(lock_class: type):
    lock = lock_class()
    async with anyio.create_task_group() as tg:
        for _ in range(10000):
            tg.start_soon(acquire_release, lock)


async def no_contention(lock_class: type):
    lock = lock_class()
    for _ in range(10000):
        await acquire_release(lock)


def benchmark(main: Callable[[type], Awaitable[None]], lock_class: type, name: str, backend: str, warmup: bool) -> None:
    start = time.monotonic()
    anyio.run(main, lock_class, backend=backend)
    if not warmup:
        print(f"{name}: {time.monotonic() - start}")


for warmup in (True, False):
    for lock_contention in (True, False):
        if not warmup:
            case_msg = "high contention" if lock_contention else "no contention"
            print(case_msg)

        case = high_contention if lock_contention else no_contention

        benchmark(case, anyio.Lock, "anyio+asyncio", backend="asyncio", warmup=warmup)
        benchmark(case, asyncio.Lock, "asyncio", backend="asyncio", warmup=warmup)
        benchmark(case, anyio.Lock, "anyio+trio", backend="trio", warmup=warmup)
        benchmark(case, trio.Lock, "trio", backend="trio", warmup=warmup)

@agronholm
Copy link
Owner Author

Yeah, so the cancel_shielded_checkpoint() part seems to be the expensive one, as expected. The margin gets a lot smaller if you comment that out. Would you say that the difference becomes acceptable if we skip the rescheduling by the checkpoint?

@MarkusSintonen
Copy link

Yeah, so the cancel_shielded_checkpoint() part seems to be the expensive one, as expected. The margin gets a lot smaller if you comment that out. Would you say that the difference becomes acceptable if we skip the rescheduling by the checkpoint?

Yep, so it seems. The difference to asyncio is then pretty negligible when removing cancel_shielded_checkpoint.

@agronholm
Copy link
Owner Author

Alright, then I would say that I need to again look at adding the fast_acquire option. Let's hope that the trio backend won't give much trouble in that respect.

@agronholm
Copy link
Owner Author

Alright, please test against the primitive-fast-acquire branch, and pass fast_acquire=True to the Lock and Semaphore constructors as needed.

@agronholm
Copy link
Owner Author

If you find that this new option works as expected, I will polish the branch and merge it to this one.

@MarkusSintonen
Copy link

Alright, please test against the primitive-fast-acquire branch, and pass fast_acquire=True to the Lock and Semaphore constructors as needed.

It seems to work! I run the benchmark here and this is what I got:

high contention
anyio+asyncio: 0.7079124029987724
asyncio: 0.7218510080056149
no contention
anyio+asyncio: 0.26606414399429923
asyncio: 0.2457228010025574

(Unable to run Trio benchmark, it fails with WouldBlock exception in that branch with fast_acquire)

Also run httpcore benchmarks and it also looks very good:

First results with all optimization PRs applied (to not count other httpcore issues in).

With fast_acquire=False:
Näyttökuva 2024-09-02 kello 20 15 55

With fast_acquire=True:
Näyttökuva 2024-09-02 kello 20 16 46

Lastly results without all optimization PRs applied (how it would look against httpcore master, with its issues).

With fast_acquire=False:
Näyttökuva 2024-09-02 kello 20 22 11

With fast_acquire=True:
Näyttökuva 2024-09-02 kello 20 20 50

@agronholm
Copy link
Owner Author

I've fixed the WouldBlock on trio now.

@MarkusSintonen
Copy link

I've fixed the WouldBlock on trio now.

I run with Trio included with fast_acquire=True (looking good):

high contention
anyio+asyncio: 0.6999061149981571
asyncio: 0.7299156189983478
anyio+trio: 1.008005741998204
trio: 1.0237657109973952
no contention
anyio+asyncio: 0.27656007900077384
asyncio: 0.24542512399784755
anyio+trio: 0.36521937700308627
trio: 0.6426814509977703

@MarkusSintonen
Copy link

Also posted the above results to the related httpcore PR and asked Tom Christie whether to keep AnyIO there for synchronization. Maintainer(s) had some other ideas about removing the hard dependency to that, but not sure what was the idea with that. Hence the plain asyncio network backend PR is also there waiting. But unfortunately httpcore author has disappeared somewhere :/

Copy link
Collaborator

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here looks good to me, but I'm somewhat concerned about the design: as you note in encode/httpcore#922 (comment) fast_acquire=True can turn code which is otherwise correct (i.e. is cancellable and subject to timeouts as expected) into an infinite loop via spooky action at a distance. I've had nasty production bugs of this form, which is why https://github.com/python-trio/flake8-async/ has lint rules to warn against such functions.

Performance improvements absolutely are important, but maybe we can support this on asyncio but make fast_acquire=True a no-op on Trio where cancellation scopes actually do exist?

@agronholm
Copy link
Owner Author

The implementation here looks good to me, but I'm somewhat concerned about the design: as you note in encode/httpcore#922 (comment) fast_acquire=True can turn code which is otherwise correct (i.e. is cancellable and subject to timeouts as expected) into an infinite loop via spooky action at a distance. I've had nasty production bugs of this form, which is why https://github.com/python-trio/flake8-async/ has lint rules to warn against such functions.

Performance improvements absolutely are important, but maybe we can support this on asyncio but make fast_acquire=True a no-op on Trio where cancellation scopes actually do exist?

I'm not sure what the existence of cancellation scopes has to do with this. The point here was to skip the extra event loop cycle in the no-contention scenario. As for the dangers, they're documented and this is strictly opt-in.

@Zac-HD
Copy link
Collaborator

Zac-HD commented Sep 4, 2024

It's locally opt-in, but has non-local effects - e.g. if I'm using some library that uses this, I can no longer rely on the await / async with / async for keywords indicating that there's a checkpoint, as Trio ensures and flake8-async can (locally) check.

Maybe that's just a lost cause with third-party libraries anyway, but it does make me sad to give up that property.

@agronholm
Copy link
Owner Author

It's locally opt-in, but has non-local effects - e.g. if I'm using some library that uses this, I can no longer rely on the await / async with / async for keywords indicating that there's a checkpoint, as Trio ensures and flake8-async can (locally) check.

Maybe that's just a lost cause with third-party libraries anyway, but it does make me sad to give up that property.

I'd say it's always been a lost cause. Until we have some way of enforcing this (and I'm not seeing this in the foreseeable future), it will remain that way.

@agronholm
Copy link
Owner Author

Aside from that, is there anything else amiss here?

Copy link
Collaborator

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair enough. I can't see any other issues, so let's ship it!

@agronholm agronholm merged commit 0c8ad51 into master Sep 5, 2024
17 checks passed
@agronholm agronholm deleted the optimize-sync-primitives branch September 5, 2024 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants