Delegated the implementations of Lock and Semaphore to the async backend class #761

agronholm · 2024-07-18T20:44:09Z

Changes

This improves the performance of anyio.Lock on asyncio by ~50% by delegating the implementation of the class to the async back-end.

Checklist

If this is a user-facing code change, like a bugfix or a new feature, please ensure that
you've fulfilled the following conditions (where applicable):

You've added tests (in tests/) added which would fail without your patch
You've updated the documentation (in docs/, in case of behavior changes or new
features)
You've added a new changelog entry (in docs/versionhistory.rst).

If this is a trivial change, like a typo fix or a code reformatting, then you can ignore
these instructions.

Updating the changelog

If there are no entries after the last release, use **UNRELEASED** as the version.
If, say, your patch fixes issue #123, the entry should look like this:

* Fix big bad boo-boo in task groups (#123 <https://github.com/agronholm/anyio/issues/123>_; PR by @yourgithubaccount)

If there's no issue linked, just link to your pull request instead by updating the
changelog after you've created the PR.

agronholm · 2024-07-18T21:14:57Z

This is what I used to benchmark Lock:

import asyncio
import time

import trio

import anyio


async def acquire_release(lock):
    async with lock:
        await anyio.sleep(0)


async def main(lock_class: type):
    lock = lock_class()
    async with anyio.create_task_group() as tg:
        for _ in range(10000):
            tg.start_soon(acquire_release, lock)


def benchmark(lock_class: type, name: str, backend: str) -> None:
    start = time.monotonic()
    anyio.run(main, lock_class, backend=backend)
    print(f"{name}: {time.monotonic() - start}")


benchmark(anyio.Lock, "anyio+asyncio", backend="asyncio")
benchmark(asyncio.Lock, "asyncio", backend="asyncio")
benchmark(anyio.Lock, "anyio+trio", backend="trio")
benchmark(trio.Lock, "trio", backend="trio")

MarkusSintonen · 2024-07-19T09:38:27Z

This is very nice! Could you also benchmark anyio.Semaphore? It might also suffer from some performance issues as it has a similar implementation as the one removed in this PR.

agronholm · 2024-07-19T09:39:53Z

I was going to tackle Semaphore next, but wanted to hear some thoughts on the Lock change first.

agronholm · 2024-07-19T10:42:04Z

anyio+asyncio: 0.4290109691210091
asyncio: 0.28042099880985916

As expected, there's a significant performance gap between anyio.Semaphore and asyncio.Semaphore.

agronholm · 2024-07-20T15:17:36Z

After my revised implementation of Semaphore:

anyio+asyncio: 0.2880397120025009
asyncio: 0.24730895110405982
anyio+trio: 0.4473942369222641
trio: 0.475036040879786

MarkusSintonen · 2024-07-23T17:35:25Z

I run the above benchmark with following main to test the low contention scenario:

async def main(lock_class: type):
    lock = lock_class()
    for _ in range(10000):
        await acquire_release(lock)

Then unfortunately AnyIO still has quite a great overhead with asyncio (as we found also here encode/httpcore#922):

anyio+asyncio: 0.5959584189999987
asyncio: 0.26639485199984847
anyio+trio: 0.6655837559997053
trio: 0.6824864679997518

Caused by this line https://github.com/agronholm/anyio/pull/761/files#diff-202c3fc4551e5a287df488a9bd96a25dcc1405ec135d5c7aceb6ef4c02755341R1694

Explanation on why above is required here encode/httpcore#922 (comment)

agronholm · 2024-08-30T08:46:47Z

I can see why this would be a problem...but removing that call violates the Trio design principles where each coroutine function in the API should always contain a checkpoint. I'm not sure I can do much about this before v5.0.

agronholm · 2024-08-30T08:48:17Z

Perhaps it could be made opt-out via a parameter? That wouldn't break backwards compat, and would be in line with Trio's principles, in the sense that by default it does the right thing.

MarkusSintonen · 2024-08-31T10:19:14Z

Perhaps it could be made opt-out via a parameter? That wouldn't break backwards compat, and would be in line with Trio's principles, in the sense that by default it does the right thing.

Yeah that sounds pretty reasonable to me!

agronholm · 2024-09-01T11:25:51Z

I misremembered the implementation. I'm already foregoing the extra scheduling cycle, but I am checking for cancellation, which the asyncio also doesn't do. So there isn't really anything that can be done on that front.

EDIT: I must be tired; disregard the above. I did have a cancel_shielded_checkpoint() call there after all.

I run the above benchmark with following main to test the low contention scenario:
async def main(lock_class: type):
    lock = lock_class()
    for _ in range(10000):
        await acquire_release(lock)
Then unfortunately AnyIO still has quite a great overhead with asyncio (as we found also here encode/httpcore#922):
anyio+asyncio: 0.5959584189999987
asyncio: 0.26639485199984847
anyio+trio: 0.6655837559997053
trio: 0.6824864679997518
Caused by this line https://github.com/agronholm/anyio/pull/761/files#diff-202c3fc4551e5a287df488a9bd96a25dcc1405ec135d5c7aceb6ef4c02755341R1694

Explanation on why above is required here encode/httpcore#922 (comment)

Are you absolutely sure that you ran the benchmark against this PR? On what Python version? Because I'm seeing very different results. Here's an updated benchmark that adds a warm-up round:

import asyncio
import time

import trio

import anyio


async def acquire_release(lock):
    async with lock:
        await anyio.sleep(0)


async def main(lock_class: type):
    lock = lock_class()
    async with anyio.create_task_group() as tg:
        for _ in range(10000):
            tg.start_soon(acquire_release, lock)


def benchmark(lock_class: type, name: str, backend: str, warmup: bool) -> None:
    start = time.monotonic()
    anyio.run(main, lock_class, backend=backend)
    if not warmup:
        print(f"{name}: {time.monotonic() - start}")


for warmup in (True, False):
    benchmark(anyio.Lock, "anyio+asyncio", backend="asyncio", warmup=warmup)
    benchmark(asyncio.Lock, "asyncio", backend="asyncio", warmup=warmup)
    benchmark(anyio.Lock, "anyio+trio", backend="trio", warmup=warmup)
    benchmark(trio.Lock, "trio", backend="trio", warmup=warmup)

I'm seeing anyio.Lock consistently outperform asyncio.Lock:

anyio+asyncio: 0.2851747270033229
asyncio: 0.29133382299914956
anyio+trio: 0.45034723199205473
trio: 0.44515540200518444
$ python lockbench.py 
anyio+asyncio: 0.30581956400419585
asyncio: 0.31490542700339574
anyio+trio: 0.46516700099164154
trio: 0.4660604179953225
$ python lockbench.py 
anyio+asyncio: 0.3131622729997616
asyncio: 0.3260173320013564
anyio+trio: 0.457542422998813
trio: 0.4611267040017992

Would you mind re-running the benchmark at your end?

agronholm · 2024-09-01T11:50:30Z

Oh, by the way, maybe you'd like to join the AnyIO room on Gitter to discuss this in (almost) real time?

MarkusSintonen · 2024-09-01T18:24:02Z

Would you mind re-running the benchmark at your end?

Sure, here are the rerun results with the commit 787a454

High contention:

anyio+asyncio: 0.6917983900000308
asyncio: 0.65059882200012
anyio+trio: 0.9897013449999577
trio: 1.071194450000121

Low contention:

anyio+asyncio: 0.5575289480000265
asyncio: 0.24671458100010568
anyio+trio: 0.6501416300000074
trio: 0.6244343480000225

I used the same benchmark as above here (that has the warmup) for the high contention case. The low contention case I again used this in the place of "main":

async def main(lock_class: type):
    lock = lock_class()
    for _ in range(10000):
        await acquire_release(lock)

So in the high-contention case the results are similar as before but in the low-contention case the asyncio is still 2x faster.

In Python 3.12.3

MarkusSintonen · 2024-09-01T18:25:35Z

I'm seeing anyio.Lock consistently outperform asyncio.Lock:

For me its about similar. But the low-contention case is much worse. Did you try the low-contention where the lock is not contested?

agronholm · 2024-09-01T18:26:13Z

That I did not. I will try it your way and see if I can make more sense of the results.

MarkusSintonen · 2024-09-01T18:36:25Z

I added the low contention case as part of full benchmark if its useful:

import asyncio
import time
from typing import Awaitable, Callable

import trio

import anyio


async def acquire_release(lock):
    async with lock:
        await anyio.sleep(0)


async def high_contention(lock_class: type):
    lock = lock_class()
    async with anyio.create_task_group() as tg:
        for _ in range(10000):
            tg.start_soon(acquire_release, lock)


async def no_contention(lock_class: type):
    lock = lock_class()
    for _ in range(10000):
        await acquire_release(lock)


def benchmark(main: Callable[[type], Awaitable[None]], lock_class: type, name: str, backend: str, warmup: bool) -> None:
    start = time.monotonic()
    anyio.run(main, lock_class, backend=backend)
    if not warmup:
        print(f"{name}: {time.monotonic() - start}")


for warmup in (True, False):
    for lock_contention in (True, False):
        if not warmup:
            case_msg = "high contention" if lock_contention else "no contention"
            print(case_msg)

        case = high_contention if lock_contention else no_contention

        benchmark(case, anyio.Lock, "anyio+asyncio", backend="asyncio", warmup=warmup)
        benchmark(case, asyncio.Lock, "asyncio", backend="asyncio", warmup=warmup)
        benchmark(case, anyio.Lock, "anyio+trio", backend="trio", warmup=warmup)
        benchmark(case, trio.Lock, "trio", backend="trio", warmup=warmup)

agronholm · 2024-09-01T18:36:56Z

Yeah, so the cancel_shielded_checkpoint() part seems to be the expensive one, as expected. The margin gets a lot smaller if you comment that out. Would you say that the difference becomes acceptable if we skip the rescheduling by the checkpoint?

MarkusSintonen · 2024-09-01T18:42:17Z

Yeah, so the cancel_shielded_checkpoint() part seems to be the expensive one, as expected. The margin gets a lot smaller if you comment that out. Would you say that the difference becomes acceptable if we skip the rescheduling by the checkpoint?

Yep, so it seems. The difference to asyncio is then pretty negligible when removing cancel_shielded_checkpoint.

agronholm · 2024-09-01T18:43:30Z

Alright, then I would say that I need to again look at adding the fast_acquire option. Let's hope that the trio backend won't give much trouble in that respect.

agronholm · 2024-09-01T20:55:47Z

Alright, please test against the primitive-fast-acquire branch, and pass fast_acquire=True to the Lock and Semaphore constructors as needed.

agronholm · 2024-09-02T17:11:19Z

If you find that this new option works as expected, I will polish the branch and merge it to this one.

MarkusSintonen · 2024-09-02T17:23:51Z

Alright, please test against the primitive-fast-acquire branch, and pass fast_acquire=True to the Lock and Semaphore constructors as needed.

It seems to work! I run the benchmark here and this is what I got:

high contention
anyio+asyncio: 0.7079124029987724
asyncio: 0.7218510080056149
no contention
anyio+asyncio: 0.26606414399429923
asyncio: 0.2457228010025574

(Unable to run Trio benchmark, it fails with WouldBlock exception in that branch with fast_acquire)

Also run httpcore benchmarks and it also looks very good:

First results with all optimization PRs applied (to not count other httpcore issues in).

With fast_acquire=False:

With fast_acquire=True:

Lastly results without all optimization PRs applied (how it would look against httpcore master, with its issues).

With fast_acquire=False:

With fast_acquire=True:

agronholm · 2024-09-02T17:29:28Z

I've fixed the WouldBlock on trio now.

MarkusSintonen · 2024-09-02T17:32:36Z

I've fixed the WouldBlock on trio now.

I run with Trio included with fast_acquire=True (looking good):

high contention
anyio+asyncio: 0.6999061149981571
asyncio: 0.7299156189983478
anyio+trio: 1.008005741998204
trio: 1.0237657109973952
no contention
anyio+asyncio: 0.27656007900077384
asyncio: 0.24542512399784755
anyio+trio: 0.36521937700308627
trio: 0.6426814509977703

MarkusSintonen · 2024-09-02T17:41:53Z

Also posted the above results to the related httpcore PR and asked Tom Christie whether to keep AnyIO there for synchronization. Maintainer(s) had some other ideas about removing the hard dependency to that, but not sure what was the idea with that. Hence the plain asyncio network backend PR is also there waiting. But unfortunately httpcore author has disappeared somewhere :/

Zac-HD

The implementation here looks good to me, but I'm somewhat concerned about the design: as you note in encode/httpcore#922 (comment) fast_acquire=True can turn code which is otherwise correct (i.e. is cancellable and subject to timeouts as expected) into an infinite loop via spooky action at a distance. I've had nasty production bugs of this form, which is why https://github.com/python-trio/flake8-async/ has lint rules to warn against such functions.

Performance improvements absolutely are important, but maybe we can support this on asyncio but make fast_acquire=True a no-op on Trio where cancellation scopes actually do exist?

agronholm · 2024-09-04T19:02:08Z

The implementation here looks good to me, but I'm somewhat concerned about the design: as you note in encode/httpcore#922 (comment) fast_acquire=True can turn code which is otherwise correct (i.e. is cancellable and subject to timeouts as expected) into an infinite loop via spooky action at a distance. I've had nasty production bugs of this form, which is why https://github.com/python-trio/flake8-async/ has lint rules to warn against such functions.

Performance improvements absolutely are important, but maybe we can support this on asyncio but make fast_acquire=True a no-op on Trio where cancellation scopes actually do exist?

I'm not sure what the existence of cancellation scopes has to do with this. The point here was to skip the extra event loop cycle in the no-contention scenario. As for the dangers, they're documented and this is strictly opt-in.

Zac-HD · 2024-09-04T19:07:31Z

It's locally opt-in, but has non-local effects - e.g. if I'm using some library that uses this, I can no longer rely on the await / async with / async for keywords indicating that there's a checkpoint, as Trio ensures and flake8-async can (locally) check.

Maybe that's just a lost cause with third-party libraries anyway, but it does make me sad to give up that property.

agronholm · 2024-09-04T19:10:00Z

It's locally opt-in, but has non-local effects - e.g. if I'm using some library that uses this, I can no longer rely on the await / async with / async for keywords indicating that there's a checkpoint, as Trio ensures and flake8-async can (locally) check.

Maybe that's just a lost cause with third-party libraries anyway, but it does make me sad to give up that property.

I'd say it's always been a lost cause. Until we have some way of enforcing this (and I'm not seeing this in the foreseeable future), it will remain that way.

agronholm · 2024-09-04T19:15:59Z

Aside from that, is there anything else amiss here?

Zac-HD

Yeah, that's fair enough. I can't see any other issues, so let's ship it!

agronholm added 3 commits July 18, 2024 23:34

Delegated the implementation of Lock to the async backend class

906dbe4

Fixed test failure

cb8b52e

Set the owner already on release

233d15f

agronholm mentioned this pull request Jul 18, 2024

Use asyncio synchronization instead of anyio encode/httpcore#922

Open

3 tasks

Added backend-specific Semaphore implementations, and some Lock fixes

93cd16f

agronholm changed the title ~~Delegated the implementation of Lock to the async backend class~~ Delegated the implementations of Lock and Semaphore to the async backend class Jul 20, 2024

Updated the changelog to mention semaphore changes

b35b188

agronholm requested a review from Zac-HD August 29, 2024 14:41

davidbrochart mentioned this pull request Aug 29, 2024

Error when using Lock from thread in asyncio #773

Open

2 tasks

Merge branch 'master' into optimize-sync-primitives

6e5e0c1

Merge branch 'master' into optimize-sync-primitives

f883390

agronholm added 2 commits September 1, 2024 14:25

Merge branch 'master' into optimize-sync-primitives

6d0f355

Merge branch 'master' into optimize-sync-primitives

787a454

agronholm added 3 commits September 2, 2024 20:44

Added the fast_acquire parameter for Lock and Semaphore

ed70e31

Merge branch 'master' into optimize-sync-primitives

2e55a31

Added tests and amended the docs

bab241c

Zac-HD reviewed Sep 4, 2024

View reviewed changes

Merge branch 'master' into optimize-sync-primitives

fa137ec

Zac-HD approved these changes Sep 4, 2024

View reviewed changes

agronholm merged commit 0c8ad51 into master Sep 5, 2024
17 checks passed

agronholm deleted the optimize-sync-primitives branch September 5, 2024 06:34

MarkusSintonen mentioned this pull request Sep 19, 2024

Use AnyIO with fast_acquire synchronization encode/httpcore#953

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delegated the implementations of Lock and Semaphore to the async backend class #761

Delegated the implementations of Lock and Semaphore to the async backend class #761

agronholm commented Jul 18, 2024 •

edited

Loading

agronholm commented Jul 18, 2024

MarkusSintonen commented Jul 19, 2024

agronholm commented Jul 19, 2024

agronholm commented Jul 19, 2024

agronholm commented Jul 20, 2024

MarkusSintonen commented Jul 23, 2024 •

edited

Loading

agronholm commented Aug 30, 2024

agronholm commented Aug 30, 2024

MarkusSintonen commented Aug 31, 2024

agronholm commented Sep 1, 2024 •

edited

Loading

agronholm commented Sep 1, 2024

MarkusSintonen commented Sep 1, 2024

MarkusSintonen commented Sep 1, 2024

agronholm commented Sep 1, 2024

MarkusSintonen commented Sep 1, 2024

agronholm commented Sep 1, 2024

MarkusSintonen commented Sep 1, 2024

agronholm commented Sep 1, 2024

agronholm commented Sep 1, 2024

agronholm commented Sep 2, 2024

MarkusSintonen commented Sep 2, 2024

agronholm commented Sep 2, 2024

MarkusSintonen commented Sep 2, 2024

MarkusSintonen commented Sep 2, 2024

Zac-HD left a comment •

edited

Loading

agronholm commented Sep 4, 2024

Zac-HD commented Sep 4, 2024

agronholm commented Sep 4, 2024

agronholm commented Sep 4, 2024

Zac-HD left a comment

Delegated the implementations of Lock and Semaphore to the async backend class #761

Delegated the implementations of Lock and Semaphore to the async backend class #761

Conversation

agronholm commented Jul 18, 2024 • edited Loading

Changes

Checklist

Updating the changelog

agronholm commented Jul 18, 2024

MarkusSintonen commented Jul 19, 2024

agronholm commented Jul 19, 2024

agronholm commented Jul 19, 2024

agronholm commented Jul 20, 2024

MarkusSintonen commented Jul 23, 2024 • edited Loading

agronholm commented Aug 30, 2024

agronholm commented Aug 30, 2024

MarkusSintonen commented Aug 31, 2024

agronholm commented Sep 1, 2024 • edited Loading

agronholm commented Sep 1, 2024

MarkusSintonen commented Sep 1, 2024

MarkusSintonen commented Sep 1, 2024

agronholm commented Sep 1, 2024

MarkusSintonen commented Sep 1, 2024

agronholm commented Sep 1, 2024

MarkusSintonen commented Sep 1, 2024

agronholm commented Sep 1, 2024

agronholm commented Sep 1, 2024

agronholm commented Sep 2, 2024

MarkusSintonen commented Sep 2, 2024

agronholm commented Sep 2, 2024

MarkusSintonen commented Sep 2, 2024

MarkusSintonen commented Sep 2, 2024

Zac-HD left a comment • edited Loading

Choose a reason for hiding this comment

agronholm commented Sep 4, 2024

Zac-HD commented Sep 4, 2024

agronholm commented Sep 4, 2024

agronholm commented Sep 4, 2024

Zac-HD left a comment

Choose a reason for hiding this comment

agronholm commented Jul 18, 2024 •

edited

Loading

MarkusSintonen commented Jul 23, 2024 •

edited

Loading

agronholm commented Sep 1, 2024 •

edited

Loading

Zac-HD left a comment •

edited

Loading