Clarification in terms of performance in Thread- and Processpools #4

dodofarm · 2025-08-28T22:20:40Z

dodofarm
Aug 28, 2025

Hi,
first of all - great project! Really appreciate this.

We are using culsans in an ETL pipeline and currently discussing what's the best way going forward in terms of using the sync or async interface.

Essentially the question is, what's the performance difference between using culsans async interface and having an async wrapper function pass the data to a sync function (either in a ThreadPoolExecutor or a ProcessPoolExector) or get the data from the sync interface directly within the ThreadPoolExecutor or a ProcessPoolExector.

I understand there's a bit more behind the scenes like when passing data to to the ProcessPoolExector from the main process (e.g. IPC + serialization) but wouldn't the same apply to culsans?

I can provide more information on the usecase if needed but essentially we have an abstract base class where users should be able to define a process function which is called from an internal wrapper function. Currently we are deciding whether the wrapper function should always be a simple async function which calls the process function and the user is responsible to passing the execution to a ThreadPoolExecutor or a ProcessPoolExector themselves.

The other option is, we let the user decide if they want a thread or process pool to be spawned in the background (handled in the base class) via a config option they pass. In that case their synchronous process function is always called from within either the thread or process pool.

The former solution would use culsans async interface and pass the data to the users async function which then can forward the data to a thread/process pool and in the latter solution the sync interface would be called from within the thread/process pool owned by the base class and passed directly to the users function.

Hope my explanation makes sense.

P.S.: I guess this is a general design question and not purely focusing on culsans. We are still weighing the various options of how to design this system most effectively. Feel free to provide insight if you have any since it seems like you work a lot with asynchronous python. If it helps items through our pipeline will have sizes anywhere between a few bytes and gigabytes so passing data effectively for both small and big objects is important.

Answered by x42005e1f

Aug 29, 2025

Yes, so in this way, all you need to do is:

Create a pool for each group of workers that the user wants to run in threads/processes.
Use pool.submit() / loop.run_in_executor(pool, ...) to transfer items to these pools.

When using the pool, idle workers will not be a problem if you pass the appropriate max_workers, so it is suitable for any scenario. Moreover, it dynamically starts threads/processes, so the scheduler will not suffer unnecessarily.

At the moment, neither Culsans nor aiologic support inter-process communication. You can find out why aiologic does not support it at https://aiologic.readthedocs.io/latest/advanced-topics/libraries.html#why-is-multiprocessing-not-supported. Ho…

View full answer

x42005e1f · 2025-08-28T23:19:49Z

x42005e1f
Aug 28, 2025
Maintainer

Could you please clarify the execution model for each solution? In particular, I would like to know the following:

What ThreadPoolExecutor/ProcessPoolExecutor is used for.
Who executes the wrapper function / user function in the first solution.
How is asynchronous code supported in the second solution.

2 replies

dodofarm Aug 28, 2025
Author

Thanks for the reply,

The user should have 3 options, depending on their usecase they might want to either:
run their process code in a thread, a process or simply as a coroutine. For the first two usecases we planned to use a ThreadPoolExecutor/ProcessPoolExecutor. Who specifically manages the executor (us or the user in the process function) is the main question here I think. Those are the two solutions I tried describing above. And of course which of those scenarios is more performant with culsans. The third usecase is quite simple to facilitate by just calling the users async function.
This is executed internally in the ETLPipeline class along side other worker functions, monitoring functions etc. The user starts the pipeline with a run method. Communication between various parts of the pipeline is facilitated through culsans. Specifically between the various wrapper function which then call the users own code. E.g. the process wrapper gets data from another worker through culsans and passes it to the users process function.
The two different options we are debating if the user decides to use threads/multiprocessing over async are:

culsans(async-interface) -> async process wrapper -> users process function -> Thread/ProcessPoolExecutor
vs
culsans(sync interface) -> sync process wrapper in Thread/ProcessPoolExecutor -> users process function
if it's a normal async function the flow would simply be:
culsans(async-interface) -> async process wrapper -> users process function

I see I only mentioned the thread or multiprocessing usecase in my initial post. What would happen is:

the user creates a process function - doesn't matter if it's async or not
they then pass something to the baseclass, e.g. a string or config option like "async", "thread" or "multi"
The base class then either runs the process function as normal coroutine, a thread or a process from within the respective executor.

Sidenote: If we decide to manage the executor on our and not let the user do it, it might be simpler to create a sync_process and a async_process function and depending on what the user pick (async/thread/multi) the respective function will be either called as an async coroutine or from within our executor.

Hope the answers clear it up a bit!

x42005e1f Aug 29, 2025
Maintainer

There is also one nuance here, so I will ask another question. Is data transfer between workers a single object (one put()) or several?

x42005e1f · 2025-08-29T00:41:41Z

x42005e1f
Aug 29, 2025
Maintainer

Thank you for the details.

I think the most appropriate solution would be as follows:

The user is given three options: run locally ("local"), run in a thread pool ("thread"), or run in a process pool ("process").
If "local" is chosen, the async process wrapper checks the type of the process function via inspect.iscoroutinefunction(), and:
2.1. If True, it is called via await.
2.2. If False, it is called as a regular function.
If "thread" or "process" is chosen, the async process wrapper first gets the data, and then:
3.1. Creates a pool if it has not been created yet.
3.2. Runs the process function within the pool.

In fact, running the sync process wrapper in the thread pool could be faster for three reasons. First, under the hood, the wakeup is performed via a very lightweight lock.release(), which usually works without context switching (the async interface requires interaction with the internal socket of the asynchronous library if the interaction occurs between different threads, which is slightly slower). Second, while the data has not yet arrived, the processing code would already start working in a worker thread, which reduces the delay between receiving data and processing it. Third, this eliminates the intermediary thread, which can be sick — end-to-end data transfer is much faster.

However, I rejected this option for a very simple reason. What will the worker thread do while the data has not yet arrived? Right, it will reduce the concurrency set by the max_workers parameter (which is finite even when None is passed), since it will not allow the same thread to perform other tasks.

And if the user wants to use their own pools, they will choose "local" and, if necessary, will asynchronously wait for the future via asyncio.wrap_future() or something similar. I think this approach is more convenient, as it does not force the user to take care of the pools themselves.

15 replies

x42005e1f Aug 29, 2025
Maintainer

Then, I think you can assign each worker group its own pools, where max_workers will correspond to the size of the group. This way, you can get around the problem of reduced concurrency, which will allow you to efficiently receive items directly in worker threads.

x42005e1f Aug 29, 2025
Maintainer

However, I think that in the case of the pool, you do not need a Culsans queue at all. pool.submit() is already associated with its own queue, and communication there is exactly the same.

dodofarm Aug 29, 2025
Author

Got it, makes sense.

I would only do that if the user specifies that this worker group should be multithreaded, right?
e.g. something like this:

        loop = asyncio.get_event_loop()
        process_futures: list[asyncio.Future[None]] = []
        for _ in range(self._process_workers_count):
            future = loop.run_in_executor(self.executor, self._process_wrapper) # multithreaded executor
            process_futures.append(future)

If they specify "local" the wrapper should just be a normal async function.

        fetch_tasks: list[asyncio.Task[None]] = []
        for _ in range(self._fetch_workers_count):
            task = asyncio.create_task(self._fetch_wrapper())
            fetch_tasks.append(task)

The above is essentially what's implemented right now, good to hear it was a good start. Your advise is to go with the above if we know that the ETL pipeline is constantly going to be "full" and go with async wrappers instead if there's a bunch of idle workers for whatever reason?

Not sure about multiprocessing, would culsans support cross process communication or would I need to place the wrapper in an async function instead of the muprocessing pool?

Regarding the newest comment you just added - we really like the features culsans offers like dynamic queue resizing, so unless there's a good reason we'd stick with culsans.

x42005e1f Aug 29, 2025
Maintainer

Yes, so in this way, all you need to do is:

Create a pool for each group of workers that the user wants to run in threads/processes.
Use pool.submit() / loop.run_in_executor(pool, ...) to transfer items to these pools.

When using the pool, idle workers will not be a problem if you pass the appropriate max_workers, so it is suitable for any scenario. Moreover, it dynamically starts threads/processes, so the scheduler will not suffer unnecessarily.

At the moment, neither Culsans nor aiologic support inter-process communication. You can find out why aiologic does not support it at https://aiologic.readthedocs.io/latest/advanced-topics/libraries.html#why-is-multiprocessing-not-supported. However, if necessary, I can think about implementing this feature in Culsans specifically for multiprocessing.

I am glad you like Culsans.

Answer selected by dodofarm

dodofarm Aug 29, 2025
Author

Thanks again for investing the time to answer my questions. This really helped! Culsans is indeed great!

I'll go with the approach you recommended.

Yes indeed max_workers should always be present, the user will need to decide how many workers each group will have. Which will spawn either asyncio Tasks, threads or processes. For threads the wrapper will run inside the thread for the other two it will be an async function.

Finding a good way to pass data with multiprocessing indeed seems very painful. In the beginning I spent a lot of time researching about the serialization overhead when passing data but that was too much work for now so I pivoted away. It's not high priority enough to pour resources into it right now but I appreciate your readiness to add multiprocessing support!

As of now we want to try to stick to data processing tools that can release the GIL or are async, e.g. DuckDB or numpy(at least some operations release the GIL) etc. Although at some point for sure there will be a point where there will be a high CPU load workflow that doesn't release the GIL and multiprocessing is the only way.

I might open a feature request when something like that comes around, we'll see how far we can get with the existing tools.

x42005e1f Aug 29, 2025
Maintainer

Since 3.14, there is concurrent.interpreters and related InterpreterPoolExecutor. It is much lightweight than processes, and at the same time allows you to run code truly in parallel within the same process (the subinterpreters do not share the same GIL!). But you need to be careful with data transfer when using it.

dodofarm Aug 29, 2025
Author

Thanks for sharing - first time hearing of InterpreterPoolExecutor it does sound very interesting. Still has the big problem of sharing data (essentially the same as ProcessPoolExecutor) which I think might be the biggest concern for this usecase here since data passed around could be huge but nonetheless seems much better suited than multiprocessing a very good feature to have in mind.

Btw we do plan to open-source the ETLPipeline at some point (hopefully a small alpha not too far into the feature). Although we mainly focus on financial data so I don't think there's gonna be a lot of eyes on it but it should work for any workload imaginable. Is there any way to shout out culsans in a meaningful matter besides on the README, to help your project?

Currently I'm working on it alone and don't have too much experience with asyncio or concurrent programming in general. You seem very knowledgeable on this topic however, I'll try my best to get the design right. Feel free to take a peek and point out issues if you see any once it's out :))

x42005e1f Aug 29, 2025
Maintainer

Still has the big problem of sharing data

Not that big, actually. Some types are efficiently shared between subinterpreters, which in the case of your bytes would be zero-cost.

Is there any way to shout out culsans in a meaningful matter besides on the README, to help your project?

Promoting my packages is indeed a challenge. My solutions are not well known, the subject matter is rarely understood, and for most AIs today, they do not exist at all. Last fall, I attempted to clean up the mess on Stack Overflow by merging questions on this topic into a single network of links (now, from each question, the searcher can go to other similar ones; before, they might not find a suitable answer due to duplicates and misleading titles), and left some of my answers. This had some effect, but it was very insignificant.

There are probably some effective ways, but most likely I am culturally outdated for them.

Feel free to take a peek and point out issues if you see any once it's out :))

Good. Feel free to share the link here as soon as it happens.

Clarification in terms of performance in Thread- and Processpools #4

Uh oh!

Uh oh!

dodofarm Aug 28, 2025

Replies: 2 comments · 17 replies

Uh oh!

x42005e1f Aug 28, 2025 Maintainer

Uh oh!

Uh oh!

dodofarm Aug 28, 2025 Author

Uh oh!

x42005e1f Aug 29, 2025 Maintainer

Uh oh!

Uh oh!

x42005e1f Aug 29, 2025 Maintainer

Uh oh!

x42005e1f Aug 29, 2025 Maintainer

Uh oh!

x42005e1f Aug 29, 2025 Maintainer

Uh oh!

dodofarm Aug 29, 2025 Author

Uh oh!

Uh oh!

x42005e1f Aug 29, 2025 Maintainer

Uh oh!

dodofarm Aug 29, 2025 Author

Uh oh!

x42005e1f Aug 29, 2025 Maintainer

Uh oh!

dodofarm Aug 29, 2025 Author

Uh oh!

x42005e1f Aug 29, 2025 Maintainer

dodofarm
Aug 28, 2025

Replies: 2 comments 17 replies

x42005e1f
Aug 28, 2025
Maintainer

dodofarm Aug 28, 2025
Author

x42005e1f Aug 29, 2025
Maintainer

x42005e1f
Aug 29, 2025
Maintainer

x42005e1f Aug 29, 2025
Maintainer

x42005e1f Aug 29, 2025
Maintainer

dodofarm Aug 29, 2025
Author

x42005e1f Aug 29, 2025
Maintainer

dodofarm Aug 29, 2025
Author

x42005e1f Aug 29, 2025
Maintainer

dodofarm Aug 29, 2025
Author

x42005e1f Aug 29, 2025
Maintainer