Description
Is your feature request related to a problem or challenge?
DataFusion currently performs CPU-bound work on the tokio runtime and this can lead to issues where IO tasks are starved and unable to adequately service the IO. Crucially this occurs long before the actual resources of the threadpool are exhausted, it is expected that if there is no CPU available that IO will be starved, what we want to avoid is IO getting throttled when there is resource available.
The accepted wisdom has been to run DataFusion in a separate threadpool and spawn IO off into a separate threadpool, as described here. #12393 seeks to document this, and by extension #13424 and #13690 to provide examples on how to achieve this.
Unfortunately this approach comes with a lot of both explicit and implicit complexity, and makes it hard to integrate DataFusion into existing codebases making use of tokio.
Describe the solution you'd like
The basic idea would be, rather than move the IO off to a separate runtime away from the CPU bound tasks, instead wrap the CPU bound tasks so that they can't starve the runtime.
Ultimately DF has three broad class of task:
- Fully synchronous code - e.g. optimisation, function evaluation, planning, most physical operators
- Mixed IO and CPU-bound - e.g. file IO, schema inference, etc...
- Pure IO - catalog
The vast majority of code in DF falls into the fully synchronous bucket (1.), and so it is a natural response to want to special case the less common IO (2. / 3.) instead of the more common CPU-bound code. However, I'd like to posit that conceptually and practically it is much simpler to dispatch the synchronous code than the asynchronous code, and that the items falling into the second bucket are the most complex operators in DataFusion.
This would yield some quite compelling advantages:
- DataFusion could easily be integrated with existing tokio-based workloads
- Avoids the many footguns of CPU-bound async tasks
This could be achieved using the native primitives
One way to achieve this would be to use block_in_place, as unlike spawn_blocking it does not require a 'static
lifetime and can therefore borrow data from the invoking context.
That being said this does come with some risks:
- It might become hard to determine what is sufficiently CPU-bound to warrant spawning
- There would potentially be a lot of callsites that would need wrapping
- It would be important to use block_in_place only at the granularity of a data batch, to ensure overheads are amortized
- Applications relying on thread count to limit concurrency might need to put in place some additional concurrency controls (e.g. a semaphore)
An arguably better solution would be to dispatch the work using an actual CPU-threadpool such as rayon.
Describe alternatives you've considered
We could instead move IO off to a separate runtime
Additional context
No response