Skip to content

Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) #12393

Open
@tustvold

Description

@tustvold

Is your feature request related to a problem or challenge?

DataFusion performs CPU bound work within async closures. This causes issues if running IO on the same async runtime, as the cooperative nature of such schedulers allows the CPU bound work to starve servicing of IO. This leads to errors such as apache/arrow-rs-object-store#272.

Describe the solution you'd like

I think at the very least this needs to be better documented, I couldn't find any mention of this in the DataFusion documentation following a cursory search.

I also think more holistic approach would be valuable to this, as it stands the use of async within DataFusion acts as a massive footgun that encourages users to intermix IO and CPU work in a way that is at best inefficient, but this can be tracked as a separate follow on task.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions