Skip to content

collect_async is blocking #18718

Open
Open
@winstxnhdw

Description

@winstxnhdw

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from typing import TypedDict, Awaitable, Iterable
from polars import DataFrame, LazyFrame


class Word(TypedDict):
    id: int
    text: str
    start_time: float
    end_time: float

class Speaker(TypedDict):
    speaker: str
    start_time: float
    end_time: float


def get_intersection(transcription: Iterable[Word], diarisation: Iterable[Speaker]) -> Awaitable[DataFrame]:
    intersection_expression = min_horizontal(
        'end_time',
        'end_time_right',
    ) - max_horizontal(
        'start_time',
        'start_time_right',
    )

    return (
        LazyFrame(transcription)
        .join(LazyFrame(diarisation), how='cross')
        .with_columns(intersection_expression.alias('intersection'))
        .collect_async()
    )

Log output

No response

Issue description

In polars, Most of the CPU-bound activities happen in Rust where the Python GIL is dropped. Ideally, collect_async should take advantage of this for polars to maximise CPU usage. As of right now, collect_async will block the main event loop and stop your single worker server from handling any more requests until the DataFrame is created.

EDIT:
DataFrame creation is blocking the event loop. We can fix this by running it in a separate thread.

Expected behavior

collect_async should not block the main event loop and act as an actual async function that will allow Python to perform context switching and process other tasks that drop the GIL concurrently.

Installed versions

Polars:              1.7.0
Index type:          UInt32
Platform:            Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.40
Python:              3.12.5 (main, Aug  9 2024, 08:20:41) [GCC 14.2.1 20240805]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         <not installed>
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.2
pyarrow              16.1.0
pydantic             2.9.1
pyiceberg            <not installed>
sqlalchemy           2.0.34
torch                2.3.1+cu121
xlsx2csv             <not installed>
xlsxwriter           <not installed>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageAwaiting prioritization by a maintainerpythonRelated to Python Polars

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions