Open
Description
Checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of Polars.
Reproducible example
from typing import TypedDict, Awaitable, Iterable
from polars import DataFrame, LazyFrame
class Word(TypedDict):
id: int
text: str
start_time: float
end_time: float
class Speaker(TypedDict):
speaker: str
start_time: float
end_time: float
def get_intersection(transcription: Iterable[Word], diarisation: Iterable[Speaker]) -> Awaitable[DataFrame]:
intersection_expression = min_horizontal(
'end_time',
'end_time_right',
) - max_horizontal(
'start_time',
'start_time_right',
)
return (
LazyFrame(transcription)
.join(LazyFrame(diarisation), how='cross')
.with_columns(intersection_expression.alias('intersection'))
.collect_async()
)
Log output
No response
Issue description
In polars
, Most of the CPU-bound activities happen in Rust where the Python GIL is dropped. Ideally, collect_async
should take advantage of this for polars
to maximise CPU usage. As of right now, collect_async
will block the main event loop and stop your single worker server from handling any more requests until the DataFrame
is created.
EDIT:
DataFrame creation is blocking the event loop. We can fix this by running it in a separate thread.
Expected behavior
collect_async
should not block the main event loop and act as an actual async function that will allow Python to perform context switching and process other tasks that drop the GIL concurrently.
Installed versions
Polars: 1.7.0
Index type: UInt32
Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.40
Python: 3.12.5 (main, Aug 9 2024, 08:20:41) [GCC 14.2.1 20240805]
----Optional dependencies----
adbc_driver_manager <not installed>
altair <not installed>
cloudpickle <not installed>
connectorx <not installed>
deltalake <not installed>
fastexcel <not installed>
fsspec 2024.9.0
gevent <not installed>
great_tables <not installed>
matplotlib 3.9.2
nest_asyncio <not installed>
numpy 1.26.4
openpyxl <not installed>
pandas 2.2.2
pyarrow 16.1.0
pydantic 2.9.1
pyiceberg <not installed>
sqlalchemy 2.0.34
torch 2.3.1+cu121
xlsx2csv <not installed>
xlsxwriter <not installed>