Feature request
The idea is to propose a different way of computing the result, giving users a trade-off between performance (streaming=False) and low memory with less chance of worker timeout (streaming=True). The .compute(streaming=True) would do something like:
def compute(self, streaming: bool = False, **kwargs):
if not streaming:
return npd.NestedFrame(super().compute(**kwargs))
client = dask.distributed.get_client()
partitions_per_chunk = 1 if client is None else len(client.scheduler_info()["workers"])
batches = []
for batch in CatalogStream(self, client=client, partitions_per_chunk=partitions_per_chunk, shuffle=False):
if len(batch) == 0:
continue
batches.append(batch)
result = pd.concat(batches)
return npd.NestedFrame(result)
Before submitting
Please check the following:
Feature request
The idea is to propose a different way of computing the result, giving users a trade-off between performance (
streaming=False) and low memory with less chance of worker timeout (streaming=True). The.compute(streaming=True)would do something like:Before submitting
Please check the following: