Skip to content

Commit 7cf937d

Browse files
authored
fix batch parquet read (#102)
1 parent f6cca76 commit 7cf937d

File tree

2 files changed

+5
-5
lines changed

2 files changed

+5
-5
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
__pycache__/
33
*.py[cod]
44
*$py.class
5-
5+
.idea
66
# C extensions
77
*.so
88

src/yandex_cloud_ml_sdk/_utils/pyarrow.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,12 @@ def get_next() -> RecordType | None:
2525

2626

2727
def read_dataset_records_sync(path: str, batch_size: int | None) -> Iterator[RecordType]:
28-
import pyarrow.dataset as pd # pylint: disable=import-outside-toplevel
28+
import pyarrow.parquet as pq # pylint: disable=import-outside-toplevel
2929

3030
# we need use kwargs method to preserve original default value
3131
kwargs = {}
3232
if batch_size is not None:
3333
kwargs['batch_size'] = batch_size
34-
dataset = pd.dataset(source=path, format='parquet')
35-
for batch in dataset.to_batches(**kwargs): # type: ignore[arg-type]
36-
yield from batch.to_pylist()
34+
with pq.ParquetFile(path) as reader:
35+
for batch in reader.iter_batches(**kwargs): # type: ignore[arg-type]
36+
yield from batch.to_pylist()

0 commit comments

Comments
 (0)