Open
Description
Hello, thank you for developing really cool tool!
Summary
I have one feature request to use Polars for loading and dumping data:
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.
If this library would support it, it would speed up the machine learning cycle even more.
Implementation idea
I have tried a very simple implementation for parquet files here.
The changes are as follows.
- Add config module as gokart/config and init.py in this module.
# gokart/config/__init__.py
from gokart.config import config
from gokart.config.config import (
get_option,
set_option,
)
- Create config.py in gokart/config. This file contains "_global_config" variable, "register_option", "get_option", and "set_option" methods. "_global_config" contains global settings as dictionary and is handled by the above methods. (Currently, only the "use_polars" option is included in "_gloaval_config" by config_init.py.)
# gokart/config/config.py
from typing import Any, Dict
_global_config: Dict[str, Any] = {}
def register_option(
key: str,
val: object,
doc: str = "",
) -> None:
_global_config.update({key: val})
def get_option(
key: str,
) -> object:
assert key in _global_config, f"No such keys: {key}"
return _global_config[key]
def set_option(
key: str,
val: object,
doc: str = "",
) -> None:
assert key in _global_config, f"No such keys: {key}"
_global_config.update({key: val})
- Create config_init.py in gokart/config. This file is used for "_global_config" initialization.
# gokart/config/config_init.py
import gokart.config.config as cf
use_polars = """
: boolean
Whether to use polars instead of pandas
"""
cf.register_option(
"use_polars",
False,
use_polars,
)
- Modify gokart/init.py to include gokart.config.
# gokart/__init__.py
from gokart.config import config_init, get_option, set_option
from gokart.build import build
...
- Modify ParquetFileProcessor Class in gokart/file_processor.py to load and dump data by Polars when "use_polars" option is True.
class ParquetFileProcessor(FileProcessor):
...
def load(self, file):
# MEMO: read_parquet only supports a filepath as string (not a file handle)
if get_option("use_polars"):
return pl.read_parquet(file.name)
else:
return pd.read_parquet(file.name)
def dump(self, obj, file):
assert isinstance(obj, (pd.DataFrame, pl.internals.dataframe.frame.DataFrame)), \
f'requires pd.DataFrame or pl.internals.dataframe.frame.DataFrame, but {type(obj)} is passed.'
# MEMO: to_parquet only supports a filepath as string (not a file handle)
if isinstance(obj, pd.DataFrame):
obj.to_parquet(file.name, index=False, compression=self._compression)
else:
obj.write_parquet(file.name, compression=self._compression if self._compression is not None else 'zstd')
I am not very familiar with the best practices regarding such a option, but if you comment on what needs to be fixed, I can work on it and make a pull request.
Metadata
Metadata
Assignees
Labels
No labels