[Feature Request] Using Polars for loading and dumping data

Hello, thank you for developing really cool tool!

### Summary
I have one feature request to use Polars for loading and dumping data:
[Polars](https://github.com/pola-rs/polars) is a blazingly fast DataFrames library implemented in Rust using [Apache Arrow Columnar Format](https://arrow.apache.org/docs/format/Columnar.html) as the memory model.
If this library would support it, it would speed up the machine learning cycle even more.

### Implementation idea
I have tried a very simple implementation for parquet files [here](https://github.com/m3dev/gokart/compare/master...takeyama0:gokart:master). 
The changes are as follows.

1.  Add config module as gokart/config and __init__.py in this module.
```python
# gokart/config/__init__.py
from gokart.config import config
from gokart.config.config import (
    get_option,
    set_option,
)
```
2.  Create config.py in gokart/config. This file contains "_global_config" variable, "register_option",  "get_option", and "set_option" methods. "_global_config" contains global settings as dictionary and is handled by the above methods. (Currently, only the "use_polars" option is included in "_gloaval_config" by  config_init.py.)


```python
# gokart/config/config.py
from typing import Any, Dict

_global_config: Dict[str, Any] = {}


def register_option(
    key: str,
    val: object,
    doc: str = "",
) -> None:
    _global_config.update({key: val})


def get_option(
    key: str,
) -> object:
    assert key in _global_config, f"No such keys: {key}"
    return _global_config[key]


def set_option(
    key: str,
    val: object,
    doc: str = "",
) -> None:
    assert key in _global_config, f"No such keys: {key}"
    _global_config.update({key: val})
```

4.  Create config_init.py in gokart/config. This file is used for "_global_config" initialization.
```python
# gokart/config/config_init.py
import gokart.config.config as cf

use_polars = """
: boolean
    Whether to use polars instead of pandas
"""

cf.register_option(
    "use_polars",
    False,
    use_polars,
)
```
5.  Modify gokart/__init__.py to include gokart.config.
```python
# gokart/__init__.py
from gokart.config import config_init, get_option, set_option
from gokart.build import build
...
```
6.  Modify ParquetFileProcessor Class in gokart/file_processor.py to load and dump data by Polars when "use_polars" option is True.
```python:gokart/file_processor.py
class ParquetFileProcessor(FileProcessor):
    ...

    def load(self, file):
        # MEMO: read_parquet only supports a filepath as string (not a file handle)
        if get_option("use_polars"):
            return pl.read_parquet(file.name)
        else:
            return pd.read_parquet(file.name)

    def dump(self, obj, file):
        assert isinstance(obj, (pd.DataFrame, pl.internals.dataframe.frame.DataFrame)), \
            f'requires pd.DataFrame or pl.internals.dataframe.frame.DataFrame, but {type(obj)} is passed.'
        # MEMO: to_parquet only supports a filepath as string (not a file handle)
        if isinstance(obj, pd.DataFrame):
            obj.to_parquet(file.name, index=False, compression=self._compression)
        else:
            obj.write_parquet(file.name, compression=self._compression if self._compression is not None else 'zstd')
```

I am not very familiar with the best practices regarding such a option, but if you comment on what needs to be fixed, I can work on it and make a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Using Polars for loading and dumping data #304

Summary

Implementation idea

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Using Polars for loading and dumping data #304

Description

Summary

Implementation idea

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions