Skip to content

[python-package] Adding support for polars for input data Β #6204

@detrin

Description

@detrin

Summary

I think polars library is on the path to replace the majority of pandas use-cases. It is already being adopted by the community. We use it internally in my company for new projects and we try not to use pandas at all.

Motivation

Polars is blazingly fast and it has several times a lower memory footprint. There is no need to use extra memory to convert data into numpy or pandas to be used for training in LightGBM.

Description

I would like the following to be working, where data_train and data_test are instances of pl.DataFrame

y_train = data_train[col_target]
y_test = data_test[col_target]
X_train = data_train.select(cols_pred)
X_test = data_test.select(cols_pred)

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
    "boosting_type": "gbdt",
    "objective": "regression",
    "metric": {"l2", "l1"},
    "learning_rate": 0.1,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "verbose": 0,
    "num_leaves": 42,
    "max_depth": 5,
    "num_iterations": 5000,
    "min_data_in_leaf": 500,
    "reg_alpha": 2, 
    "reg_lambda": 5,
}

gbm = lgb.train(
    params,
    lgb_train,
    valid_sets=lgb_eval,
    callbacks=[lgb.early_stopping(stopping_rounds=500)],
)

as of now I have to convert it into numpy matrices

y_train = data_train[col_target].to_numpy()
y_test = data_test[col_target].to_numpy()
X_train = data_train.select(cols_pred).to_numpy()
X_test = data_test.select(cols_pred).to_numpy()

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions