Skip to content
This repository was archived by the owner on Jan 12, 2026. It is now read-only.

add multi-label support#298

Open
louis-huang wants to merge 3 commits intoray-project:masterfrom
louis-huang:add_multi_label_support
Open

add multi-label support#298
louis-huang wants to merge 3 commits intoray-project:masterfrom
louis-huang:add_multi_label_support

Conversation

@louis-huang
Copy link
Contributor

Hi I added support to allow label as a list. So we can support reading data with multiple labels. This can then solve #286.
I verified new unit tests pass. Also test_matrix.py all pass with my local set up.
I verified locally by training a xgboost model with parquet data format, it works well. So far it should work well for parquet data format. Thank you!

@louis-huang
Copy link
Contributor Author

I verified the change works with the blow code example:

from sklearn.datasets import make_multilabel_classification
import pandas as pd
import numpy as np
n_classes = 5
random_state = 0
X, y = make_multilabel_classification(n_samples=32, n_classes=5, n_labels=3, random_state=random_state)
features = [f"f{i}" for i in range(len(X[0]))]
labels = [f"label_{i}" for i in range(n_classes)]

X_df = pd.DataFrame(X, columns = features)
y_df = pd.DataFrame(y, columns = labels)
data = pd.concat([X_df, y_df], axis = 1)

data.to_parquet("~/Desktop/sample_data/data.parquet")

from xgboost_ray import RayDMatrix, RayParams, train, RayFileType
n_classes = 5
features = [f"f{i}" for i in range(20)]

labels = [f"label_{i}" for i in range(n_classes)]

training_data = "~/Desktop/sample_data"
train_set = RayDMatrix(training_data, labels, columns = features + labels, filetype=RayFileType.PARQUET)

evals_result = {}
bst = train(
    {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "random_state": random_state,
    },
    train_set,
    num_boost_round = 1,
    evals_result=evals_result,
    evals=[(train_set, "train")],
    verbose_eval=False,
    ray_params=RayParams(
        num_actors=1,  # Number of remote actors
        cpus_per_actor=1))

#bst.save_model("model.xgb")
#print("Final training error: {:.4f}".format(
#    evals_result["train"]["error"][-1]))

from xgboost_ray import predict
pred_ray = predict(bst, train_set, ray_params=RayParams(num_actors=1))
print(pred_ray)


import xgboost as xgb

clf = xgb.XGBClassifier(tree_method="hist", n_estimators = 1, random_state=0)
clf.fit(X, y)
expected = clf.predict_proba(X)

np.testing.assert_allclose(expected, pred_ray)

@heyitsmui
Copy link

@Yard1 can you help take a look when you get a chance? thanks!

def get_column(
cls, data: pd.DataFrame, column: Any
) -> Tuple[pd.Series, Optional[str]]:
) -> Tuple[pd.Series, Optional[Union[str, List]]]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we open up a separate get_columns(...) instead of overloading this method?

Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! cc @krfricke

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@louis-huang
Copy link
Contributor Author

Hi @Yard1 may I ask how to fix the lint test? Seems it still blocks the merge. Thank you!

@Yard1
Copy link
Member

Yard1 commented Nov 9, 2023

Can you run the ./format.sh script in the root of the repo?

@yc2984
Copy link

yc2984 commented Mar 21, 2024

@louis-huang can you please run the above test please?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants