Skip to content
This repository was archived by the owner on Jan 12, 2026. It is now read-only.

Add multi label support v2#306

Merged
Yard1 merged 5 commits intoray-project:masterfrom
louis-huang:add_multi_label_support_v2
Mar 2, 2024
Merged

Add multi label support v2#306
Yard1 merged 5 commits intoray-project:masterfrom
louis-huang:add_multi_label_support_v2

Conversation

@louis-huang
Copy link
Contributor

This is a new pull request to replace #298 because I'm unable to work on that branch.

I copied content from that pull requests:
Hi I added support to allow label as a list. So we can support reading data with multiple labels. This can then solve #286.
I verified new unit tests pass. Also test_matrix.py all pass with my local set up.
I verified locally by training a xgboost model with parquet data format, it works well. So far it should work well for parquet data format. Thank you!

I verified the change works with the blow code example:

from sklearn.datasets import make_multilabel_classification
import pandas as pd
import numpy as np
n_classes = 5
random_state = 0
X, y = make_multilabel_classification(n_samples=32, n_classes=5, n_labels=3, random_state=random_state)
features = [f"f{i}" for i in range(len(X[0]))]
labels = [f"label_{i}" for i in range(n_classes)]

X_df = pd.DataFrame(X, columns = features)
y_df = pd.DataFrame(y, columns = labels)
data = pd.concat([X_df, y_df], axis = 1)

data.to_parquet("~/Desktop/sample_data/data.parquet")

from xgboost_ray import RayDMatrix, RayParams, train, RayFileType
n_classes = 5
features = [f"f{i}" for i in range(20)]

labels = [f"label_{i}" for i in range(n_classes)]

training_data = "~/Desktop/sample_data"
train_set = RayDMatrix(training_data, labels, columns = features + labels, filetype=RayFileType.PARQUET)

evals_result = {}
bst = train(
    {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "random_state": random_state,
    },
    train_set,
    num_boost_round = 1,
    evals_result=evals_result,
    evals=[(train_set, "train")],
    verbose_eval=False,
    ray_params=RayParams(
        num_actors=1,  # Number of remote actors
        cpus_per_actor=1))

#bst.save_model("model.xgb")
#print("Final training error: {:.4f}".format(
#    evals_result["train"]["error"][-1]))

from xgboost_ray import predict
pred_ray = predict(bst, train_set, ray_params=RayParams(num_actors=1))
print(pred_ray)


import xgboost as xgb

clf = xgb.XGBClassifier(tree_method="hist", n_estimators = 1, random_state=0)
clf.fit(X, y)
expected = clf.predict_proba(X)

np.testing.assert_allclose(expected, pred_ray)

@louis-huang
Copy link
Contributor Author

@Yard1 Hi, could you please review this again? Thank you so much!!

Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@Yard1
Copy link
Member

Yard1 commented Feb 27, 2024

@matthewdeng could you help get this merged and released?

@louis-huang
Copy link
Contributor Author

louis-huang commented Feb 28, 2024

Hi @Yard1 Thanks for your review! I missed out a change and just added it. I need approval to run the workflow. Thank you!

@Yard1
Copy link
Member

Yard1 commented Mar 1, 2024

@louis-huang Looks like just lint needs to be fixed

@louis-huang
Copy link
Contributor Author

@Yard1 Got it, fixed it now. Not get used to running format.sh whenever I make a change. Will remember this next time. Thank you for your help!

@Yard1 Yard1 merged commit e904925 into ray-project:master Mar 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants