Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 30 additions & 35 deletions flaml/automl/task/generic_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -514,45 +514,40 @@ def prepare_data(
last = first[i] + 1
rest.extend(range(last, len(y_train_all)))
X_first = X_train_all.iloc[first] if data_is_df else X_train_all[first]
if len(first) < len(y_train_all) / 2:
# Get X_rest and y_rest with drop, sparse matrix can't apply np.delete
X_rest = (
np.delete(X_train_all, first, axis=0)
if isinstance(X_train_all, np.ndarray)
else X_train_all.drop(first.tolist())
if data_is_df
else X_train_all[rest]
)
y_rest = (
np.delete(y_train_all, first, axis=0)
if isinstance(y_train_all, np.ndarray)
else y_train_all.drop(first.tolist())
if data_is_df
else y_train_all[rest]
)
else:
X_rest = (
iloc_pandas_on_spark(X_train_all, rest)
if is_spark_dataframe
else X_train_all.iloc[rest]
if data_is_df
else X_train_all[rest]
)
y_rest = (
iloc_pandas_on_spark(y_train_all, rest)
if is_spark_dataframe
else y_train_all.iloc[rest]
if data_is_df
else y_train_all[rest]
)
Comment on lines -517 to -547
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the second way of getting X_rest?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh,I made a mistake about it. The second part should be kept here.

X_rest = (
np.delete(X_train_all, first, axis=0)
if isinstance(X_train_all, np.ndarray)
else X_train_all.drop(first.tolist())
if data_is_df
else X_train_all[rest]
)
y_rest = (
np.delete(y_train_all, first, axis=0)
if isinstance(y_train_all, np.ndarray)
else y_train_all.drop(first.tolist())
if data_is_df
else y_train_all[rest]
)
stratify = y_rest if split_type == "stratified" else None
X_train, X_val, y_train, y_val = self._train_test_split(
state, X_rest, y_rest, first, rest, split_ratio, stratify
)
X_train = concat(X_first, X_train)
y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
#Check whether the training set and validation set cover all categories.
train_labels = np.unique(y_train)
val_labels = np.unique(y_val)
Comment on lines +536 to +537
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.unique doesn't work for psSeries or psDataFrame. Check an example here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be modified like this:
if isinstance(y_train, (ps.Series, ps.DataFrame)):
train_labels = y_train.unique() if isinstance(y_train, ps.Series) else y_train.iloc[:, 0].unique()
else:
train_labels = np.unique(y_train)

if isinstance(y_val, (ps.Series, ps.DataFrame)):
val_labels = y_val.unique() if isinstance(y_val, ps.Series) else y_val.iloc[:, 0].unique()
else:
val_labels = np.unique(y_val)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try reusing existing functions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty big project, and I'm not very familiar with it yet. I'm not sure if there are already existing functions that I can use.

Copy link
Collaborator

@thinkall thinkall Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check an example here.

You can reuse the function len_labels.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed this function after seeing your correction. I hadn't paid attention to it before,thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the modified code:
train_labels= len_labels(y_train)
val_labels= len_labels(y_val)

missing_in_train = set(label_set) - set(train_labels)
missing_in_val = set(label_set) - set(val_labels)

#Add X_first only to the validation set (remove the merge of the training set).
if missing_in_val:
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
#The training set supplements the missing categories with the remaining data.
if missing_in_train:
Comment on lines +542 to +546
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if missing_in_val only has 1 value missed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the code can be optimized.

if missing_in_val:
if len(label_set) == 1:
X_val = concat(X_first.iloc[[0]], X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
else:
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better way is to fill the missing labels in both train and val with first. For those not missing in either train or val, put them in train.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I've learned it.

Copy link
Author

@commint-tian commint-tian Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if missing_in_val:
if data_is_df:
X_val = concat([X_first, X_val])
y_val = concat([label_set, y_val])
else:
X_val = np.concatenate([X_first, X_val], axis=0)
y_val = np.concatenate([label_set, y_val])

if missing_in_train:
if data_is_df:
X_train = concat([X_first, X_train])
y_train = concat([label_set, y_train])
else:
X_train = np.concatenate([X_first, X_train], axis=0)
y_train = np.concatenate([label_set, y_train])

common_labels = set(train_labels) & set(val_labels)
only_train_labels = set(train_labels) - set(val_labels)
only_val_labels = set(val_labels) - set(train_labels)

#将不在train或val中缺失的标签放到train中
for label in only_val_labels:
#从val中移除这些标签
mask = y_val != label
X_val = X_val[mask]
y_val = y_val[mask]
#将这些标签添加到train中
if data_is_df:
X_train = concat([X_train, X_first.loc[X_first[label_col] == label]])
y_train = concat([y_train, label_set.loc[label_set[label_col] == label]])
else:
X_train = np.concatenate([X_train, X_first[y_first == label]], axis=0)
y_train = np.concatenate([y_train, label_set[y_first == label]])

for label in missing_in_train:
mask = (y_rest == label)
X_train = concat(X_rest[mask], X_train)
y_train = concat(y_rest[mask], y_train)
Comment on lines +548 to +550
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_rest[mask] may not work for dataframe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could work:
mask = (y_rest == label)
filtered_X_rest = X_rest.filter(mask)
filtered_y_rest = y_rest.filter(mask)

X_train = X_train.union(filtered_X_rest)
y_train = y_train.union(filtered_y_rest)


if isinstance(y_train, (psDataFrame, pd.DataFrame)) and y_train.shape[1] == 1:
y_train = y_train[y_train.columns[0]]
Expand Down