-
Notifications
You must be signed in to change notification settings - Fork 546
fix(automl): fix data leakage in classification holdout validation #1418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -514,45 +514,40 @@ def prepare_data( | |
last = first[i] + 1 | ||
rest.extend(range(last, len(y_train_all))) | ||
X_first = X_train_all.iloc[first] if data_is_df else X_train_all[first] | ||
if len(first) < len(y_train_all) / 2: | ||
# Get X_rest and y_rest with drop, sparse matrix can't apply np.delete | ||
X_rest = ( | ||
np.delete(X_train_all, first, axis=0) | ||
if isinstance(X_train_all, np.ndarray) | ||
else X_train_all.drop(first.tolist()) | ||
if data_is_df | ||
else X_train_all[rest] | ||
) | ||
y_rest = ( | ||
np.delete(y_train_all, first, axis=0) | ||
if isinstance(y_train_all, np.ndarray) | ||
else y_train_all.drop(first.tolist()) | ||
if data_is_df | ||
else y_train_all[rest] | ||
) | ||
else: | ||
X_rest = ( | ||
iloc_pandas_on_spark(X_train_all, rest) | ||
if is_spark_dataframe | ||
else X_train_all.iloc[rest] | ||
if data_is_df | ||
else X_train_all[rest] | ||
) | ||
y_rest = ( | ||
iloc_pandas_on_spark(y_train_all, rest) | ||
if is_spark_dataframe | ||
else y_train_all.iloc[rest] | ||
if data_is_df | ||
else y_train_all[rest] | ||
) | ||
X_rest = ( | ||
np.delete(X_train_all, first, axis=0) | ||
if isinstance(X_train_all, np.ndarray) | ||
else X_train_all.drop(first.tolist()) | ||
if data_is_df | ||
else X_train_all[rest] | ||
) | ||
y_rest = ( | ||
np.delete(y_train_all, first, axis=0) | ||
if isinstance(y_train_all, np.ndarray) | ||
else y_train_all.drop(first.tolist()) | ||
if data_is_df | ||
else y_train_all[rest] | ||
) | ||
stratify = y_rest if split_type == "stratified" else None | ||
X_train, X_val, y_train, y_val = self._train_test_split( | ||
state, X_rest, y_rest, first, rest, split_ratio, stratify | ||
) | ||
X_train = concat(X_first, X_train) | ||
y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train]) | ||
X_val = concat(X_first, X_val) | ||
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val]) | ||
#Check whether the training set and validation set cover all categories. | ||
train_labels = np.unique(y_train) | ||
val_labels = np.unique(y_val) | ||
Comment on lines
+536
to
+537
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. np.unique doesn't work for psSeries or psDataFrame. Check an example here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it can be modified like this: if isinstance(y_val, (ps.Series, ps.DataFrame)): There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. try reusing existing functions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a pretty big project, and I'm not very familiar with it yet. I'm not sure if there are already existing functions that I can use. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You can reuse the function There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I noticed this function after seeing your correction. I hadn't paid attention to it before,thank you! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the modified code: |
||
missing_in_train = set(label_set) - set(train_labels) | ||
missing_in_val = set(label_set) - set(val_labels) | ||
|
||
#Add X_first only to the validation set (remove the merge of the training set). | ||
if missing_in_val: | ||
X_val = concat(X_first, X_val) | ||
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val]) | ||
#The training set supplements the missing categories with the remaining data. | ||
if missing_in_train: | ||
Comment on lines
+542
to
+546
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if missing_in_val only has 1 value missed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, the code can be optimized. if missing_in_val: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a better way is to fill the missing labels in both train and val with first. For those not missing in either train or val, put them in train. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, I've learned it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if missing_in_val: if missing_in_train: common_labels = set(train_labels) & set(val_labels) #将不在train或val中缺失的标签放到train中 |
||
for label in missing_in_train: | ||
mask = (y_rest == label) | ||
X_train = concat(X_rest[mask], X_train) | ||
y_train = concat(y_rest[mask], y_train) | ||
Comment on lines
+548
to
+550
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. X_rest[mask] may not work for dataframe. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps this could work: X_train = X_train.union(filtered_X_rest) |
||
|
||
if isinstance(y_train, (psDataFrame, pd.DataFrame)) and y_train.shape[1] == 1: | ||
y_train = y_train[y_train.columns[0]] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the second way of getting X_rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh,I made a mistake about it. The second part should be kept here.