-
Notifications
You must be signed in to change notification settings - Fork 20
Param tuning code integration: pca chosen #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -142,8 +142,14 @@ def pca(dataframe: pd.DataFrame, output_png: str, output_var: str, output_coord: | |||
if not isinstance(labels, bool): | |||
raise ValueError(f"labels={labels} must be True or False") | |||
|
|||
scaler = StandardScaler() | |||
#TODO: MinMaxScaler changes nothing about the data | |||
# scaler = MinMaxScaler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to do PCA on Binary Data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (data is not in a gaussian distribution; does not make sense to use standard scalar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://stackoverflow.com/questions/40795141/pca-for-categorical-features
https://stats.stackexchange.com/questions/159705/would-pca-work-for-boolean-binary-data-types
Based on a bunch of different forums, people suggest not using PCA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be best to keep these features are one hot encoded values (which they already are).
@agitter Do this PR Last Will need to merge with updated master after #193, #207 is merged, and #208. (hopefully this will remove the repeated files through out the PRs)
Included in this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the create palette function, update the code to have a sorted list of unique column headers
unique_column_names = list(sorted(set(column_names)))
custom_palette = sns.color_palette(palette = "tab20c", n_colors = len(unique_column_names))
label_color_map = {label: color for label, color in zip(unique_column_names, custom_palette, strict=True)}
return label_color_map
No description provided.