-
Notifications
You must be signed in to change notification settings - Fork 24
Param tuning code integration: pca chosen #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
spras/analysis/ml.py
Outdated
@@ -142,8 +142,14 @@ def pca(dataframe: pd.DataFrame, output_png: str, output_var: str, output_coord: | |||
if not isinstance(labels, bool): | |||
raise ValueError(f"labels={labels} must be True or False") | |||
|
|||
scaler = StandardScaler() | |||
#TODO: MinMaxScaler changes nothing about the data | |||
# scaler = MinMaxScaler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to do PCA on Binary Data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (data is not in a gaussian distribution; does not make sense to use standard scalar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://stackoverflow.com/questions/40795141/pca-for-categorical-features
https://stats.stackexchange.com/questions/159705/would-pca-work-for-boolean-binary-data-types
Based on a bunch of different forums, people suggest not using PCA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be best to keep these features are one hot encoded values (which they already are).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StandardScaler (with_std=True) is the wrong choice for this use case.
- The data is binary (0/1) and not Gaussian-distributed — standardizing based on mean and variance doesn't make sense here.
- Applying standard scaling can distort sparse binary features: columns with mostly 0s and very few 1s (e.g., 99% zeros) will have a very small standard deviation.
- When divided by this small std, rare 1s become disproportionately large, which causes PCA to overemphasize those columns.
- Additionally, standardization transforms values outside the binary range [0, 1], making interpretation of PCA results less meaningful.
Two new options seem like better options:
- Pass the binary matrix as is:
- Each column is a 0/1 vector to represent if an output includes an edge?
- No preprocessing will be done (like centering or standardization)
- Edges that are frequently included across runs (common edges) will contribute more to the total variance.
- As a result, PCA will favor these high-frequency edges, and principal components will be aligned with global edge popularity.
- Runs that include many common edges will appear more similar, even if they don't share rare or distinctive edges.
- This leads to PCA grouping runs based on how many total edges they selected, rather than on which specific edges they selected.
- The result is that PCA reflects output volume more than meaningful edge inclusion patterns.
- StandardScaler with with_std=False
- I want PCA to answer: Which algorithm runs have similar edge inclusion patterns?
- This transforms each column (edge) by subtracting its mean inclusion rate across all runs, centering values around 0.
- This removes the effect of certain edges being globally common or rare, for example, edges that are always selected or never selected won't drive similarity.
- Runs are "similar" if they include or avoid the same edges more than expected, compared to other runs.
- This helps PCA focus on meaningful differences in edge preference behavior across runs, not just raw counts.
- Without centering, common edges dominate the total variance simply because they are included frequently.
- PCA is variance-driven, so it will align the first components with these common edges, even though they contribute no meaningful information.
- Meanwhile, rare but distinctive edges are underweighted.
- Centering fixes this by neutralizing the influence of globally common edges and allowing rare or selectively included edges to be defined by principal components.
- Example:
- Suppose edge E1 is included in 90% of runs. If two runs both include E1, that's not informative.
- But if they both include rare edge E5 (included in only 10% of runs), they’ll share a strong positive deviation, and PCA will pick up on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The background section of https://arxiv.org/abs/1510.06112 gives some good explanation and earlier work. I'm doubtful sklearn supports any of those options directly. A previous graduate student worked on general matrix factorization with multiple types of data and wrote custom code to support different data types, e.g. binary.
If we don't find correct PCA implementations, we could switch to a different visualization like multidimensional scaling on the Jaccard similarities. Or use option 2 above and document in the code and manuscript why we chose that even though it is not what we really want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to use StandardScaler with with_std=False for this pull request and make a new issue to discuss switching to a better alternative.
@agitter Do this PR Last Will need to merge with updated master after #193, #207 is merged, and #208. (hopefully this will remove the repeated files through out the PRs)
Included in this PR:
|
fix in this pr #231 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusing how the chosen pathway is the empty one. I was expecting test/evaluate/input/data-test-params-789/pathway.txt to be the chosen pathway but it's not the closest one to the centroid.
datapoint_labels PC1 PC2
data-test-params-123 -0.6564351782133608 -0.7071067811865477
data-test-params-456 0.9008137349989621 2.498001805406602e-16
data-test-params-789 -0.6564351782133608 0.7071067811865476
data-test-params-empty 0.41205662142775956 -1.942890293094024e-16
centroid 0.0 -2.0816681711721685e-17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best way to calculate the centroid then?
#calculating the centroid
centroid = np.mean(X_pca, axis=0) # mean of each principal component across all samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you expect a different pathway to be selected as closest to the centroid?
Do we want to eliminate empty pathways from the closest-to-centroid selection if there exist any non-empty alternatives?
spras/evaluation.py
Outdated
|
||
pc_columns = [col for col in coord_df.columns if col.startswith('PC')] | ||
coord_df['Distance To Centroid'] = np.sqrt(sum((coord_df[pc] - centroid[i]) ** 2 for i, pc in enumerate(pc_columns))) | ||
closest_to_centroid = coord_df.sort_values(by='Distance To Centroid').iloc[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how to deal with tiebreaker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If tied add all ties to the rep_pathways list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we easily access the pathway sizes? I would use smallest size as the tiebreaker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would need to read it from summary.txt. Or read in the graphs into nx directly.
@@ -95,6 +96,9 @@ def make_final_input(wildcards): | |||
final_input.extend(expand('{out_dir}{sep}{dataset}-ml{sep}jaccard-matrix.txt',out_dir=out_dir,sep=SEP,dataset=dataset_labels,algorithm_params=algorithms_with_params)) | |||
final_input.extend(expand('{out_dir}{sep}{dataset}-ml{sep}jaccard-heatmap.png',out_dir=out_dir,sep=SEP,dataset=dataset_labels,algorithm_params=algorithms_with_params)) | |||
|
|||
if generate_kde: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to save the KDE plot, but I left temporary code in case we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our meeting discussion, the plot is nice to have but if it introduces a great deal of additional complexity in terms of new code or the Snakemake rule, we can skip it.
Edit: it only makes sense to expose KDE parameters bandwidth and kernel if we save the plots. Otherwise a user has not way to see the effects of these parameters.
Which KDE metrics are important to include/give the user access to:Yes:
No:
|
This idea will only choose a pathways from pathway reconstruction algortihms if it has multiple parameter combinations. Allpairs and those with 1 chosen parameter combination will never be given a pca chosen pathway. Should we add the one pathway it makes to the list of representative pathways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locally this is what my expected pathway is, but I think on github actions, the expected is 789 pathway
The visualization that sns.kde has looks better than the one I am using right now. (TODO: add images to show) |
spras/analysis/ml.py
Outdated
|
||
# TODO: TEMP; still choosing the one maximum for now | ||
max_row = max_rows.iloc[0] | ||
kde_row = ['kde_peak', max_row["X_coordinate"], max_row["Y_coordinate"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
closest one to 0.0 or some value
The functions
In contrast, Because of this,
|
How to deal with KDE Max tiebreakers: When deciding which maximum KDE point to highlight in a 2D PCA plot, I think it makes much more sense to use the point closest to the centroid instead of the origin (0,0). The centroid actually reflects where the data points are concentrated of the data in PCA space. On the other hand, (0,0) is just a fixed mathematical point that doesn’t really tell me anything about how my data are distributed. Especially after PCA, the points often aren’t centered at the origin because of how the variance is spread across components. By using the centroid, I’m making sure that the selected KDE peak represents the overall structure and central tendency of my data, rather than just being arbitrarily close to (0,0). The centroid is just the average location of all points in PCA space(I take the mean value of the top 2 PCA component across my points). It mathematically reflects the "center of mass" of my data, unlike (0,0), which is just a fixed point that doesn't consider where the data actually are. For now: In the future: |
How to deal with multiple representative pathways tiebreakers: The plan is to send summary.txt to help find pathways that are the smallest. Smallest in this case means smallest number of edges and nodes. If they are all equal, then do it based on name to keep it deterministic.
in the future, we can choose other stats to use for tiebreaking |
Due to different machines rounding differently, the test case for pca chosen kept failing. I tried rounding everything for pca (pca kde, pca coordinates, pca variance, all the tiebreaker distances) to the 8th decimal and this seemed to fix the issue. |
# hyperparameter to control the smoothness of estimated kernel density | ||
# single float, 'scott', 'silverman' | ||
bandwidth: 1.0 | ||
# hyperparameter kernel to use for kernel_density | ||
# 'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine' | ||
kernel: 'gaussian' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted this in a comment elsewhere already. If we don't save the PCA plots, I'm not sure how a user will know how to tune these, which may imply we don't need to expose them.
There are new merge conflicts to resolve |
TODOs: