Skip to content

Param tuning code integration: pca chosen #209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

ntalluri
Copy link
Collaborator

@ntalluri ntalluri commented Mar 3, 2025

TODOs:

  1. Test case config
  2. Test case Eval
  3. Test case ML
  4. remove kde file (last thing)
  5. How to deal with no pca chosen pathways visulization: The Snakefile to use algorithms_mult_param_combos idea, how do we deal with an empty []?

@@ -142,8 +142,14 @@ def pca(dataframe: pd.DataFrame, output_png: str, output_var: str, output_coord:
if not isinstance(labels, bool):
raise ValueError(f"labels={labels} must be True or False")

scaler = StandardScaler()
#TODO: MinMaxScaler changes nothing about the data
# scaler = MinMaxScaler()
Copy link
Collaborator Author

@ntalluri ntalluri Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to do PCA on Binary Data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@ntalluri ntalluri Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be best to keep these features are one hot encoded values (which they already are).

Copy link
Collaborator Author

@ntalluri ntalluri Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StandardScaler (with_std=True) is the wrong choice for this use case.

  • The data is binary (0/1) and not Gaussian-distributed — standardizing based on mean and variance doesn't make sense here.
  • Applying standard scaling can distort sparse binary features: columns with mostly 0s and very few 1s (e.g., 99% zeros) will have a very small standard deviation.
  • When divided by this small std, rare 1s become disproportionately large, which causes PCA to overemphasize those columns.
  • Additionally, standardization transforms values outside the binary range [0, 1], making interpretation of PCA results less meaningful.

Two new options seem like better options:

  1. Pass the binary matrix as is:
  • Each column is a 0/1 vector to represent if an output includes an edge?
  • No preprocessing will be done (like centering or standardization)
  • Edges that are frequently included across runs (common edges) will contribute more to the total variance.
  • As a result, PCA will favor these high-frequency edges, and principal components will be aligned with global edge popularity.
  • Runs that include many common edges will appear more similar, even if they don't share rare or distinctive edges.
  • This leads to PCA grouping runs based on how many total edges they selected, rather than on which specific edges they selected.
  • The result is that PCA reflects output volume more than meaningful edge inclusion patterns.
  1. StandardScaler with with_std=False
  • I want PCA to answer: Which algorithm runs have similar edge inclusion patterns?
  • This transforms each column (edge) by subtracting its mean inclusion rate across all runs, centering values around 0.
  • This removes the effect of certain edges being globally common or rare, for example, edges that are always selected or never selected won't drive similarity.
  • Runs are "similar" if they include or avoid the same edges more than expected, compared to other runs.
  • This helps PCA focus on meaningful differences in edge preference behavior across runs, not just raw counts.
  • Without centering, common edges dominate the total variance simply because they are included frequently.
    • PCA is variance-driven, so it will align the first components with these common edges, even though they contribute no meaningful information.
    • Meanwhile, rare but distinctive edges are underweighted.
  • Centering fixes this by neutralizing the influence of globally common edges and allowing rare or selectively included edges to be defined by principal components.
  • Example:
    • Suppose edge E1 is included in 90% of runs. If two runs both include E1, that's not informative.
    • But if they both include rare edge E5 (included in only 10% of runs), they’ll share a strong positive deviation, and PCA will pick up on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background section of https://arxiv.org/abs/1510.06112 gives some good explanation and earlier work. I'm doubtful sklearn supports any of those options directly. A previous graduate student worked on general matrix factorization with multiple types of data and wrote custom code to support different data types, e.g. binary.

If we don't find correct PCA implementations, we could switch to a different visualization like multidimensional scaling on the Jaccard similarities. Or use option 2 above and document in the code and manuscript why we chose that even though it is not what we really want.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to use StandardScaler with with_std=False for this pull request and make a new issue to discuss switching to a better alternative.

@ntalluri
Copy link
Collaborator Author

@agitter Do this PR Last

Will need to merge with updated master after #193, #207 is merged, and #208. (hopefully this will remove the repeated files through out the PRs)

  • there will be merge conflicts with the Snakefile, evaluation.py and the test suite.

Included in this PR:

  1. update to evaluation.py that updates the code to include precision_and_recall and pca chosen pathway (precision_and_recall is also used in Param tuning code integration:no param tuning #208)
  2. a new test suite evaluate for pca chosen pathway
  3. updates to Snakemake file that will run evaluation per dataset and per algortihm-dataset pair for pca chosen pathway
  4. update to ml code for PCA to add centroids
  5. still need to figure out how to rescale the binary data
  6. update to ml test suite for expected pca coordinates and ml test code

@ntalluri
Copy link
Collaborator Author

fix in this pr #231

@tristan-f-r tristan-f-r added the tuning Workflow-spanning algorithm tuning label May 30, 2025
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing how the chosen pathway is the empty one. I was expecting test/evaluate/input/data-test-params-789/pathway.txt to be the chosen pathway but it's not the closest one to the centroid.

datapoint_labels	PC1	PC2
data-test-params-123	-0.6564351782133608	-0.7071067811865477
data-test-params-456	0.9008137349989621	2.498001805406602e-16
data-test-params-789	-0.6564351782133608	0.7071067811865476
data-test-params-empty	0.41205662142775956	-1.942890293094024e-16
centroid	0.0	-2.0816681711721685e-17

Copy link
Collaborator Author

@ntalluri ntalluri Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best way to calculate the centroid then?

#calculating the centroid
 centroid = np.mean(X_pca, axis=0) # mean of each principal component across all samples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you expect a different pathway to be selected as closest to the centroid?

Do we want to eliminate empty pathways from the closest-to-centroid selection if there exist any non-empty alternatives?


pc_columns = [col for col in coord_df.columns if col.startswith('PC')]
coord_df['Distance To Centroid'] = np.sqrt(sum((coord_df[pc] - centroid[i]) ** 2 for i, pc in enumerate(pc_columns)))
closest_to_centroid = coord_df.sort_values(by='Distance To Centroid').iloc[0]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to deal with tiebreaker?

Copy link
Collaborator Author

@ntalluri ntalluri Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tied add all ties to the rep_pathways list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we easily access the pathway sizes? I would use smallest size as the tiebreaker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to read it from summary.txt. Or read in the graphs into nx directly.

@@ -95,6 +96,9 @@ def make_final_input(wildcards):
final_input.extend(expand('{out_dir}{sep}{dataset}-ml{sep}jaccard-matrix.txt',out_dir=out_dir,sep=SEP,dataset=dataset_labels,algorithm_params=algorithms_with_params))
final_input.extend(expand('{out_dir}{sep}{dataset}-ml{sep}jaccard-heatmap.png',out_dir=out_dir,sep=SEP,dataset=dataset_labels,algorithm_params=algorithms_with_params))

if generate_kde:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to save the KDE plot, but I left temporary code in case we want to.

Copy link
Collaborator

@agitter agitter Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our meeting discussion, the plot is nice to have but if it introduces a great deal of additional complexity in terms of new code or the Snakemake rule, we can skip it.

Edit: it only makes sense to expose KDE parameters bandwidth and kernel if we save the plots. Otherwise a user has not way to see the effects of these parameters.

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jun 30, 2025

Which KDE metrics are important to include/give the user access to:

Yes:

  • bandwidth
  • kernel
    Bandwidth controls smoothing scale for each bump, kernel defines bump shape; together they determine density estimate sharpness and shape.

No:

  • metric; should always be euclidean because the normalization of the density output is correct only for euclidean
  • algorithm; the rest below are used for speed and have no affects on results. I am planning on keeping the default provided.
  • atol, rtol
  • breadth_first
  • leaf_size
  • metric_params

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jun 30, 2025

This idea will only choose a pathways from pathway reconstruction algortihms if it has multiple parameter combinations. Allpairs and those with 1 chosen parameter combination will never be given a pca chosen pathway. Should we add the one pathway it makes to the list of representative pathways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locally this is what my expected pathway is, but I think on github actions, the expected is 789 pathway

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jul 3, 2025

The visualization that sns.kde has looks better than the one I am using right now. (TODO: add images to show)


# TODO: TEMP; still choosing the one maximum for now
max_row = max_rows.iloc[0]
kde_row = ['kde_peak', max_row["X_coordinate"], max_row["Y_coordinate"]]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closest one to 0.0 or some value

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jul 10, 2025

The functions contourf and tricontourf in Matplotlib are both used to create smooth contour plots, but they differ in how they handle the underlying data points.

contourf requires data evaluated on a regular grid (mesh), making it ideal when we have uniformly spaced evaluation points such as those generated using np.meshgrid.

In contrast, tricontourf is specifically designed for scattered (irregular) points and internally performs a Delaunay triangulation to define contour regions.

Because of this, tricontourf requires at least three unique, non-colinear points to build a valid triangulation. If the points are too similar, collapsed, or aligned along a single line, the triangulation fails, and the method cannot produce a contour plot.

contourf can always visulize the kde, even when there are fewer than three points or when points are colinear, by simply returning a flat or nearly empty KDE ontop of the pca plot.

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jul 10, 2025

How to deal with KDE Max tiebreakers:

When deciding which maximum KDE point to highlight in a 2D PCA plot, I think it makes much more sense to use the point closest to the centroid instead of the origin (0,0). The centroid actually reflects where the data points are concentrated of the data in PCA space. On the other hand, (0,0) is just a fixed mathematical point that doesn’t really tell me anything about how my data are distributed. Especially after PCA, the points often aren’t centered at the origin because of how the variance is spread across components. By using the centroid, I’m making sure that the selected KDE peak represents the overall structure and central tendency of my data, rather than just being arbitrarily close to (0,0).

The centroid is just the average location of all points in PCA space(I take the mean value of the top 2 PCA component across my points). It mathematically reflects the "center of mass" of my data, unlike (0,0), which is just a fixed point that doesn't consider where the data actually are.

For now:
I'm going to pick the coordinate (associated with the max density) closest to (0,0) as the kde peak to use. If all the coordinates are equal distance to (0,0) I will pick the smallest index to be the coordinate to be chosen.

In the future:
One idea I have is picking the coordinate (associated with the max density) closest to centriod is the kde peak to use. If all the coordinates are equal distance to (0,0), I will pick the centroid itself to be the coordinate to be chosen. I will need to work through different test cases.

@ntalluri
Copy link
Collaborator Author

How to deal with multiple representative pathways tiebreakers:

The plan is to send summary.txt to help find pathways that are the smallest. Smallest in this case means smallest number of edges and nodes. If they are all equal, then do it based on name to keep it deterministic.

  1. choose smallest number of edges and smallest number of nodes
  2. end all be all, choose the first one based on name

in the future, we can choose other stats to use for tiebreaking

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jul 11, 2025

Due to different machines rounding differently, the test case for pca chosen kept failing. I tried rounding everything for pca (pca kde, pca coordinates, pca variance, all the tiebreaker distances) to the 8th decimal and this seemed to fix the issue.

Comment on lines +177 to +182
# hyperparameter to control the smoothness of estimated kernel density
# single float, 'scott', 'silverman'
bandwidth: 1.0
# hyperparameter kernel to use for kernel_density
# 'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'
kernel: 'gaussian'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted this in a comment elsewhere already. If we don't save the PCA plots, I'm not sure how a user will know how to tune these, which may imply we don't need to expose them.

@agitter
Copy link
Collaborator

agitter commented Jul 11, 2025

There are new merge conflicts to resolve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needed for benchmarking Priority PRs needed for the benchmarking paper tuning Workflow-spanning algorithm tuning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants