Param tuning code integration: pca chosen #209

ntalluri · 2025-03-03T23:18:32Z

TODOs:

Test case config
Test case Eval
Test case ML
remove kde file (last thing)
How to deal with no pca chosen pathways visulization: The Snakefile to use algorithms_mult_param_combos idea, how do we deal with an empty []?

ntalluri · 2025-03-10T21:10:16Z

spras/analysis/ml.py

@@ -142,8 +142,14 @@ def pca(dataframe: pd.DataFrame, output_png: str, output_var: str, output_coord:
    if not isinstance(labels, bool):
        raise ValueError(f"labels={labels} must be True or False")

-    scaler = StandardScaler()
+    #TODO: MinMaxScaler changes nothing about the data
+    # scaler = MinMaxScaler()


How to do PCA on Binary Data?

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (data is not in a gaussian distribution; does not make sense to use standard scalar)

https://stackoverflow.com/questions/40795141/pca-for-categorical-features
https://stats.stackexchange.com/questions/159705/would-pca-work-for-boolean-binary-data-types

Based on a bunch of different forums, people suggest not using PCA

It might be best to keep these features are one hot encoded values (which they already are).

StandardScaler (with_std=True) is the wrong choice for this use case.

The data is binary (0/1) and not Gaussian-distributed — standardizing based on mean and variance doesn't make sense here.

Applying standard scaling can distort sparse binary features: columns with mostly 0s and very few 1s (e.g., 99% zeros) will have a very small standard deviation.

When divided by this small std, rare 1s become disproportionately large, which causes PCA to overemphasize those columns.

Additionally, standardization transforms values outside the binary range [0, 1], making interpretation of PCA results less meaningful.

Two new options seem like better options:

Pass the binary matrix as is:

Each column is a 0/1 vector to represent if an output includes an edge?

No preprocessing will be done (like centering or standardization)

Edges that are frequently included across runs (common edges) will contribute more to the total variance.

As a result, PCA will favor these high-frequency edges, and principal components will be aligned with global edge popularity.

Runs that include many common edges will appear more similar, even if they don't share rare or distinctive edges.

This leads to PCA grouping runs based on how many total edges they selected, rather than on which specific edges they selected.

The result is that PCA reflects output volume more than meaningful edge inclusion patterns.

StandardScaler with with_std=False

I want PCA to answer: Which algorithm runs have similar edge inclusion patterns?

This transforms each column (edge) by subtracting its mean inclusion rate across all runs, centering values around 0.

This removes the effect of certain edges being globally common or rare, for example, edges that are always selected or never selected won't drive similarity.

Runs are "similar" if they include or avoid the same edges more than expected, compared to other runs.

This helps PCA focus on meaningful differences in edge preference behavior across runs, not just raw counts.

Without centering, common edges dominate the total variance simply because they are included frequently.

PCA is variance-driven, so it will align the first components with these common edges, even though they contribute no meaningful information.

Meanwhile, rare but distinctive edges are underweighted.

Centering fixes this by neutralizing the influence of globally common edges and allowing rare or selectively included edges to be defined by principal components.

Example:

Suppose edge E1 is included in 90% of runs. If two runs both include E1, that's not informative.

But if they both include rare edge E5 (included in only 10% of runs), they’ll share a strong positive deviation, and PCA will pick up on this.

The background section of https://arxiv.org/abs/1510.06112 gives some good explanation and earlier work. I'm doubtful sklearn supports any of those options directly. A previous graduate student worked on general matrix factorization with multiple types of data and wrote custom code to support different data types, e.g. binary.

If we don't find correct PCA implementations, we could switch to a different visualization like multidimensional scaling on the Jaccard similarities. Or use option 2 above and document in the code and manuscript why we chose that even though it is not what we really want.

We decided to use StandardScaler with with_std=False for this pull request and make a new issue to discuss switching to a better alternative.

ntalluri · 2025-03-11T18:41:55Z

@agitter Do this PR Last

Will need to merge with updated master after #193, #207 is merged, and #208. (hopefully this will remove the repeated files through out the PRs)

there will be merge conflicts with the Snakefile, evaluation.py and the test suite.

Included in this PR:

update to evaluation.py that updates the code to include precision_and_recall and pca chosen pathway (precision_and_recall is also used in Param tuning code integration:no param tuning #208)
a new test suite evaluate for pca chosen pathway
updates to Snakemake file that will run evaluation per dataset and per algortihm-dataset pair for pca chosen pathway
update to ml code for PCA to add centroids
still need to figure out how to rescale the binary data
update to ml test suite for expected pca coordinates and ml test code

spras/analysis/ml.py

ntalluri · 2025-05-28T18:33:44Z

fix in this pr #231

…semble pr

ntalluri · 2025-06-04T19:32:01Z

test/evaluate/expected/expected-pr-per-pathway-pca-chosen.txt

this is confusing how the chosen pathway is the empty one. I was expecting test/evaluate/input/data-test-params-789/pathway.txt to be the chosen pathway but it's not the closest one to the centroid.

datapoint_labels PC1 PC2 data-test-params-123 -0.6564351782133608 -0.7071067811865477 data-test-params-456 0.9008137349989621 2.498001805406602e-16 data-test-params-789 -0.6564351782133608 0.7071067811865476 data-test-params-empty 0.41205662142775956 -1.942890293094024e-16 centroid 0.0 -2.0816681711721685e-17

Is this the best way to calculate the centroid then?

#calculating the centroid centroid = np.mean(X_pca, axis=0) # mean of each principal component across all samples

Why did you expect a different pathway to be selected as closest to the centroid?

Do we want to eliminate empty pathways from the closest-to-centroid selection if there exist any non-empty alternatives?

ntalluri · 2025-06-04T19:56:21Z

spras/evaluation.py

+
+            pc_columns = [col for col in coord_df.columns if col.startswith('PC')]
+            coord_df['Distance To Centroid'] = np.sqrt(sum((coord_df[pc] - centroid[i]) ** 2 for i, pc in enumerate(pc_columns)))
+            closest_to_centroid = coord_df.sort_values(by='Distance To Centroid').iloc[0]


how to deal with tiebreaker?

If tied add all ties to the rep_pathways list?

Can we easily access the pathway sizes? I would use smallest size as the tiebreaker.

I think we would need to read it from summary.txt. Or read in the graphs into nx directly.

ntalluri · 2025-06-26T17:24:34Z

Snakefile

@@ -95,6 +96,9 @@ def make_final_input(wildcards):
        final_input.extend(expand('{out_dir}{sep}{dataset}-ml{sep}jaccard-matrix.txt',out_dir=out_dir,sep=SEP,dataset=dataset_labels,algorithm_params=algorithms_with_params))
        final_input.extend(expand('{out_dir}{sep}{dataset}-ml{sep}jaccard-heatmap.png',out_dir=out_dir,sep=SEP,dataset=dataset_labels,algorithm_params=algorithms_with_params))

+        if generate_kde:


I don't think we need to save the KDE plot, but I left temporary code in case we want to.

Per our meeting discussion, the plot is nice to have but if it introduces a great deal of additional complexity in terms of new code or the Snakemake rule, we can skip it.

Edit: it only makes sense to expose KDE parameters bandwidth and kernel if we save the plots. Otherwise a user has not way to see the effects of these parameters.

config/config.yaml

ntalluri · 2025-06-30T18:04:35Z

Which KDE metrics are important to include/give the user access to:

Yes:

bandwidth
kernel
Bandwidth controls smoothing scale for each bump, kernel defines bump shape; together they determine density estimate sharpness and shape.

No:

metric; should always be euclidean because the normalization of the density output is correct only for euclidean
algorithm; the rest below are used for speed and have no affects on results. I am planning on keeping the default provided.
atol, rtol
breadth_first
leaf_size
metric_params

ntalluri · 2025-06-30T18:30:11Z

This idea will only choose a pathways from pathway reconstruction algortihms if it has multiple parameter combinations. Allpairs and those with 1 chosen parameter combination will never be given a pca chosen pathway. Should we add the one pathway it makes to the list of representative pathways.

ntalluri · 2025-07-01T15:58:24Z

test/evaluate/expected/expected-pr-per-pathway-pca-chosen.txt

Locally this is what my expected pathway is, but I think on github actions, the expected is 789 pathway

ntalluri · 2025-07-03T20:07:41Z

The visualization that sns.kde has looks better than the one I am using right now. (TODO: add images to show)

ntalluri · 2025-07-07T21:25:16Z

spras/analysis/ml.py

+
+            # TODO: TEMP; still choosing the one maximum for now
+            max_row = max_rows.iloc[0]
+            kde_row = ['kde_peak', max_row["X_coordinate"], max_row["Y_coordinate"]]


closest one to 0.0 or some value

ntalluri · 2025-07-10T17:08:04Z

The functions contourf and tricontourf in Matplotlib are both used to create smooth contour plots, but they differ in how they handle the underlying data points.

contourf requires data evaluated on a regular grid (mesh), making it ideal when we have uniformly spaced evaluation points such as those generated using np.meshgrid.

In contrast, tricontourf is specifically designed for scattered (irregular) points and internally performs a Delaunay triangulation to define contour regions.

Because of this, tricontourf requires at least three unique, non-colinear points to build a valid triangulation. If the points are too similar, collapsed, or aligned along a single line, the triangulation fails, and the method cannot produce a contour plot.

contourf can always visulize the kde, even when there are fewer than three points or when points are colinear, by simply returning a flat or nearly empty KDE ontop of the pca plot.

ntalluri · 2025-07-10T20:06:42Z

How to deal with KDE Max tiebreakers:

When deciding which maximum KDE point to highlight in a 2D PCA plot, I think it makes much more sense to use the point closest to the centroid instead of the origin (0,0). The centroid actually reflects where the data points are concentrated of the data in PCA space. On the other hand, (0,0) is just a fixed mathematical point that doesn’t really tell me anything about how my data are distributed. Especially after PCA, the points often aren’t centered at the origin because of how the variance is spread across components. By using the centroid, I’m making sure that the selected KDE peak represents the overall structure and central tendency of my data, rather than just being arbitrarily close to (0,0).

The centroid is just the average location of all points in PCA space(I take the mean value of the top 2 PCA component across my points). It mathematically reflects the "center of mass" of my data, unlike (0,0), which is just a fixed point that doesn't consider where the data actually are.

For now:
I'm going to pick the coordinate (associated with the max density) closest to (0,0) as the kde peak to use. If all the coordinates are equal distance to (0,0) I will pick the smallest index to be the coordinate to be chosen.

In the future:
One idea I have is picking the coordinate (associated with the max density) closest to centriod is the kde peak to use. If all the coordinates are equal distance to (0,0), I will pick the centroid itself to be the coordinate to be chosen. I will need to work through different test cases.

ntalluri · 2025-07-10T23:11:11Z

How to deal with multiple representative pathways tiebreakers:

The plan is to send summary.txt to help find pathways that are the smallest. Smallest in this case means smallest number of edges and nodes. If they are all equal, then do it based on name to keep it deterministic.

choose smallest number of edges and smallest number of nodes
end all be all, choose the first one based on name

in the future, we can choose other stats to use for tiebreaking

ntalluri · 2025-07-11T16:28:45Z

Due to different machines rounding differently, the test case for pca chosen kept failing. I tried rounding everything for pca (pca kde, pca coordinates, pca variance, all the tiebreaker distances) to the 8th decimal and this seemed to fix the issue.

agitter · 2025-07-11T20:48:15Z

config/config.yaml

+    # hyperparameter to control the smoothness of estimated kernel density
+    # single float, 'scott', 'silverman'
+    bandwidth: 1.0
+    # hyperparameter kernel to use for kernel_density
+    # 'gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'
+    kernel: 'gaussian'


I noted this in a comment elsewhere already. If we don't save the PCA plots, I'm not sure how a user will know how to tune these, which may imply we don't need to expose them.

agitter · 2025-07-11T21:02:10Z

There are new merge conflicts to resolve

ntalluri added 3 commits March 3, 2025 15:53

only keep functions needed for pca chosen

d6ce621

update snakefile

d57ab5d

update test cases

61c3482

agitter mentioned this pull request Mar 7, 2025

Parameter Tuning Code Integration: update evaluation and add egfr gold standard #193

Merged

ntalluri commented Mar 10, 2025

View reviewed changes

This was referenced Mar 11, 2025

Param tuning code integration: ensembling #207

Closed

Param tuning code integration:no param tuning #208

Open

ntalluri mentioned this pull request Mar 24, 2025

Param tuning: ensembling (version 2 but all the same code as version 1) #212

Merged

ntalluri added 7 commits March 24, 2025 12:34

resolving merge conflicts

4e29ba4

added back test cases

8d805da

readd ml code

462f8d8

readd code from spras/evaluation.py

1f358d1

readd ml test cases

added89

updated ml test cases again for summarize networks

46270f2

precommit

4595380

ntalluri commented Apr 19, 2025

View reviewed changes

spras/analysis/ml.py Show resolved Hide resolved

tristan-f-r added the tuning Workflow-spanning algorithm tuning label May 30, 2025

ntalluri added 7 commits June 3, 2025 12:04

updated snakemake/code with compact txt files and updated based on en…

5b97b36

…semble pr

updated to make a pca pr visual

f721d4e

update to one scale by the mean

501d16d

cleaned up evaluate test case

fb8aeee

precommit

05a8b63

removed comments and prints

0b8a6de

remove minmaxscalar

a914264

ntalluri commented Jun 4, 2025

View reviewed changes

update ml, still deciding how to scale

aa8ef61

ntalluri commented Jun 4, 2025

View reviewed changes

comments

dd7f776

ntalluri added 2 commits June 26, 2025 12:17

update naming for ml

3c624d3

fixing tab

973f4be

ntalluri commented Jun 26, 2025

View reviewed changes

ntalluri added 2 commits June 26, 2025 13:02

updated test cases

7b401a0

precommit and add todos

aa5778d

ntalluri commented Jun 30, 2025

View reviewed changes

config/config.yaml Outdated Show resolved Hide resolved

ntalluri added 2 commits June 30, 2025 14:26

updated with parameters to use

71e763e

working through todos

d653c08

ntalluri commented Jul 1, 2025

View reviewed changes

ntalluri added 4 commits July 1, 2025 12:45

added new parameters to comment

22769ba

working through more todos

ec1f979

precommit and updating todo

ba6f34d

updating todos for config test case

4807350

ntalluri commented Jul 7, 2025

View reviewed changes

contourf vs tricontourf and added todos for test cases

98b0537

ntalluri added 4 commits July 10, 2025 16:23

update kde code to include tiebreaker

3ceb0c5

the code for the update to include tiebreaker

5982377

update to tiebreaker for rep pathway

70236c8

round everything to 8 decimal places

0d5927b

ntalluri added 2 commits July 11, 2025 10:29

fix kde visulization, update naming, reviewed rounding

d3f744d

reviewed tiebreakers

92d9d25

agitter reviewed Jul 11, 2025

View reviewed changes

Param tuning code integration: pca chosen #209

Are you sure you want to change the base?

Param tuning code integration: pca chosen #209

Uh oh!

Conversation

ntalluri commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri commented Mar 11, 2025

Uh oh!

Uh oh!

ntalluri commented May 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agitter Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ntalluri commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which KDE metrics are important to include/give the user access to:

Uh oh!

ntalluri commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri commented Jul 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri commented Jul 10, 2025

Uh oh!

ntalluri commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

ntalluri commented Mar 3, 2025 •

edited

Loading

ntalluri Mar 10, 2025 •

edited

Loading

ntalluri Mar 13, 2025 •

edited

Loading

ntalluri Jun 4, 2025 •

edited

Loading

ntalluri Jun 4, 2025 •

edited

Loading

ntalluri Jun 4, 2025 •

edited

Loading

agitter Jul 11, 2025 •

edited

Loading

ntalluri commented Jun 30, 2025 •

edited

Loading

ntalluri commented Jun 30, 2025 •

edited

Loading

ntalluri commented Jul 10, 2025 •

edited

Loading

ntalluri commented Jul 10, 2025 •

edited

Loading

ntalluri commented Jul 11, 2025 •

edited

Loading