Skip to content

Param tuning: ensembling (version 2 but all the same code as version 1) #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jul 11, 2025

Conversation

ntalluri
Copy link
Collaborator

@ntalluri ntalluri commented Mar 24, 2025

@agitter my param-tuning-ensembling branch #207 was out of sync with the changes I had locally. I needed to redo the branch to be up to date

@agitter Do this PR second (then follow up with the pull requests #208, #209 after this one is merged)

Will need to merge with updated master after #193 is merged. (hopefully this will remove the repeated files through out the PRs)
Included in this PR:

update to evaluation.py that will deal with making node ensemble frequencies then create a node PR curve
a new test suite evaluate for only ensembling idea
updates to Snakemake file that will run evaluation per dataset and per algortihm-dataset pair

@ntalluri
Copy link
Collaborator Author

Reminder: There was one unresolved comment that we can continue discussing here #193 (comment).

@ntalluri
Copy link
Collaborator Author

issue #232 can be updated in this PR

@ntalluri ntalluri requested a review from agitter May 29, 2025 20:07
@tristan-f-r tristan-f-r added the tuning Workflow-spanning algorithm tuning label May 30, 2025
Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed formatting changes.

Reminder: There was one unresolved comment that we can continue discussing here #193 (comment).

I opened #259 for this so we don't need to track it here.

Comment on lines 176 to 180
'Threshold': ["None"],
'Precision': ["None"],
'Recall': ["None"],
'Average_Precison': ["None"],
'Baseline_Precision': ["None"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these are strings and not None? Can we set them to actual values? The precision and recall may default to 0, we can look for precedent. The AP may as well. The baseline precision we actually do have.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a specific error: If using all scalar values, you must pass an index. To not this error, I need to add an index when I do pd.DataFrame(data). Another solution to get around that is to make everything a list.
https://stackoverflow.com/questions/17839973/constructing-dataframe-from-values-in-variables-yields-valueerror-if-using-all#:~:text=The%20error%20message,3%20%202%20%203

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is also pd.DataFrame.from_dict(dictionary, orient = "index")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a specific error: If using all scalar values, you must pass an index. To not this error, I need to add an index when I do pd.DataFrame(data). Another solution to get around that is to make everything a list. https://stackoverflow.com/questions/17839973/constructing-dataframe-from-values-in-variables-yields-valueerror-if-using-all#:~:text=The%20error%20message,3%20%202%20%203

I'm planning on using the list method

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jun 11, 2025

Thinking about the conversation from our meeting:

The way I have been thinking about evaluation, I have been asking "Given the nodes the algorithm selected in its subnetworks, how well does the algorithm recover the gold standard ones?". This assesses the quality of what the algorithm selected.

I think if I was asking, "Did the algorithm select the correct nodes (the gold standard nodes) from the entire network?", then including all of the nodes from the full interactome with a frequency of 0 makes sense. I would be seeing if the algorithms were able to distinguish between the relevant gold standard nodes from the entire "universe" of possible nodes within all of their outputs.

  • I think this is what was done in Pathlinker as well

The right answer is based on what question we are trying to answer.

Adding only the missing gold standard nodes with a frequency of 0 ensures accurate recall calculation by capturing the correct number of false negatives (the gold standard nodes that should have been recovered but were not).

By adding the entire network's nodes with frequency 0, we would be penalizing an algorithm for not predicting the whole network. This also might penalize methods that return sparser smaller networks.
However, this seems to be the correct way to look at the ensembles where the gold standard will be the positive samples and the network part not predicted is the negative samples. This allows us to evaluate how well the algorithm prioritizes relevant nodes (the gold standard) over all possible alternatives.

add only the gold standard
Gold standard nodes: {"A", "B", "C"}

Node Is Gold Standard (y_true) Frequency (y_score)
A 1 0.9
D 0 0.8
B 1 0.0
C 1 0.0
  • True Positives (TP): A
  • False Positives (FP): D
  • False Negatives (FN): B, C
  • Precision = TP / (TP + FP) = 1 / (1 + 1) = 0.5
  • Recall = TP / (TP + FN) = 1 / (1 + 2) = 0.33

add the whole network

Node Is Gold Standard Frequency
A 1 0.9
D 0 0.8
B 1 0.0
C 1 0.0
X 0 0.0
...
X997 0 0.0
  • True Positives (TP): A
  • False Positives (FP): D, X -> X997
  • False Negatives (FN): B, C
  • Precision = TP / (TP + FP) = 1 / (1 + 998) = 0.001
  • Recall = TP / (TP + FN) = 1 / (1 + 2) = 0.33

How I have been defining TP, FP, FN, TN for nodes:

Term Meaning
True Positive (TP) A predicted node that is also in the gold standard pathway
False Positive (FP) A predicted node that is not in the gold standard pathway
False Negative (FN) A node in the gold standard pathway that was not predicted
True Negative (TN) A node that was not predicted and is not in the gold standard pathway

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jun 11, 2025

add the whole network plot:
Screenshot 2025-06-11 at 17 02 15

The baseline precision and average precision gets messed up because I added the whole network

@agitter
Copy link
Collaborator

agitter commented Jun 13, 2025

"Did the algorithm select the correct nodes (the gold standard nodes) from the entire network?"

I also believe this is the version of the question we want to ask. It makes for the most straightforward and fair comparison of evaluation metrics across methods that predict variable size networks.

The baseline precision and average precision gets messed up because I added the whole network

Are they wrong in the plot now? Or are they different and lower? It may be okay if they are low. Do we have any example where a pathway reconstruction algorithm actually does recover all of the gold standard nodes so we can confirm the PR curve and AP look as expected in that case?

@ntalluri
Copy link
Collaborator Author

The baseline precision is calculated the same way for each algorithm by including all gold standard nodes with a frequency of 0, resulting in a baseline of (all of the gold standard nodes) / (total nodes). There will no longer be a baseline per algorithm since this will always be the same per algorithm.

Average precision (AP) tends to be low because adding the entire network’s nodes and edges to the ensembles introduces a large number of negatives, skewing the score. In a smaller test case where all gold standard nodes are recovered, the AP remains higher than the image above (run on the egfr dataset) since the total number of nodes is limited.

Example:

Node Ensemble (bolded are the gold standard nodes)
Node Frequency
A 0.5
B 0.5

C 0.75
D 0.75
E 0.9
F 0.9
L 0.5
M 0.5
N 0.25
O 0.25
P 0.25
Q 0.25
Z 0.01
G 0.0
H 0.0
I 0.0
J 0.0
K 0.0
R 0.0
Screenshot 2025-06-16 at 16 22 48

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to test this with the EGFR config, but it isn't set up to use the gold standard. Should we add that to the config as part of this pull request to demonstrate the behavior on a real dataset? The toy datasets are too small to see how the PR curves work.

@ntalluri
Copy link
Collaborator Author

I wanted to test this with the EGFR config, but it isn't set up to use the gold standard. Should we add that to the config as part of this pull request to demonstrate the behavior on a real dataset? The toy datasets are too small to see how the PR curves work.

I agree, I was testing the egfr dataset locally. I will add the evaluation dataset, and more of the algorithms/parameter setting. Not all of the parameter settings from the egfr param tuning config file.

@ntalluri ntalluri added needed for benchmarking Priority PRs needed for the benchmarking paper labels Jun 25, 2025
@ntalluri ntalluri requested a review from agitter July 1, 2025 21:26
Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a few formatting changes. We can merge once the tests pass.

@agitter agitter merged commit 85a3184 into Reed-CompBio:main Jul 11, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needed for benchmarking Priority PRs needed for the benchmarking paper tuning Workflow-spanning algorithm tuning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants