Validate multihit branch lengths #124

willdumm · 2025-03-05T22:10:48Z

See #123 for discussion, referencing a test in this PR

willdumm · 2025-04-23T23:37:27Z

The issues that ended up causing the discrepancy were:

gradient-based optimization giving consistently incorrect (though close to correct on real data) results. We now use the gradient-free "Brent" method from scipy.optimize.minimize_scalar with some contortions to avoid numerical issues since our log-probabilities are quite flat w.r.t. branch lengths, especially when the optimal branch length is very close to 0.
we were unnecessarily clamping linear-space probabilities to a number that was too large. Now using torch.finfo(dtype).tiny instead of torch.finfo(dtype).eps. The point is to avoid issues taking the log of something too close to zero, and the changed value is still adequate for that purpose.
Also, there are two completely independent code paths computing codon probabilities, one for branch length optimization and one for loss computation (in DASM). This PR doesn't fix that, but tests both paths. A future PR should fix this. Issue opened Independent code paths for codon probability computations #134

netam/dxsm.py

netam/molevol.py

Copilot

Pull Request Overview

This PR implements changes to validate multihit branch lengths and improve branch length optimization along with related refactoring. Key changes include:

Addition of new utility functions (flatten_codon_idxs, unflatten_codon_idxs, nt_idx_tensor_of_str) and corresponding tests.
Refactored optimization routines in molevol.py using SciPy’s bracket and minimize_scalar functions.
Consistent updating of branch length handling and multihit correction across modules (multihit, dxsm, dnsm, ddsm, dasm, framework).

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_sequences.py	Added tests for new codon index conversion functions.
setup.py	Added the scipy dependency.
netam/sequences.py	Introduced new codon indexing utilities supporting multihit workflows.
netam/multihit.py	Updated HitClassDataset to subclass BranchLengthDataset and refactored branch length handling.
netam/molevol.py	Overhauled optimization functions and added new codon reshaping utilities.
netam/models.py	Updated forward docstring to reflect new input expectations.
netam/hit_class.py	Modified multihit correction using the new normalization routine.
netam/framework.py	Added parallel branch length optimization with multiprocessing.
netam/dxsm.py, dnsm.py, ddsm.py, dasm.py	Replaced deprecated branch length references with the updated ones.

tests/test_sequences.py

netam/molevol.py

netam/models.py

matsen

Merge when happy

netam/dasm.py

matsen · 2025-04-24T16:52:45Z

netam/dasm.py


        # We have to clamp the predictions to avoid log(0) issues.
-        preds = torch.clamp(preds, min=torch.finfo(preds.dtype).eps)
+        preds = torch.clamp(preds, min=torch.finfo(preds.dtype).tiny)


netam/molevol.py

willdumm · 2025-04-24T17:36:06Z

Look how wrong the branch lengths were! (I think they were more wrong with multihit adjustment, but we can't make that comparison because we also changed how the multihit model gets applied)

These are on 50k pcps from the Jaffe heavy chain dataset.
branch_length_comparison.pdf

branch_length_comparison_origin.pdf

add branch length tests remove old comment update comment add lots of tests maybe fixed? fix multihit normalization and add tests tweaks to branch lengths finalized branch length optimization fixed tests again cleanup fix copilot suggestion dasm training performs well cleanup and fix tests format lint remove unnecessary warning

willdumm commented Apr 23, 2025

View reviewed changes

netam/dxsm.py Outdated Show resolved Hide resolved

willdumm commented Apr 23, 2025

View reviewed changes

netam/molevol.py Show resolved Hide resolved

willdumm marked this pull request as ready for review April 23, 2025 23:42

willdumm requested review from Copilot and matsen April 23, 2025 23:43

Copilot AI reviewed Apr 23, 2025

View reviewed changes

tests/test_sequences.py Show resolved Hide resolved

netam/molevol.py Outdated Show resolved Hide resolved

netam/models.py Show resolved Hide resolved

matsen approved these changes Apr 24, 2025

View reviewed changes

willdumm force-pushed the 123-multihit-branch-lengths branch from 0cec219 to bb412a1 Compare April 28, 2025 17:53

willdumm mentioned this pull request Apr 28, 2025

Fit new multihit model matsengrp/thrifty-experiments-1#26

Merged

willdumm merged commit 0d36aee into main Apr 28, 2025
2 checks passed

willdumm deleted the 123-multihit-branch-lengths branch April 28, 2025 18:04

willdumm restored the 123-multihit-branch-lengths branch April 29, 2025 20:38

willdumm deleted the 123-multihit-branch-lengths branch April 29, 2025 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate multihit branch lengths #124

Validate multihit branch lengths #124

Uh oh!

willdumm commented Mar 5, 2025 •

edited

Loading

Uh oh!

willdumm commented Apr 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matsen left a comment

Uh oh!

Uh oh!

matsen Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

willdumm commented Apr 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Validate multihit branch lengths #124

Validate multihit branch lengths #124

Uh oh!

Conversation

willdumm commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willdumm commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matsen Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

willdumm commented Apr 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willdumm commented Mar 5, 2025 •

edited

Loading

willdumm commented Apr 23, 2025 •

edited

Loading