March backport changes by mmdanziger · Pull Request #31 · BiomedSciAI/biomed-multi-omic

mmdanziger · 2026-03-08T09:28:32Z

This PR significantly simplifies the core modeling architecture by consolidating task-specific models (MLM, Sequence Classification, etc.) into a unified MultiTask architecture. It also introduces a user-friendly, Scanpy-style Python inference API, improves how model configurations are merged from checkpoints, and ensures backward compatibility for older checkpoints.

Key Changes

1. Unified MultiTask Architecture

**Removed collation_strategy & ModelingStrategy**: The codebase no longer relies on tracking separate modeling strategies (mlm, sequence_classification, sequence_labeling). All models (SCBert, Performer, Nystromformer, Llama, ModernBert) now exclusively return and instantiate their ForMultiTaskModeling variants.
HuggingFace AutoModel registrations have been updated so that both AutoModelForMaskedLM and AutoModelForSequenceClassification correctly resolve to the multitask models.
Replaced dynamic training module resolution with direct usage of MultiTaskTrainingModule (or SequenceLabelingTrainingModule if enable_perturbation_metrics is set).

2. New Python Inference API

Introduced bmfm_targets.inference.inference to allow seamless zero-shot predictions directly on AnnData objects.
This feature acts similarly to a Scanpy tool (bmfm.inference(adata)), handling layer swapping, tokenization, prediction extraction, and appending the resulting embeddings and metadata labels back into the adata object automatically.

3. Robust Config & Checkpoint Merging

Refactored SCBertMainConfig to handle config merging intelligently between YAML specifications and checkpoint hyperparameters (_merge_fields, _merge_label_columns, and _merge_configs_from_checkpoint).
Clarified precedence: Checkpoint configs are largely authoritative during inference/prediction to match trained weights, while YAML configs take precedence/augment during training.
Handles edge cases gracefully, such as clearing label_columns in predict mode if a checkpoint lacks label decoder weights.
A generalized merge_configs function replaces merge_trainer_configs in task_utils.py.

4. Backward Compatibility & Migration

Added migrate_checkpoint_if_needed to dynamically detect and convert legacy checkpoints (e.g., pure MLM or pure SeqCls) into the expected multitask format at runtime, eliminating the need to manually re-train old models.

5. Model & Layer Refinements

Pooling Defaults: Changed the default pooling_method in TrainerConfig from "pooling_layer" to "first_token", which provides a safer default (especially for pure MLM models where the pooler is untrained).
MVC Embeddings: Fixed missing dictionary unpacking for mvc_query_embeddings across multiple predictive layers, ensuring the correct query tensors are routed to MVC decoders.
PEFT Support: Added output_attentions and return_dict arguments to forward signatures of model wrappers to preserve compatibility with standard PEFT/LoRA integrations.

6. Cleanup and Documentation

Removed all references to collation_strategy across tutorial notebooks, GitHub Actions CI workflows, and .yaml config files.
Updated README.md files to reflect the simplified data module configuration requirements.

that have label_columns but do not actually use them this is nonsensical in the new paradigm but the cruft is already there in the legacy ckpts

we may need to fix this architecturally but as long as there are two poolers they should be inited properly

mmdanziger added 15 commits March 8, 2026 11:24

backport changes

e18f61e

add missing file

0e12de8

update pyproject

9a38cb8

remove obsolete references to collation_strategy

204087f

better handling of old ckpts

2dff9dd

remove test file that is missing data

51134e8

fix inference test problems

70f8568

intercept legacy configs

778532f

that have label_columns but do not actually use them this is nonsensical in the new paradigm but the cruft is already there in the legacy ckpts

skip if data is missing

8a2e27e

handle no yaml fields correctly

1891862

identify scmodernbert when migrating

e3f2df7

add two poolers

5850491

we may need to fix this architecturally but as long as there are two poolers they should be inited properly

scmodernbert ckpt loading

2519e88

fix scmodernbert inheritance

a6b946f

remove bad asserts

14788b1

mmdanziger marked this pull request as ready for review March 8, 2026 15:16

lora checkpoint load

37c76a1

mmdanziger merged commit 0a944f5 into main Mar 26, 2026
8 checks passed

mmdanziger deleted the march-backport branch March 26, 2026 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

March backport changes#31

March backport changes#31
mmdanziger merged 16 commits into
mainfrom
march-backport

mmdanziger commented Mar 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mmdanziger commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mmdanziger commented Mar 8, 2026 •

edited

Loading