Skip to content

adds block sampling#1040

Open
skothenhill-nv wants to merge 6 commits intomainfrom
hillst/block-sampling3
Open

adds block sampling#1040
skothenhill-nv wants to merge 6 commits intomainfrom
hillst/block-sampling3

Conversation

@skothenhill-nv
Copy link
Collaborator

@skothenhill-nv skothenhill-nv commented Aug 14, 2025

Description

implements scDataset style block sampling. Map-style and Iter-style are provided, with the license from the original source. We add our internal permute calls to ensure rng compatability, and then test for equivalence. Additionally, randomness tests were performed outside the framework to compare the np.permute method with the bionemo internal permute.

  • Adds performance optimizations for process_item in the geneformer Dataset object
  • Fixes a major performance issue in MultiEpochResamplerDataset
  • Adds __getitems__ implementations for a few Dataset classes (note that these do nothing unless its the top level class in a Dataloader)
  • Added performance benchmarking script to compare the scdataset implementation (iter-style) with our implementation (map-style). they are approximately equivalent in runtime and randomness. scDataset original implementation still has some issues around the edges - like breaking out of a loop before the generator has exhausted. Timings with 1024 batch size, fetch factor of 8, block size of 64 (settings that seem appropriate for pretraining) with 16 workers. Relative performance changes with the order of execution, probably there is some weird caching behavior happening.
IterStyleDataset: 278.7125172615051 seconds
IterStyleDataset: 2939.2293107215432 samples per second
MapStyleScDataset: 294.61863374710083 seconds
MapStyleScDataset: 2780.5437476273046 samples per second

Purpose of this work is to test the consequences of block-randomness on model training.

Type of changes

  • New feature (non-breaking change which adds functionality)

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

# TODO: Add code snippet

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

@codecov-commenter
Copy link

codecov-commenter commented Aug 15, 2025

Codecov Report

❌ Patch coverage is 59.64126% with 90 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.83%. Comparing base (21b1442) to head (b2f03c9).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...rmer/src/bionemo/geneformer/data/block_sampling.py 59.27% 90 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1040      +/-   ##
==========================================
- Coverage   81.29%   80.83%   -0.47%     
==========================================
  Files         152      153       +1     
  Lines       10271    10494     +223     
==========================================
+ Hits         8350     8483     +133     
- Misses       1921     2011      +90     
Files with missing lines Coverage Δ
...ages/bionemo-core/src/bionemo/core/data/permute.py 100.00% <100.00%> (ø)
...rmer/src/bionemo/geneformer/data/block_sampling.py 59.27% <59.27%> (ø)

Signed-off-by: Steven <skothenhill@nvidia.com>
Signed-off-by: Steven <skothenhill@nvidia.com>
@skothenhill-nv skothenhill-nv force-pushed the hillst/block-sampling3 branch from 7c4b597 to 2eacb12 Compare August 15, 2025 16:34
nominally adds block sampling to the geneformer datamodule (probably will change more)
adds a small performance script (will change)

Signed-off-by: Steven <skothenhill@nvidia.com>
…amework-fresh/bionemo-framework/. into hillst/block-sampling3

Signed-off-by: Steven <skothenhill@nvidia.com>
@skothenhill-nv skothenhill-nv force-pushed the hillst/block-sampling3 branch from b2f03c9 to 9b86370 Compare August 21, 2025 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants