Skip to content

Benchmark whichmut dense vs sparse formats for re-implementation decision #171

@jgallowa07

Description

@jgallowa07

Final Results: GPU Benchmarking Complete - Dense is 2.3x Faster!

Status:COMPLETE - CPU and GPU benchmarking finished with surprising results!

Major Finding: Dense format is 2.3x FASTER than sparse on GPU at production scale, completely reversing the initial expectations.

Recommendation: Use DENSE format for production GPU training.


Executive Summary

What We Discovered

The original 6,600x performance gap was not due to the data format but rather inefficient Python loops. After implementing vectorized dense:

CPU Results:

  • Sparse: 0.070s/epoch
  • Dense: 0.084s/epoch (+20% slower)
  • Both formats viable

GPU Results (Game Changer!):

  • Small dataset (225 seqs): Nearly identical performance
  • Large dataset (64k seqs): Dense is 2.3x FASTER than sparse!
  • Dense: 24.1s/epoch vs Sparse: 54.9s/epoch

Final Recommendation

For Production (GPU Training):
Use DENSE format

  • 2.3x faster on GPU with large datasets
  • Simpler code, easier to debug
  • Better GPU utilization through simpler indexing
  • Minimal memory overhead (~9%)

For CPU-Only Workflows:

  • Use SPARSE format (20% faster than dense)

Complete Results

Experiment Summary

Experiment Device Dataset Batch Size N Seqs Sparse Time/Epoch Dense Time/Epoch Ratio Notes
cpu_test_bs32 CPU 225 32 180/45 0.070s 0.084s 1.20x slower CPU baseline
gpu_test_bs32 GPU 225 32 180/45 0.127s 0.132s 1.04x slower GPU small
gpu_21K_bs128 GPU 64,465 128 64,465 54.95s 24.10s 2.28x FASTER Production scale

Key Findings

  1. Bug Fix: Fixed dense tensor shape from (L_codon, 65, 65) to (L_codon, 65)

    • Branch: viral-dasm-experiments-1/fix-dense-neutral-rates-tensor-shape
    • PR: matsengrp/viral-dasm-experiments-1#42
  2. Vectorization: Eliminated Python loops, achieved 5,785x speedup

    • Old dense (iterative): 462.8s/epoch (6,600x slower)
    • New dense (vectorized): 0.084s/epoch on CPU, 24.1s/epoch on GPU
    • Branch: 171-vectorized-dense-implementation (netam)
  3. Mathematical Equivalence: Both formats produce identical results

    • Same validation losses
    • Same model parameters (diff < 1e-8)
    • Same predictions (diff < 1e-8)
  4. GPU Performance Reversal: Dense is significantly faster on GPU

    • Simpler indexing patterns → better GPU parallelization
    • Coalesced memory access patterns
    • Scales better with dataset size and batch size

Why Dense is Faster on GPU

Technical Explanation:

  1. Simpler Indexing: Dense uses straightforward tensor indexing, sparse requires complex lookup operations
  2. Better GPU Utilization: Dense operations are more amenable to GPU parallelization
  3. Coalesced Memory Access: Dense format enables better memory coalescing on GPU
  4. Scales Better: Performance advantage grows with dataset size and batch size

Vectorized Dense Implementation:

# Build lookup table once (65 codons, not millions of data points)
child_indices = parent_to_children[codon_parents_idxss]  # (N, L, 9)

# Parallel tensor operations - NO PYTHON LOOPS
neutral_rates_gathered = neutral_rates_tensor[batch_idx, pos_idx, child_indices]
selection_factors_gathered = selection_factors[batch_idx, pos_idx, child_aa_idx]

# Vectorized multiply and sum
products = neutral_rates_gathered * selection_factors_gathered
Z = masked_products.sum(dim=(1, 2))

Implementation Details

Changes Made

netam Repository:

  • Branch: 171-vectorized-dense-implementation
  • File: netam/whichmut_trainer.py:337-429
  • Status: All 27 whichmut tests pass
  • Created vectorized dense implementation alongside existing sparse

viral-dasm-experiments-1 Repository:

  • Branch: 171-benchmark-sparse-v-dense-whichmut (experimental, not for merging)
  • Branch: fix-dense-neutral-rates-tensor-shape (bug fix, PR add ability to see attention map #42)
  • Bug fix: src/viraldasmex/dasm_utils.py:232
  • Tests: tests/test_dasm_utils.py:281
  • Comprehensive benchmarking infrastructure

Benchmarking Infrastructure

Created reusable infrastructure in viral-dasm-experiments-1/scv2-dasm/benchmark_sparse_v_dense/:

  • Parameterized experiment runner
  • Memory and timing tracking
  • Model comparison verification
  • Organized experiment results

Complete Documentation:
📊 Full Results & Analysis


Decision Framework

Use DENSE if:

Training on GPU (2.3x faster at scale!) - RECOMMENDED
✅ Code simplicity is a priority
✅ Debugging and understanding code is important
✅ Using large batch sizes (>64)
✅ Production training with thousands of sequences
✅ New implementation or fresh codebase

Use SPARSE if:

  • Training on CPU only (20% faster than dense)
  • GPU memory is severely constrained
  • Already using sparse format in existing codebase
  • Team has significant investment in sparse data structures

Experiment Results

CPU Baseline (225 sequences, batch 32)

  • Device: CPU (MPS not compatible)
  • Sparse: 0.070s/epoch, 66.3 MB memory
  • Dense: 0.084s/epoch, 66.1 MB memory
  • Outcome: Sparse 20% faster on CPU

GPU Small Dataset (225 sequences, batch 32)

  • Device: CUDA (orca04)
  • Sparse: 0.127s/epoch, 48 MB GPU memory
  • Dense: 0.132s/epoch, 60 MB GPU memory
  • Outcome: Nearly identical (~4% difference)

GPU Production Scale (64,465 sequences, batch 128)

  • Device: CUDA (ermine)
  • Sparse: 54.95s/epoch, 194 MB GPU memory
  • Dense: 24.10s/epoch, 212 MB GPU memory
  • Outcome: Dense 2.28x FASTER than sparse!

What Changed from Original Plan

Original Plan (Not Executed)

  • ❌ Synthetic data generation
  • ❌ On-the-fly neutral rate computation (infrastructure doesn't exist)
  • ❌ 10,000 sequences benchmark (used real datasets: 225 for validation, 64k for production)

Actual Approach (What We Did)

  1. Used Empirical Data: SARS-CoV-2 spike sequences from existing pipeline
  2. Found Critical Bug: Dense tensor shape was incorrect
  3. Discovered Root Cause: 6,600x slowdown from Python loops, not data format
  4. Implemented Vectorized Dense: Achieved 5,785x speedup
  5. Verified Equivalence: Both implementations mathematically identical
  6. GPU Benchmarking: Discovered dense is 2.3x faster on GPU!

Memory Analysis

CPU

  • Dense and sparse have similar memory footprints for small datasets
  • Difference becomes more apparent at scale

GPU (Production Scale - 64k sequences)

  • Sparse: 194 MB GPU memory reserved
  • Dense: 212 MB GPU memory reserved
  • Difference: ~9% more for dense (minimal and acceptable)

Both formats fit comfortably in GPU memory for typical workloads. Dense's 2.3x speed advantage far outweighs the 9% memory increase.


Timeline

  • Day 1-2: Initial benchmarking, discovered 6,600x slowdown
  • Day 3: Root cause analysis, identified Python loops
  • Day 4: Implemented vectorized dense, achieved 5,785x speedup
  • Day 5: Verified equivalence, reorganized infrastructure
  • Day 6-7: GPU benchmarking, discovered dense is 2.3x faster!

Key Documentation

Main Results

📊 SUMMARY.md - Complete Results & Analysis

Supporting Documentation

Experiment Data

All experiment results available in:
viral-dasm-experiments-1/scv2-dasm/benchmark_sparse_v_dense/experiments/

  • 2025-11-05_cpu_test_bs32/ - CPU baseline
  • 2025-11-05_gpu_test_bs32/ - GPU small dataset
  • 2025-11-06_gpu_21K_bs128/ - GPU production scale

Related PRs and Branches

netam

  • Branch: 171-vectorized-dense-implementation
  • File: netam/whichmut_trainer.py:337-429
  • Tests: All 27 whichmut tests pass

viral-dasm-experiments-1

  • Bug Fix Branch: fix-dense-neutral-rates-tensor-shape
  • Bug Fix PR: add ability to see attention map #42 (ready for merge)
  • Benchmarking Branch: 171-benchmark-sparse-v-dense-whichmut (experimental, not for merging)

Next Steps

  1. ✅ CPU validation complete
  2. ✅ GPU benchmarks complete
  3. ✅ Production-scale testing (64k sequences) complete
  4. Discovered dense is 2.3x faster on GPU!
  5. ✅ Merge vectorized dense implementation to netam main
  6. ⏳ Update netam documentation with GPU-based recommendations
  7. ⏳ Consider deprecating iterative dense implementation
  8. ⏳ Update whichmut documentation to recommend dense for GPU training

Conclusion

This investigation revealed a fundamental performance reversal based on hardware:

Before Investigation

  • Dense was 6,600x slower (unusable due to Python loops)
  • Sparse was the only option

After Vectorization

  • CPU: Sparse is 20% faster (modest advantage)
  • GPU: Dense is 2.3x faster (major advantage!)

Final Recommendation:

  • Production GPU training: Use DENSE (faster + simpler)
  • CPU-only workflows: Use SPARSE (faster on CPU)

The performance winner depends on the hardware. Dense's simpler indexing patterns that are neutral on CPU become a major advantage on GPU, delivering 2.3x speedup at production scale while also being conceptually simpler and easier to maintain.

For re-implementation: Use DENSE format for GPU-based production training.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions