Benchmark whichmut dense vs sparse formats for re-implementation decision

## Final Results: GPU Benchmarking Complete - Dense is 2.3x Faster!

**Status:** ✅ **COMPLETE** - CPU and GPU benchmarking finished with surprising results!

**Major Finding:** Dense format is **2.3x FASTER than sparse on GPU** at production scale, completely reversing the initial expectations.

**Recommendation:** **Use DENSE format for production GPU training.**

---

## Executive Summary

### What We Discovered

The original 6,600x performance gap was **not due to the data format** but rather inefficient Python loops. After implementing vectorized dense:

**CPU Results:**
- Sparse: 0.070s/epoch
- Dense: 0.084s/epoch (+20% slower)
- Both formats viable

**GPU Results (Game Changer!):**
- **Small dataset (225 seqs)**: Nearly identical performance
- **Large dataset (64k seqs)**: Dense is **2.3x FASTER** than sparse!
- Dense: 24.1s/epoch vs Sparse: 54.9s/epoch

### Final Recommendation

**For Production (GPU Training):**
✅ **Use DENSE format**
- 2.3x faster on GPU with large datasets
- Simpler code, easier to debug
- Better GPU utilization through simpler indexing
- Minimal memory overhead (~9%)

**For CPU-Only Workflows:**
- Use SPARSE format (20% faster than dense)

---

## Complete Results

### Experiment Summary

| Experiment | Device | Dataset | Batch Size | N Seqs | Sparse Time/Epoch | Dense Time/Epoch | Ratio | Notes |
|------------|--------|---------|------------|--------|-------------------|------------------|-------|-------|
| cpu_test_bs32 | CPU | 225 | 32 | 180/45 | 0.070s | 0.084s | 1.20x slower | CPU baseline |
| gpu_test_bs32 | GPU | 225 | 32 | 180/45 | 0.127s | 0.132s | 1.04x slower | GPU small |
| gpu_21K_bs128 | GPU | 64,465 | 128 | 64,465 | 54.95s | 24.10s | **2.28x FASTER** | **Production scale** |

### Key Findings

1. **Bug Fix**: Fixed dense tensor shape from `(L_codon, 65, 65)` to `(L_codon, 65)`
   - Branch: `viral-dasm-experiments-1/fix-dense-neutral-rates-tensor-shape`
   - PR: matsengrp/viral-dasm-experiments-1#42

2. **Vectorization**: Eliminated Python loops, achieved 5,785x speedup
   - Old dense (iterative): 462.8s/epoch (6,600x slower)
   - New dense (vectorized): 0.084s/epoch on CPU, 24.1s/epoch on GPU
   - Branch: `171-vectorized-dense-implementation` (netam)

3. **Mathematical Equivalence**: Both formats produce identical results
   - Same validation losses
   - Same model parameters (diff < 1e-8)
   - Same predictions (diff < 1e-8)

4. **GPU Performance Reversal**: Dense is significantly faster on GPU
   - Simpler indexing patterns → better GPU parallelization
   - Coalesced memory access patterns
   - Scales better with dataset size and batch size

---

## Why Dense is Faster on GPU

**Technical Explanation:**

1. **Simpler Indexing**: Dense uses straightforward tensor indexing, sparse requires complex lookup operations
2. **Better GPU Utilization**: Dense operations are more amenable to GPU parallelization
3. **Coalesced Memory Access**: Dense format enables better memory coalescing on GPU
4. **Scales Better**: Performance advantage grows with dataset size and batch size

**Vectorized Dense Implementation:**
```python
# Build lookup table once (65 codons, not millions of data points)
child_indices = parent_to_children[codon_parents_idxss]  # (N, L, 9)

# Parallel tensor operations - NO PYTHON LOOPS
neutral_rates_gathered = neutral_rates_tensor[batch_idx, pos_idx, child_indices]
selection_factors_gathered = selection_factors[batch_idx, pos_idx, child_aa_idx]

# Vectorized multiply and sum
products = neutral_rates_gathered * selection_factors_gathered
Z = masked_products.sum(dim=(1, 2))
```

---

## Implementation Details

### Changes Made

**netam Repository:**
- Branch: `171-vectorized-dense-implementation`
- File: `netam/whichmut_trainer.py:337-429`
- Status: All 27 whichmut tests pass
- Created vectorized dense implementation alongside existing sparse

**viral-dasm-experiments-1 Repository:**
- Branch: `171-benchmark-sparse-v-dense-whichmut` (experimental, not for merging)
- Branch: `fix-dense-neutral-rates-tensor-shape` (bug fix, PR #42)
- Bug fix: `src/viraldasmex/dasm_utils.py:232`
- Tests: `tests/test_dasm_utils.py:281`
- Comprehensive benchmarking infrastructure

### Benchmarking Infrastructure

Created reusable infrastructure in `viral-dasm-experiments-1/scv2-dasm/benchmark_sparse_v_dense/`:
- Parameterized experiment runner
- Memory and timing tracking
- Model comparison verification
- Organized experiment results

**Complete Documentation:**
📊 **[Full Results & Analysis](https://github.com/matsengrp/viral-dasm-experiments-1/blob/171-benchmark-sparse-v-dense-whichmut/scv2-dasm/benchmark_sparse_v_dense/SUMMARY.md)**

---

## Decision Framework

### Use DENSE if:
✅ **Training on GPU** (2.3x faster at scale!) - **RECOMMENDED**
✅ Code simplicity is a priority
✅ Debugging and understanding code is important
✅ Using large batch sizes (>64)
✅ Production training with thousands of sequences
✅ New implementation or fresh codebase

### Use SPARSE if:
- Training on CPU only (20% faster than dense)
- GPU memory is severely constrained
- Already using sparse format in existing codebase
- Team has significant investment in sparse data structures

---

## Experiment Results

### CPU Baseline (225 sequences, batch 32)
- **Device**: CPU (MPS not compatible)
- **Sparse**: 0.070s/epoch, 66.3 MB memory
- **Dense**: 0.084s/epoch, 66.1 MB memory
- **Outcome**: Sparse 20% faster on CPU

### GPU Small Dataset (225 sequences, batch 32)
- **Device**: CUDA (orca04)
- **Sparse**: 0.127s/epoch, 48 MB GPU memory
- **Dense**: 0.132s/epoch, 60 MB GPU memory
- **Outcome**: Nearly identical (~4% difference)

### GPU Production Scale (64,465 sequences, batch 128)
- **Device**: CUDA (ermine)
- **Sparse**: 54.95s/epoch, 194 MB GPU memory
- **Dense**: 24.10s/epoch, 212 MB GPU memory
- **Outcome**: **Dense 2.28x FASTER than sparse!**

---

## What Changed from Original Plan

### Original Plan (Not Executed)
- ❌ Synthetic data generation
- ❌ On-the-fly neutral rate computation (infrastructure doesn't exist)
- ❌ 10,000 sequences benchmark (used real datasets: 225 for validation, 64k for production)

### Actual Approach (What We Did)

1. **Used Empirical Data**: SARS-CoV-2 spike sequences from existing pipeline
2. **Found Critical Bug**: Dense tensor shape was incorrect
3. **Discovered Root Cause**: 6,600x slowdown from Python loops, not data format
4. **Implemented Vectorized Dense**: Achieved 5,785x speedup
5. **Verified Equivalence**: Both implementations mathematically identical
6. **GPU Benchmarking**: Discovered dense is 2.3x faster on GPU!

---

## Memory Analysis

### CPU
- Dense and sparse have similar memory footprints for small datasets
- Difference becomes more apparent at scale

### GPU (Production Scale - 64k sequences)
- Sparse: 194 MB GPU memory reserved
- Dense: 212 MB GPU memory reserved
- **Difference**: ~9% more for dense (minimal and acceptable)

Both formats fit comfortably in GPU memory for typical workloads. Dense's 2.3x speed advantage far outweighs the 9% memory increase.

---

## Timeline

- **Day 1-2**: Initial benchmarking, discovered 6,600x slowdown
- **Day 3**: Root cause analysis, identified Python loops
- **Day 4**: Implemented vectorized dense, achieved 5,785x speedup
- **Day 5**: Verified equivalence, reorganized infrastructure
- **Day 6-7**: GPU benchmarking, discovered dense is 2.3x faster!

---

## Key Documentation

### Main Results
📊 **[SUMMARY.md - Complete Results & Analysis](https://github.com/matsengrp/viral-dasm-experiments-1/blob/171-benchmark-sparse-v-dense-whichmut/scv2-dasm/benchmark_sparse_v_dense/SUMMARY.md)**

### Supporting Documentation
- **[Performance Analysis](https://github.com/matsengrp/viral-dasm-experiments-1/blob/171-benchmark-sparse-v-dense-whichmut/scv2-dasm/benchmark_sparse_v_dense/docs/PERFORMANCE_ANALYSIS.md)** - Root cause of Python loop bottleneck
- **[Vectorization Results](https://github.com/matsengrp/viral-dasm-experiments-1/blob/171-benchmark-sparse-v-dense-whichmut/scv2-dasm/benchmark_sparse_v_dense/docs/VECTORIZED_RESULTS.md)** - Detailed vectorization improvements
- **[Benchmark Summary](https://github.com/matsengrp/viral-dasm-experiments-1/blob/171-benchmark-sparse-v-dense-whichmut/scv2-dasm/benchmark_sparse_v_dense/docs/BENCHMARK_SUMMARY.md)** - Initial findings
- **[README](https://github.com/matsengrp/viral-dasm-experiments-1/blob/171-benchmark-sparse-v-dense-whichmut/scv2-dasm/benchmark_sparse_v_dense/README.md)** - Benchmarking methodology

### Experiment Data
All experiment results available in:
`viral-dasm-experiments-1/scv2-dasm/benchmark_sparse_v_dense/experiments/`
- `2025-11-05_cpu_test_bs32/` - CPU baseline
- `2025-11-05_gpu_test_bs32/` - GPU small dataset
- `2025-11-06_gpu_21K_bs128/` - GPU production scale

---

## Related PRs and Branches

### netam
- Branch: `171-vectorized-dense-implementation`
- File: `netam/whichmut_trainer.py:337-429`
- Tests: All 27 whichmut tests pass

### viral-dasm-experiments-1
- **Bug Fix Branch**: `fix-dense-neutral-rates-tensor-shape`
- **Bug Fix PR**: #42 (ready for merge)
- **Benchmarking Branch**: `171-benchmark-sparse-v-dense-whichmut` (experimental, not for merging)

---

## Next Steps

1. ✅ CPU validation complete
2. ✅ GPU benchmarks complete
3. ✅ Production-scale testing (64k sequences) complete
4. ✅ **Discovered dense is 2.3x faster on GPU!**
5. ✅  Merge vectorized dense implementation to netam main
6. ⏳ Update netam documentation with GPU-based recommendations
7. ⏳ Consider deprecating iterative dense implementation
8. ⏳ Update whichmut documentation to recommend dense for GPU training

---

## Conclusion

This investigation revealed a **fundamental performance reversal** based on hardware:

### Before Investigation
- Dense was 6,600x slower (unusable due to Python loops)
- Sparse was the only option

### After Vectorization
- **CPU**: Sparse is 20% faster (modest advantage)
- **GPU**: Dense is 2.3x faster (major advantage!)

**Final Recommendation:**
- ✅ **Production GPU training**: Use DENSE (faster + simpler)
- ✅ **CPU-only workflows**: Use SPARSE (faster on CPU)

The performance winner depends on the hardware. Dense's simpler indexing patterns that are neutral on CPU become a major advantage on GPU, delivering 2.3x speedup at production scale while also being conceptually simpler and easier to maintain.

**For re-implementation: Use DENSE format for GPU-based production training.**


Experiment	Device	Dataset	Batch Size	N Seqs	Sparse Time/Epoch	Dense Time/Epoch	Ratio	Notes
cpu_test_bs32	CPU	225	32	180/45	0.070s	0.084s	1.20x slower	CPU baseline
gpu_test_bs32	GPU	225	32	180/45	0.127s	0.132s	1.04x slower	GPU small
gpu_21K_bs128	GPU	64,465	128	64,465	54.95s	24.10s	2.28x FASTER	Production scale

Benchmark whichmut dense vs sparse formats for re-implementation decision #171

Description

Final Results: GPU Benchmarking Complete - Dense is 2.3x Faster!

Executive Summary

What We Discovered

Final Recommendation

Complete Results

Experiment Summary

Key Findings

Why Dense is Faster on GPU

Implementation Details

Changes Made

Benchmarking Infrastructure

Decision Framework

Use DENSE if:

Use SPARSE if:

Experiment Results

CPU Baseline (225 sequences, batch 32)

GPU Small Dataset (225 sequences, batch 32)

GPU Production Scale (64,465 sequences, batch 128)

What Changed from Original Plan

Original Plan (Not Executed)

Actual Approach (What We Did)

Memory Analysis

CPU

GPU (Production Scale - 64k sequences)

Timeline

Key Documentation

Main Results

Supporting Documentation

Experiment Data

Related PRs and Branches

netam

viral-dasm-experiments-1

Next Steps

Conclusion

Before Investigation

After Vectorization

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions