|
| 1 | +# Multithreaded SSv2 Evaluation |
| 2 | + |
| 3 | +This directory contains a high-performance multiprocessing evaluation script for the SSv2 video classifier. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Parallel Processing**: Uses Python's `ProcessPoolExecutor` to run multiple worker processes in parallel |
| 8 | +- **Independent Models**: Each worker loads its own copy of the encoder and classifier |
| 9 | +- **Memory Efficient**: Each worker uses ~2 GB RAM, allowing up to 10 workers on a MacBook with 32GB RAM |
| 10 | +- **Progress Tracking**: Real-time progress bar showing evaluation status |
| 11 | +- **Comprehensive Logging**: Detailed logs saved to file with configurable log levels |
| 12 | +- **Comprehensive Metrics**: Generates accuracy, precision, recall, F1-score, confusion matrix, and full classification report |
| 13 | + |
| 14 | +## Usage |
| 15 | + |
| 16 | +### Quick Start |
| 17 | + |
| 18 | +```bash |
| 19 | +# Run with default settings (10 workers, full test set) |
| 20 | +./scripts/evaluate_ssv2_mt.sh |
| 21 | + |
| 22 | +# Run with custom number of workers |
| 23 | +NUM_WORKERS=8 ./scripts/evaluate_ssv2_mt.sh |
| 24 | + |
| 25 | +# Test with a small subset |
| 26 | +SUBSET_SIZE=100 NUM_WORKERS=4 ./scripts/evaluate_ssv2_mt.sh |
| 27 | + |
| 28 | +# Enable debug logging |
| 29 | +LOG_LEVEL=DEBUG NUM_WORKERS=4 ./scripts/evaluate_ssv2_mt.sh |
| 30 | +``` |
| 31 | + |
| 32 | +### Direct Python Invocation |
| 33 | + |
| 34 | +```bash |
| 35 | +python3 evaluate_ssv2_multithreaded.py \ |
| 36 | + --videos-dir videos/20bn-something-something-v2 \ |
| 37 | + --test-csv videos/labels/test-answers.csv \ |
| 38 | + --labels-json videos/labels/labels.json \ |
| 39 | + --encoder-weights weights/vitl_mlx.safetensors \ |
| 40 | + --classifier-weights output_ssv2_classifier/best_classifier.safetensors \ |
| 41 | + --num-workers 10 \ |
| 42 | + --output-dir evaluation_results \ |
| 43 | + --log-level INFO |
| 44 | +``` |
| 45 | + |
| 46 | +## Arguments |
| 47 | + |
| 48 | +- `--videos-dir`: Path to the videos directory |
| 49 | +- `--test-csv`: Path to test answers CSV file |
| 50 | +- `--labels-json`: Path to labels JSON file |
| 51 | +- `--encoder-weights`: Path to pretrained encoder weights |
| 52 | +- `--classifier-weights`: Path to trained classifier weights |
| 53 | +- `--num-frames`: Number of frames per clip (default: 16) |
| 54 | +- `--resolution`: Video resolution (default: 224) |
| 55 | +- `--tubelet-size`: Tubelet size for 3D patch embedding (default: 2) |
| 56 | +- `--num-classes`: Number of classes (default: 174) |
| 57 | +- `--num-workers`: Number of worker processes (default: CPU count) |
| 58 | +- `--output-dir`: Output directory for results (default: evaluation_results) |
| 59 | +- `--subset-size`: Evaluate only a subset of samples for testing |
| 60 | +- `--log-level`: Logging level: DEBUG, INFO, WARNING, or ERROR (default: INFO) |
| 61 | + |
| 62 | +## Output Files |
| 63 | + |
| 64 | +The script generates multiple files in the output directory: |
| 65 | + |
| 66 | +1. **evaluation_summary.json**: Overall metrics including accuracy, precision, recall, F1-scores, timing information |
| 67 | +2. **classification_report.txt**: Detailed per-class metrics from scikit-learn |
| 68 | +3. **confusion_matrix.npy**: Confusion matrix as NumPy array |
| 69 | +4. **evaluation_YYYYMMDD_HHMMSS.log**: Main process log with orchestration details |
| 70 | +5. **worker_N_YYYYMMDD_HHMMSS.log**: Individual log file for each worker process (N = worker ID) |
| 71 | + |
| 72 | +### Log File Details |
| 73 | + |
| 74 | +The evaluation creates separate log files for better debugging and analysis: |
| 75 | + |
| 76 | +**Main Log (`evaluation_*.log`)**: |
| 77 | +- Overall configuration and setup |
| 78 | +- Data distribution across workers |
| 79 | +- Worker task submission |
| 80 | +- Aggregate results and metrics |
| 81 | +- Final timing and throughput |
| 82 | + |
| 83 | +**Worker Logs (`worker_N_*.log`)**: |
| 84 | +- Model initialization for each worker |
| 85 | +- Video processing progress |
| 86 | +- Per-video errors and warnings |
| 87 | +- Worker-specific performance metrics |
| 88 | +- Individual worker completion status |
| 89 | + |
| 90 | +Example main log entry: |
| 91 | +``` |
| 92 | +2025-11-16 22:04:24 - ssv2_evaluation - INFO - Starting parallel evaluation... |
| 93 | +2025-11-16 22:04:24 - ssv2_evaluation - INFO - Submitted 2 worker tasks |
| 94 | +``` |
| 95 | + |
| 96 | +Example worker log entries: |
| 97 | +``` |
| 98 | +2025-11-16 22:04:26 - worker_0 - INFO - Worker 0: Models loaded successfully |
| 99 | +2025-11-16 22:04:26 - worker_0 - INFO - Worker 0: Processing 3 samples... |
| 100 | +2025-11-16 22:04:27 - worker_0 - INFO - Worker 0: Completed 3 samples in 1.5s (1.94 videos/sec) - Success: 3, Failed: 0 |
| 101 | +``` |
| 102 | + |
| 103 | +This separation makes it easy to: |
| 104 | +- Track individual worker performance |
| 105 | +- Identify which worker encountered errors |
| 106 | +- Debug specific video processing issues |
| 107 | +- Analyze parallel execution patterns |
| 108 | + |
| 109 | +## Performance |
| 110 | + |
| 111 | +On a MacBook with M-series chip: |
| 112 | +- **10 workers**: ~20 GB RAM usage, optimal for 32GB systems |
| 113 | +- **8 workers**: ~16 GB RAM usage, optimal for 16GB systems |
| 114 | +- **4 workers**: ~8 GB RAM usage, safe for 8GB systems |
| 115 | + |
| 116 | +Processing time depends on: |
| 117 | +- Number of workers |
| 118 | +- Video resolution and length |
| 119 | +- Model size |
| 120 | +- Disk I/O speed |
| 121 | + |
| 122 | +Typical performance: ~2-5 videos/second/worker |
| 123 | + |
| 124 | +## Technical Notes |
| 125 | + |
| 126 | +### Why Multiprocessing Instead of Multithreading? |
| 127 | + |
| 128 | +MLX uses GPU command buffers that cannot be safely shared across threads. Using separate processes ensures each worker has its own MLX context and GPU resources. |
| 129 | + |
| 130 | +### Memory Considerations |
| 131 | + |
| 132 | +Each worker loads a complete copy of both the encoder (~1.5GB) and classifier (~0.5GB). Monitor memory usage with: |
| 133 | + |
| 134 | +```bash |
| 135 | +# macOS |
| 136 | +top -pid $(pgrep -f evaluate_ssv2_multithreaded) |
| 137 | + |
| 138 | +# Or use Activity Monitor |
| 139 | +``` |
| 140 | + |
| 141 | +### Batch Distribution |
| 142 | + |
| 143 | +The script automatically splits the test set into equal batches for each worker, with any remainder distributed evenly across the first few workers. |
| 144 | + |
| 145 | +## Troubleshooting |
| 146 | + |
| 147 | +**Out of Memory Error**: Reduce the number of workers |
| 148 | +```bash |
| 149 | +NUM_WORKERS=4 ./scripts/evaluate_ssv2_mt.sh |
| 150 | +``` |
| 151 | + |
| 152 | +**Slow Performance**: Check disk I/O and ensure videos are on a fast drive (SSD recommended) |
| 153 | + |
| 154 | +**Process Crashes**: Ensure you have enough available RAM and no other heavy processes running |
| 155 | + |
| 156 | +## Example Output |
| 157 | + |
| 158 | +``` |
| 159 | +Starting multithreaded evaluation with 10 workers |
| 160 | +MLX device: Device(gpu, 0) |
| 161 | +Expected memory per worker: ~2 GB |
| 162 | +Total expected memory: ~20 GB |
| 163 | +
|
| 164 | +Loading labels... |
| 165 | +Loaded 174 classes |
| 166 | +Loading test data... |
| 167 | +Loaded 27158 test samples |
| 168 | +
|
| 169 | +Split data into 10 batches |
| 170 | + Worker 0: 2716 samples |
| 171 | + Worker 1: 2716 samples |
| 172 | + ... |
| 173 | +
|
| 174 | +Starting evaluation... |
| 175 | +Evaluating: 100%|██████████| 27158/27158 [15:32<00:00, 29.12it/s] |
| 176 | +
|
| 177 | +Evaluation complete! |
| 178 | +Successfully evaluated: 27158 samples |
| 179 | +Failed: 0 samples |
| 180 | +
|
| 181 | +================================================================================ |
| 182 | +EVALUATION RESULTS |
| 183 | +================================================================================ |
| 184 | +
|
| 185 | +Test Set Accuracy: 0.6542 (65.42%) |
| 186 | +Correct Predictions: 17765 / 27158 |
| 187 | +
|
| 188 | +Macro Average: |
| 189 | + Precision: 0.6234 |
| 190 | + Recall: 0.6189 |
| 191 | + F1-Score: 0.6211 |
| 192 | +
|
| 193 | +Weighted Average: |
| 194 | + Precision: 0.6498 |
| 195 | + Recall: 0.6542 |
| 196 | + F1-Score: 0.6519 |
| 197 | +``` |
0 commit comments