Skip to content

update weights#18

Merged
clararehmann merged 52 commits intoweightsfrom
main
Jul 16, 2025
Merged

update weights#18
clararehmann merged 52 commits intoweightsfrom
main

Conversation

@clararehmann
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and which issue is fixed. Include relevant motivation and context.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce.

  • Test A
  • Test B

Test Configuration:

  • Python version:
  • TensorFlow version:
  • Operating System:

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

- Add Jupyter notebook for interactive actinemys analysis
- Add holdout cross-validation example script
- Add parallel processing example for multi-sample analysis

These examples demonstrate various analysis workflows using the
actinemys turtle dataset, including k-fold cross-validation and
window-based genomic analysis.
Fixed critical bug where parallel k-fold implementation was only using half the
genetic data due to incorrect genotype array serialization.

Key changes:
- Changed genotype serialization from to_n_alt() to values in parallel_analysis.py
  - to_n_alt() converts 3D GenotypeArray to 2D, losing critical information
  - Now saves full genotype.values to preserve 3D structure (SNPs × samples × 2)
- Fixed genotype reconstruction in Ray worker to handle full 3D array
- Added critical fixes for sample ordering consistency:
  - Store _sample_data_df in worker config
  - Set locator.samples before training
- Fixed k-fold seed consistency in analysis.py for reproducible splits

Results:
- Parallel predictions now match sequential (spread ratio improved from 0.56 to 1.02)
- Models train with full genetic data instead of half
- ~6x speedup on 4 GPUs with accurate results

This resolves the issue where parallel k-fold was producing poor predictions due
to models being trained on incomplete genetic data.
- Add holdout_sample_ids parameter to all holdout methods
  - run_holdouts() and parallel_holdouts()
  - run_windows_holdouts() and parallel_windows_holdouts()

- Allow users to specify holdout samples by ID instead of index
  - More intuitive and reproducible than numerical indices
  - Supports both single list (same for all reps) and list of lists
  - Clear error messages for missing sample IDs

- Fix numpy array compatibility issue
  - Handle both list and numpy array sample inputs
  - Convert to list before using index() method
  - Ensures compatibility with all data loading methods

- Add comprehensive documentation and demo script
  - example/holdout_sample_ids_demo.py shows usage patterns
  - SAMPLE_ID_IMPLEMENTATION_SUMMARY.md documents changes

This makes holdout analysis more user-friendly by allowing direct
specification of sample names rather than requiring users to figure
out array indices.
Fix major performance bottlenecks in tf.data pipeline that caused 92s of
overhead from process forking and excessive parallelization.

Performance issues identified from profiling:
- 256 fork() calls taking 92.1s (39% of total time)
- 3000 h5py dataset recreations taking 8.8s
- Excessive parallelization overhead from tf.data.AUTOTUNE

Fixes implemented:
1. Set fixed thread pool size (4) instead of 0 to prevent process forking
2. Disable map_parallelization to reduce overhead
3. Use fixed num_parallel_calls (4) instead of tf.data.AUTOTUNE
4. Limit inter-op parallelism to 1 to reduce thread contention

These changes maintain functionality while eliminating the expensive
fork() calls that dominated execution time. Expected performance
improvement: 2-3x faster training with ~90% reduction in forking overhead.

Changes in locator/data/tf_dataset.py:
- options.threading.private_threadpool_size: 0 → 4
- options.experimental_optimization.map_parallelization: True → False
- options.threading.max_intra_op_parallelism: (new) → 1
- All map() num_parallel_calls: tf.data.AUTOTUNE → 4
Add holdout_no_intermediate_saves option to skip ModelCheckpoint
during train_holdout, reducing file I/O overhead for k-fold
cross-validation workflows.

When enabled, this option:
- Skips ModelCheckpoint callback during training
- Saves model weights only once at the end
- Reduces HDF5 operations from ~100+ to just 1
- Maintains existing checkpoint behavior by default

Performance improvement from profiling:
- HDF5 I/O overhead: 13.7s → ~2s expected
- Particularly beneficial for k-fold CV with many folds

To enable:
config['holdout_no_intermediate_saves'] = True

This complements the tf.data pipeline optimization (commit 4c6a300)
for significant performance gains in cross-validation workflows.
Add TensorFlow threading configuration to prevent excessive process
forking during k-fold cross-validation and other repeated training
workflows.

Performance improvements:
- Process forking: 126.3s → 5.9s (95% reduction)
- Total k-fold CV time: 257.9s → 145.7s (43% improvement)

Changes:
- Add _configure_tensorflow_optimization() method to Locator
- Set inter-op threads to 1 to prevent forking
- Keep intra-op threads at 4 for parallelism within ops
- Add optimize_tf_parallelism config option (default: True)
- Disable tf.data experimental slack to reduce overhead

This complements previous optimizations:
- tf.data pipeline optimization (commit 4c6a300)
- HDF5 I/O reduction (commit 5190ec1)

Together, these optimizations provide significant performance
gains for cross-validation workflows without affecting accuracy.
Add two new config options to control verbosity of common operations
that can clutter output during repeated runs like k-fold CV.

New config options:
- verbose_splits (default: False): Show train/val/test/holdout split sizes
- verbose_batch_size (default: False): Show batch size optimization details

When verbose_splits is enabled, displays:
- Number and percentage of samples in each split
- Total samples and SNPs being used
- Works for both train() and train_holdout() methods

When verbose_batch_size is enabled, displays:
- GPU memory estimation details
- Batch size optimization process
- Final optimized batch size
- Only relevant when gpu_batch_size='auto'

Benefits:
- Cleaner output for k-fold CV and other repeated training workflows
- Still provides detailed info when debugging or first running
- Backward compatible - defaults preserve existing quiet behavior

Changes:
- locator/core.py: Add default config values
- locator/training.py: Add split reporting and batch size verbosity control
- locator/gpu_optimizer.py: Add verbose parameter to get_optimal_batch_size()
- tests/test_verbosity_control.py: Comprehensive test suite
Reduce excessive array copying and improve vectorization to address
performance bottlenecks identified in profiling.

Performance improvements from profiling:
- numpy.array operations: 34.4s → ~10s (70% reduction expected)
- array.copy operations: 25.2s → ~5s (80% reduction expected)
- Overall train_holdout: 129s → ~50s (60% improvement)

Optimizations:
1. filter_snps: Cache allele counts to avoid recomputation
   - Count alleles once instead of twice
   - Combine biallelic and MAC filters efficiently

2. Location normalization: Use vectorized operations
   - Replace list comprehension with direct array operations
   - Eliminates slow Python loop over samples

3. Holdout data storage: Avoid transpose copy
   - Use efficient array ordering to minimize memory copies
   - Ensures C-contiguous arrays for better performance

4. normalize_locs: Create new array instead of modifying copy
   - More efficient memory usage
   - Clearer intent in the code

These changes significantly improve performance for train_holdout
and k-fold cross-validation workflows without affecting accuracy.
Change the default value of holdout_no_intermediate_saves from False
to True to provide better performance out of the box for k-fold CV
and leave-one-out workflows.

Benefits:
- Reduces HDF5 I/O from N model saves to just 1
- Particularly beneficial for leave-one-out CV with many samples
- No impact on model quality or final results
- Users can still set it to False if they need intermediate checkpoints

This is especially helpful for:
- run_k_fold_holdouts() with large k values
- run_leave_one_out() which uses k = n_samples
- Any repeated holdout analysis workflows

The option only affects train_holdout() behavior, not regular train().
- Add test workflow with parallel execution using pytest-xdist
- Support multi-Python version testing (3.9, 3.10, 3.11)
- Configure CPU-only mode to prevent GPU conflicts in CI
- Add coverage reporting with pytest-cov and Codecov integration
- Include code quality checks (black, isort, flake8)
- Add documentation building workflow
- Add PyPI publishing workflow for releases
- Add manual test trigger workflow for debugging
- Configure pytest settings in pyproject.toml
- Fix test_verbose_batch_size_auto to work with parallel execution
- Add CUDA_VISIBLE_DEVICES=-1 to prevent GPU conflicts
- Add Dependabot configuration for automated updates
- Add issue and PR templates

All tests now pass with parallel execution enabled.
- Add .pre-commit-config.yaml with black, isort, flake8 hooks
- Configure black with 89 char line limit to match existing style
- Configure isort to be compatible with black
- Format entire codebase with black and isort
- Fix trailing whitespace and missing newlines
- Update CI workflows to use consistent linting
- Add pre-commit workflow for automated checks
- Add scripts for easy formatting and setup
- Update documentation with pre-commit instructions

All Python files now follow consistent formatting standards.
- Fix boolean comparisons (E712) - use 'is' instead of '=='
- Remove/comment unused imports (F401)
- Convert f-strings without placeholders to regular strings (F541)
- Replace bare except with specific exceptions (E722)
- Fix unused variables with comments or underscore (F841, B007)
- Fix lambda assignment to def (E731)
- Fix undefined names in __all__ (F822)
- Add docstrings for missing __init__ methods (D107)
- Fix block comment formatting (E265)
- Fix blank line formatting and whitespace (E306, W293, D202)
- Add per-file-ignores for complex functions (C901)
- Apply black formatting to ensure consistency

All 69 flake8 errors have been resolved, code is now fully compliant.
- Remove trailing whitespace from YAML and markdown files
- Fix line endings and formatting inconsistencies
- Apply black formatting to remaining Python files
- Clean up imports and whitespace

These changes were automatically applied by pre-commit hooks.
The pyproject.toml already contains coverage configuration in addopts,
so passing duplicate --cov arguments was causing conflicts.
- Add pytest-cov and pytest-xdist to dev dependencies
- Remove coverage configuration from pyproject.toml to avoid conflicts
- Keep coverage arguments only in GitHub Actions workflow
- Add windows module exports to locator/data/__init__.py
- Export generate_genomic_windows function for window analysis tests
- Fix numpy boolean comparison in test_model_persistence.py
- Convert numpy.True_ to Python bool before comparison
- Add tensorflow import for GPU detection
- Update GPU config test to handle case when GPU is not available
- In CI with CUDA_VISIBLE_DEVICES=-1, mixed precision is correctly disabled
- Change 'data/' to '/data/' and './data/' to only exclude top-level data directories
- Add windows.py to git tracking (was previously excluded)
- This fixes ModuleNotFoundError for locator.data.windows in CI
- WindowGenerator class does not exist in windows.py
- Only generate_genomic_windows function is available
- This fixes ImportError in all test files
- Make parallel module imports optional when Ray is not installed
- Add stub functions with helpful error messages for parallel methods
- Fix duplicate object description warning for PlottingMixin.plot_history
- This allows docs to build successfully without Ray dependency
andrewkern and others added 22 commits July 6, 2025 23:02
- Prevent accidentally committing large output files from examples
- This directory contains generated plots and CSVs from demo runs
- Filter out harmless protobuf version warnings from TensorFlow
- Fix module path for parallel functions in api.rst (use locator.parallel not locator.parallel.parallel_analysis)
- The protobuf warnings are from TensorFlow/Google libraries and cannot be fixed by us
- Documentation should now build successfully
- Created EnsembleMixin with modern patterns (IndexSet, tf.data pipeline)
- Added k_fold_split method to IndexSet for efficient fold creation
- Refactored EnsembleLocator as a legacy compatibility wrapper
- Added comprehensive tests for ensemble functionality
- Integrated EnsembleMixin into core Locator class

Key improvements:
- Memory-efficient data handling without array copies
- Consistent NA handling with na_action parameter
- Integration with modern tf.data pipeline
- Backward compatibility through legacy wrapper
- Uses standard normalize_locs function instead of manual normalization
- Uses NormalizationParams class for denormalization
- Reduced cyclomatic complexity by extracting helper methods

BREAKING CHANGE: EnsembleLocator is now deprecated in favor of Locator's
ensemble methods (train_ensemble, predict_ensemble). The old API still
works but shows deprecation warnings.
Phase 1: Create EnsembleMixin with modern patterns
- Created EnsembleMixin with modern patterns (IndexSet, tf.data pipeline)
- Added k_fold_split method to IndexSet for efficient fold creation
- Refactored EnsembleLocator as a legacy compatibility wrapper
- Uses standard normalize_locs function instead of manual normalization
- Uses NormalizationParams class for denormalization
- Reduced cyclomatic complexity by extracting helper methods

Phase 2: Memory efficiency and model management
- Implemented _train_single_fold method to avoid creating separate Locator instances
- Created EnsembleModelManager for efficient model storage and lazy loading
- Fixed _create_model signature to use input_shape parameter
- Fixed save_fold_models parameter passing through method chain
- Made JSON serialization robust by filtering out DataFrames from config

Test consolidation:
- Consolidated test_ensemble_mixin.py and test_ensemble_phase2.py into test_ensemble.py
- All 12 tests passing with comprehensive coverage of both phases

Key improvements:
- Memory-efficient data handling without array copies
- Consistent NA handling with na_action parameter
- Integration with modern tf.data pipeline
- Backward compatibility through legacy wrapper
- Efficient model management with lazy loading support
- Comprehensive test coverage for both phases

BREAKING CHANGE: EnsembleLocator is now deprecated in favor of Locator's
ensemble methods (train_ensemble, predict_ensemble). The old API still
works but shows deprecation warnings.
Implemented comprehensive ensemble functionality for Locator with k-fold cross-validation
and advanced training optimizations.

Phase 1 - Core Ensemble Functionality:
- Added EnsembleMixin with train_ensemble() and predict_ensemble() methods
- Implemented memory-efficient k-fold splitting using IndexSet
- Support for NA sample handling during ensemble training
- Proper normalization parameter averaging across folds

Phase 2 - Model Persistence:
- Created EnsembleModelManager for efficient model storage/loading
- Memory-efficient prediction without loading all models at once
- Metadata tracking for ensemble configuration and fold information
- Support for on-demand model loading during prediction

Phase 4 - Training Improvements:
- Mixed precision training support via GPUOptimizer integration
- Automatic batch size optimization for each fold
- Enhanced callbacks with patience multiplier for ensemble training
- Per-fold learning rate variation for improved diversity
- Memory clearing between folds to prevent OOM errors

Architecture:
- All functionality consolidated in ensemble_mixin.py for maintainability
- Reuses existing Locator infrastructure (tf.data pipeline, GPU optimizer)
- Maintains backward compatibility with standard Locator interface
- Comprehensive test suite with 15 tests covering all functionality

Performance:
- Memory-efficient training without creating separate Locator instances
- GPU optimizations automatically applied when available
- Efficient prediction pipeline with on-demand model loading
- Proper memory management between fold training

This refactoring enables robust ensemble predictions while maintaining
code clarity and performance efficiency.
Implemented comprehensive ensemble functionality for Locator with k-fold cross-validation,
advanced training optimizations, and parallel GPU execution.
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4 to 5.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@v4...v5)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
- Added module-level skip decorator when Ray is not installed
- Fixed mock patches to use 'ray' directly instead of module path
- Added checks for stub functions in signature tests
- Fixed unused variable warnings
- Removed duplicate test file
feat: complete ensemble refactoring with parallel training support
- added function to plot interactive error map
- added dataframe output to the error summary plot
feat: interactive error map and summary plot updates
@clararehmann clararehmann merged commit 68333e1 into weights Jul 16, 2025
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants