Release v0.9.0 - Performance Optimizations and Architectural Improvements · OldCrow/libstats

🚀 DUAL API BATCH PROCESSING SYSTEM

NEW: Auto-dispatch batch processing API with intelligent strategy selection
NEW: Power-user explicit strategy control for fine-tuned performance optimization
SIMD and parallel processing strategies automatically selected based on data size and CPU capabilities
Performance hints system for guiding optimization decisions (MINIMIZE_LATENCY, MAXIMIZE_THROUGHPUT, etc.)
Thread-safe batch operations: getProbability(), getLogProbability(), getCumulativeProbability()
Comprehensive strategy options: SCALAR, SIMD_BATCH, PARALLEL_SIMD, WORK_STEALING, CACHE_AWARE

📐 HEADER ARCHITECTURE CONSOLIDATION

MAJOR: Consolidated header architecture reducing redundant includes by ~60%
NEW: Modular header system with clear dependency levels (0-6)
NEW: Consolidated convenience headers: distribution_common.h, distribution_platform_common.h
Enhanced build performance through better header organization and dependency management
Maintained backward compatibility while optimizing compilation efficiency

📚 DOCUMENTATION OVERHAUL

MAJOR: Updated README.md to be concise while directing to comprehensive documentation
NEW: Four detailed documentation guides covering all aspects of the library:
- BUILD_SYSTEM_GUIDE.md - Complete build system, cross-platform support, SIMD detection
- HEADER_ARCHITECTURE_GUIDE.md - Modular headers, dependency management, usage patterns
- PARALLEL_BATCH_PROCESSING_GUIDE.md - High-performance APIs, optimization guidelines
- WINDOWS_SUPPORT_GUIDE.md - Windows development environment support
Clear separation between quick-start content and detailed reference material

✅ BUILD SYSTEM ENHANCEMENTS

Enhanced CMake configuration with better error handling and cross-platform support
Improved parallel build detection and automatic optimization
Better SIMD detection and configuration across platforms
Comprehensive threading system detection (TBB, OpenMP, pthreads, GCD, Windows Thread Pool)

🎯 PERFORMANCE IMPROVEMENTS

Intelligent auto-dispatch eliminates need for manual performance optimization in most cases
SIMD optimization: 2-70x speedup for suitable operations depending on distribution complexity
Parallel processing: Up to N× speedup where N = CPU cores for large batch operations
Work-stealing thread pools provide superior load balancing for irregular workloads

Provide feedback