Skip to content

Find the outliers that matter; an advanced framework for detecting extreme-value correlations and rare patterns in data.🧞‍♀️

License

Notifications You must be signed in to change notification settings

Cazzy-Aporbo/Serendipity-Finder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Serendipity Finder

Advanced Pattern Discovery in Distribution Extremes - Finding Correlations Where Others Don't Look

Python Data Science Statistical Analysis Interactive Viz Live Demo

Author: Cazandra Aporbo, MS 2025 | Contact: [email protected]

separator separator separator separator separator

Table of Contents

  1. Executive Summary
  2. Core Concept
  3. Technical Implementation
  4. Example Results
  5. Mathematical Foundation
  6. Skills Demonstrated
  7. Installation
  8. Usage
  9. Applications
  10. Performance Characteristics
  11. File Documentation
  12. Future Enhancements
  13. Citation
  14. Contact
  15. License

Serendipity-Finder

Click above to view the interactive visualization

Executive Summary

Serendipity Finder

Serendipity Finder is an advanced data-analysis framework designed to surface hidden relationships in the tails of distributions—patterns that classic correlation methods systematically miss. It focuses on extreme conditions and edge cases where real scientific breakthroughs often emerge, prioritizing outliers over averages to reveal signal that’s invisible to mean-centered analytics.

What this is (and why)

This project is more than code. It’s a learning system that improves with each iteration—documenting reasoning, trade-offs, and experiments so others can follow (or challenge) the path to better models. Every release includes notes on what changed and why, not just what “worked.”

separator separator separator separator separator

Transparency & authorship

  • The original baseline file was generated by AI to bootstrap ideas quickly.
  • Subsequent structured versions are authored and edited by me, with clearer architecture, documentation, and guardrails.
  • Later production-ready versions will be fully human-written, refactored for clarity and rigor, and backed by empirical tests on real datasets.

Versioning philosophy

  • v1: AI-generated baseline (exploration, scaffolding, quick prototypes).
  • v2: Human-structured refactor (clear modules, docstrings, tests begin).
  • v3 (current): Human-led iterations + early dataset trials and evaluation harnesses.
  • v4+: Real-world datasets, stronger validation, benchmarks, and reproducible reports.

Current status: v3. I’m testing initial data sources and validation workflows. Real datasets will be integrated next, with accompanying methods, metrics, and ablation studies.

separator separator separator separator separator

How it works (short version)

Instead of relying on global correlation coefficients, Serendipity Finder:

  1. Probes distribution tails and conditional regimes.
  2. Runs localized tests where standard assumptions break.
  3. Compares stability across subsamples to avoid “lucky” artifacts.
  4. Reports candidate relationships with uncertainty bounds and falsification prompts.

Responsible use

Tail discoveries are powerful and risky. All surfaced relationships are hypotheses, not causal claims. Use with rigorous validation, domain review, and ethical safeguards. I’ll document failure modes and negative results as the project matures.

Roadmap (high level)

  • Expand tail-focused estimators and stress tests
  • Add dataset loaders + reproducible evaluation reports
  • Publish benchmarks and examples (notebooks + CLI)
  • Harden tests and CI for reliability

Follow along: I’ll keep commits and release notes detailed so anyone can trace decisions, reproduce results, and suggest better approaches.

separator separator separator separator separator

1. S&P 500 Stock Returns Dataset (Best for Financial Tail Dependencies)

Source: Yahoo Finance API or CRSP database Why it's perfect: During the 2008 crisis, correlations between normally uncorrelated stocks jumped from 0.3 to 0.9+. This dataset captures real contagion effects. What to look for: Tech vs. finance sector correlations, international market spillovers Size: Daily data from 2000-present (~6000 days × 500 stocks) Access: yfinance Python package or Kaggle datasets

  1. FDA Adverse Event Reporting System (FAERS)

Source: FDA OpenFDA API (https://open.fda.gov/apis/) Why it's perfect: Contains millions of adverse drug event reports where serious side effects often only appear in specific patient subgroups (exactly what your tool finds) What to look for: Drug dose vs. adverse events by age group, drug interactions Size: 15+ million reports Real discovery potential: This is where actual drug recalls originate

  1. NOAA Climate Data - Global Temperature Anomalies

Source: NOAA Climate Data Online (https://www.ncdc.noaa.gov/cdo-web/) Why it's perfect: Contains documented tipping points and regime shifts in climate systems What to look for: Temperature vs. hurricane intensity, Arctic ice vs. global temps Size: Daily measurements from thousands of stations since 1880 Known patterns: El Niño/La Niña transitions show completely different correlations

  1. Cryptocurrency Market Data

Source: CoinGecko API or CryptoCompare Why it's perfect: Extreme volatility with documented "alt-season" effects where correlations flip What to look for: Bitcoin dominance vs. altcoin correlations, stablecoin flows during crashes Size: Minute-level data for 1000+ coins Interesting aspect: Correlations can go from -0.5 to +0.9 during liquidation cascades

  1. California Housing Prices Dataset (Enhanced Version)

Source: California census data + Zillow Research Data Why it's perfect: Housing crashes show non-linear price relationships What to look for: Price vs. income ratios in bubble conditions, coastal vs. inland divergence Size: 20,000+ observations with temporal data Real pattern: 2008 showed complete correlation breakdown between traditionally linked areas

  1. MIMIC-III Clinical Database

Source: PhysioNet (requires ethics training certificate) Why it's perfect: ICU data where vital sign correlations change dramatically near critical events What to look for: Blood pressure vs. heart rate in shock states, lab values before organ failure Size: 40,000+ ICU stays Clinical value: These patterns literally predict patient deterioration

Recommended Starting Point: python# Example with S&P 500 data import yfinance as yf import pandas as pd

Get S&P 500 tickers

sp500_tickers = ['AAPL', 'MSFT', 'JPM', 'BAC', 'XOM'] # Subset for demo

Download data

data = yf.download(sp500_tickers, start='2007-01-01', end='2010-01-01')['Close']

Calculate daily returns

returns = data.pct_change().dropna()

This dataset will show:

- Normal times (2007): Low cross-sector correlations

- Crisis (late 2008): Everything correlates strongly

- Your tool will find this without being told when the crisis was

Data Preprocessing Tips:

For financial data: Use log returns rather than prices For medical data: Normalize by population demographics For climate data: Detrend seasonal patterns first For all datasets: Check for survivorship bias and missing data patterns

Why These Beat Synthetic Data:

Regulatory decisions are made on these datasets (FDA, SEC) Published papers have validated the tail patterns Timestamps let you verify discoveries against known events Size provides statistical power for bootstrap confidence intervals

The S&P 500 dataset during 2007-2009 is probably your best showcase - the correlation structure changes are dramatic, well-documented in academic literature, and have clear real-world implications for portfolio risk management.

Philosophy: The most interesting discoveries happen not in the average, but in the extremes.

separator separator separator

Core Concept

Traditional correlation analysis examines relationships across entire datasets, often missing critical patterns that only appear in extreme conditions. The Serendipity Finder addresses this limitation by:

  • Separating data into core and tail regions
  • Computing correlations within each region independently
  • Identifying cases where tail correlations are strong despite weak global correlations
  • Quantifying the significance of these discoveries

separator separator separator

Technical Implementation

Python Module: serendipity_finder.py

The core Python implementation provides a complete framework for serendipitous discovery.

Class Structure

class SerendipityFinder:
    def __init__(self, tail_threshold: float = 0.15, correlation_threshold: float = 0.6)
    def generate_serendipitous_data(n_samples, n_features, hidden_pairs, noise_level)
    def find_hidden_correlations(data)
    def visualize_discovery(feature1, feature2, save_path)
    def generate_report()
    def export_discoveries(filepath)

Key Methods

Method Purpose Technical Details
generate_serendipitous_data() Creates synthetic data with hidden tail patterns Uses numpy to inject strong correlations in distribution extremes while maintaining weak global correlation
find_hidden_correlations() Discovers hidden patterns Computes Pearson correlations for global, lower tail (≤15th percentile), and upper tail (≥85th percentile) regions
_calculate_significance() Quantifies discovery importance Score = (1 - |global_corr|) × max(|tail_corr|)
visualize_discovery() Creates three-panel visualization Uses matplotlib to show global view vs. tail patterns with trend lines
generate_report() Produces formatted text report Ranks discoveries by significance score
export_discoveries() Saves findings to CSV Enables further analysis in other tools

HTML Visualization: serendipity_visualization.html

The interactive HTML dashboard provides:

Technical Components

Component Technology Functionality
Particle Background Particles.js Creates dynamic visual environment representing data points
Mathematical Formulas MathJax Renders LaTeX equations for statistical methods
Interactive Plots D3.js Real-time scatter plots with correlation calculations
Statistical Computing Simple-statistics.js Client-side correlation and regression analysis
Parameter Controls Custom sliders Adjust tail threshold (5-30%), noise level (10-90%), hidden strength (100-300%)

Visualization Features

  1. Three-Panel Analysis: Simultaneous display of global, lower tail, and upper tail correlations
  2. Real-time Computation: Instant recalculation as parameters change
  3. Discovery Alerts: Visual notifications when serendipitous patterns are found
  4. Academic Context: Links to peer-reviewed research and mathematical foundations

Skills Demonstrated

Technical Competencies

Skill Category Specific Skills Implementation Evidence
Statistical Analysis Conditional correlation, Tail analysis, Significance testing Implements region-specific correlation analysis with bootstrap confidence
Data Science Pattern discovery, Anomaly detection, Feature engineering Creates derived features and identifies non-linear relationships
Software Engineering Object-oriented design, Error handling, Documentation Clean class structure with comprehensive docstrings
Visualization Multi-panel plots, Interactive dashboards, Real-time updates Matplotlib for static plots, D3.js for dynamic visualizations
Mathematical Modeling Synthetic data generation, Noise injection, Correlation manipulation Programmatically creates data with specific statistical properties
Web Development HTML5, CSS3, JavaScript ES6 Responsive design with modern web technologies

Research & Innovation

Aspect Description Implementation
Novel Approach Conditional correlation analysis Separates tail behavior from global patterns
Scientific Thinking Hypothesis-driven discovery Mimics how breakthroughs occur in extremes
Practical Application Risk management, Drug discovery, Climate science Documented use cases with real examples

Installation

Requirements

Python NumPy Pandas Matplotlib Seaborn SciPy

Setup

# Clone repository
git clone https://github.com/Cazzy-Aporbo/Serendipity-Finder.git
cd Serendipity-Finder

# Install dependencies
pip install -r requirements.txt

# Run demonstration
python serendipity_finder.py

Usage

Basic Example

from serendipity_finder import SerendipityFinder

# Initialize finder
finder = SerendipityFinder(tail_threshold=0.15, correlation_threshold=0.6)

# Generate synthetic data with hidden patterns
data = finder.generate_serendipitous_data(n_samples=1000, n_features=10)

# Discover hidden correlations
discoveries = finder.find_hidden_correlations()

# Generate report
print(finder.generate_report())

# Visualize top discovery
finder.visualize_discovery('feature_00', 'feature_01')

With Your Own Data

import pandas as pd

# Load your data
df = pd.read_csv('your_data.csv')

# Find hidden patterns
finder = SerendipityFinder()
discoveries = finder.find_hidden_correlations(df)

# Export results
finder.export_discoveries('discoveries.csv')

Example Results

Actual Output from Execution

When run on synthetic data with 1000 samples and 12 features:

Initializing Serendipity Finder...
--------------------------------------------------
Generating synthetic data with hidden tail correlations...
   Created dataset with 1000 samples and 12 features

Searching for serendipitous discoveries...

DISCOVERIES FOUND:
   • Hidden tail correlations: 14
   • Inverted patterns: 0

Top Discoveries

Feature Pair Global Correlation Lower Tail Upper Tail Significance Score
feature_02 ↔ derived_product -0.057 -0.315 -0.850 0.801
feature_02 ↔ feature_07 -0.031 -0.556 -0.639 0.619
feature_00 ↔ feature_07 -0.026 -0.553 -0.635 0.618
feature_00 ↔ feature_09 -0.038 -0.635 -0.589 0.611
feature_01 ↔ derived_product -0.036 -0.632 -0.448 0.609

Interpretation

These results demonstrate the core value proposition:

  • Global correlations near zero (-0.026 to -0.057) would lead traditional analysis to conclude "no relationship"
  • Tail correlations exceeding 0.6 (up to -0.850) reveal strong hidden relationships
  • Significance scores above 0.6 indicate highly serendipitous discoveries

File Documentation

Core Files

File Purpose Key Features
serendipity_finder.py Main algorithmic implementation Pattern detection, visualization, reporting
serendipity_visualization.html Interactive exploration interface Real-time analysis, parameter adjustment, visual storytelling
README.md Technical documentation Usage instructions, mathematical foundation, examples

Generated Outputs

File Description Format
serendipity_discoveries.csv Exported findings CSV with correlations and significance scores
serendipity_discovery.png Top discovery visualization Three-panel matplotlib figure

Mathematical Foundation

Core Algorithms

Global Correlation

ρ_global = Cov(X,Y) / (σ_X × σ_Y)

Tail Correlation

ρ_tail = Cov(X,Y | X ∈ T or Y ∈ T) / (σ_X|T × σ_Y|T)

Where T represents the tail region (e.g., ≤15th or ≥85th percentile)

Serendipity Score

S = (1 - |ρ_global|) × max(|ρ_lower|, |ρ_upper|)

Statistical Significance

t = ρ_tail × sqrt((n_tail - 2) / (1 - ρ_tail²))

Applications

Industry Use Cases

Domain Application Value Proposition
Finance Risk management Identify correlations that emerge during market stress
Healthcare Drug safety Detect side effects in specific patient subgroups
Climate Tipping points Find threshold behaviors in complex systems
Manufacturing Quality control Discover failure modes in extreme conditions
Research Scientific discovery Identify anomalies that lead to breakthroughs

Performance Characteristics

Metric Value Notes
Time Complexity O(n²×m) n features, m samples
Space Complexity O(n×m) Stores correlation matrix
Typical Runtime <1 second For 1000 samples, 10 features
Scalability Up to 1M rows Can be parallelized for larger datasets

Future Enhancements

  1. Algorithm Extensions

    • Implement copula-based dependency measures
    • Add time-series tail correlation analysis
    • Include multivariate tail dependence
  2. Performance Optimizations

    • Parallel processing for large datasets
    • GPU acceleration for correlation computation
    • Incremental updates for streaming data
  3. Additional Features

    • API endpoint for integration
    • Machine learning models trained on tail patterns
    • Automated threshold optimization

Citation

If you use this tool in your research or work, please cite:

@software{serendipity_finder_2025,
  title = {Serendipity Finder: Advanced Pattern Discovery in Distribution Extremes},
  author = {Aporbo, Cazandra},
  year = {2025},
  url = {https://github.com/Cazzy-Aporbo/Serendipity-Finder},
  email = {becaziam@gmail.com}
}

Contact

Email GitHub LinkedIn

Cazandra Aporbo, MS 2025

For questions, collaboration opportunities, or implementation support, please reach out via email.


License

This project is licensed under the MIT License. See LICENSE file for details.

separator separator separator separator separator

About

Find the outliers that matter; an advanced framework for detecting extreme-value correlations and rare patterns in data.🧞‍♀️

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published