Curious Coder — Mathematical & Scientific Algorithms, Explained

What this repository is

A focused, continually-improving collection of clear, rigorous explanations of algorithms I use to analyze complex systems—especially in biological and medical contexts where data are noisy, high-dimensional, and full of edge cases.
I care about mathematics. You will see it: definitions first, assumptions stated, and trade-offs made explicit. But everything is written to be approachable—you should not need to fight the notation to understand the idea. This progress is a continual work in progress, so check back for more updates.

This is not a tutorial farm, a code dump, or a catalog of buzzwords. It’s a place to understand how an algorithm works, why it’s appropriate, and when to move on to a neighboring method.

How to use these notes

Start with context. Each entry begins with a short statement of what the method is good at and the assumptions it quietly relies on.
Scan the formulation. A compact, precise formulation follows (objective, loss, constraints). No heavy derivations unless they matter for usage.
Check “When to prefer / avoid.” Practical decision criteria so you can move quickly.
Look sideways. Every method lists a few close neighbors (e.g., PCA ↔ ICA; K-means ↔ GMM/EM; Lasso ↔ Ridge/Elastic-Net).
Apply with discipline. Metrics and diagnostics are included so results don’t become anecdotes.

Reading guide for each algorithm entry

Section	Purpose	What you’ll see
Intent	One-sentence “what problem does this solve?”	Clear problem statement
Formulation	The core mathematics without ceremony	Objective/loss, constraints, variables
Assumptions	The part that breaks silently if ignored	IID, linearity, separability, smoothness, stationarity, etc.
When to prefer	Practical conditions where it excels	Data size/shape, noise regime, feature types
When to avoid	Failure modes and edge cases	Multicollinearity, non-convexity traps, class imbalance, etc.
Neighbor methods	Closely related alternatives	Swap-ins worth testing
Diagnostics	How to know it worked	Residual checks, calibration, stability, uncertainty
Biological applications	Where this has bite	Imaging, genomics, EHR, epidemiology, physiology

A compact decision orientation

Data shape	Often a good starting point	If that stalls, try
Tabular, small→medium	Regularized GLMs; GBDT (XGB/LGBM/CatBoost)	Nonlinear SVM; simple MLP
Sequences / longitudinal	Transformers (baseline); temporal CNNs	RNN/LSTM/GRU; HMM/Kalman for structure
Spatial grids / images	CNNs; U-Net for segmentation	Vision Transformers; diffusion for generation
Graphs / molecular	Message-passing GNNs	Graph Transformers; spectral methods
Very high-dimensional, low labels	Contrastive pretraining; masked modeling	Autoencoders/VAEs; self-distillation
Uncertainty is critical	Bayesian GLMs; calibrated ensembles	Bayesian deep learning; conformal prediction

Why this curation exists

I enjoy the structure and honesty of mathematics. In applied work—especially in biology and medicine—results improve when the assumptions are explicit, the algorithms are chosen for the data (not for fashion), and the limits are respected. These notes are written to make that process fast, transparent, and repeatable.

You will see a mix of:

Core methods (GLMs, trees/ensembles, SVMs, PCA, clustering)
Advanced learning (Transformers, diffusion, contrastive/self-supervised learning, GNNs)
Frontier topics (flows, neural ODEs, causal estimation, federated/multitask learning)
Biological ML where signal is subtle and mechanisms matter (U-Net families, AlphaFold-style structure models, EHR sequence models)

Template for an algorithm entry

Use this structure to keep entries consistent and quick to scan.

Curious Coder — Algorithms Index

This section collects core, advanced, and frontier methods I study and use. Entries focus on:

Used for (primary purpose)
When (practical decision criteria)
Similar (adjacent methods to consider)

The intent is clarity: rigorous enough for research, direct enough for practice.

Core Algorithms for DS/AI/ML Engineers

Method	Used for	When	Similar
Linear Regression	Linear relationship to continuous target	Interpretable baseline; trend testing	Ridge, Lasso
Logistic Regression	Binary classification with calibrated probs	Probabilities + interpretability; baseline	Probit; Softmax Regression (multiclass)
Decision Trees	Rule-based classification/regression	Nonlinear patterns; mixed feature types	Random Forest; Gradient Boosted Trees
Random Forest	Ensemble of trees (bagging)	Robust tabular performance; low overfit	ExtraTrees; GBM
Gradient Boosting (XGB/LGBM/CatBoost)	Boosted trees for strong tabular accuracy	State-of-the-art on many tabular tasks	AdaBoost; Random Forest
K-Nearest Neighbors	Instance-based classification/regression	Simple nonparametric baseline; low-dim data	KDE; RBF-kernel SVM
Support Vector Machines	Max-margin classification/regression	Medium-sized data; robustness to outliers	Logistic (linear); NNs (nonlinear)
Naïve Bayes	Generative classification with independence	Text; very high-dimensional sparse features	Logistic Regression; LDA
PCA	Orthogonal dimensionality reduction	Compression; de-correlation; visualization	SVD; ICA
K-Means	Hard-partition clustering	Fast baseline clustering	GMM (soft clusters); DBSCAN
Expectation–Maximization	Latent-variable MLE (e.g., GMM)	Overlapping distributions; soft assignments	K-Means; Variational Inference
Apriori / FP-Growth	Association rule mining	Frequent itemsets; basket analysis	Eclat
Dynamic Programming	Optimal substructure optimization	Overlapping subproblems	Greedy (approximate)
Gradient Descent	Continuous optimization	Differentiable models; large-scale training	SGD; Adam; RMSProp
Neural Networks (MLP)	Flexible nonlinear mapping	Complex patterns; large data	CNN; RNN

Advanced Algorithms

Method	Used for	When	Similar
CNNs	Spatial representation learning	Vision; local structure	ViTs; Graph Convolutions
RNN / LSTM / GRU	Sequence modeling with memory	Time series; language; speech	Transformers; Temporal CNNs
Transformers	Attention-based sequence modeling	Language; multimodal; long context	RNNs; Attentional CNNs
Autoencoders	Compression; anomaly detection	Representation learning	PCA; VAE
Variational Autoencoders	Probabilistic generative modeling	Latent structure + generation	GANs; Normalizing Flows
GANs	Adversarial generative modeling	Realistic synthesis; augmentation	VAEs; Diffusion
Diffusion Models	Score-based generation	Diversity + stability	GANs; Score Matching
Reinforcement Learning (Q-Learning)	Value-based decision policies	Discrete actions; tabular/compact states	Policy Gradient; DQN
Policy Gradient / Actor–Critic	Direct policy optimization	Continuous/high-dim actions	REINFORCE; PPO
K-Means++ / Advanced Clustering	Improved initialization	Reduce bad local minima	Spectral; GMM; DBSCAN
DBSCAN	Density-based clustering with noise	Arbitrary shapes; outliers	OPTICS; HDBSCAN
Spectral Clustering	Graph-Laplacian embeddings	Manifold/complex geometry	GNNs; Laplacian Eigenmaps
HMMs	Probabilistic sequence models	Hidden state dynamics	Kalman Filters; CRF
Kalman Filters	State estimation with noise	Real-time tracking	Particle Filters; HMM
Graph Neural Networks	Learning on graphs	Relational structure > features	CNN (grids); Graph Transformers
MCMC	Sampling complex posteriors	Bayesian inference	Variational Inference; HMC
GBDT (XGB/LGBM/CatBoost)	Top performance on tabular data	Accuracy with moderate compute	Random Forest; AdaBoost
Recommenders (MF: SVD/ALS)	Collaborative filtering	Sparse user–item matrices	NCF; Graph-based Recsys

Frontier / Expert Topics

Method	Used for	When	Similar
Normalizing Flows	Exact-likelihood generative modeling	Need density + sampling	VAE; Diffusion
Diffusion Transformers	Diffusion + Transformer backbones	Scaled multimodal generation	DDPM; GANs
Neural ODEs	Continuous-time dynamics	Physics/biology/finance signals	RNNs; SDEs
Graph Transformers / Message Passing	Expressive graph learning	Complex relational structure	Spectral GNNs
Neural Tangent Kernel	Infinite-width NN theory	Generalization & convergence study	Kernels; GPs
Meta-Learning (MAML, ProtoNets)	Rapid adaptation	Few-shot; transfer	Bayesian Opt; Fine-tuning
Bayesian Deep Learning	Uncertainty-aware deep models	High-stakes decisions	MCMC; VI
Causal Inference (DoWhy, EconML)	Estimating causal effects	Policy/health interventions	IV; Propensity Scores
Federated Learning (FedAvg, FedProx)	Privacy-preserving distributed training	Decentralized sensitive data	Distributed SGD; DP
Contrastive Learning (SimCLR, CLIP)	Self-supervised representations	Limited labels; large raw data	Autoencoders; Distillation
Energy-Based Models	Unnormalized density modeling	Intractable partition functions	Boltzmann Machines
RL — PPO / SAC / DDPG	Scalable policy optimization	Continuous/high-dim control	REINFORCE; Q-Learning
Multi-Agent RL	Interacting agents	Markets; autonomy; swarms	Game Theory; Single-agent RL
Mixture-of-Experts / Sparse Transformers	Efficient scaling	Conditional computation	Standard Transformers; LoRA
Quantum ML (VQE, QAOA)	Quantum optimization/chemistry	NISQ-era research	Classical Variational Methods
Neurosymbolic AI	Neural perception + symbolic reasoning	Tasks needing both pattern and logic	Knowledge Graphs
Masked Self-Supervision (BERT, MAE)	Representation pretraining	Large unlabeled corpora	Contrastive; Autoencoders
Prompting / Few-Shot Adaptation	LLM task transfer without updates	Generalization to unseen tasks	Meta-Learning; Instruction Tuning
Curriculum Learning	Staged difficulty schedules	Unstable/complex training	RL Shaping; Augmentation
Neural Architecture Search	Automated model design	Edge constraints; task specificity	Bayesian/Hyperparameter Opt

Medical & Biological AI (Selected)

Area	Model	Used for	Similar
Neuro / Brain Imaging	U-Net, V-Net, nnU-Net; BrainAGE; GLM (SPM/FSL)	Segmentation; age prediction; activation modeling	SegNet; DeepLab
Radiology	Radiomics+ML; DeepMedic; CheXNet	Quantitative features; lesion segmentation; X-ray Dx	ResNet/EfficientNet variants
Genomics	DeepSEA; AlphaFold; SpliceAI; DeepCpG/EpiDeep	Variant effect; protein structure; splicing; epigenetics	Basset; Basenji; RoseTTAFold
Cardiology	ECGNet/DeepECG; EchoNet	Arrhythmia classification; EF estimation	1D CNNs; video CNNs
Pathology	HoVer-Net; CLAM (MIL); tile-based classifiers	Nucleus segmentation; WSI classification	Mask R-CNN; MIL variants
Population & EHR	RETAIN; DeepPatient; BEHRT	Longitudinal risk; multi-outcome prediction	RNNs; Transformers for EHR
Epidemiology	Compartmental (SIR/SEIR/SEIRD); ABM	Spread modeling; intervention simulation	System dynamics
Multimodal Medical AI	MedCLIP; BioViL; Bio/ClinicalBERT	Image–text alignment; biomedical NLP	CLIP; BERT

Why these families

Clinical Core: U-Net, RETAIN, SIR — established workhorses.
Research-Grade: AlphaFold, DeepSEA, SpliceAI — molecular scale.
Practice-Changing: CheXNet, EchoNet, CLAM — real clinical impact.
Emerging Frontier: MedCLIP, BEHRT, BioBERT — multimodal and longitudinal.

Library Reference & Imports (Curated)

Quick Index

Core Python
Data Handling
Visualization
Machine Learning / AI
Math, Statistics, SciPy
NLP / Text
Utilities & Workflow
Data I/O
Visualization Add-ons
Advanced Data & Big Data
Deep Learning & GPU
Advanced AI / Transformers / LLMs
Advanced Visualization & Dashboards
Statistics, Bayesian, Probabilistic
Optimization & Math
Graphs, Knowledge, Advanced Data
Advanced NLP / Text
Advanced Utilities & Parallelism
Computer Vision & Image/Video
Geospatial & Maps
Ultra / Rare Imports (HPC, Research, Frontier)

Core Python

Library	Role	Notes
os, sys, Path	Filesystem, environment, paths	Portable path handling via `pathlib`
re, json, csv	Regex, serialization, CSV I/O	Use `jsonlines` for large JSONL
math, random, time, datetime	Math, RNG, timing	`dt` alias for concise timestamps
Counter, defaultdict	Counting, default dicts	Efficient tallies and grouping

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FFE4E1?style=flat-square)](#quick-index)

Data Handling

Library	Role	Notes
numpy	Arrays, vectorized math	Foundation for most stacks
pandas	Tabular data	Wide ecosystem; groupby, time series
pyarrow	Columnar memory, parquet	High-perf interchange with pandas
polars	Fast DataFrame (Rust engine)	Laziness, speed on medium/large data

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-E6E6FA?style=flat-square)](#quick-index)

Visualization

Library	Role	Notes
matplotlib, seaborn	Static plotting	Seaborn for statistical charts
plotly.express, graph_objects	Interactive plots	Browser-ready, tooltips, zoom
altair	Declarative grammar	Readable specs; Vega-Lite backend

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-F0F8FF?style=flat-square)](#quick-index)

Machine Learning / AI

Library	Role	Notes
scikit-learn	Classical ML, metrics, preprocessing	Baselines, pipelines, grid search
XGBoost, LightGBM, CatBoost	Gradient boosting	SOTA tabular; categorical support (CatBoost)
PyTorch	Deep learning	Define-by-run, custom training loops
TensorFlow / Keras	Deep learning	High-level layers, production tooling
transformers	LLMs, transfer learning	Tokenizers, pipelines, model zoo

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FFF0F5?style=flat-square)](#quick-index)

Math, Statistics, SciPy

Library	Role	Notes
scipy (stats, signal, optimize, integrate)	Scientific routines	Tests, filters, solvers
statistics	Built-in descriptive stats	Lightweight helpers
sympy	Symbolic math	Derivations, simplifications

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-D4E4FC?style=flat-square)](#quick-index)

NLP / Text

Library	Role	Notes
nltk, spacy	Tokenization, parsing	spaCy for pipelines; NLTK utilities
gensim	Word2Vec, LDA	Topic modeling and embeddings
wordcloud	Visual summaries	Exploratory visuals

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-E0FFFF?style=flat-square)](#quick-index)

Utilities & Workflow

Library	Role	Notes
tqdm	Progress bars	Notebook-friendly via `tqdm.notebook`
logging, warnings	Diagnostics	Set handlers, suppress noise selectively
joblib, pickle	Model I/O	Persist artifacts; mind security

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FFE5CC?style=flat-square)](#quick-index)

Data I/O

Library	Role	Notes
csv, sqlite3	Flat files, local DB	Good for lightweight pipelines
h5py	HDF5 storage	Large arrays, hierarchical datasets
requests	HTTP APIs	Timeouts, retries, backoff

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FADADD?style=flat-square)](#quick-index)

Visualization Add-ons

Library	Role	Notes
networkx	Graphs/networks	Topology, centrality measures
geopandas, folium	Geospatial viz	Interactive maps and overlays

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FFF4E6?style=flat-square)](#quick-index)

Advanced Data & Big Data

Library	Role	Notes
dask.dataframe	Out-of-core pandas	Parallelize wide workflows
vaex, modin	Lazy or distributed DataFrame	Scale on single machine or cluster
pyspark	Spark API	Cluster compute for very large data

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-E7F3E7?style=flat-square)](#quick-index)

Deep Learning & GPU

Library	Role	Notes
torch, nn, optim, F	Core training	Custom loops, modules
torch.distributed, TensorBoard	Multi-GPU, logging	DDP for scale-out
tensorflow, keras	DL stacks	High-level layers and fit loops
jax, jnp, flax, optax	JIT DL, functional NN	Fast grad, pure functions

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-D4E4FC?style=flat-square)](#quick-index)

Advanced AI / Transformers / LLMs

Library	Role	Notes
transformers	LLMs, pipelines	Text, vision, audio models
peft, bitsandbytes	Efficient finetuning, quantization	LoRA, 8-bit/4-bit training
accelerate, sentence_transformers	Distributed, embeddings	Multi-GPU orchestration, retrieval

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FFE4E1?style=flat-square)](#quick-index)

Advanced Visualization & Dashboards

Library	Role	Notes
bokeh, holoviews, hvplot	Interactive viz stacks	Linked brushing, high-level APIs
panel, dash, streamlit	Dashboards/apps	From notebook to app quickly
pyvis, pyvista	Networks, 3D	Explorable graphs and volumes

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-E6E6FA?style=flat-square)](#quick-index)

Statistics, Bayesian, Probabilistic

Library	Role	Notes
pymc, arviz	Bayesian inference, diagnostics	Priors, posteriors, PPC
statsmodels	Regression, time series	GLM, ARIMA families
lifelines, prophet	Survival, forecasting	Kaplan–Meier; components/trends

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FFF0F5?style=flat-square)](#quick-index)

Optimization & Math

Library	Role	Notes
cvxpy, pulp, ortools	Convex, LP/MIP, routing	Solvers and modeling
numba	JIT acceleration	Speed up Python loops
sympy	Symbolic math	Closed forms, derivations

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FAFAD2?style=flat-square)](#quick-index)

Graphs, Knowledge, Advanced Data

Library	Role	Notes
networkx, neo4j	Graph analysis, DB	Topology + graph stores
dgl, torch_geometric, stellargraph	Graph ML	Message passing, link prediction

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-E0FFFF?style=flat-square)](#quick-index)

Advanced NLP / Text

Library	Role	Notes
stanza, flair	NLP pipelines, embeddings	Strong pretrained components
yake, textblob	Keywords, sentiment	Lightweight tasks
gensim LdaModel	Topic modeling	Classical LDA workflow

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FFE5CC?style=flat-square)](#quick-index)

Advanced Utilities & Parallelism

Library	Role	Notes
ray, joblib	Distributed, parallel pipelines	Scale compute across cores/nodes
ThreadPoolExecutor, ProcessPoolExecutor	Concurrency APIs	IO vs CPU bound tasks

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-F0F8FF?style=flat-square)](#quick-index)

Computer Vision & Image/Video

Library	Role	Notes
opencv	Image/video processing	Transforms, codecs, tracking
mediapipe	Pose/gesture	Prebuilt inference graphs
albumentations, skimage	Augmentation, analysis	Training-ready pipelines
imageio, tifffile	I/O, large images	Microscopy, GeoTIFFs

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-FFF4E6?style=flat-square)](#quick-index)

Geospatial & Maps

Library	Role	Notes
geopandas, shapely	Geo tables, geometry ops	Buffers, intersections
rasterio, cartopy	Rasters, cartography	CRS management
folium, contextily	Interactive maps, basemaps	Tiles and layers

[![Back to index](https://img.shields.io/badge/Back_to_Index-Click-E6E6FA?style=flat-square)](#quick-index)

Ultra / Rare Imports (HPC, Research, Frontier)

Area	Examples	Use
HPC & GPU Kernels	triton, mpi4py, pycuda, pyopencl, numexpr	Custom kernels, multi-node, speed
Large-Scale Training	deepspeed, fairscale, megatron	Sharded models, parallelism
Probabilistic Programming	pyro, edward2, gpytorch	Bayesian deep learning, GPs
Causal ML	dowhy, econml, causalinference	Effects, policy evaluation
Science & Bio	biopython, deepchem, mdtraj, openmm	Genomics, chemistry, MD
Quantum	qiskit, cirq, pennylane, qutip	VQA, simulation
Advanced Viz	datashader, mayavi, k3d, fastplotlib	Huge data, 3D interactive
Privacy & Federated	opacus, tensorflow_privacy, syft	Differential privacy, FL
Infra & MLOps	prefect, dagster, kedro, mlflow, hydra, feast	Pipelines, tracking, configs

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Biological-Systems		Biological-Systems
Core-algorithms		Core-algorithms
Environmental		Environmental
explore_stuff		explore_stuff
README.md		README.md
algorithm_selection_bio.md		algorithm_selection_bio.md
batch_effects_bio.md		batch_effects_bio.md
causal_inference_biology.md		causal_inference_biology.md
deep_learning_biology.md		deep_learning_biology.md
failure_case_studies.md		failure_case_studies.md
imbalanced_medical_data.md		imbalanced_medical_data.md
manifold_learning_bio.md		manifold_learning_bio.md
multiomics_integration.md		multiomics_integration.md
network_inference_bio.md		network_inference_bio.md
pca_ica_comparison.md		pca_ica_comparison.md
single_cell_analysis.md		single_cell_analysis.md
survival_analysis_medical.md		survival_analysis_medical.md
time_series_biological.md		time_series_biological.md

Cazzy-Aporbo/Curious-Coder

Folders and files

Latest commit

History

Repository files navigation

Curious Coder — Mathematical & Scientific Algorithms, Explained

What this repository is

How to use these notes

Reading guide for each algorithm entry

A compact decision orientation

Why this curation exists

Template for an algorithm entry

Curious Coder — Algorithms Index

Core Algorithms for DS/AI/ML Engineers

Advanced Algorithms

Frontier / Expert Topics

Medical & Biological AI (Selected)

Why these families

Library Reference & Imports (Curated)

Quick Index

Core Python

Data Handling

Visualization

Machine Learning / AI

Math, Statistics, SciPy

NLP / Text

Utilities & Workflow

Data I/O

Visualization Add-ons

Advanced Data & Big Data

Deep Learning & GPU

Advanced AI / Transformers / LLMs

Advanced Visualization & Dashboards

Statistics, Bayesian, Probabilistic

Optimization & Math

Graphs, Knowledge, Advanced Data

Advanced NLP / Text

Advanced Utilities & Parallelism

Computer Vision & Image/Video

Geospatial & Maps

Ultra / Rare Imports (HPC, Research, Frontier)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages