Skip to content

A continually evolving collection of mathematical & scientific algorithms, especially for noisy, high-dimensional biological and medical data.๐Ÿ–‡๏ธ

Notifications You must be signed in to change notification settings

Cazzy-Aporbo/Curious-Coder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

32 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Curious Coder โ€” Mathematical & Scientific Algorithms, Explained

Mathematics Algorithms Concepts Biological ML

separator separator separator separator separator

What this repository is

A focused, continually-improving collection of clear, rigorous explanations of algorithms I use to analyze complex systemsโ€”especially in biological and medical contexts where data are noisy, high-dimensional, and full of edge cases.
I care about mathematics. You will see it: definitions first, assumptions stated, and trade-offs made explicit. But everything is written to be approachableโ€”you should not need to fight the notation to understand the idea. This progress is a continual work in progress, so check back for more updates.

This is not a tutorial farm, a code dump, or a catalog of buzzwords. Itโ€™s a place to understand how an algorithm works, why itโ€™s appropriate, and when to move on to a neighboring method.

separator separator separator

How to use these notes

  1. Start with context. Each entry begins with a short statement of what the method is good at and the assumptions it quietly relies on.
  2. Scan the formulation. A compact, precise formulation follows (objective, loss, constraints). No heavy derivations unless they matter for usage.
  3. Check โ€œWhen to prefer / avoid.โ€ Practical decision criteria so you can move quickly.
  4. Look sideways. Every method lists a few close neighbors (e.g., PCA โ†” ICA; K-means โ†” GMM/EM; Lasso โ†” Ridge/Elastic-Net).
  5. Apply with discipline. Metrics and diagnostics are included so results donโ€™t become anecdotes.

separator separator separator

Reading guide for each algorithm entry

Section Purpose What youโ€™ll see
Intent One-sentence โ€œwhat problem does this solve?โ€ Clear problem statement
Formulation The core mathematics without ceremony Objective/loss, constraints, variables
Assumptions The part that breaks silently if ignored IID, linearity, separability, smoothness, stationarity, etc.
When to prefer Practical conditions where it excels Data size/shape, noise regime, feature types
When to avoid Failure modes and edge cases Multicollinearity, non-convexity traps, class imbalance, etc.
Neighbor methods Closely related alternatives Swap-ins worth testing
Diagnostics How to know it worked Residual checks, calibration, stability, uncertainty
Biological applications Where this has bite Imaging, genomics, EHR, epidemiology, physiology

separator separator separator

A compact decision orientation

Data shape Often a good starting point If that stalls, try
Tabular, smallโ†’medium Regularized GLMs; GBDT (XGB/LGBM/CatBoost) Nonlinear SVM; simple MLP
Sequences / longitudinal Transformers (baseline); temporal CNNs RNN/LSTM/GRU; HMM/Kalman for structure
Spatial grids / images CNNs; U-Net for segmentation Vision Transformers; diffusion for generation
Graphs / molecular Message-passing GNNs Graph Transformers; spectral methods
Very high-dimensional, low labels Contrastive pretraining; masked modeling Autoencoders/VAEs; self-distillation
Uncertainty is critical Bayesian GLMs; calibrated ensembles Bayesian deep learning; conformal prediction

separator separator separator

Why this curation exists

I enjoy the structure and honesty of mathematics. In applied workโ€”especially in biology and medicineโ€”results improve when the assumptions are explicit, the algorithms are chosen for the data (not for fashion), and the limits are respected. These notes are written to make that process fast, transparent, and repeatable.

You will see a mix of:

  • Core methods (GLMs, trees/ensembles, SVMs, PCA, clustering)
  • Advanced learning (Transformers, diffusion, contrastive/self-supervised learning, GNNs)
  • Frontier topics (flows, neural ODEs, causal estimation, federated/multitask learning)
  • Biological ML where signal is subtle and mechanisms matter (U-Net families, AlphaFold-style structure models, EHR sequence models)

separator separator separator

Template for an algorithm entry

Use this structure to keep entries consistent and quick to scan.

Curious Coder โ€” Algorithms Index

Mathematics Algorithms Concepts

separator separator separator separator separator

This section collects core, advanced, and frontier methods I study and use. Entries focus on:

  • Used for (primary purpose)
  • When (practical decision criteria)
  • Similar (adjacent methods to consider)

The intent is clarity: rigorous enough for research, direct enough for practice.


Core Algorithms for DS/AI/ML Engineers

Method Used for When Similar
Linear Regression Linear relationship to continuous target Interpretable baseline; trend testing Ridge, Lasso
Logistic Regression Binary classification with calibrated probs Probabilities + interpretability; baseline Probit; Softmax Regression (multiclass)
Decision Trees Rule-based classification/regression Nonlinear patterns; mixed feature types Random Forest; Gradient Boosted Trees
Random Forest Ensemble of trees (bagging) Robust tabular performance; low overfit ExtraTrees; GBM
Gradient Boosting (XGB/LGBM/CatBoost) Boosted trees for strong tabular accuracy State-of-the-art on many tabular tasks AdaBoost; Random Forest
K-Nearest Neighbors Instance-based classification/regression Simple nonparametric baseline; low-dim data KDE; RBF-kernel SVM
Support Vector Machines Max-margin classification/regression Medium-sized data; robustness to outliers Logistic (linear); NNs (nonlinear)
Naรฏve Bayes Generative classification with independence Text; very high-dimensional sparse features Logistic Regression; LDA
PCA Orthogonal dimensionality reduction Compression; de-correlation; visualization SVD; ICA
K-Means Hard-partition clustering Fast baseline clustering GMM (soft clusters); DBSCAN
Expectationโ€“Maximization Latent-variable MLE (e.g., GMM) Overlapping distributions; soft assignments K-Means; Variational Inference
Apriori / FP-Growth Association rule mining Frequent itemsets; basket analysis Eclat
Dynamic Programming Optimal substructure optimization Overlapping subproblems Greedy (approximate)
Gradient Descent Continuous optimization Differentiable models; large-scale training SGD; Adam; RMSProp
Neural Networks (MLP) Flexible nonlinear mapping Complex patterns; large data CNN; RNN

separator separator separator

Advanced Algorithms

Method Used for When Similar
CNNs Spatial representation learning Vision; local structure ViTs; Graph Convolutions
RNN / LSTM / GRU Sequence modeling with memory Time series; language; speech Transformers; Temporal CNNs
Transformers Attention-based sequence modeling Language; multimodal; long context RNNs; Attentional CNNs
Autoencoders Compression; anomaly detection Representation learning PCA; VAE
Variational Autoencoders Probabilistic generative modeling Latent structure + generation GANs; Normalizing Flows
GANs Adversarial generative modeling Realistic synthesis; augmentation VAEs; Diffusion
Diffusion Models Score-based generation Diversity + stability GANs; Score Matching
Reinforcement Learning (Q-Learning) Value-based decision policies Discrete actions; tabular/compact states Policy Gradient; DQN
Policy Gradient / Actorโ€“Critic Direct policy optimization Continuous/high-dim actions REINFORCE; PPO
K-Means++ / Advanced Clustering Improved initialization Reduce bad local minima Spectral; GMM; DBSCAN
DBSCAN Density-based clustering with noise Arbitrary shapes; outliers OPTICS; HDBSCAN
Spectral Clustering Graph-Laplacian embeddings Manifold/complex geometry GNNs; Laplacian Eigenmaps
HMMs Probabilistic sequence models Hidden state dynamics Kalman Filters; CRF
Kalman Filters State estimation with noise Real-time tracking Particle Filters; HMM
Graph Neural Networks Learning on graphs Relational structure > features CNN (grids); Graph Transformers
MCMC Sampling complex posteriors Bayesian inference Variational Inference; HMC
GBDT (XGB/LGBM/CatBoost) Top performance on tabular data Accuracy with moderate compute Random Forest; AdaBoost
Recommenders (MF: SVD/ALS) Collaborative filtering Sparse userโ€“item matrices NCF; Graph-based Recsys

separator separator separator

Frontier / Expert Topics

Method Used for When Similar
Normalizing Flows Exact-likelihood generative modeling Need density + sampling VAE; Diffusion
Diffusion Transformers Diffusion + Transformer backbones Scaled multimodal generation DDPM; GANs
Neural ODEs Continuous-time dynamics Physics/biology/finance signals RNNs; SDEs
Graph Transformers / Message Passing Expressive graph learning Complex relational structure Spectral GNNs
Neural Tangent Kernel Infinite-width NN theory Generalization & convergence study Kernels; GPs
Meta-Learning (MAML, ProtoNets) Rapid adaptation Few-shot; transfer Bayesian Opt; Fine-tuning
Bayesian Deep Learning Uncertainty-aware deep models High-stakes decisions MCMC; VI
Causal Inference (DoWhy, EconML) Estimating causal effects Policy/health interventions IV; Propensity Scores
Federated Learning (FedAvg, FedProx) Privacy-preserving distributed training Decentralized sensitive data Distributed SGD; DP
Contrastive Learning (SimCLR, CLIP) Self-supervised representations Limited labels; large raw data Autoencoders; Distillation
Energy-Based Models Unnormalized density modeling Intractable partition functions Boltzmann Machines
RL โ€” PPO / SAC / DDPG Scalable policy optimization Continuous/high-dim control REINFORCE; Q-Learning
Multi-Agent RL Interacting agents Markets; autonomy; swarms Game Theory; Single-agent RL
Mixture-of-Experts / Sparse Transformers Efficient scaling Conditional computation Standard Transformers; LoRA
Quantum ML (VQE, QAOA) Quantum optimization/chemistry NISQ-era research Classical Variational Methods
Neurosymbolic AI Neural perception + symbolic reasoning Tasks needing both pattern and logic Knowledge Graphs
Masked Self-Supervision (BERT, MAE) Representation pretraining Large unlabeled corpora Contrastive; Autoencoders
Prompting / Few-Shot Adaptation LLM task transfer without updates Generalization to unseen tasks Meta-Learning; Instruction Tuning
Curriculum Learning Staged difficulty schedules Unstable/complex training RL Shaping; Augmentation
Neural Architecture Search Automated model design Edge constraints; task specificity Bayesian/Hyperparameter Opt

separator separator separator

Medical & Biological AI (Selected)

Area Model Used for Similar
Neuro / Brain Imaging U-Net, V-Net, nnU-Net; BrainAGE; GLM (SPM/FSL) Segmentation; age prediction; activation modeling SegNet; DeepLab
Radiology Radiomics+ML; DeepMedic; CheXNet Quantitative features; lesion segmentation; X-ray Dx ResNet/EfficientNet variants
Genomics DeepSEA; AlphaFold; SpliceAI; DeepCpG/EpiDeep Variant effect; protein structure; splicing; epigenetics Basset; Basenji; RoseTTAFold
Cardiology ECGNet/DeepECG; EchoNet Arrhythmia classification; EF estimation 1D CNNs; video CNNs
Pathology HoVer-Net; CLAM (MIL); tile-based classifiers Nucleus segmentation; WSI classification Mask R-CNN; MIL variants
Population & EHR RETAIN; DeepPatient; BEHRT Longitudinal risk; multi-outcome prediction RNNs; Transformers for EHR
Epidemiology Compartmental (SIR/SEIR/SEIRD); ABM Spread modeling; intervention simulation System dynamics
Multimodal Medical AI MedCLIP; BioViL; Bio/ClinicalBERT Imageโ€“text alignment; biomedical NLP CLIP; BERT

Why these families

  • Clinical Core: U-Net, RETAIN, SIR โ€” established workhorses.
  • Research-Grade: AlphaFold, DeepSEA, SpliceAI โ€” molecular scale.
  • Practice-Changing: CheXNet, EchoNet, CLAM โ€” real clinical impact.
  • Emerging Frontier: MedCLIP, BEHRT, BioBERT โ€” multimodal and longitudinal.

separator separator separator

Library Reference & Imports (Curated)

Core Python Data Viz ML/AI LLMs

sep sep sep sep sep sep

Quick Index

sep sep sep sep sep

Core Python

LibraryRoleNotes
os, sys, PathFilesystem, environment, pathsPortable path handling via pathlib
re, json, csvRegex, serialization, CSV I/OUse jsonlines for large JSONL
math, random, time, datetimeMath, RNG, timingdt alias for concise timestamps
Counter, defaultdictCounting, default dictsEfficient tallies and grouping

Data Handling

LibraryRoleNotes
numpyArrays, vectorized mathFoundation for most stacks
pandasTabular dataWide ecosystem; groupby, time series
pyarrowColumnar memory, parquetHigh-perf interchange with pandas
polarsFast DataFrame (Rust engine)Laziness, speed on medium/large data

Visualization

LibraryRoleNotes
matplotlib, seabornStatic plottingSeaborn for statistical charts
plotly.express, graph_objectsInteractive plotsBrowser-ready, tooltips, zoom
altairDeclarative grammarReadable specs; Vega-Lite backend

Machine Learning / AI

LibraryRoleNotes
scikit-learnClassical ML, metrics, preprocessingBaselines, pipelines, grid search
XGBoost, LightGBM, CatBoostGradient boostingSOTA tabular; categorical support (CatBoost)
PyTorchDeep learningDefine-by-run, custom training loops
TensorFlow / KerasDeep learningHigh-level layers, production tooling
transformersLLMs, transfer learningTokenizers, pipelines, model zoo

Math, Statistics, SciPy

LibraryRoleNotes
scipy (stats, signal, optimize, integrate)Scientific routinesTests, filters, solvers
statisticsBuilt-in descriptive statsLightweight helpers
sympySymbolic mathDerivations, simplifications

NLP / Text

LibraryRoleNotes
nltk, spacyTokenization, parsingspaCy for pipelines; NLTK utilities
gensimWord2Vec, LDATopic modeling and embeddings
wordcloudVisual summariesExploratory visuals

Utilities & Workflow

LibraryRoleNotes
tqdmProgress barsNotebook-friendly via tqdm.notebook
logging, warningsDiagnosticsSet handlers, suppress noise selectively
joblib, pickleModel I/OPersist artifacts; mind security

Data I/O

LibraryRoleNotes
csv, sqlite3Flat files, local DBGood for lightweight pipelines
h5pyHDF5 storageLarge arrays, hierarchical datasets
requestsHTTP APIsTimeouts, retries, backoff

Visualization Add-ons

LibraryRoleNotes
networkxGraphs/networksTopology, centrality measures
geopandas, foliumGeospatial vizInteractive maps and overlays

Advanced Data & Big Data

LibraryRoleNotes
dask.dataframeOut-of-core pandasParallelize wide workflows
vaex, modinLazy or distributed DataFrameScale on single machine or cluster
pysparkSpark APICluster compute for very large data

Deep Learning & GPU

LibraryRoleNotes
torch, nn, optim, FCore trainingCustom loops, modules
torch.distributed, TensorBoardMulti-GPU, loggingDDP for scale-out
tensorflow, kerasDL stacksHigh-level layers and fit loops
jax, jnp, flax, optaxJIT DL, functional NNFast grad, pure functions

Advanced AI / Transformers / LLMs

LibraryRoleNotes
transformersLLMs, pipelinesText, vision, audio models
peft, bitsandbytesEfficient finetuning, quantizationLoRA, 8-bit/4-bit training
accelerate, sentence_transformersDistributed, embeddingsMulti-GPU orchestration, retrieval

Advanced Visualization & Dashboards

LibraryRoleNotes
bokeh, holoviews, hvplotInteractive viz stacksLinked brushing, high-level APIs
panel, dash, streamlitDashboards/appsFrom notebook to app quickly
pyvis, pyvistaNetworks, 3DExplorable graphs and volumes

Statistics, Bayesian, Probabilistic

LibraryRoleNotes
pymc, arvizBayesian inference, diagnosticsPriors, posteriors, PPC
statsmodelsRegression, time seriesGLM, ARIMA families
lifelines, prophetSurvival, forecastingKaplanโ€“Meier; components/trends

Optimization & Math

LibraryRoleNotes
cvxpy, pulp, ortoolsConvex, LP/MIP, routingSolvers and modeling
numbaJIT accelerationSpeed up Python loops
sympySymbolic mathClosed forms, derivations

Graphs, Knowledge, Advanced Data

LibraryRoleNotes
networkx, neo4jGraph analysis, DBTopology + graph stores
dgl, torch_geometric, stellargraphGraph MLMessage passing, link prediction

Advanced NLP / Text

LibraryRoleNotes
stanza, flairNLP pipelines, embeddingsStrong pretrained components
yake, textblobKeywords, sentimentLightweight tasks
gensim LdaModelTopic modelingClassical LDA workflow

Advanced Utilities & Parallelism

LibraryRoleNotes
ray, joblibDistributed, parallel pipelinesScale compute across cores/nodes
ThreadPoolExecutor, ProcessPoolExecutorConcurrency APIsIO vs CPU bound tasks

Computer Vision & Image/Video

LibraryRoleNotes
opencvImage/video processingTransforms, codecs, tracking
mediapipePose/gesturePrebuilt inference graphs
albumentations, skimageAugmentation, analysisTraining-ready pipelines
imageio, tifffileI/O, large imagesMicroscopy, GeoTIFFs

Geospatial & Maps

LibraryRoleNotes
geopandas, shapelyGeo tables, geometry opsBuffers, intersections
rasterio, cartopyRasters, cartographyCRS management
folium, contextilyInteractive maps, basemapsTiles and layers

Ultra / Rare Imports (HPC, Research, Frontier)

AreaExamplesUse
HPC & GPU Kernelstriton, mpi4py, pycuda, pyopencl, numexprCustom kernels, multi-node, speed
Large-Scale Trainingdeepspeed, fairscale, megatronSharded models, parallelism
Probabilistic Programmingpyro, edward2, gpytorchBayesian deep learning, GPs
Causal MLdowhy, econml, causalinferenceEffects, policy evaluation
Science & Biobiopython, deepchem, mdtraj, openmmGenomics, chemistry, MD
Quantumqiskit, cirq, pennylane, qutipVQA, simulation
Advanced Vizdatashader, mayavi, k3d, fastplotlibHuge data, 3D interactive
Privacy & Federatedopacus, tensorflow_privacy, syftDifferential privacy, FL
Infra & MLOpsprefect, dagster, kedro, mlflow, hydra, feastPipelines, tracking, configs

sep sep sep sep sep Back to index

About

A continually evolving collection of mathematical & scientific algorithms, especially for noisy, high-dimensional biological and medical data.๐Ÿ–‡๏ธ

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages