Vision • Features • Architecture • Installation • Quick Start • Learning Path • Core Modules • Math • Experiments • Performance • Contributing • License
Why build from scratch?
In an era of high-level APIs and auto-ML, the fundamental mechanisms of machine learning are often obscured. This lab exists to peel back the layers—to implement every algorithm, every optimization step, and every backpropagation pass with transparent, readable code. It is a research-grade educational repository for those who believe that true understanding comes from building.
Educational & Research Goals
- Provide a self-contained curriculum that progresses from linear algebra foundations to transformer architectures.
- Serve as a reference implementation for researchers prototyping new ideas without heavy framework dependencies.
- Demonstrate engineering best practices in a research context: modular design, comprehensive testing, and reproducible benchmarks.
- Foster a community of learners who can experiment, extend, and share their insights.
Engineering Philosophy
- Mathematical Transparency: Every equation in a paper should map directly to a line of code.
- Performance Awareness: Write efficient NumPy code without sacrificing clarity.
- Educational First: Code is documentation; notebooks are tutorials.
| Icon | Feature | Description |
|---|---|---|
| 🧮 | Pure NumPy Implementations | No black boxes—everything from linear regression to multi-head attention is built with np.dot, np.einsum, and manual gradients. |
| 📉 | Gradient-Based Optimization | Implement SGD, Momentum, Adam, and more with explicit gradient computation and optional autodiff for educational clarity. |
| 🧠 | Deep Learning Foundations | Modular layers (Dense, Conv1D, RNN, LSTM), activation functions, loss functions, and backpropagation through time. |
| 🔍 | Transformer Basics | Scaled dot-product attention, positional encoding, encoder/decoder blocks—all from scratch, ready for experimentation. |
| 🔬 | Research Experiments | Easily swap components, log metrics, and compare convergence against reference libraries. |
| ⚡ | Benchmarking vs. sklearn | Automated tests assert numerical correctness and performance parity with scikit-learn’s reference implementations. |
| 🧪 | Comprehensive Testing | Unit tests, integration tests, and gradient checks ensure reliability. |
| 📚 | 40+ Jupyter Notebooks | A structured learning path from math foundations to advanced topics. |
| 🐍 | Pythonic & Modular | Clean, object-oriented design that follows industry best practices. |
| 📊 | Visual Learning | Every notebook includes rich visualizations of loss landscapes, decision boundaries, and attention maps. |
| 🧩 | Modular Components | Easily swap optimizers, activation functions, or regularizers to observe effects. |
graph TB
subgraph "Repository Root"
direction TB
A[.github/] --> A1[workflows/]
A1 --> A2[test.yml]
A1 --> A3[notebooks.yml]
A1 --> A4[docs.yml]
B[notebooks/] --> B1[00_mathematical_foundations]
B --> B2[01_linear_models]
B --> B3[02_tree_models]
B --> B4[03_neural_networks]
B --> B5[04_transformers]
B --> B6[05_experiments]
C[src/ml_from_scratch/] --> C1[core/]
C --> C2[optimizers/]
C --> C3[models/]
C --> C4[neural/]
C --> C5[datasets/]
C --> C6[utils/]
D[tests/]
E[benchmarks/]
F[examples/]
G[configs/]
H[docs/]
I[scripts/]
J[.gitignore]
K[LICENSE]
L[README.md]
M[CONTRIBUTING.md]
N[CODE_OF_CONDUCT.md]
O[CHANGELOG.md]
P[pyproject.toml]
Q[setup.cfg]
R[requirements.txt]
S[requirements-dev.txt]
end
| Path | Description |
|---|---|
.github/workflows/ |
CI/CD pipelines: test, notebook validation, doc deployment |
notebooks/ |
Interactive learning path (40+ Jupyter notebooks) |
src/ml_from_scratch/ |
Core library (installable package) |
tests/ |
Unit & integration tests |
benchmarks/ |
Performance & correctness benchmarks against scikit-learn |
examples/ |
Standalone usage scripts |
configs/ |
YAML configuration files for experiments |
docs/ |
Sphinx documentation source |
scripts/ |
Utility scripts (data download, benchmark runner) |
- Python 3.9 or later
- pip (Python package manager)
- (Optional) virtualenv or conda for environment management
git clone https://github.com/ammmanism/ml-from-scratch-lab.git
cd ml-from-scratch-labWindows
python -m venv venv
venv\Scripts\activatemacOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -e .pip install -r requirements-dev.txtpython -c "from ml_from_scratch import __version__; print(__version__)"docker build -t ml-from-scratch-lab .
docker run -it --rm -p 8888:8888 ml-from-scratch-labClick the "Open in Colab" badge at the top to run notebooks directly in your browser with zero setup.
from ml_from_scratch.models import LinearRegression
from ml_from_scratch.datasets import make_regression
from ml_from_scratch.utils import train_test_split
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
model = LinearRegression()
model.fit(X_train, y_train, lr=0.01, epochs=1000, verbose=False)
# Predict
predictions = model.predict(X_test)
mse = ((predictions - y_test) ** 2).mean()
print(f"Test MSE: {mse:.4f}")from ml_from_scratch.models import LogisticRegression
from ml_from_scratch.datasets import make_classification
from ml_from_scratch.metrics import accuracy_score
X, y = make_classification(n_samples=200, n_features=10, n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train, lr=0.1, epochs=500)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))from ml_from_scratch.neural import Model, Dense, ReLU, Softmax
from ml_from_scratch.optimizers import Adam
from ml_from_scratch.losses import CrossEntropy
from ml_from_scratch.datasets import load_mnist
# Load data (or use synthetic)
X_train, y_train, X_test, y_test = load_mnist(normalize=True)
# Build model
model = Model()
model.add(Dense(128, input_dim=784))
model.add(ReLU())
model.add(Dense(64))
model.add(ReLU())
model.add(Dense(10))
model.add(Softmax())
# Compile
model.compile(optimizer=Adam(learning_rate=0.001),
loss=CrossEntropy())
# Train
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)
# Evaluate
y_pred = model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))from ml_from_scratch.neural.layers import MultiHeadAttention, FeedForward
import numpy as np
# Example sequence: batch=2, seq_len=5, dim=16
x = np.random.randn(2, 5, 16)
# Multi-head attention (8 heads)
mha = MultiHeadAttention(d_model=16, num_heads=8)
attn_out = mha(x, x, x) # self-attention
# Feed-forward network
ffn = FeedForward(d_model=16, d_ff=64)
out = ffn(attn_out)
print(out.shape) # (2, 5, 16)Explore the notebooks in the recommended order. Each notebook builds on the previous one, forming a cohesive learning journey.
graph LR
subgraph "00 Math Foundations"
A1[01_vectors_matrices.ipynb] --> A2[02_eigenvalues_svd.ipynb]
A2 --> A3[03_probability_distributions.ipynb]
A3 --> A4[04_calculus_gradients.ipynb]
end
subgraph "01 Linear Models"
B1[01_linear_regression.ipynb] --> B2[02_logistic_regression.ipynb]
B2 --> B3[03_regularization.ipynb]
end
subgraph "02 Tree Models"
C1[01_decision_trees.ipynb] --> C2[02_random_forests.ipynb]
C2 --> C3[03_gradient_boosting.ipynb]
end
subgraph "03 Neural Networks"
D1[01_backpropagation.ipynb] --> D2[02_mlp.ipynb]
D2 --> D3[03_cnn.ipynb]
D3 --> D4[04_rnn_lstm.ipynb]
end
subgraph "04 Transformers"
E1[01_attention.ipynb] --> E2[02_transformer_implementation.ipynb]
E2 --> E3[03_positional_encoding.ipynb]
end
subgraph "05 Experiments"
F1[01_hyperparameter_tuning.ipynb]
F2[02_ablation_studies.ipynb]
F3[03_gradient_flow_analysis.ipynb]
end
A4 --> B1
B3 --> C1
C3 --> D1
D4 --> E1
E3 --> F1
F1 --> F2
F2 --> F3
| Notebook | Topic | Key Concepts | Link |
|---|---|---|---|
| 00_01 | Vectors & Matrices | Norms, dot product, linear transformations | Open |
| 00_02 | Eigenvalues & SVD | Spectral theorem, PCA intuition | Open |
| 00_03 | Probability Distributions | Gaussian, Bernoulli, MLE | Open |
| 00_04 | Calculus & Gradients | Partial derivatives, chain rule, Jacobians | Open |
| 01_01 | Linear Regression | Closed-form, gradient descent, R² | Open |
| 01_02 | Logistic Regression | Sigmoid, cross-entropy, decision boundary | Open |
| 01_03 | Regularization | Ridge, Lasso, ElasticNet | Open |
| 02_01 | Decision Trees | Entropy, information gain, pruning | Open |
| 02_02 | Random Forests | Bagging, feature importance | Open |
| 02_03 | Gradient Boosting | AdaBoost, XGBoost intuition | Open |
| 03_01 | Backpropagation | Chain rule, computational graph | Open |
| 03_02 | Multilayer Perceptron | MLP, activation functions, initializations | Open |
| 03_03 | Convolutional Networks | Convolution, pooling, receptive fields | Open |
| 03_04 | RNN & LSTM | Recurrent cells, backprop through time | Open |
| 04_01 | Attention Mechanism | Scaled dot-product, multi-head | Open |
| 04_02 | Transformer from Scratch | Encoder-decoder, positional encoding | Open |
| 04_03 | Positional Encoding | Sinusoidal, learned embeddings | Open |
| 05_01 | Hyperparameter Tuning | Grid search, random search, Bayesian opt | Open |
| 05_02 | Ablation Studies | Removing components, measuring impact | Open |
| 05_03 | Gradient Flow Analysis | Vanishing/exploding gradients, visualization | Open |
- autodiff.py: A minimal computational graph for automatic differentiation (educational, not for production). Supports forward/backward passes with dynamic graph building.
- parameter.py: Wrapper for trainable parameters with gradient storage. Supports parameter sharing and gradient accumulation.
- initializers.py: Xavier, He, random normal/uniform initializations.
- SGD: Stochastic gradient descent with momentum (Nesterov optional).
- Adam: Adaptive Moment Estimation with bias correction.
- RMSprop: Root Mean Square Propagation.
- Adagrad: Adaptive gradient algorithm.
- LearningRateSchedulers: Step decay, exponential decay, cosine annealing.
- LinearRegression, LogisticRegression: Ordinary and logistic regression with L1/L2 regularization.
- DecisionTreeClassifier, DecisionTreeRegressor: CART algorithm with entropy/Gini.
- RandomForestClassifier, RandomForestRegressor: Ensemble of trees with bagging.
- GradientBoostingClassifier: Gradient boosted trees (simplified).
- Sequential: Container for neural network layers.
| Layer | Description |
|---|---|
Dense |
Fully connected layer with configurable activation |
Conv2D |
2D convolution with stride, padding, dilation |
MaxPooling2D |
Max pooling |
Flatten |
Flattens input |
RNN |
Recurrent layer with tanh activation |
LSTM |
Long Short-Term Memory cell |
Embedding |
Token embedding layer |
MultiHeadAttention |
Scaled dot-product attention with multiple heads |
FeedForward |
Position-wise FFN used in transformers |
ReLU,LeakyReLU,ELU,Sigmoid,Tanh,Softmax
MSE,MAE,CrossEntropy,BinaryCrossEntropy,KLDivergence,HuberLoss
Dropout,BatchNormalization,L1L2Regularizer
make_regression,make_classification,make_blobs,make_circles,make_moonsload_mnist,load_cifar10,load_fashion_mnist(auto-download)TimeSeriesGeneratorfor RNN/LSTM data (sine wave, AR process)
train_test_split,batch_iteratoraccuracy_score,precision_recall_fscore,confusion_matrix,roc_auc_scorenormalize,standardize,one_hot_encode,to_categoricalvisualize_decision_boundary,plot_loss_curve,plot_confusion_matrix
Every implementation is accompanied by rigorous mathematical documentation. Key formulas are directly translated into code:
| Concept | Equation | Code |
|---|---|---|
| Linear Regression (closed-form) | beta = np.linalg.inv(X.T @ X) @ X.T @ y |
|
| Gradient of MSE | grad = (2/n) * X.T @ (X @ beta - y) |
|
| Backpropagation for Dense layer | delta = (W_next.T @ delta_next) * activation_derivative(z) |
|
| Attention Scores |
scores = np.einsum('bqd,bkd->bqk', Q, K) / np.sqrt(d_k)attn = softmax(scores)out = np.einsum('bqk,bkd->bqd', attn, V)
|
|
| Adam Update |
$\hat{v}t = v_t / (1-\beta_2^t)$ $\theta{t+1} = \theta_t - \alpha \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$ |
See optimizers/adam.py
|
For deeper dives, refer to the notebooks/00_mathematical_foundations/ series, which includes:
- Visualizations of eigenvectors and SVD.
- Interactive probability distributions.
- Gradient descent on 2D loss surfaces.
The benchmarks/ directory contains scripts to compare our implementations with scikit-learn on metrics, speed, and memory usage. Run a benchmark:
python benchmarks/compare_logistic_regression.py --samples 10000 --features 100 --epochs 100Sample output:
Scikit-learn LogisticRegression: Accuracy = 0.876, Time = 0.23s
Our LogisticRegression: Accuracy = 0.875, Time = 0.41s
Difference: 0.001 (within tolerance ✓)
We also provide scripts to generate convergence plots comparing different optimizers:
xychart-beta
title "Convergence of Optimizers on MNIST (MLP)"
x-axis ["Epoch 0", "Epoch 25", "Epoch 50", "Epoch 75", "Epoch 100"]
y-axis "Loss" 0 --> 2.5
line "SGD" [2.3, 1.2, 0.8, 0.6, 0.5]
line "Momentum" [2.3, 0.9, 0.5, 0.3, 0.2]
line "Adam" [2.3, 0.7, 0.3, 0.15, 0.1]
Note: The above chart is a static representation. Actual interactive plots are available in benchmarks/convergence_plots/.
The notebooks/05_experiments/ folder contains in-depth studies:
| Experiment | Description | Key Findings |
|---|---|---|
01_hyperparameter_tuning.ipynb |
Grid search vs random search for MLP | Random search is 3x faster for same performance |
02_ablation_studies.ipynb |
Remove dropout, batch norm, skip connections | Dropout critical for generalization; skip connections enable deeper nets |
03_gradient_flow_analysis.ipynb |
Visualize gradient norms in deep networks | Vanishing gradients appear after 5 layers with sigmoid; ReLU mitigates |
- Unit Tests:
pytestruns over 200 tests covering:- Gradient checks (finite difference).
- Shape consistency.
- Equality to scikit-learn on small datasets.
- Continuous Integration: GitHub Actions run tests on every push and PR.
- Coverage: Maintains >90% code coverage.
- We vectorize operations using NumPy's optimized C backend.
- Memory usage is monitored with
memory_profiler. - For large-scale experiments, consider using the optional
cupybackend (in development).
| Model | Dataset (size) | sklearn time | our time | speed ratio |
|---|---|---|---|---|
| Linear Regression | Boston (506,13) | 0.002s | 0.003s | 0.67x |
| Logistic Regression | Digits (1797,64) | 0.12s | 0.18s | 0.67x |
| Random Forest (10 trees) | Wine (178,13) | 0.08s | 0.15s | 0.53x |
| MLP (2 hidden, 100 units) | MNIST (60000,784) | 5.2s | 8.1s | 0.64x |
| Transformer (1 block) | IMDB (5000,100) | 1.8s | 2.4s | 0.75x |
Our implementations are typically 0.5x–0.75x the speed of highly optimized libraries – a reasonable tradeoff for transparency and educational value.
pie title Memory Usage (MNIST MLP)
"Parameters" : 45
"Activations" : 30
"Gradients" : 20
"Overhead" : 5
We welcome contributions from everyone! Whether you're fixing a bug, adding a new model, improving documentation, or suggesting an experiment, your help is appreciated.
- Fork the repository.
- Create a branch (
git checkout -b feature/amazing-model). - Write tests for your changes.
- Ensure code quality:
black src/ tests/ flake8 src/ tests/ pytest tests/
- Commit (
git commit -m 'Add amazing model'). - Push (
git push origin feature/amazing-model). - Open a Pull Request with a clear description.
- Follow PEP 8.
- Use Black for formatting (line length 88).
- Write docstrings in Google format.
- Place the model in the appropriate subdirectory under
src/ml_from_scratch/. - Inherit from
BaseModelif applicable. - Implement
fitandpredictmethods. - Add unit tests in
tests/. - Create a notebook in
notebooks/05_experiments/demonstrating usage.
See CONTRIBUTING.md for full details.
- Discord: Join our server for real-time discussions.
- Twitter: Follow @mlscratchlab for updates.
- GitHub Discussions: Ask questions, share ideas, or show off your experiments in the Discussions tab.
- Issues: Report bugs or request features via GitHub Issues.
Please note that this project adheres to the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code.
This project is licensed under the MIT License – see the LICENSE file for details.
If you use this repository in your research or teaching, please cite it as:
@misc{ml-from-scratch-lab,
author = {Amman Hussain Ansari},
title = {ML From Scratch Lab: Implementing Machine Learning from First Principles with NumPy},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ammmanism/ml-from-scratch-lab}}
}
If you find this project valuable, please consider giving it a ⭐ on GitHub!
Happy learning, and may your gradients always converge! 🚀

