Skip to content

Sam-v6/scratch-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scratch-ml

A learning project for comparing four deep learning architectures on a time-series regression task. The goal is to understand the tradeoffs between LSTMs, Transformers, and convolutional networks by implementing each one from scratch, training them under identical conditions, and comparing the results.


The Data

The dataset is fully synthetic, generated by src/generate.py. It contains 2,000 timesteps sampled at 100 Hz over 20 seconds, rescaled to the range [1, 5] V to simulate a sensor reading.

Three input features (the signals the model can see):

Signal Description
sine 1 Hz sine wave + small Gaussian noise
square 2 Hz square wave (50% duty cycle) + noise
triangle 1 Hz triangle wave + noise

One target (what the model must predict):

The target is a nonlinear mix of the three inputs, designed to require memory:

y_base   = 1.2*sine + 0.5*(sine * triangle_lag5) + 0.6*(square * triangle) + 0.3*triangle^2
envelope = 1 + 0.4*sin(2*pi * 0.2 * t)
target   = envelope * y_base + noise

Two things make this hard:

  • Lag - the triangle_lag5 term means the target depends on the triangle wave from 5 steps ago. A model that only looks at the current input will consistently miss this.
  • Slow envelope - the amplitude of the relationship changes sinusoidally over time (0.2 Hz), so the model must track both the fast signals and this slow modulation simultaneously.

Split: 60% training (1,200 samples) / 40% validation (800 samples), in temporal order. No shuffling - future data must never influence training.


Models

All four models share the same interface: they take a lookback window of shape (B, 256, 3) (batch x time x features) and output a single predicted value (B, 1).

They are trained under identical conditions: Adam at lr=0.001, batch size 64, MSE loss. Epochs are set in CONFIG["epochs"] in src/train.py.


1. Naive LSTM

What it does: Processes the 256-step input sequentially, one timestep at a time, maintaining an internal "memory" (the hidden state). After seeing all 256 steps, it uses only the final hidden state to make a prediction.

Architecture:

LSTM(F=3 -> H=32, 2 layers, dropout=0.1)
  |  take h_T  (last timestep only)
Linear(32 -> 1)

The key weakness: The hidden states at steps h_1 ... h_{T-1} are thrown away. All the memory from earlier in the window is compressed into h_T, which may not preserve everything.

Training: Plain Adam, no gradient clipping, no LR schedule.


2. Improved LSTM

What it does: Same LSTM, but instead of discarding h_1 ... h_{T-1}, it uses soft attention to compute a weighted average of all T hidden states. The model learns which timesteps are most informative and concentrates weight on those.

Architecture:

LSTM(F=3 -> H=64, 3 layers, dropout=0.1)
  |  LayerNorm(H)
  |  Attention:
       scores[t] = w * h_t              (one learned scalar per hidden state)
       alpha     = softmax(scores)      (weights that sum to 1 across T)
       context   = sum_t alpha_t * h_t  (weighted average: shape H)
  |  Linear(64 -> 1)

Why it's better:

  • Soft attention lets important earlier timesteps contribute directly to the prediction.
  • Larger hidden size (64 vs 32) and more layers (3 vs 2) give more capacity.
  • Gradient clipping (max_norm=1.0) prevents training instability.
  • ReduceLROnPlateau halves the learning rate when validation RMSE stops improving.

3. Transformer (encoder-only)

What it does: Processes all 256 timesteps in parallel using multi-head self-attention. Every timestep "looks at" every other timestep simultaneously - no sequential computation.

Architecture:

Linear(F=3 -> d=64)              embed each timestep into d_model dimensions
  |  PositionalEncoding          add position info (Transformers are order-agnostic)
  |  3 x TransformerEncoderLayer
       MultiHeadAttention(8 heads, head_dim=8)
       FeedForward(64 -> 256 -> 64)
       LayerNorm + residuals
  |  mean pool over time         collapse 256 tokens into 1 vector
  |  Linear(64 -> 1)

Positional encoding - because self-attention treats input as an unordered set, we must explicitly encode each position. We use the sinusoidal formula from "Attention is All You Need" (Vaswani et al., 2017):

PE(pos, 2i)   = sin(pos / 10000^(2i/64))
PE(pos, 2i+1) = cos(pos / 10000^(2i/64))

These fixed sinusoids are added to the input embeddings before attention, giving the model a unique "fingerprint" for each position without requiring any learned parameters.

Multi-head attention - 8 attention heads each compute attention over an 8-dimensional subspace in parallel, then concatenate their outputs. This lets the model attend to different aspects of the sequence simultaneously.


4. TCN - Temporal Convolutional Network

What it does: Uses 1-D convolutions instead of recurrence or attention. The key innovations are causal and dilated convolutions stacked in a residual network.

From "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling" (Bai et al., 2018).

Architecture:

6 x TemporalBlock (dilation = 1, 2, 4, 8, 16, 32)
  each block:
    causal Conv1d(dilation=d, kernel=3) + WeightNorm + ReLU + Dropout
    causal Conv1d(dilation=d, kernel=3) + WeightNorm + ReLU + Dropout
    residual connection (1x1 conv if channels differ)
  |  mean pool over time
  |  Linear(64 -> 1)

Causal convolution - a regular Conv1d with padding=p pads both sides, so the output at time t can depend on inputs at time t+1, t+2, ... - this leaks the future. We fix this by setting padding=0 and manually left-padding only:

F.pad(x, (pad, 0))   # pad on the left; nothing on the right

Dilated convolution - with dilation d and kernel size k, the filter reads positions t, t-d, t-2d, ..., t-(k-1)d. A kernel of size 3 with dilation 32 covers timesteps that are 64 steps apart, giving a wide receptive field without extra parameters.

Receptive field - stacking 6 blocks with dilation 1, 2, 4, 8, 16, 32 and kernel size 3:

total = (1+2+4+8+16+32) * (3-1) = 63 * 2 = 126 timesteps

This comfortably covers our 256-step lookback window.

Weight normalisation - decouples the weight's magnitude from its direction (w = g * v/||v||), which can stabilise training compared to plain batch norm.


Code Quality

This project uses ruff for linting and formatting. The configuration lives in pyproject.toml under [tool.ruff].

Ruff rules in use

Rule set Code What it checks
pycodestyle errors E Indentation, whitespace, and basic syntax style
pycodestyle warnings W Whitespace before comments, blank lines between blocks
pyflakes F Undefined names, unused imports, shadowed variables
isort I Import ordering: stdlib first, then third-party, then first-party
pyupgrade UP Modern Python syntax (e.g. list[int] instead of List[int])
flake8-bugbear B Common bugs and design issues (e.g. mutable default args)
flake8-annotations ANN Type annotation coverage rules

Ignored rules (documented in pyproject.toml):

Code Reason ignored
E221 Multiple spaces before operator - allowed for vertical alignment of related assignments
E241 Multiple spaces after comma - same reason
E203 Whitespace before : in slices (kept compatible with Black-style formatting)
E501 Line length handled separately (line-length = 200)

Additional Ruff config:

  • line-length = 200
  • target-version = "py312"
  • exclude = ["*.ipynb"]

Running ruff manually

# Lint
uv run ruff check src

# Auto-fix lint issues where possible
uv run ruff check --fix src

# Format
uv run ruff format src

Pre-commit hook

Pre-commit hooks are configured in .pre-commit-config.yaml and run Ruff lint/format checks before commit once installed.

# Run this to setup the pre-commit hook the first time
uv run pre-commit install

How to Run

Install uv (one-time local setup)

uv is required for dependency management and running commands in this project.

macOS / Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows (PowerShell):

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Alternative install method:

pipx install uv

Verify installation:

uv --version

For additional install options, see the official docs: https://docs.astral.sh/uv/getting-started/installation/

Project setup

Install dependencies:

uv sync

Install dev tools (ruff):

uv sync --group dev

Regenerate the synthetic data (optional - data is already committed):

uv run python src/generate.py

Run the full inference pipeline (recommended):

src/pipeline.py chains all four steps:

  1. data generation
  2. model training + export (.pth, .pt, .onnx)
  3. C++ build (inference, plus trt_inference if TensorRT is installed)
  4. unified benchmark (src/benchmark.py)
# Recommended command
uv run src/pipeline.py

ONNX Runtime resolution order in pipeline.py:

  1. --ort-root /path/to/onnxruntime
  2. ORT_ROOT environment variable
  3. $HOME/onnxruntime (if include/ + lib/ are present)
  4. system library search (apt-installed ORT)

Examples:

# Explicit ORT tarball location
uv run src/pipeline.py --ort-root $HOME/onnxruntime

# Same, but from environment variable
export ORT_ROOT=$HOME/onnxruntime
uv run src/pipeline.py

# Skip work already done
uv run src/pipeline.py --skip-generate --skip-train --skip-build

Or run each step individually:

Train all four models and produce comparison artifacts:

uv run python src/train.py

This will:

  1. Print per-epoch progress every 50 epochs for each model
  2. Save individual loss curves to artifacts/{ModelName}_loss.png
  3. Save a 6-panel comparison figure to artifacts/comparison.png
  4. Generate a synthetic holdout and save a 2x2 inference overlay figure to artifacts/inference_comparison.png
  5. Export each model to .pth (state dict), .pt (TorchScript), and .onnx in artifacts/onnx/
  6. Write holdout binary data (holdout_input.bin, holdout_target.bin, holdout_meta.json) to artifacts/onnx/
  7. Print a summary table with parameter counts, best val RMSE, and training times

All hyperparameters are in the CONFIG dict at the top of src/train.py. Inference holdout controls:

  • CONFIG["inference_holdout_enabled"]
  • CONFIG["inference_duration_s"]
  • CONFIG["inference_fs"]
  • CONFIG["inference_seed"]
  • CONFIG["inference_plot_path"]

ONNX export controls:

  • CONFIG["onnx_export_enabled"]
  • CONFIG["onnx_dir"]
  • CONFIG["onnx_opset"]

Results

Latest committed run (100 epochs):

Model Params Best Val RMSE Train Time (s) Epochs
NaiveLSTM 13,217 0.2821 3.9 100
ImprovedLSTM 84,481 0.2919 5.5 100
Transformer 150,273 0.2781 16.0 100
TCN 137,601 0.2815 9.1 100

ONNX Export

After training, src/train.py automatically exports each model to ONNX format when CONFIG["onnx_export_enabled"] is True (the default).

What is ONNX? ONNX (Open Neural Network Exchange) is a portable, language-agnostic format for representing trained models. Once a model is serialised to .onnx, it can be loaded by any runtime that implements the ONNX spec -- ONNX Runtime, TensorRT, OpenVINO, etc. -- without any Python or PyTorch dependency.

Export mechanics

All four models are exported with a fully static input shape of (1, 256, 3) (batch=1, time=256, features=3). Batch size is fixed to 1 because C++ inference always processes one window at a time. Export uses torch.onnx.export with dynamo=True (the torch.export-based exporter, the default from PyTorch 2.9 onward) and opset_version=17.

dynamo=True uses torch.export.export to capture the computation graph as an ExportedProgram before converting it to ONNX, rather than executing the model via TorchScript tracing. fallback=True is also set: if the dynamo-to-ONNX translator encounters an unsupported primitive op (a current onnxscript limitation for LSTM h₀/c₀ allocation, Transformer reshapes, and TCN weight_norm), it retries automatically with the legacy TorchScript path. The resulting .onnx file is identical either way.

Model Export path Note
NaiveLSTM TorchScript fallback prims.empty_strided (LSTM h₀/c₀ init) not yet in onnxscript
ImprovedLSTM TorchScript fallback Same reason as NaiveLSTM
Transformer TorchScript fallback prims.collapse_view (reshape before attention) not yet in onnxscript
TCN TorchScript fallback prims.copy_to (weight_norm copy-back) not yet in onnxscript

After export, each graph is validated with onnx.checker.check_model() and a round-trip forward pass via onnxruntime asserts the output shape is (1, 1).

Holdout data serialisation

The same holdout windows used by the Python inference plots are written to raw binary files so the C++ binary can load them without any Python dependency. The pre-processing is identical to infer_on_dataframe(): features are scaled with the StandardScaler fitted on training data, then make_windows() builds the sliding windows. This guarantees C++ sees bit-identical floats to Python.

Exported artifacts (written to artifacts/onnx/):

File Description
NaiveLSTM.pth State dict (load with model class + torch.load)
ImprovedLSTM.pth State dict
Transformer.pth State dict
TCN.pth State dict
NaiveLSTM.pt TorchScript traced graph (torch.jit.load, no class needed)
ImprovedLSTM.pt TorchScript traced graph
Transformer.pt TorchScript traced graph
TCN.pt TorchScript traced graph
NaiveLSTM.onnx ONNX graph (opset 17, static input (1,256,3))
ImprovedLSTM.onnx ONNX graph
Transformer.onnx ONNX graph
TCN.onnx ONNX graph
holdout_input.bin Raw float32 input windows, shape (N, 256, 3), row-major
holdout_target.bin Raw float32 targets, shape (N,)
holdout_meta.json {n_windows, lookback, n_features}
{Name}.engine Cached TensorRT engine (written by trt_inference on first run)
cpp_metrics.json ONNX/ORT C++ latency results
trt_metrics.json TensorRT C++ latency results (GPU only)
python_metrics.json Python .pth and .pt latency results

ONNX Runtime C++ Inference

cpp/inference.cpp benchmarks all ONNX models on:

  • CPUExecutionProvider (always)
  • CUDAExecutionProvider (when available)

It writes artifacts/onnx/cpp_metrics.json.

Install prerequisites

sudo apt update
sudo apt install -y build-essential cmake git

Install ONNX Runtime (choose one)

Option A: apt deb (CPU only, Ubuntu 24.10+)

sudo apt update
sudo apt install -y libonnxruntime libonnxruntime-dev

Option B: GitHub tar.gz (CPU or GPU, any Ubuntu)

CPU build:

cd ~
wget https://github.com/microsoft/onnxruntime/releases/download/v1.24.2/onnxruntime-linux-x64-1.24.2.tgz
tar -xf onnxruntime-linux-x64-1.24.2.tgz
mv onnxruntime-linux-x64-1.24.2 onnxruntime

GPU build:

cd ~
wget https://github.com/microsoft/onnxruntime/releases/download/v1.24.2/onnxruntime-linux-x64-gpu-1.24.2.tgz
tar -xf onnxruntime-linux-x64-gpu-1.24.2.tgz
mv onnxruntime-linux-x64-gpu-1.24.2 onnxruntime

Expected layout:

~/onnxruntime/
  include/
  lib/

Build C++ binaries

mkdir -p cpp/build
cd cpp/build

# apt install path (system detection)
cmake .. -DCMAKE_BUILD_TYPE=Release

# OR explicit tarball path
# cmake .. -DCMAKE_BUILD_TYPE=Release -DORT_ROOT=$HOME/onnxruntime

cmake --build . --parallel

Notes:

  • When ORT_ROOT points to a tarball install, CMake copies libonnxruntime*.so* and provider libs next to the binary.
  • TensorRT target is auto-enabled only if TensorRT headers/libs are present.

Run ONNX C++ inference directly

./cpp/build/inference artifacts/onnx

If running directly and using ORT GPU tarball, you may need runtime library paths:

export LD_LIBRARY_PATH="$HOME/onnxruntime/lib:$LD_LIBRARY_PATH"
./cpp/build/inference artifacts/onnx

Expected output includes CPU and, when configured correctly, CUDA blocks:

=== Provider: CPUExecutionProvider ===
...
=== Provider: CUDAExecutionProvider ===
...

If CUDA provider is unavailable, only CPU rows are emitted.


TensorRT C++ Inference

cpp/trt_inference.cpp builds/loads TensorRT engines and benchmarks GPU inference only. It writes artifacts/onnx/trt_metrics.json.

Install TensorRT

sudo apt update
sudo apt install -y tensorrt tensorrt-dev

Verify:

dpkg -l | grep tensorrt
ls /usr/include/x86_64-linux-gnu/NvInfer.h

Build

mkdir -p cpp/build
cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --parallel

If TensorRT is detected, CMake prints:

  • TensorRT found -- trt_inference target enabled

Run

./cpp/build/trt_inference artifacts/onnx

First run builds .engine files; later runs reuse cached engines.


Unified Inference Benchmark

src/benchmark.py runs:

  • Python .pth (CPU/GPU)
  • Python .pt (CPU/GPU)
  • ONNX Runtime C++ (CPU/GPU when available)
  • TensorRT C++ (GPU)

Run benchmark only

uv run python src/benchmark.py

Optional flags:

uv run python src/benchmark.py --skip-python
uv run python src/benchmark.py --onnx-dir artifacts/onnx --inference-bin cpp/build/inference --trt-bin cpp/build/trt_inference

Run full end-to-end pipeline

uv run src/pipeline.py

This is the preferred command for reproducible inference validation because it:

  • regenerates data
  • retrains and exports all formats
  • rebuilds C++ binaries
  • runs the unified benchmark

Why .pt LSTM GPU latency is higher than .pth

You may observe Python (.pt) GPU latency for NaiveLSTM and ImprovedLSTM significantly higher than Python (.pth) GPU latency. This is expected.

Reason:

  1. .pt exports are trace-based TorchScript modules.
  2. For traced LSTM .pt, PyTorch does not expose a usable flatten_parameters() method after load.
  3. cuDNN then emits the contiguous-weight warning (RNN module weights are not part of single contiguous chunk of memory).
  4. To guarantee warning-free inference, benchmark code disables cuDNN for .pt LSTM GPU runs.

Impact:

  • No warning spam during benchmark.
  • Correct outputs are unchanged.
  • LSTM .pt GPU latency is higher because those runs use the non-cuDNN path.

Scope:

  • This applies only to Python .pt LSTM GPU rows.
  • .pth LSTM GPU runs still use cuDNN and remain fast.
  • Transformer and TCN .pt GPU rows are not affected.

CUDA dependency handling for ONNX C++ benchmarks

When src/benchmark.py launches C++ binaries, it prepends CUDA library paths from the active uv environment (.venv/site-packages/nvidia/*/lib) to LD_LIBRARY_PATH. This prevents common ORT CUDA load failures such as:

  • Failed to load ... libonnxruntime_providers_cuda.so
  • libcudnn.so.9: cannot open shared object file

If you run C++ binaries manually, set LD_LIBRARY_PATH yourself.


Repository Structure

.
|-- LICENSE
|-- README.md
|-- pyproject.toml
|-- uv.lock
|-- artifacts
|   |-- comparison.png
|   |-- inference_comparison.png
|   |-- ImprovedLSTM_loss.png
|   |-- NaiveLSTM_loss.png
|   |-- TCN_loss.png
|   |-- Transformer_loss.png
|   `-- onnx
|       |-- NaiveLSTM.{pth,pt,onnx,engine}     <- all four export formats per model
|       |-- ImprovedLSTM.{pth,pt,onnx,engine}
|       |-- Transformer.{pth,pt,onnx,engine}
|       |-- TCN.{pth,pt,onnx,engine}
|       |-- holdout_input.bin
|       |-- holdout_target.bin
|       |-- holdout_meta.json
|       |-- cpp_metrics.json                    <- ONNX/ORT C++ results
|       |-- trt_metrics.json                    <- TensorRT C++ results
|       `-- python_metrics.json                 <- Python .pth/.pt results
|-- cpp
|   |-- CMakeLists.txt
|   |-- inference.cpp                           <- ONNX/ORT benchmark binary
|   `-- trt_inference.cpp                       <- TensorRT benchmark binary
|-- data
|   `-- input
|       |-- data_signals.csv
|       `-- data_signals.png
|-- notebooks
|   `-- explore.ipynb
`-- src
    |-- __init__.py
    |-- generate.py
    |-- path.py
    |-- pipeline.py                             <- end-to-end pipeline runner
    |-- train.py                                <- trains models, exports all formats
    |-- benchmark.py                            <- unified 8-row comparison table
    `-- models
        |-- __init__.py
        |-- improved_lstm.py
        |-- naive_lstm.py
        |-- tcn.py
        `-- transformer.py

Note: this tree uses plain ASCII to render reliably in Markdown preview. If you use lt/lsd, disable icons before pasting output into docs.

About

Sandbox environment for experimenting with time series modeling and Ray Train/Tune, MLflow, and ONNX integration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors