scratch-ml

A learning project for comparing four deep learning architectures on a time-series regression task. The goal is to understand the tradeoffs between LSTMs, Transformers, and convolutional networks by implementing each one from scratch, training them under identical conditions, and comparing the results.

The Data

The dataset is fully synthetic, generated by src/generate.py. It contains 2,000 timesteps sampled at 100 Hz over 20 seconds, rescaled to the range [1, 5] V to simulate a sensor reading.

Three input features (the signals the model can see):

Signal	Description
`sine`	1 Hz sine wave + small Gaussian noise
`square`	2 Hz square wave (50% duty cycle) + noise
`triangle`	1 Hz triangle wave + noise

One target (what the model must predict):

The target is a nonlinear mix of the three inputs, designed to require memory:

y_base   = 1.2*sine + 0.5*(sine * triangle_lag5) + 0.6*(square * triangle) + 0.3*triangle^2
envelope = 1 + 0.4*sin(2*pi * 0.2 * t)
target   = envelope * y_base + noise

Two things make this hard:

Lag - the triangle_lag5 term means the target depends on the triangle wave from 5 steps ago. A model that only looks at the current input will consistently miss this.
Slow envelope - the amplitude of the relationship changes sinusoidally over time (0.2 Hz), so the model must track both the fast signals and this slow modulation simultaneously.

Split: 60% training (1,200 samples) / 40% validation (800 samples), in temporal order. No shuffling - future data must never influence training.

Models

All four models share the same interface: they take a lookback window of shape (B, 256, 3) (batch x time x features) and output a single predicted value (B, 1).

They are trained under identical conditions: Adam at lr=0.001, batch size 64, MSE loss. Epochs are set in CONFIG["epochs"] in src/train.py.

1. Naive LSTM

What it does: Processes the 256-step input sequentially, one timestep at a time, maintaining an internal "memory" (the hidden state). After seeing all 256 steps, it uses only the final hidden state to make a prediction.

Architecture:

LSTM(F=3 -> H=32, 2 layers, dropout=0.1)
  |  take h_T  (last timestep only)
Linear(32 -> 1)

The key weakness: The hidden states at steps h_1 ... h_{T-1} are thrown away. All the memory from earlier in the window is compressed into h_T, which may not preserve everything.

Training: Plain Adam, no gradient clipping, no LR schedule.

2. Improved LSTM

What it does: Same LSTM, but instead of discarding h_1 ... h_{T-1}, it uses soft attention to compute a weighted average of all T hidden states. The model learns which timesteps are most informative and concentrates weight on those.

Architecture:

LSTM(F=3 -> H=64, 3 layers, dropout=0.1)
  |  LayerNorm(H)
  |  Attention:
       scores[t] = w * h_t              (one learned scalar per hidden state)
       alpha     = softmax(scores)      (weights that sum to 1 across T)
       context   = sum_t alpha_t * h_t  (weighted average: shape H)
  |  Linear(64 -> 1)

Why it's better:

Soft attention lets important earlier timesteps contribute directly to the prediction.
Larger hidden size (64 vs 32) and more layers (3 vs 2) give more capacity.
Gradient clipping (max_norm=1.0) prevents training instability.
ReduceLROnPlateau halves the learning rate when validation RMSE stops improving.

3. Transformer (encoder-only)

What it does: Processes all 256 timesteps in parallel using multi-head self-attention. Every timestep "looks at" every other timestep simultaneously - no sequential computation.

Architecture:

Linear(F=3 -> d=64)              embed each timestep into d_model dimensions
  |  PositionalEncoding          add position info (Transformers are order-agnostic)
  |  3 x TransformerEncoderLayer
       MultiHeadAttention(8 heads, head_dim=8)
       FeedForward(64 -> 256 -> 64)
       LayerNorm + residuals
  |  mean pool over time         collapse 256 tokens into 1 vector
  |  Linear(64 -> 1)

Positional encoding - because self-attention treats input as an unordered set, we must explicitly encode each position. We use the sinusoidal formula from "Attention is All You Need" (Vaswani et al., 2017):

PE(pos, 2i)   = sin(pos / 10000^(2i/64))
PE(pos, 2i+1) = cos(pos / 10000^(2i/64))

These fixed sinusoids are added to the input embeddings before attention, giving the model a unique "fingerprint" for each position without requiring any learned parameters.

Multi-head attention - 8 attention heads each compute attention over an 8-dimensional subspace in parallel, then concatenate their outputs. This lets the model attend to different aspects of the sequence simultaneously.

4. TCN - Temporal Convolutional Network

What it does: Uses 1-D convolutions instead of recurrence or attention. The key innovations are causal and dilated convolutions stacked in a residual network.

From "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling" (Bai et al., 2018).

Architecture:

6 x TemporalBlock (dilation = 1, 2, 4, 8, 16, 32)
  each block:
    causal Conv1d(dilation=d, kernel=3) + WeightNorm + ReLU + Dropout
    causal Conv1d(dilation=d, kernel=3) + WeightNorm + ReLU + Dropout
    residual connection (1x1 conv if channels differ)
  |  mean pool over time
  |  Linear(64 -> 1)

Causal convolution - a regular Conv1d with padding=p pads both sides, so the output at time t can depend on inputs at time t+1, t+2, ... - this leaks the future. We fix this by setting padding=0 and manually left-padding only:

F.pad(x, (pad, 0))   # pad on the left; nothing on the right

Dilated convolution - with dilation d and kernel size k, the filter reads positions t, t-d, t-2d, ..., t-(k-1)d. A kernel of size 3 with dilation 32 covers timesteps that are 64 steps apart, giving a wide receptive field without extra parameters.

Receptive field - stacking 6 blocks with dilation 1, 2, 4, 8, 16, 32 and kernel size 3:

total = (1+2+4+8+16+32) * (3-1) = 63 * 2 = 126 timesteps

This comfortably covers our 256-step lookback window.

Weight normalisation - decouples the weight's magnitude from its direction (w = g * v/||v||), which can stabilise training compared to plain batch norm.

Code Quality

This project uses ruff for linting and formatting. The configuration lives in pyproject.toml under [tool.ruff].

Ruff rules in use

Rule set	Code	What it checks
pycodestyle errors	`E`	Indentation, whitespace, and basic syntax style
pycodestyle warnings	`W`	Whitespace before comments, blank lines between blocks
pyflakes	`F`	Undefined names, unused imports, shadowed variables
isort	`I`	Import ordering: stdlib first, then third-party, then first-party
pyupgrade	`UP`	Modern Python syntax (e.g. `list[int]` instead of `List[int]`)
flake8-bugbear	`B`	Common bugs and design issues (e.g. mutable default args)
flake8-annotations	`ANN`	Type annotation coverage rules

Ignored rules (documented in pyproject.toml):

Code	Reason ignored
`E221`	Multiple spaces before operator - allowed for vertical alignment of related assignments
`E241`	Multiple spaces after comma - same reason
`E203`	Whitespace before `:` in slices (kept compatible with Black-style formatting)
`E501`	Line length handled separately (`line-length = 200`)

Additional Ruff config:

line-length = 200
target-version = "py312"
exclude = ["*.ipynb"]

Running ruff manually

# Lint
uv run ruff check src

# Auto-fix lint issues where possible
uv run ruff check --fix src

# Format
uv run ruff format src

Pre-commit hook

Pre-commit hooks are configured in .pre-commit-config.yaml and run Ruff lint/format checks before commit once installed.

# Run this to setup the pre-commit hook the first time
uv run pre-commit install

How to Run

Install `uv` (one-time local setup)

uv is required for dependency management and running commands in this project.

macOS / Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows (PowerShell):

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Alternative install method:

pipx install uv

Verify installation:

uv --version

For additional install options, see the official docs: https://docs.astral.sh/uv/getting-started/installation/

Project setup

Install dependencies:

uv sync

Install dev tools (ruff):

uv sync --group dev

Regenerate the synthetic data (optional - data is already committed):

uv run python src/generate.py

Run the full inference pipeline (recommended):

src/pipeline.py chains all four steps:

data generation
model training + export (.pth, .pt, .onnx)
C++ build (inference, plus trt_inference if TensorRT is installed)
unified benchmark (src/benchmark.py)

# Recommended command
uv run src/pipeline.py

ONNX Runtime resolution order in pipeline.py:

--ort-root /path/to/onnxruntime
ORT_ROOT environment variable
$HOME/onnxruntime (if include/ + lib/ are present)
system library search (apt-installed ORT)

Examples:

# Explicit ORT tarball location
uv run src/pipeline.py --ort-root $HOME/onnxruntime

# Same, but from environment variable
export ORT_ROOT=$HOME/onnxruntime
uv run src/pipeline.py

# Skip work already done
uv run src/pipeline.py --skip-generate --skip-train --skip-build

Or run each step individually:

Train all four models and produce comparison artifacts:

uv run python src/train.py

This will:

Print per-epoch progress every 50 epochs for each model
Save individual loss curves to artifacts/{ModelName}_loss.png
Save a 6-panel comparison figure to artifacts/comparison.png
Generate a synthetic holdout and save a 2x2 inference overlay figure to artifacts/inference_comparison.png
Export each model to .pth (state dict), .pt (TorchScript), and .onnx in artifacts/onnx/
Write holdout binary data (holdout_input.bin, holdout_target.bin, holdout_meta.json) to artifacts/onnx/
Print a summary table with parameter counts, best val RMSE, and training times

All hyperparameters are in the CONFIG dict at the top of src/train.py. Inference holdout controls:

CONFIG["inference_holdout_enabled"]
CONFIG["inference_duration_s"]
CONFIG["inference_fs"]
CONFIG["inference_seed"]
CONFIG["inference_plot_path"]

ONNX export controls:

CONFIG["onnx_export_enabled"]
CONFIG["onnx_dir"]
CONFIG["onnx_opset"]

Results

Latest committed run (100 epochs):

Model	Params	Best Val RMSE	Train Time (s)	Epochs
NaiveLSTM	13,217	0.2821	3.9	100
ImprovedLSTM	84,481	0.2919	5.5	100
Transformer	150,273	0.2781	16.0	100
TCN	137,601	0.2815	9.1	100

ONNX Export

After training, src/train.py automatically exports each model to ONNX format when CONFIG["onnx_export_enabled"] is True (the default).

What is ONNX? ONNX (Open Neural Network Exchange) is a portable, language-agnostic format for representing trained models. Once a model is serialised to .onnx, it can be loaded by any runtime that implements the ONNX spec -- ONNX Runtime, TensorRT, OpenVINO, etc. -- without any Python or PyTorch dependency.

Export mechanics

All four models are exported with a fully static input shape of (1, 256, 3) (batch=1, time=256, features=3). Batch size is fixed to 1 because C++ inference always processes one window at a time. Export uses torch.onnx.export with dynamo=True (the torch.export-based exporter, the default from PyTorch 2.9 onward) and opset_version=17.

dynamo=True uses torch.export.export to capture the computation graph as an ExportedProgram before converting it to ONNX, rather than executing the model via TorchScript tracing. fallback=True is also set: if the dynamo-to-ONNX translator encounters an unsupported primitive op (a current onnxscript limitation for LSTM h₀/c₀ allocation, Transformer reshapes, and TCN weight_norm), it retries automatically with the legacy TorchScript path. The resulting .onnx file is identical either way.

Model	Export path	Note
NaiveLSTM	TorchScript fallback	`prims.empty_strided` (LSTM h₀/c₀ init) not yet in onnxscript
ImprovedLSTM	TorchScript fallback	Same reason as NaiveLSTM
Transformer	TorchScript fallback	`prims.collapse_view` (reshape before attention) not yet in onnxscript
TCN	TorchScript fallback	`prims.copy_to` (weight_norm copy-back) not yet in onnxscript

After export, each graph is validated with onnx.checker.check_model() and a round-trip forward pass via onnxruntime asserts the output shape is (1, 1).

Holdout data serialisation

The same holdout windows used by the Python inference plots are written to raw binary files so the C++ binary can load them without any Python dependency. The pre-processing is identical to infer_on_dataframe(): features are scaled with the StandardScaler fitted on training data, then make_windows() builds the sliding windows. This guarantees C++ sees bit-identical floats to Python.

Exported artifacts (written to artifacts/onnx/):

File	Description
`NaiveLSTM.pth`	State dict (load with model class + `torch.load`)
`ImprovedLSTM.pth`	State dict
`Transformer.pth`	State dict
`TCN.pth`	State dict
`NaiveLSTM.pt`	TorchScript traced graph (`torch.jit.load`, no class needed)
`ImprovedLSTM.pt`	TorchScript traced graph
`Transformer.pt`	TorchScript traced graph
`TCN.pt`	TorchScript traced graph
`NaiveLSTM.onnx`	ONNX graph (opset 17, static input `(1,256,3)`)
`ImprovedLSTM.onnx`	ONNX graph
`Transformer.onnx`	ONNX graph
`TCN.onnx`	ONNX graph
`holdout_input.bin`	Raw float32 input windows, shape (N, 256, 3), row-major
`holdout_target.bin`	Raw float32 targets, shape (N,)
`holdout_meta.json`	`{n_windows, lookback, n_features}`
`{Name}.engine`	Cached TensorRT engine (written by `trt_inference` on first run)
`cpp_metrics.json`	ONNX/ORT C++ latency results
`trt_metrics.json`	TensorRT C++ latency results (GPU only)
`python_metrics.json`	Python .pth and .pt latency results

ONNX Runtime C++ Inference

cpp/inference.cpp benchmarks all ONNX models on:

CPUExecutionProvider (always)
CUDAExecutionProvider (when available)

It writes artifacts/onnx/cpp_metrics.json.

Install prerequisites

sudo apt update
sudo apt install -y build-essential cmake git

Install ONNX Runtime (choose one)

Option A: apt deb (CPU only, Ubuntu 24.10+)

sudo apt update
sudo apt install -y libonnxruntime libonnxruntime-dev

Option B: GitHub tar.gz (CPU or GPU, any Ubuntu)

CPU build:

cd ~
wget https://github.com/microsoft/onnxruntime/releases/download/v1.24.2/onnxruntime-linux-x64-1.24.2.tgz
tar -xf onnxruntime-linux-x64-1.24.2.tgz
mv onnxruntime-linux-x64-1.24.2 onnxruntime

GPU build:

cd ~
wget https://github.com/microsoft/onnxruntime/releases/download/v1.24.2/onnxruntime-linux-x64-gpu-1.24.2.tgz
tar -xf onnxruntime-linux-x64-gpu-1.24.2.tgz
mv onnxruntime-linux-x64-gpu-1.24.2 onnxruntime

Expected layout:

~/onnxruntime/
  include/
  lib/

Build C++ binaries

mkdir -p cpp/build
cd cpp/build

# apt install path (system detection)
cmake .. -DCMAKE_BUILD_TYPE=Release

# OR explicit tarball path
# cmake .. -DCMAKE_BUILD_TYPE=Release -DORT_ROOT=$HOME/onnxruntime

cmake --build . --parallel

Notes:

When ORT_ROOT points to a tarball install, CMake copies libonnxruntime*.so* and provider libs next to the binary.
TensorRT target is auto-enabled only if TensorRT headers/libs are present.

Run ONNX C++ inference directly

./cpp/build/inference artifacts/onnx

If running directly and using ORT GPU tarball, you may need runtime library paths:

export LD_LIBRARY_PATH="$HOME/onnxruntime/lib:$LD_LIBRARY_PATH"
./cpp/build/inference artifacts/onnx

Expected output includes CPU and, when configured correctly, CUDA blocks:

=== Provider: CPUExecutionProvider ===
...
=== Provider: CUDAExecutionProvider ===
...

If CUDA provider is unavailable, only CPU rows are emitted.

TensorRT C++ Inference

cpp/trt_inference.cpp builds/loads TensorRT engines and benchmarks GPU inference only. It writes artifacts/onnx/trt_metrics.json.

Install TensorRT

sudo apt update
sudo apt install -y tensorrt tensorrt-dev

Verify:

dpkg -l | grep tensorrt
ls /usr/include/x86_64-linux-gnu/NvInfer.h

Build

mkdir -p cpp/build
cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --parallel

If TensorRT is detected, CMake prints:

TensorRT found -- trt_inference target enabled

Run

./cpp/build/trt_inference artifacts/onnx

First run builds .engine files; later runs reuse cached engines.

Unified Inference Benchmark

src/benchmark.py runs:

Python .pth (CPU/GPU)
Python .pt (CPU/GPU)
ONNX Runtime C++ (CPU/GPU when available)
TensorRT C++ (GPU)

Run benchmark only

uv run python src/benchmark.py

Optional flags:

uv run python src/benchmark.py --skip-python
uv run python src/benchmark.py --onnx-dir artifacts/onnx --inference-bin cpp/build/inference --trt-bin cpp/build/trt_inference

Run full end-to-end pipeline

uv run src/pipeline.py

This is the preferred command for reproducible inference validation because it:

regenerates data
retrains and exports all formats
rebuilds C++ binaries
runs the unified benchmark

Why `.pt` LSTM GPU latency is higher than `.pth`

You may observe Python (.pt) GPU latency for NaiveLSTM and ImprovedLSTM significantly higher than Python (.pth) GPU latency. This is expected.

Reason:

.pt exports are trace-based TorchScript modules.
For traced LSTM .pt, PyTorch does not expose a usable flatten_parameters() method after load.
cuDNN then emits the contiguous-weight warning (RNN module weights are not part of single contiguous chunk of memory).
To guarantee warning-free inference, benchmark code disables cuDNN for .pt LSTM GPU runs.

Impact:

No warning spam during benchmark.
Correct outputs are unchanged.
LSTM .pt GPU latency is higher because those runs use the non-cuDNN path.

Scope:

This applies only to Python .pt LSTM GPU rows.
.pth LSTM GPU runs still use cuDNN and remain fast.
Transformer and TCN .pt GPU rows are not affected.

CUDA dependency handling for ONNX C++ benchmarks

When src/benchmark.py launches C++ binaries, it prepends CUDA library paths from the active uv environment (.venv/site-packages/nvidia/*/lib) to LD_LIBRARY_PATH. This prevents common ORT CUDA load failures such as:

Failed to load ... libonnxruntime_providers_cuda.so
libcudnn.so.9: cannot open shared object file

If you run C++ binaries manually, set LD_LIBRARY_PATH yourself.

Repository Structure

.
|-- LICENSE
|-- README.md
|-- pyproject.toml
|-- uv.lock
|-- artifacts
|   |-- comparison.png
|   |-- inference_comparison.png
|   |-- ImprovedLSTM_loss.png
|   |-- NaiveLSTM_loss.png
|   |-- TCN_loss.png
|   |-- Transformer_loss.png
|   `-- onnx
|       |-- NaiveLSTM.{pth,pt,onnx,engine}     <- all four export formats per model
|       |-- ImprovedLSTM.{pth,pt,onnx,engine}
|       |-- Transformer.{pth,pt,onnx,engine}
|       |-- TCN.{pth,pt,onnx,engine}
|       |-- holdout_input.bin
|       |-- holdout_target.bin
|       |-- holdout_meta.json
|       |-- cpp_metrics.json                    <- ONNX/ORT C++ results
|       |-- trt_metrics.json                    <- TensorRT C++ results
|       `-- python_metrics.json                 <- Python .pth/.pt results
|-- cpp
|   |-- CMakeLists.txt
|   |-- inference.cpp                           <- ONNX/ORT benchmark binary
|   `-- trt_inference.cpp                       <- TensorRT benchmark binary
|-- data
|   `-- input
|       |-- data_signals.csv
|       `-- data_signals.png
|-- notebooks
|   `-- explore.ipynb
`-- src
    |-- __init__.py
    |-- generate.py
    |-- path.py
    |-- pipeline.py                             <- end-to-end pipeline runner
    |-- train.py                                <- trains models, exports all formats
    |-- benchmark.py                            <- unified 8-row comparison table
    `-- models
        |-- __init__.py
        |-- improved_lstm.py
        |-- naive_lstm.py
        |-- tcn.py
        `-- transformer.py

Note: this tree uses plain ASCII to render reliably in Markdown preview. If you use lt/lsd, disable icons before pasting output into docs.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cpp		cpp
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

scratch-ml

The Data

Models

1. Naive LSTM

2. Improved LSTM

3. Transformer (encoder-only)

4. TCN - Temporal Convolutional Network

Code Quality

Ruff rules in use

Running ruff manually

Pre-commit hook

How to Run

Install uv (one-time local setup)

Project setup

Results

ONNX Export

ONNX Runtime C++ Inference

Install prerequisites

Install ONNX Runtime (choose one)

Build C++ binaries

Run ONNX C++ inference directly

TensorRT C++ Inference

Install TensorRT

Build

Run

Unified Inference Benchmark

Run benchmark only

Run full end-to-end pipeline

Why .pt LSTM GPU latency is higher than .pth

CUDA dependency handling for ONNX C++ benchmarks

Repository Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Install `uv` (one-time local setup)

Why `.pt` LSTM GPU latency is higher than `.pth`

Packages