A learning project for comparing four deep learning architectures on a time-series regression task. The goal is to understand the tradeoffs between LSTMs, Transformers, and convolutional networks by implementing each one from scratch, training them under identical conditions, and comparing the results.
The dataset is fully synthetic, generated by src/generate.py. It contains 2,000 timesteps
sampled at 100 Hz over 20 seconds, rescaled to the range [1, 5] V to simulate a sensor reading.
Three input features (the signals the model can see):
| Signal | Description |
|---|---|
sine |
1 Hz sine wave + small Gaussian noise |
square |
2 Hz square wave (50% duty cycle) + noise |
triangle |
1 Hz triangle wave + noise |
One target (what the model must predict):
The target is a nonlinear mix of the three inputs, designed to require memory:
y_base = 1.2*sine + 0.5*(sine * triangle_lag5) + 0.6*(square * triangle) + 0.3*triangle^2
envelope = 1 + 0.4*sin(2*pi * 0.2 * t)
target = envelope * y_base + noise
Two things make this hard:
- Lag - the
triangle_lag5term means the target depends on the triangle wave from 5 steps ago. A model that only looks at the current input will consistently miss this. - Slow envelope - the amplitude of the relationship changes sinusoidally over time (0.2 Hz), so the model must track both the fast signals and this slow modulation simultaneously.
Split: 60% training (1,200 samples) / 40% validation (800 samples), in temporal order. No shuffling - future data must never influence training.
All four models share the same interface: they take a lookback window of shape (B, 256, 3) (batch x time x features) and output a single predicted value (B, 1).
They are trained under identical conditions: Adam at lr=0.001, batch size 64, MSE loss.
Epochs are set in CONFIG["epochs"] in src/train.py.
What it does: Processes the 256-step input sequentially, one timestep at a time, maintaining an internal "memory" (the hidden state). After seeing all 256 steps, it uses only the final hidden state to make a prediction.
Architecture:
LSTM(F=3 -> H=32, 2 layers, dropout=0.1)
| take h_T (last timestep only)
Linear(32 -> 1)
The key weakness: The hidden states at steps h_1 ... h_{T-1} are thrown away. All the memory from earlier in the window is compressed into h_T, which may not preserve everything.
Training: Plain Adam, no gradient clipping, no LR schedule.
What it does: Same LSTM, but instead of discarding h_1 ... h_{T-1}, it uses soft attention to compute a weighted average of all T hidden states. The model learns which timesteps are most informative and concentrates weight on those.
Architecture:
LSTM(F=3 -> H=64, 3 layers, dropout=0.1)
| LayerNorm(H)
| Attention:
scores[t] = w * h_t (one learned scalar per hidden state)
alpha = softmax(scores) (weights that sum to 1 across T)
context = sum_t alpha_t * h_t (weighted average: shape H)
| Linear(64 -> 1)
Why it's better:
- Soft attention lets important earlier timesteps contribute directly to the prediction.
- Larger hidden size (64 vs 32) and more layers (3 vs 2) give more capacity.
- Gradient clipping (
max_norm=1.0) prevents training instability. ReduceLROnPlateauhalves the learning rate when validation RMSE stops improving.
What it does: Processes all 256 timesteps in parallel using multi-head self-attention. Every timestep "looks at" every other timestep simultaneously - no sequential computation.
Architecture:
Linear(F=3 -> d=64) embed each timestep into d_model dimensions
| PositionalEncoding add position info (Transformers are order-agnostic)
| 3 x TransformerEncoderLayer
MultiHeadAttention(8 heads, head_dim=8)
FeedForward(64 -> 256 -> 64)
LayerNorm + residuals
| mean pool over time collapse 256 tokens into 1 vector
| Linear(64 -> 1)
Positional encoding - because self-attention treats input as an unordered set, we must explicitly encode each position. We use the sinusoidal formula from "Attention is All You Need" (Vaswani et al., 2017):
PE(pos, 2i) = sin(pos / 10000^(2i/64))
PE(pos, 2i+1) = cos(pos / 10000^(2i/64))
These fixed sinusoids are added to the input embeddings before attention, giving the model a unique "fingerprint" for each position without requiring any learned parameters.
Multi-head attention - 8 attention heads each compute attention over an 8-dimensional subspace in parallel, then concatenate their outputs. This lets the model attend to different aspects of the sequence simultaneously.
What it does: Uses 1-D convolutions instead of recurrence or attention. The key innovations are causal and dilated convolutions stacked in a residual network.
From "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling" (Bai et al., 2018).
Architecture:
6 x TemporalBlock (dilation = 1, 2, 4, 8, 16, 32)
each block:
causal Conv1d(dilation=d, kernel=3) + WeightNorm + ReLU + Dropout
causal Conv1d(dilation=d, kernel=3) + WeightNorm + ReLU + Dropout
residual connection (1x1 conv if channels differ)
| mean pool over time
| Linear(64 -> 1)
Causal convolution - a regular Conv1d with padding=p pads both sides, so the output at
time t can depend on inputs at time t+1, t+2, ... - this leaks the future. We fix this by
setting padding=0 and manually left-padding only:
F.pad(x, (pad, 0)) # pad on the left; nothing on the rightDilated convolution - with dilation d and kernel size k, the filter reads positions t, t-d, t-2d, ..., t-(k-1)d. A kernel of size 3 with dilation 32 covers timesteps that are 64 steps apart, giving a wide receptive field without extra parameters.
Receptive field - stacking 6 blocks with dilation 1, 2, 4, 8, 16, 32 and kernel size 3:
total = (1+2+4+8+16+32) * (3-1) = 63 * 2 = 126 timesteps
This comfortably covers our 256-step lookback window.
Weight normalisation - decouples the weight's magnitude from its direction
(w = g * v/||v||), which can stabilise training compared to plain batch norm.
This project uses ruff for linting and formatting.
The configuration lives in pyproject.toml under [tool.ruff].
| Rule set | Code | What it checks |
|---|---|---|
| pycodestyle errors | E |
Indentation, whitespace, and basic syntax style |
| pycodestyle warnings | W |
Whitespace before comments, blank lines between blocks |
| pyflakes | F |
Undefined names, unused imports, shadowed variables |
| isort | I |
Import ordering: stdlib first, then third-party, then first-party |
| pyupgrade | UP |
Modern Python syntax (e.g. list[int] instead of List[int]) |
| flake8-bugbear | B |
Common bugs and design issues (e.g. mutable default args) |
| flake8-annotations | ANN |
Type annotation coverage rules |
Ignored rules (documented in pyproject.toml):
| Code | Reason ignored |
|---|---|
E221 |
Multiple spaces before operator - allowed for vertical alignment of related assignments |
E241 |
Multiple spaces after comma - same reason |
E203 |
Whitespace before : in slices (kept compatible with Black-style formatting) |
E501 |
Line length handled separately (line-length = 200) |
Additional Ruff config:
line-length = 200target-version = "py312"exclude = ["*.ipynb"]
# Lint
uv run ruff check src
# Auto-fix lint issues where possible
uv run ruff check --fix src
# Format
uv run ruff format srcPre-commit hooks are configured in .pre-commit-config.yaml and run Ruff lint/format checks
before commit once installed.
# Run this to setup the pre-commit hook the first time
uv run pre-commit installuv is required for dependency management and running commands in this project.
macOS / Linux:
curl -LsSf https://astral.sh/uv/install.sh | shWindows (PowerShell):
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"Alternative install method:
pipx install uvVerify installation:
uv --versionFor additional install options, see the official docs: https://docs.astral.sh/uv/getting-started/installation/
Install dependencies:
uv syncInstall dev tools (ruff):
uv sync --group devRegenerate the synthetic data (optional - data is already committed):
uv run python src/generate.pyRun the full inference pipeline (recommended):
src/pipeline.py chains all four steps:
- data generation
- model training + export (
.pth,.pt,.onnx) - C++ build (
inference, plustrt_inferenceif TensorRT is installed) - unified benchmark (
src/benchmark.py)
# Recommended command
uv run src/pipeline.pyONNX Runtime resolution order in pipeline.py:
--ort-root /path/to/onnxruntimeORT_ROOTenvironment variable$HOME/onnxruntime(ifinclude/+lib/are present)- system library search (apt-installed ORT)
Examples:
# Explicit ORT tarball location
uv run src/pipeline.py --ort-root $HOME/onnxruntime
# Same, but from environment variable
export ORT_ROOT=$HOME/onnxruntime
uv run src/pipeline.py
# Skip work already done
uv run src/pipeline.py --skip-generate --skip-train --skip-buildOr run each step individually:
Train all four models and produce comparison artifacts:
uv run python src/train.pyThis will:
- Print per-epoch progress every 50 epochs for each model
- Save individual loss curves to
artifacts/{ModelName}_loss.png - Save a 6-panel comparison figure to
artifacts/comparison.png - Generate a synthetic holdout and save a 2x2 inference overlay figure to
artifacts/inference_comparison.png - Export each model to
.pth(state dict),.pt(TorchScript), and.onnxinartifacts/onnx/ - Write holdout binary data (
holdout_input.bin,holdout_target.bin,holdout_meta.json) toartifacts/onnx/ - Print a summary table with parameter counts, best val RMSE, and training times
All hyperparameters are in the CONFIG dict at the top of src/train.py.
Inference holdout controls:
CONFIG["inference_holdout_enabled"]CONFIG["inference_duration_s"]CONFIG["inference_fs"]CONFIG["inference_seed"]CONFIG["inference_plot_path"]
ONNX export controls:
CONFIG["onnx_export_enabled"]CONFIG["onnx_dir"]CONFIG["onnx_opset"]
Latest committed run (100 epochs):
| Model | Params | Best Val RMSE | Train Time (s) | Epochs |
|---|---|---|---|---|
| NaiveLSTM | 13,217 | 0.2821 | 3.9 | 100 |
| ImprovedLSTM | 84,481 | 0.2919 | 5.5 | 100 |
| Transformer | 150,273 | 0.2781 | 16.0 | 100 |
| TCN | 137,601 | 0.2815 | 9.1 | 100 |
After training, src/train.py automatically exports each model to ONNX format when
CONFIG["onnx_export_enabled"] is True (the default).
What is ONNX?
ONNX (Open Neural Network Exchange) is a portable, language-agnostic format for
representing trained models. Once a model is serialised to .onnx, it can be
loaded by any runtime that implements the ONNX spec -- ONNX Runtime, TensorRT,
OpenVINO, etc. -- without any Python or PyTorch dependency.
Export mechanics
All four models are exported with a fully static input shape of (1, 256, 3)
(batch=1, time=256, features=3). Batch size is fixed to 1 because C++ inference
always processes one window at a time. Export uses torch.onnx.export with
dynamo=True (the torch.export-based exporter, the default from PyTorch 2.9
onward) and opset_version=17.
dynamo=True uses torch.export.export to capture the computation graph as an
ExportedProgram before converting it to ONNX, rather than executing the model
via TorchScript tracing. fallback=True is also set: if the dynamo-to-ONNX
translator encounters an unsupported primitive op (a current onnxscript limitation
for LSTM h₀/c₀ allocation, Transformer reshapes, and TCN weight_norm), it retries
automatically with the legacy TorchScript path. The resulting .onnx file is
identical either way.
| Model | Export path | Note |
|---|---|---|
| NaiveLSTM | TorchScript fallback | prims.empty_strided (LSTM h₀/c₀ init) not yet in onnxscript |
| ImprovedLSTM | TorchScript fallback | Same reason as NaiveLSTM |
| Transformer | TorchScript fallback | prims.collapse_view (reshape before attention) not yet in onnxscript |
| TCN | TorchScript fallback | prims.copy_to (weight_norm copy-back) not yet in onnxscript |
After export, each graph is validated with onnx.checker.check_model() and a
round-trip forward pass via onnxruntime asserts the output shape is (1, 1).
Holdout data serialisation
The same holdout windows used by the Python inference plots are written to raw
binary files so the C++ binary can load them without any Python dependency.
The pre-processing is identical to infer_on_dataframe(): features are scaled
with the StandardScaler fitted on training data, then make_windows() builds
the sliding windows. This guarantees C++ sees bit-identical floats to Python.
Exported artifacts (written to artifacts/onnx/):
| File | Description |
|---|---|
NaiveLSTM.pth |
State dict (load with model class + torch.load) |
ImprovedLSTM.pth |
State dict |
Transformer.pth |
State dict |
TCN.pth |
State dict |
NaiveLSTM.pt |
TorchScript traced graph (torch.jit.load, no class needed) |
ImprovedLSTM.pt |
TorchScript traced graph |
Transformer.pt |
TorchScript traced graph |
TCN.pt |
TorchScript traced graph |
NaiveLSTM.onnx |
ONNX graph (opset 17, static input (1,256,3)) |
ImprovedLSTM.onnx |
ONNX graph |
Transformer.onnx |
ONNX graph |
TCN.onnx |
ONNX graph |
holdout_input.bin |
Raw float32 input windows, shape (N, 256, 3), row-major |
holdout_target.bin |
Raw float32 targets, shape (N,) |
holdout_meta.json |
{n_windows, lookback, n_features} |
{Name}.engine |
Cached TensorRT engine (written by trt_inference on first run) |
cpp_metrics.json |
ONNX/ORT C++ latency results |
trt_metrics.json |
TensorRT C++ latency results (GPU only) |
python_metrics.json |
Python .pth and .pt latency results |
cpp/inference.cpp benchmarks all ONNX models on:
CPUExecutionProvider(always)CUDAExecutionProvider(when available)
It writes artifacts/onnx/cpp_metrics.json.
sudo apt update
sudo apt install -y build-essential cmake gitOption A: apt deb (CPU only, Ubuntu 24.10+)
sudo apt update
sudo apt install -y libonnxruntime libonnxruntime-devOption B: GitHub tar.gz (CPU or GPU, any Ubuntu)
CPU build:
cd ~
wget https://github.com/microsoft/onnxruntime/releases/download/v1.24.2/onnxruntime-linux-x64-1.24.2.tgz
tar -xf onnxruntime-linux-x64-1.24.2.tgz
mv onnxruntime-linux-x64-1.24.2 onnxruntimeGPU build:
cd ~
wget https://github.com/microsoft/onnxruntime/releases/download/v1.24.2/onnxruntime-linux-x64-gpu-1.24.2.tgz
tar -xf onnxruntime-linux-x64-gpu-1.24.2.tgz
mv onnxruntime-linux-x64-gpu-1.24.2 onnxruntimeExpected layout:
~/onnxruntime/
include/
lib/
mkdir -p cpp/build
cd cpp/build
# apt install path (system detection)
cmake .. -DCMAKE_BUILD_TYPE=Release
# OR explicit tarball path
# cmake .. -DCMAKE_BUILD_TYPE=Release -DORT_ROOT=$HOME/onnxruntime
cmake --build . --parallelNotes:
- When
ORT_ROOTpoints to a tarball install, CMake copieslibonnxruntime*.so*and provider libs next to the binary. - TensorRT target is auto-enabled only if TensorRT headers/libs are present.
./cpp/build/inference artifacts/onnxIf running directly and using ORT GPU tarball, you may need runtime library paths:
export LD_LIBRARY_PATH="$HOME/onnxruntime/lib:$LD_LIBRARY_PATH"
./cpp/build/inference artifacts/onnxExpected output includes CPU and, when configured correctly, CUDA blocks:
=== Provider: CPUExecutionProvider ===
...
=== Provider: CUDAExecutionProvider ===
...
If CUDA provider is unavailable, only CPU rows are emitted.
cpp/trt_inference.cpp builds/loads TensorRT engines and benchmarks GPU inference only.
It writes artifacts/onnx/trt_metrics.json.
sudo apt update
sudo apt install -y tensorrt tensorrt-devVerify:
dpkg -l | grep tensorrt
ls /usr/include/x86_64-linux-gnu/NvInfer.hmkdir -p cpp/build
cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --parallelIf TensorRT is detected, CMake prints:
TensorRT found -- trt_inference target enabled
./cpp/build/trt_inference artifacts/onnxFirst run builds .engine files; later runs reuse cached engines.
src/benchmark.py runs:
- Python
.pth(CPU/GPU) - Python
.pt(CPU/GPU) - ONNX Runtime C++ (CPU/GPU when available)
- TensorRT C++ (GPU)
uv run python src/benchmark.pyOptional flags:
uv run python src/benchmark.py --skip-python
uv run python src/benchmark.py --onnx-dir artifacts/onnx --inference-bin cpp/build/inference --trt-bin cpp/build/trt_inferenceuv run src/pipeline.pyThis is the preferred command for reproducible inference validation because it:
- regenerates data
- retrains and exports all formats
- rebuilds C++ binaries
- runs the unified benchmark
You may observe Python (.pt) GPU latency for NaiveLSTM and ImprovedLSTM
significantly higher than Python (.pth) GPU latency. This is expected.
Reason:
.ptexports are trace-based TorchScript modules.- For traced LSTM
.pt, PyTorch does not expose a usableflatten_parameters()method after load. - cuDNN then emits the contiguous-weight warning (
RNN module weights are not part of single contiguous chunk of memory). - To guarantee warning-free inference, benchmark code disables cuDNN for
.ptLSTM GPU runs.
Impact:
- No warning spam during benchmark.
- Correct outputs are unchanged.
- LSTM
.ptGPU latency is higher because those runs use the non-cuDNN path.
Scope:
- This applies only to Python
.ptLSTM GPU rows. .pthLSTM GPU runs still use cuDNN and remain fast.- Transformer and TCN
.ptGPU rows are not affected.
When src/benchmark.py launches C++ binaries, it prepends CUDA library paths
from the active uv environment (.venv/site-packages/nvidia/*/lib) to
LD_LIBRARY_PATH. This prevents common ORT CUDA load failures such as:
Failed to load ... libonnxruntime_providers_cuda.solibcudnn.so.9: cannot open shared object file
If you run C++ binaries manually, set LD_LIBRARY_PATH yourself.
.
|-- LICENSE
|-- README.md
|-- pyproject.toml
|-- uv.lock
|-- artifacts
| |-- comparison.png
| |-- inference_comparison.png
| |-- ImprovedLSTM_loss.png
| |-- NaiveLSTM_loss.png
| |-- TCN_loss.png
| |-- Transformer_loss.png
| `-- onnx
| |-- NaiveLSTM.{pth,pt,onnx,engine} <- all four export formats per model
| |-- ImprovedLSTM.{pth,pt,onnx,engine}
| |-- Transformer.{pth,pt,onnx,engine}
| |-- TCN.{pth,pt,onnx,engine}
| |-- holdout_input.bin
| |-- holdout_target.bin
| |-- holdout_meta.json
| |-- cpp_metrics.json <- ONNX/ORT C++ results
| |-- trt_metrics.json <- TensorRT C++ results
| `-- python_metrics.json <- Python .pth/.pt results
|-- cpp
| |-- CMakeLists.txt
| |-- inference.cpp <- ONNX/ORT benchmark binary
| `-- trt_inference.cpp <- TensorRT benchmark binary
|-- data
| `-- input
| |-- data_signals.csv
| `-- data_signals.png
|-- notebooks
| `-- explore.ipynb
`-- src
|-- __init__.py
|-- generate.py
|-- path.py
|-- pipeline.py <- end-to-end pipeline runner
|-- train.py <- trains models, exports all formats
|-- benchmark.py <- unified 8-row comparison table
`-- models
|-- __init__.py
|-- improved_lstm.py
|-- naive_lstm.py
|-- tcn.py
`-- transformer.py
Note: this tree uses plain ASCII to render reliably in Markdown preview. If you use lt/lsd, disable icons before pasting output into docs.