GitHub - NVIDIA-AI-Blueprints/transaction-foundation-model

NVIDIA Developer Example: Build Your Own Transaction Foundation Model

Financial transaction data is one of the richest signals available in the enterprise. Every swipe, transfer, and payment encodes patterns of human behavior — from daily spending habits to subtle shifts that precede fraud. Traditional approaches rely on hand-crafted features and rules that are brittle, slow to adapt, and blind to the deep sequential structure in transaction histories. Foundation models change this equation: by pretraining on large volumes of unlabeled transaction sequences, they learn general-purpose representations of financial behavior that transfer to a wide range of downstream tasks — fraud detection, anomaly scoring, customer segmentation, and personalized financial services.

This developer example shows how to build such a model end-to-end on NVIDIA GPUs:

Custom GPU-accelerated tokenizer — A modular, RAPIDS-powered tokenizer converts heterogeneous tabular fields (merchant category, amount, time deltas, and more) into domain-specific token sequences. The pipeline is designed to be flexible: swap or add tokenizer components to match any transaction schema.
Scalable pretraining with NeMo AutoModel — A decoder-only foundation model is trained with causal language modeling through NVIDIA NeMo AutoModel. NeMo AutoModel handles distributed training out of the box, scaling seamlessly from a single GPU to multi-node clusters, and can incorporate essentially any HuggingFace-compatible decoder architecture. This example uses Llama, but the framework is architecture-agnostic.
Embedding extraction and downstream evaluation — Learned embeddings are extracted via last-token pooling and evaluated on fraud detection with XGBoost, demonstrating clear lift over hand-crafted feature baselines.

Software Components

NVIDIA Technology

NVIDIA NeMo AutoModel — Foundation model training and inference
NVIDIA RAPIDS (cuDF, cuML) — GPU-accelerated data processing and tokenization

3rd Party Software

PyTorch 2.x — Deep learning framework
HuggingFace Transformers — Model checkpointing and loading
XGBoost — Gradient-boosted trees for fraud detection
scikit-learn — Classical ML preprocessing, metrics, and baseline utilities
pandas — CPU dataframe operations and interoperability with GPU pipelines
NumPy — Array operations used across preprocessing and inference
CuPy — GPU array operations for tokenizer and embedding workflows
matplotlib — Static visualizations
seaborn — Statistical plotting for dataset exploration
plotly — Interactive 3D embedding visualization
tqdm — Progress bars in notebook inference workflows
ipywidgets — Notebook widget support
torchdata — Stateful data loading for model training

Third-Party Software Notice This project will download and install additional third-party open source software projects. Please review the license terms of these open source projects before use.

#	Notebook	Description
1	`01_dataset_baseline.ipynb`	Load the TabFormer financial transaction dataset, create temporal train/val/test splits, and train a GPU-accelerated XGBoost baseline for fraud detection.
2	`02_seq_preproc_tokenization.ipynb`	Build a custom GPU-accelerated tokenizer pipeline that converts transaction records into domain-specific token sequences.
3	`03_foundation_model_training.ipynb`	Pretrain a decoder-only foundation model (~29M parameters) on tokenized transaction sequences using NeMo AutoModel with causal language modeling.
4	`04_inference_embedding_extraction.ipynb`	Load the pretrained model, run GPU inference, extract 512-dimensional embeddings via last-token pooling, and visualize with UMAP.
5	`05_xgboost_fraud_detection.ipynb`	Compare XGBoost fraud detection using raw features, foundation model embeddings, and combined features.

Pull and launch the NeMo Framework container (25.09.01+) with GPU access and port mapping:
```
docker run --gpus all --rm -it \
  -v $(pwd):/workspace \
  --shm-size=8g \
  -p 8888:8888 \
  --ulimit memlock=-1 \
  nvcr.io/nvidia/nemo:25.09.01
```
- --shm-size=8g — increases shared memory to prevent DataLoader crashes under PyTorch multi-process loading
- -p 8888:8888 — publishes the Jupyter port to the host browser
- --ulimit memlock=-1 — removes the locked-memory limit required by some CUDA operations
Inside the container, install Git LFS, fetch the checkpoint artifacts, and start Jupyter:
```
git config --global --add safe.directory /workspace
apt-get update && apt-get install -y git-lfs
git lfs install
git lfs pull
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```
Note: The safe.directory line is required because the repository is bind-mounted from the host, which causes a Git ownership mismatch inside the container. Without it, git lfs pull will fail.

Each notebook installs its own dependencies (e.g. %pip install xgboost ...) so a separate requirements.txt is not needed.

Open http://localhost:8888/?token=... in your browser.
Run 01_dataset_baseline.ipynb to download the dataset and establish an XGBoost baseline.
Continue through notebooks 02\–05 sequentially.

Pre-trained Model Checkpoint (required for notebooks 04\–05)

Notebooks 04 and 05 load the pre-trained checkpoint from models/decoder-foundation-model/. This checkpoint (~56 MB) is tracked with Git LFS and was trained for ~3,000 steps on 8× A100 GPUs. You must run git lfs pull (step 2 above) to download it.

Notebook 03 runs a short 30-step demo to illustrate the training pipeline; its output is saved to a separate directory (models/decoder-demo/) and is not used by notebooks 04\–05. Whether you run notebook 03 or skip it, the pre-trained checkpoint from Git LFS is what notebooks 04 and 05 expect.

Deployment

Prerequisites

Component	Requirement
GPU	1× NVIDIA A100 (80 GB) or H100
System RAM	32 GB
OS	Ubuntu 22.04+
CUDA	CUDA 12 (provided via the NeMo Framework container)
Container	NeMo Framework 25.09.01+
Python	3.10+

Steps

Pull the NeMo container:

docker pull nvcr.io/nvidia/nemo:25.09.01

Launch with GPU access, mount this repository, and publish the Jupyter port:
```
docker run --gpus all --rm -it \
  -v $(pwd):/workspace \
  --shm-size=8g \
  -p 8888:8888 \
  --ulimit memlock=-1 \
  nvcr.io/nvidia/nemo:25.09.01
```
Remote host: If running on a remote machine, add SSH port forwarding (ssh -L 8888:localhost:8888 user@host) so Jupyter is reachable from your local browser.
Install Git LFS and pull the pre-trained checkpoint:
```
git config --global --add safe.directory /workspace
apt-get update && apt-get install -y git-lfs
git lfs install
git lfs pull
```
Note: The safe.directory line is needed because bind-mounting the repo causes a Git ownership mismatch inside the container.

Each notebook installs its own dependencies inline, so no separate requirements.txt is needed.
Start Jupyter inside the container:
```
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```
Open the URL printed in the terminal (e.g. http://localhost:8888/?token=...) in your browser.
Run notebooks 01\–05 sequentially. Notebooks 04 and 05 require the pre-trained checkpoint downloaded by git lfs pull in step 3 (see Pre-trained Model Checkpoint above).

Customization

The developer example is designed for extensibility:

Tokenizer — The modular tokenizer pipeline (src/tokenizer/) can be adapted to different transaction schemas by adding or replacing individual tokenizer components.
Model Architecture — Training hyperparameters and model configuration are in configs/pretrain_financial_decoder.yaml. Swap in any HuggingFace-compatible decoder architecture by updating this config.
Downstream Tasks — Replace XGBoost with any classifier that accepts fixed-length feature vectors.

Model Architecture

The included example uses a Llama decoder architecture, but NeMo AutoModel supports any HuggingFace-compatible decoder model.

Parameter	Value
Architecture	Llama (decoder-only transformer)
Parameters	~29M
Hidden size	512
Layers	8
Attention	Grouped Query Attention (8 query heads, 2 KV heads)
Context window	8,192 tokens (RoPE)
Activation	SwiGLU
Normalization	RMSNorm
Vocabulary	~6,251 domain-specific tokens

License

Unless otherwise noted, the contents of this repository are licensed under the Apache License, Version 2.0.

This project downloads and uses additional third-party open source software and datasets. Those third-party components are governed by their respective licenses and terms.

Terms of Use

This project is currently not accepting contributions.

Please report security vulnerabilities or NVIDIA AI concerns here.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
configs		configs
models/decoder-foundation-model		models/decoder-foundation-model
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
01_dataset_baseline.ipynb		01_dataset_baseline.ipynb
02_seq_preproc_tokenization.ipynb		02_seq_preproc_tokenization.ipynb
03_foundation_model_training.ipynb		03_foundation_model_training.ipynb
04_inference_embedding_extraction.ipynb		04_inference_embedding_extraction.ipynb
05_xgboost_fraud_detection.ipynb		05_xgboost_fraud_detection.ipynb
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA Developer Example: Build Your Own Transaction Foundation Model

Software Components

NVIDIA Technology

3rd Party Software

Table of Contents

Quickstart

Notebooks

Deployment

Prerequisites

Steps

Customization

Model Architecture

License

Terms of Use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NVIDIA Developer Example: Build Your Own Transaction Foundation Model

Software Components

NVIDIA Technology

3rd Party Software

Table of Contents

Quickstart

Notebooks

Deployment

Prerequisites

Steps

Customization

Model Architecture

License

Terms of Use

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages