Pretrained embeddings underperform raw XGBoost baseline in fraud detection evaluation

Hi NVIDIA team,

Thank you for releasing this transaction foundation model example. I am trying to reproduce and understand the downstream fraud detection results using the pretrained checkpoint.

I followed the notebook flow described in the repository:

- `01_dataset_baseline.ipynb`: raw-feature XGBoost baseline
- `04_inference_embedding_extraction.ipynb`: pretrained model inference and embedding extraction
- `05_xgboost_fraud_detection.ipynb`: downstream XGBoost evaluation using raw features, embeddings, and combined features

The README says that notebooks 04 and 05 use the pretrained checkpoint from `models/decoder-foundation-model/`, and that notebook 04 extracts 512-dimensional embeddings via last-token pooling.

However, in my run, the pretrained embeddings alone perform much worse than the raw-feature baseline:

| Model | Feature Dim | ROC-AUC | Avg Precision / AP |
|---|---:|---:|---:|
| Raw Features baseline | 13d | 0.9885 | 0.1238 |
| 64d PCA Embeddings | 64d | 0.8689 | 0.0139 |
| Combined raw + embedding features | 77d | 0.9872 | 0.1569 |

For comparison, notebook 01 reports approximately:

- ROC-AUC: 0.98
- AUPRC/AP: 0.14

My questions are:

1. Is it expected that the pretrained embeddings alone can underperform the raw XGBoost baseline so significantly?
2. Are the pretrained embeddings intended mainly as complementary features rather than a replacement for raw tabular features?
3. Should the embedding evaluation in notebook 05 use the full 512-dimensional embeddings instead of 64d PCA embeddings?
4. Are there any recommended preprocessing, pooling, normalization, or XGBoost hyperparameter settings for evaluating the pretrained embeddings?
5. Could this gap indicate that I missed a step, such as `git lfs pull`, using the wrong checkpoint, mismatched train/test split, or different notebook execution order?

Any guidance on the expected metrics for the pretrained checkpoint would be very helpful.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pretrained embeddings underperform raw XGBoost baseline in fraud detection evaluation #7

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Feature Dim	ROC-AUC	Avg Precision / AP
Raw Features baseline	13d	0.9885	0.1238
64d PCA Embeddings	64d	0.8689	0.0139
Combined raw + embedding features	77d	0.9872	0.1569

Uh oh!

Pretrained embeddings underperform raw XGBoost baseline in fraud detection evaluation #7

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions