Hi NVIDIA team,
Thank you for releasing this transaction foundation model example. I am trying to reproduce and understand the downstream fraud detection results using the pretrained checkpoint.
I followed the notebook flow described in the repository:
01_dataset_baseline.ipynb: raw-feature XGBoost baseline
04_inference_embedding_extraction.ipynb: pretrained model inference and embedding extraction
05_xgboost_fraud_detection.ipynb: downstream XGBoost evaluation using raw features, embeddings, and combined features
The README says that notebooks 04 and 05 use the pretrained checkpoint from models/decoder-foundation-model/, and that notebook 04 extracts 512-dimensional embeddings via last-token pooling.
However, in my run, the pretrained embeddings alone perform much worse than the raw-feature baseline:
| Model |
Feature Dim |
ROC-AUC |
Avg Precision / AP |
| Raw Features baseline |
13d |
0.9885 |
0.1238 |
| 64d PCA Embeddings |
64d |
0.8689 |
0.0139 |
| Combined raw + embedding features |
77d |
0.9872 |
0.1569 |
For comparison, notebook 01 reports approximately:
- ROC-AUC: 0.98
- AUPRC/AP: 0.14
My questions are:
- Is it expected that the pretrained embeddings alone can underperform the raw XGBoost baseline so significantly?
- Are the pretrained embeddings intended mainly as complementary features rather than a replacement for raw tabular features?
- Should the embedding evaluation in notebook 05 use the full 512-dimensional embeddings instead of 64d PCA embeddings?
- Are there any recommended preprocessing, pooling, normalization, or XGBoost hyperparameter settings for evaluating the pretrained embeddings?
- Could this gap indicate that I missed a step, such as
git lfs pull, using the wrong checkpoint, mismatched train/test split, or different notebook execution order?
Any guidance on the expected metrics for the pretrained checkpoint would be very helpful.
Thanks!
Hi NVIDIA team,
Thank you for releasing this transaction foundation model example. I am trying to reproduce and understand the downstream fraud detection results using the pretrained checkpoint.
I followed the notebook flow described in the repository:
01_dataset_baseline.ipynb: raw-feature XGBoost baseline04_inference_embedding_extraction.ipynb: pretrained model inference and embedding extraction05_xgboost_fraud_detection.ipynb: downstream XGBoost evaluation using raw features, embeddings, and combined featuresThe README says that notebooks 04 and 05 use the pretrained checkpoint from
models/decoder-foundation-model/, and that notebook 04 extracts 512-dimensional embeddings via last-token pooling.However, in my run, the pretrained embeddings alone perform much worse than the raw-feature baseline:
For comparison, notebook 01 reports approximately:
My questions are:
git lfs pull, using the wrong checkpoint, mismatched train/test split, or different notebook execution order?Any guidance on the expected metrics for the pretrained checkpoint would be very helpful.
Thanks!