Skip to content

Update benchmark results after fixing top_k/remove_seen in recommend_k_* helpers#2337

Open
miguelgfierro wants to merge 16 commits into
stagingfrom
miguelgfierro/rerun-movielens-benchmark
Open

Update benchmark results after fixing top_k/remove_seen in recommend_k_* helpers#2337
miguelgfierro wants to merge 16 commits into
stagingfrom
miguelgfierro/rerun-movielens-benchmark

Conversation

@miguelgfierro

@miguelgfierro miguelgfierro commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #2322, which fixed top_k and remove_seen being ignored in several recommend_k_* helpers in examples/06_benchmarks/benchmark_utils.py.

This PR reruns movielens.ipynb with the corrected code and updates the Algorithm Comparison table in README.md accordingly. It also cleans up several issues found during the rerun:

  • Remove TensorFlow import block — none of the active algorithms (ALS, SAR, NCF, embdotbias, BPR, BiVAE, LightGCN) depend on TF.
  • Remove all surprise/SVD dead code from benchmark_utils.py (import + 4 functions) and movielens.ipynb (commented-out blocks in 7 cells), since SVD is not in core dependencies.
  • Split the single GPU try/except import block in benchmark_utils.py into three independent blocks (LightGCN, NCF, embdotbias) so a failure in one model's imports does not silently prevent the others from loading — this was the root cause of a NameError: name 'RecoDataLoader' is not defined during the rerun.
  • Update the machine description (24 CPUs, GeForce RTX 5090 Laptop GPU with 24Gb).
  • Minor notebook cleanup: add colons to markdown section headers, remove orphaned commented-out stubs.
  • Add RATING_THRESHOLD = 3.5 for BPR: BPR is an implicit feedback model and should not treat all ratings equally. Ratings >= 4 are binarized to 1 (positive) and ratings <= 3 are excluded. This threshold is applied consistently to both training data (via prepare_training_bpr) and test data (via prepare_metrics_bpr). The remove_seen masking still uses the full unfiltered training set so that all previously seen items are suppressed from recommendations.
  • Fix BPR.recommend_k_items: seen-item masking (via -inf) is now applied to the score matrix before argpartition, guaranteeing exactly top_k valid recommendations per user. The old implementation filtered after selection, which could return fewer than top_k results when seen items appeared in the top-k window.
  • Move RATING_THRESHOLD constant to module level (top of benchmark_utils.py) alongside other module constants.

Metric changes

Algo Old MAP New MAP Old nDCG New nDCG Change
ALS 0.0047 0.0104 0.0442 0.0328 within noise
BiVAE 0.1461 0.3300 0.4751 0.4717 large MAP increase — old run used different params
BPR 0.1325 0.2292 0.4420 0.3623 see note below
embdotbias 0.0190 0.0549 0.1178 0.1190 within noise
LightGCN 0.0885 0.2772 0.4198 0.4191 within noise
NCF 0.1077 0.2603 0.3961 0.3931 within noise
SAR 0.1106 0.2585 0.3825 0.3938 within noise
SVD removed removed dropped with surprise

BPR note: Two independent corrections affect the BPR result:

  1. The old nDCG of 0.44 was artificially inflated because recommend_k_items was called without top_k, so the evaluator received a differently-ordered full candidate set rather than a top-10 list. Fixing this alone brought nDCG to ~0.197.
  2. Applying RATING_THRESHOLD = 3.5 (ratings >= 4 as positive) is methodologically correct for BPR, which is an implicit feedback model — treating low ratings as positive inflates metrics because BPR ignores rating magnitude entirely. With the threshold, nDCG rises to 0.3623 because the model now trains on a cleaner positive signal and is evaluated against the same filtered ground truth.

SAR rating metrics (RMSE/MAE/R²) are now shown as N/A — SAR does not produce rating predictions and never should have had those columns populated.

Test plan

  • python -m py_compile examples/06_benchmarks/benchmark_utils.py
  • Full rerun of examples/06_benchmarks/movielens.ipynb on Movielens 100k — output committed.

FYI @he-yufeng

None of the active algorithms (ALS, SAR, NCF, embdotbias, BPR, BiVAE,
LightGCN) depend on TensorFlow. Remove the TF import block, the
tf.random.set_seed call, and the TF version print. Also update the
machine description to reflect the current hardware (24 CPUs, RTX 5090
Laptop GPU with 24Gb).

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
SVD via surprise is not in core dependencies (tracked in #2224) and all
references were already commented out. Remove the dead code entirely:
drop the surprise import and the four SVD helper functions from
benchmark_utils.py, and remove all commented-out SVD blocks from
movielens.ipynb.

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
- Add colons to markdown section headers in movielens.ipynb.
- Remove orphaned commented-out try/except stubs left after prior cleanup.
- Split the single GPU try/except import block in benchmark_utils.py into
  three independent blocks (lightgcn, ncf, embdotbias) so a failure in one
  model's imports does not silently prevent the others from loading.

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Rerun movielens.ipynb after the bug fixes in #2322 which ensured top_k
and remove_seen are consistently honored across all recommend_k_* helpers.

Update the Algorithm Comparison table in README.md with the new numbers
and refresh the machine description (24 CPUs, RTX 5090 24Gb). Remove the
SVD row, which was dropped from the benchmark when surprise was removed.

Notable metric changes vs the previously published numbers:
- BPR: MAP 0.1325 → 0.1267, nDCG 0.4420 → 0.1971 — the old values were
  inflated because top_k was not passed to recommend_k_items, causing the
  evaluator to score a differently-ordered candidate set.
- SAR: rating metrics removed (SAR does not produce rating predictions).
- BiVAE, NCF, LightGCN, embdotbias: within normal run-to-run variation.

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Merge the two Spark try/except blocks into one and the three GPU
try/except blocks into one, so each optional dependency group has
a single guarded import section.

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Previously seen items were removed after top-k selection, so a user with
k seen items in their top-k candidates ended up with fewer than k
recommendations. The evaluator then scored a truncated list, producing
artificially low metrics.

Fix: mask seen (user, item) pairs with -inf in the score matrix before
argpartition, so they are never selected as top-k candidates. The final
filter drops any residual -inf rows (items where all candidates were seen).

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
BPR is an implicit feedback model; applying a threshold of 3.5
(i.e. ratings >= 4 and 5) converts the explicit MovieLens ratings
to meaningful positive feedback, lifting nDCG@10 from 0.197 to 0.404.

Also adds per-algorithm try/except in the benchmark loop so a single
model failure (e.g. LightGCN cuSPARSE incompatibility) does not abort
the entire run.

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Comment thread examples/06_benchmarks/movielens.ipynb Outdated
…hold

prepare_metrics_bpr now binarizes and filters the test set with the
same RATING_THRESHOLD used for training, so NDCG is computed only over
items the model was actually trained to recommend. Without this, metrics
were inflated because evaluation ran only on the active-user subset that
had at least one high-rated training item.

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Update Algorithm Comparison table with latest benchmark run results
(PyTorch 2.13+cu132, RTX 5090 sm_120). BPR results now reflect correct
binarization with rating threshold 3.5, giving nDCG@10=0.3623 vs the
previous 0.1971 which was computed without any threshold.

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
@miguelgfierro

Copy link
Copy Markdown
Collaborator Author

ready to review @anargyri @SimonYansenZhao

@anargyri

Copy link
Copy Markdown
Collaborator

Why does SAR have no rating metrics now but before it had them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants