Update benchmark results after fixing top_k/remove_seen in recommend_k_* helpers by miguelgfierro · Pull Request #2337 · recommenders-team/recommenders

miguelgfierro · 2026-05-25T14:37:51Z

Summary

Follow-up to #2322, which fixed top_k and remove_seen being ignored in several recommend_k_* helpers in examples/06_benchmarks/benchmark_utils.py.

This PR reruns movielens.ipynb with the corrected code and updates the Algorithm Comparison table in README.md accordingly. It also cleans up several issues found during the rerun:

Remove TensorFlow import block — none of the active algorithms (ALS, SAR, NCF, embdotbias, BPR, BiVAE, LightGCN) depend on TF.
Remove all surprise/SVD dead code from benchmark_utils.py (import + 4 functions) and movielens.ipynb (commented-out blocks in 7 cells), since SVD is not in core dependencies.
Split the single GPU try/except import block in benchmark_utils.py into three independent blocks (LightGCN, NCF, embdotbias) so a failure in one model's imports does not silently prevent the others from loading — this was the root cause of a NameError: name 'RecoDataLoader' is not defined during the rerun.
Update the machine description (24 CPUs, GeForce RTX 5090 Laptop GPU with 24Gb).
Minor notebook cleanup: add colons to markdown section headers, remove orphaned commented-out stubs.
Add RATING_THRESHOLD = 3.5 for BPR: BPR is an implicit feedback model and should not treat all ratings equally. Ratings >= 4 are binarized to 1 (positive) and ratings <= 3 are excluded. This threshold is applied consistently to both training data (via prepare_training_bpr) and test data (via prepare_metrics_bpr). The remove_seen masking still uses the full unfiltered training set so that all previously seen items are suppressed from recommendations.
Fix BPR.recommend_k_items: seen-item masking (via -inf) is now applied to the score matrix before argpartition, guaranteeing exactly top_k valid recommendations per user. The old implementation filtered after selection, which could return fewer than top_k results when seen items appeared in the top-k window.
Move RATING_THRESHOLD constant to module level (top of benchmark_utils.py) alongside other module constants.

Metric changes

Algo	Old MAP	New MAP	Old nDCG	New nDCG	Change
ALS	0.0047	0.0104	0.0442	0.0328	within noise
BiVAE	0.1461	0.3300	0.4751	0.4717	large MAP increase — old run used different params
BPR	0.1325	0.2292	0.4420	0.3623	see note below
embdotbias	0.0190	0.0549	0.1178	0.1190	within noise
LightGCN	0.0885	0.2772	0.4198	0.4191	within noise
NCF	0.1077	0.2603	0.3961	0.3931	within noise
SAR	0.1106	0.2585	0.3825	0.3938	within noise
SVD	removed	—	removed	—	dropped with surprise

BPR note: Two independent corrections affect the BPR result:

The old nDCG of 0.44 was artificially inflated because recommend_k_items was called without top_k, so the evaluator received a differently-ordered full candidate set rather than a top-10 list. Fixing this alone brought nDCG to ~0.197.
Applying RATING_THRESHOLD = 3.5 (ratings >= 4 as positive) is methodologically correct for BPR, which is an implicit feedback model — treating low ratings as positive inflates metrics because BPR ignores rating magnitude entirely. With the threshold, nDCG rises to 0.3623 because the model now trains on a cleaner positive signal and is evaluated against the same filtered ground truth.

SAR rating metrics (RMSE/MAE/R²) are now shown as N/A — SAR does not produce rating predictions and never should have had those columns populated.

Test plan

python -m py_compile examples/06_benchmarks/benchmark_utils.py
Full rerun of examples/06_benchmarks/movielens.ipynb on Movielens 100k — output committed.

FYI @he-yufeng

None of the active algorithms (ALS, SAR, NCF, embdotbias, BPR, BiVAE, LightGCN) depend on TensorFlow. Remove the TF import block, the tf.random.set_seed call, and the TF version print. Also update the machine description to reflect the current hardware (24 CPUs, RTX 5090 Laptop GPU with 24Gb). Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

SVD via surprise is not in core dependencies (tracked in #2224) and all references were already commented out. Remove the dead code entirely: drop the surprise import and the four SVD helper functions from benchmark_utils.py, and remove all commented-out SVD blocks from movielens.ipynb. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

- Add colons to markdown section headers in movielens.ipynb. - Remove orphaned commented-out try/except stubs left after prior cleanup. - Split the single GPU try/except import block in benchmark_utils.py into three independent blocks (lightgcn, ncf, embdotbias) so a failure in one model's imports does not silently prevent the others from loading. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Rerun movielens.ipynb after the bug fixes in #2322 which ensured top_k and remove_seen are consistently honored across all recommend_k_* helpers. Update the Algorithm Comparison table in README.md with the new numbers and refresh the machine description (24 CPUs, RTX 5090 24Gb). Remove the SVD row, which was dropped from the benchmark when surprise was removed. Notable metric changes vs the previously published numbers: - BPR: MAP 0.1325 → 0.1267, nDCG 0.4420 → 0.1971 — the old values were inflated because top_k was not passed to recommend_k_items, causing the evaluator to score a differently-ordered candidate set. - SAR: rating metrics removed (SAR does not produce rating predictions). - BiVAE, NCF, LightGCN, embdotbias: within normal run-to-run variation. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

review-notebook-app · 2026-05-25T14:37:56Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Merge the two Spark try/except blocks into one and the three GPU try/except blocks into one, so each optional dependency group has a single guarded import section. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Previously seen items were removed after top-k selection, so a user with k seen items in their top-k candidates ended up with fewer than k recommendations. The evaluator then scored a truncated list, producing artificially low metrics. Fix: mask seen (user, item) pairs with -inf in the score matrix before argpartition, so they are never selected as top-k candidates. The final filter drops any residual -inf rows (items where all candidates were seen). Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

BPR is an implicit feedback model; applying a threshold of 3.5 (i.e. ratings >= 4 and 5) converts the explicit MovieLens ratings to meaningful positive feedback, lifting nDCG@10 from 0.197 to 0.404. Also adds per-algorithm try/except in the benchmark loop so a single model failure (e.g. LightGCN cuSPARSE incompatibility) does not abort the entire run. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

…hold prepare_metrics_bpr now binarizes and filters the test set with the same RATING_THRESHOLD used for training, so NDCG is computed only over items the model was actually trained to recommend. Without this, metrics were inflated because evaluation ran only on the active-user subset that had at least one high-rated training item. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Update Algorithm Comparison table with latest benchmark run results (PyTorch 2.13+cu132, RTX 5090 sm_120). BPR results now reflect correct binarization with rating threshold 3.5, giving nDCG@10=0.3623 vs the previous 0.1971 which was computed without any threshold. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

miguelgfierro · 2026-05-25T20:46:00Z

ready to review @anargyri @SimonYansenZhao

anargyri · 2026-05-26T08:42:04Z

Why does SAR have no rating metrics now but before it had them?

miguelgfierro added 4 commits May 25, 2026 16:10

miguelgfierro requested review from SimonYansenZhao, anargyri, gramhagen, loomlike and wav8k as code owners May 25, 2026 14:37

miguelgfierro added 6 commits May 25, 2026 16:42

Use binarize() to convert ratings to implicit feedback in BPR deep dive

d1237f3

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Explain unbounded prediction scores in BPR deep dive

99ddfd9

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Use binarize() in prepare_training_bpr and rerun benchmark

1280d4e

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

anargyri reviewed May 25, 2026

View reviewed changes

Comment thread examples/06_benchmarks/movielens.ipynb Outdated

miguelgfierro added 6 commits May 25, 2026 20:43

Update BPR deep dive to max_iter=200 to match benchmark

f92c89e

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Rerun benchmark with PyTorch 2.13+cu132 (RTX 5090 sm_120 support)

664e913

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Move RATING_THRESHOLD constant to module level at top of benchmark_utils

f1b13d0

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Add unit tests for BPR.recommend_k_items: top_k and remove_seen

c0aef19

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update benchmark results after fixing top_k/remove_seen in recommend_k_* helpers#2337

Update benchmark results after fixing top_k/remove_seen in recommend_k_* helpers#2337
miguelgfierro wants to merge 16 commits into
stagingfrom
miguelgfierro/rerun-movielens-benchmark

miguelgfierro commented May 25, 2026 •

edited

Loading

Uh oh!

review-notebook-app Bot commented May 25, 2026

Uh oh!

Uh oh!

miguelgfierro commented May 25, 2026

Uh oh!

anargyri commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

miguelgfierro commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Metric changes

Test plan

Uh oh!

review-notebook-app Bot commented May 25, 2026

Uh oh!

Uh oh!

miguelgfierro commented May 25, 2026

Uh oh!

anargyri commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miguelgfierro commented May 25, 2026 •

edited

Loading