Update benchmark results after fixing top_k/remove_seen in recommend_k_* helpers#2337
Open
miguelgfierro wants to merge 16 commits into
Open
Update benchmark results after fixing top_k/remove_seen in recommend_k_* helpers#2337miguelgfierro wants to merge 16 commits into
miguelgfierro wants to merge 16 commits into
Conversation
None of the active algorithms (ALS, SAR, NCF, embdotbias, BPR, BiVAE, LightGCN) depend on TensorFlow. Remove the TF import block, the tf.random.set_seed call, and the TF version print. Also update the machine description to reflect the current hardware (24 CPUs, RTX 5090 Laptop GPU with 24Gb). Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
SVD via surprise is not in core dependencies (tracked in #2224) and all references were already commented out. Remove the dead code entirely: drop the surprise import and the four SVD helper functions from benchmark_utils.py, and remove all commented-out SVD blocks from movielens.ipynb. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
- Add colons to markdown section headers in movielens.ipynb. - Remove orphaned commented-out try/except stubs left after prior cleanup. - Split the single GPU try/except import block in benchmark_utils.py into three independent blocks (lightgcn, ncf, embdotbias) so a failure in one model's imports does not silently prevent the others from loading. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Rerun movielens.ipynb after the bug fixes in #2322 which ensured top_k and remove_seen are consistently honored across all recommend_k_* helpers. Update the Algorithm Comparison table in README.md with the new numbers and refresh the machine description (24 CPUs, RTX 5090 24Gb). Remove the SVD row, which was dropped from the benchmark when surprise was removed. Notable metric changes vs the previously published numbers: - BPR: MAP 0.1325 → 0.1267, nDCG 0.4420 → 0.1971 — the old values were inflated because top_k was not passed to recommend_k_items, causing the evaluator to score a differently-ordered candidate set. - SAR: rating metrics removed (SAR does not produce rating predictions). - BiVAE, NCF, LightGCN, embdotbias: within normal run-to-run variation. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Merge the two Spark try/except blocks into one and the three GPU try/except blocks into one, so each optional dependency group has a single guarded import section. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Previously seen items were removed after top-k selection, so a user with k seen items in their top-k candidates ended up with fewer than k recommendations. The evaluator then scored a truncated list, producing artificially low metrics. Fix: mask seen (user, item) pairs with -inf in the score matrix before argpartition, so they are never selected as top-k candidates. The final filter drops any residual -inf rows (items where all candidates were seen). Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
BPR is an implicit feedback model; applying a threshold of 3.5 (i.e. ratings >= 4 and 5) converts the explicit MovieLens ratings to meaningful positive feedback, lifting nDCG@10 from 0.197 to 0.404. Also adds per-algorithm try/except in the benchmark loop so a single model failure (e.g. LightGCN cuSPARSE incompatibility) does not abort the entire run. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
anargyri
reviewed
May 25, 2026
…hold prepare_metrics_bpr now binarizes and filters the test set with the same RATING_THRESHOLD used for training, so NDCG is computed only over items the model was actually trained to recommend. Without this, metrics were inflated because evaluation ran only on the active-user subset that had at least one high-rated training item. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Update Algorithm Comparison table with latest benchmark run results (PyTorch 2.13+cu132, RTX 5090 sm_120). BPR results now reflect correct binarization with rating threshold 3.5, giving nDCG@10=0.3623 vs the previous 0.1971 which was computed without any threshold. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Collaborator
Author
|
ready to review @anargyri @SimonYansenZhao |
Collaborator
|
Why does SAR have no rating metrics now but before it had them? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #2322, which fixed
top_kandremove_seenbeing ignored in severalrecommend_k_*helpers inexamples/06_benchmarks/benchmark_utils.py.This PR reruns
movielens.ipynbwith the corrected code and updates the Algorithm Comparison table inREADME.mdaccordingly. It also cleans up several issues found during the rerun:benchmark_utils.py(import + 4 functions) andmovielens.ipynb(commented-out blocks in 7 cells), since SVD is not in core dependencies.try/exceptimport block inbenchmark_utils.pyinto three independent blocks (LightGCN, NCF, embdotbias) so a failure in one model's imports does not silently prevent the others from loading — this was the root cause of aNameError: name 'RecoDataLoader' is not definedduring the rerun.RATING_THRESHOLD = 3.5for BPR: BPR is an implicit feedback model and should not treat all ratings equally. Ratings >= 4 are binarized to 1 (positive) and ratings <= 3 are excluded. This threshold is applied consistently to both training data (viaprepare_training_bpr) and test data (viaprepare_metrics_bpr). Theremove_seenmasking still uses the full unfiltered training set so that all previously seen items are suppressed from recommendations.BPR.recommend_k_items: seen-item masking (via-inf) is now applied to the score matrix beforeargpartition, guaranteeing exactlytop_kvalid recommendations per user. The old implementation filtered after selection, which could return fewer thantop_kresults when seen items appeared in the top-k window.RATING_THRESHOLDconstant to module level (top ofbenchmark_utils.py) alongside other module constants.Metric changes
BPR note: Two independent corrections affect the BPR result:
recommend_k_itemswas called withouttop_k, so the evaluator received a differently-ordered full candidate set rather than a top-10 list. Fixing this alone brought nDCG to ~0.197.RATING_THRESHOLD = 3.5(ratings >= 4 as positive) is methodologically correct for BPR, which is an implicit feedback model — treating low ratings as positive inflates metrics because BPR ignores rating magnitude entirely. With the threshold, nDCG rises to 0.3623 because the model now trains on a cleaner positive signal and is evaluated against the same filtered ground truth.SAR rating metrics (RMSE/MAE/R²) are now shown as N/A — SAR does not produce rating predictions and never should have had those columns populated.
Test plan
python -m py_compile examples/06_benchmarks/benchmark_utils.pyexamples/06_benchmarks/movielens.ipynbon Movielens 100k — output committed.FYI @he-yufeng