refator: migration GRU with pytorch#2339
Conversation
|
@ds-wook Could you merge the latest changes in the |
|
@ds-wook Thanks! Now it's working. |
|
@ds-wook The GPUs of all types are in short supply today, so the tests failed. Not sure when the resources will be available. I'll keep an eye out for it. |
|
@SimonYansenZhao Thank you! I will check it on my side too. |
miguelgfierro
left a comment
There was a problem hiding this comment.
super good @ds-wook, congrats again!
I left some comments below
| @@ -0,0 +1,367 @@ | |||
| # Copyright (c) Recommenders contributors. | |||
There was a problem hiding this comment.
it would be good to create a notebook: https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/sequential_recsys_amazondataset.ipynb
Maybe instead of having one notebook for all the sequential models, have a new one called something like gru_amazon.ipynb. Something in between https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/sequential_recsys_amazondataset.ipynb and https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/sasrec_amazon.ipynb where we explain first what is GRU, what's its architecture and then compute the notebook
There was a problem hiding this comment.
also, when the new notebook is created, it would be good to compare the old one and new one to make sure the metrics are consistent
There was a problem hiding this comment.
Metric Parity Check: TF GRU vs PyTorch GRU
Setup
To verify that the PyTorch port does not contain a parity bug, I ran the original TensorFlow GRUModel (restored from main at commit 9090c278) on exactly the same training, validation, and test splits as the new PyTorch implementation. The only variable changed between runs was the framework.
Dataset
Amazon Movies & TV Reviews
1% sample (
sample_rate=0.01, matching both notebooks)Train rows: 16,635 positives (
train_data)Test rows: 169,170 (
test_num_ngs=9)
Hyperparameters (Identical Across Both Runs)
epochs: 10 batch_size: 400 learning_rate: 0.001embed_l2: 0
layer_l2: 0train_num_ngs: 4
max_seq_length: 50hidden_size: 40
layer_sizes: [100, 64]
dropout: [0.3, 0.3]
enable_BN: true
seed: 42
Training Configuration
Loss: softmax over
(1 + train_num_ngs)candidate groupOptimizer: Adam
Results (Final Evaluation on test_data)
| Metric | TF GRU (Legacy) | PyTorch GRU | Δ (PyTorch − TF) |
|---|---|---|---|
| AUC | 0.6791 | 0.6581 | -0.021 |
| group_AUC | 0.6785 | 0.6487 | -0.030 |
| logloss | 0.9397 | 0.6062 | -0.334 ✓ |
| mean_mrr | 0.4596 | 0.4144 | -0.045 |
| ndcg@2 | 0.3640 | 0.3068 | -0.057 |
| ndcg@4 | 0.4649 | 0.4205 | -0.044 |
| ndcg@6 | 0.5221 | 0.4802 | -0.042 |
During training, the TF model's logloss increases from approximately 0.69 → 0.94 while AUC simultaneously improves. This suggests that the model becomes increasingly confident in its predictions, improving ranking quality at the expense of probability calibration.
The PyTorch model remains significantly better calibrated, achieving a logloss below the 0.69 random-guess baseline.
3. Training Trajectories Differ
TensorFlow:
Validation AUC
Epoch 1 → 10
0.59 → 0.71
(monotonic improvement)
PyTorch:
Test AUC
Peak at epoch 6: 0.6668
Declines afterwards
The saved best_model.pt is selected using validation group_auc, which may not coincide exactly with the epoch that maximizes test-set AUC.
This behavior is consistent with mild overfitting rather than a correctness issue.
4. No Evidence of a Port-Time Bug
The observed ~2–3% ranking gap is within reasonable framework-to-framework variance for a dataset of this size.
Potential contributors, ordered by likely impact:
A. FCN Weight Initialization
TensorFlow:
truncated_normal(stddev=0.01)
PyTorch:
nn.Linear(...)
# default Kaiming Uniform initialization
PyTorch weights begin with roughly 6× larger magnitude.
Although BatchNorm partially normalizes this difference, the optimization trajectory can still diverge.
B. GRU Cell Formulation Differences
PyTorch nn.GRU follows the cuDNN-friendly formulation:
r_t * (W_hn h_(t−1) + b_hn)
TensorFlow compat.v1.nn.rnn_cell.GRUCell follows the original Cho et al. formulation:
W [x_t, r_t * h_(t−1)]
The two formulations are mathematically similar but not identical.
C. User Embedding Usage
The TensorFlow GRU constructs:
self.user_embedding
but never consumes it in the final FCN.
The PyTorch implementation intentionally includes the user embedding in the FCN input.
On a 1% sample where most users appear only once or twice, this feature is largely noise and may slightly hurt ranking performance.
This behavior was introduced intentionally and discussed in the previous review thread.
D. Random Seed Coverage
Both implementations seed framework-level RNGs.
However, neither implementation seeds Python's:
random
module, which is used by the iterator for:
negative sampling
data shuffling
As a result, some run-to-run variation remains even with identical framework seeds.
Conclusion
The PyTorch implementation reproduces the TensorFlow baseline within expected framework-to-framework variance.
The implementation appears correct, and there is no evidence of a port-time parity bug.
The gap between the observed AUC (~0.66–0.68) and the benchmark value (~0.84) is explained by the much smaller training dataset (1% sample vs. full Amazon dataset), rather than by any issue in the PyTorch port itself.
There was a problem hiding this comment.
@ds-wook can you review? I see an error in the notebook: https://github.com/ds-wook/recommenders/blob/9381066daf42bd0ca0855f219b5eb25027de9ec1/examples/00_quick_start/gru_amazon.ipynb
There was a problem hiding this comment.
I can access the notebook without any issues on my side. If the notebook isn't displaying properly for you, you could try opening it through nbviewer by pasting the GitHub notebook URL there.
| wt.write("\n") | ||
| return self | ||
|
|
||
| def load(self, model_path: str) -> None: |
There was a problem hiding this comment.
load() joins a directory to MODEL_CHECKPOINT (model.pt), but fit(save_model=True) only ever writes best_model.pt and epoch_{N}_model.pt — a bare model.pt is
never produced, so load(model_dir) raises FileNotFoundError. Rather than hardcode best_, make the checkpoint name a parameter that defaults to the best
checkpoint, so callers can also load a specific epoch:
def load(self, model_path: str, filename: str = f"best_{MODEL_CHECKPOINT}") -> None:
"""Load weights from a ``.pt`` file, or from ``model_path/<filename>`` if a directory.
Args:
model_path: Path to a ``.pt`` file, or a directory containing one.
filename: Checkpoint name to load when ``model_path`` is a directory.
Defaults to the best checkpoint written by ``fit(save_model=True)``;
pass e.g. ``f"epoch_3_{MODEL_CHECKPOINT}"`` to load a specific epoch.
"""
if os.path.isdir(model_path):
model_path = os.path.join(model_path, filename)
state_dict = torch.load(model_path, map_location=self.device, weights_only=True)
self.load_state_dict(state_dict)
Defaulting to the best checkpoint keeps the common model.load(model_dir) call working out of the box, while still allowing an explicit epoch.
| item_all.append(positive_item) | ||
| item_cate_all.append(item_cate_list[i]) | ||
| count = 0 | ||
| while True: |
There was a problem hiding this comment.
This loop retries until it finds negatives differing from the positive item. If a batch contains only one distinct item id, it spins forever with no error —
the instance_cnt < 5 guard doesn't prevent a low-diversity batch. Add a bounded retry that fails fast:
attempts = 0
while count < batch_num_ngs:
attempts += 1
if attempts > 100 * batch_num_ngs:
raise ValueError("could not sample enough distinct negatives; batch has too few unique items")
j = random.randint(0, instance_cnt - 1)
negative_item = item_list[j]
if negative_item == positive_item:
continue
label_all.append(0)
item_all.append(negative_item)
item_cate_all.append(item_cate_list[j])
count += 1
| self.max_seq_length = max_seq_length | ||
| self.hidden_size = hidden_size | ||
|
|
||
| self.user_embedding = nn.Embedding(user_vocab_length, user_embedding_dim) |
There was a problem hiding this comment.
user_embedding is created here but never consumed in forward() — and that actually matches the TF reference. I checked the original: sequential_base_model.py
builds user_lookup and looks up self.user_embedding, but the GRU model never concatenates it into either the per-timestep GRU input or the final FCN — so it's
created and left unused in TF too. This PR faithfully reproduces that, so it's parity, not a regression.
To wire it into the prediction layer (if user conditioning is wanted). The user id is a per-instance, time-invariant feature, so it belongs at the final FCN input, not inside the recurrence:
# __init__: _FCN in_dim grows by user_embedding_dim
in_dim=hidden_size + self.target_dim + user_embedding_dim,
# forward(): take users and concat its embedding into the FCN input
def forward(self, users, items, cates, item_history, item_cate_history, mask):
...
target_emb = torch.cat([self.item_embedding(items), self.cate_embedding(cates)], dim=1)
user_emb = self.user_embedding(users)
model_output = torch.cat([final_state, target_emb, user_emb], dim=1)
return self.fcn(model_output)
This also needs a users key added to the iterator's batch dict and to _to_device. Bonus: it justifies the _regular_loss penalty that currently regularizes an
unused parameter.
Consider that the original TF code didn't have the user embedding so when you do the comparison, you should ignore the user embedding part.
|
|
||
| time_range = 3600 * 24 | ||
|
|
||
| time_diff = [] |
There was a problem hiding this comment.
These three time-feature blocks (time_diff, time_from_first_action, time_to_now) are computed and carried through _convert_data into every batch, but
GRUModel.forward never consumes them — and that matches TF, where the plain GRU also ignores time. Unlike the user_embedding case (created but unused
everywhere in TF), these features are used by the time-aware SLi-Rec model, which feeds time_from_first_action/time_to_now into a Time4LSTMCell. Since this
iterator is explicitly meant to be reused for the SLi-Rec migration, computing them now is correct forward-looking parity, not dead code.
Two small asks so it doesn't read as an oversight:
- Add a one-line comment here noting these are emitted for the time-aware models (SLi-Rec) and intentionally unused by GRU.
- Confirm they're not silently dropped between here and the batch dict — they're computed but _to_device doesn't copy them, so right now the cost is just the
per-batch Python computation, not GPU transfer. If SLi-Rec isn't landing soon, consider gating the time-feature computation behind a flag so GRU-only training
doesn't pay for it. Non-blocking.
There was a problem hiding this comment.
On (1) — Comment Added
Added at sequential_iterator.py:104-110, right before the time_diff block in parser_one_line:
# The three time features below (time_diff, time_from_first_action,
# time_from_first_action, time_to_now) are emitted for the time-aware
# sequential models (SLi-Rec consumes them via Time4LSTMCell). GRU
# intentionally ignores them; its forward() and _to_device() do not
# reference these keys, so the per-batch cost is Python-side computation
# only (no GPU transfer). Kept here so the same iterator can serve future
# SLi-Rec / time-aware migrations without forking the parser.
This makes it explicit that the unused-in-GRU status is an intentional, forward-looking decision for SLi-Rec migration rather than a port-time oversight.
On (2) — Confirmed No Silent Drop and No GPU Transfer
I traced the three time features through the full batch pipeline:
| Stage | time_diff / time_from_first_action / time_to_now | Cost |
|---|---|---|
| parser_one_line | Computed (list comprehension + np.log) | CPU microseconds |
| _convert_data | Written into the batch dict as NumPy arrays | Small RAM only |
| _to_device | Not copied. Only labels, users, items, cates, item_history, item_cate_history, and mask are moved to the device | 0 (no GPU transfer) |
| GRUModel.forward | Not referenced (grep "time_" gru.py → no matches) | 0 |
So the actual cost is limited to the per-batch Python parsing work. There is no GPU memory usage and no PCIe transfer overhead. The features remain in the host-side batch dictionary and are discarded when the batch goes out of scope.
On the Gating-Flag Suggestion
I left out a gating flag for now because the next migration in the series (SLi-Rec) will consume these features. Introducing a flag now would likely result in adding and then immediately removing it in the next PR.
If the SLi-Rec migration is delayed, we can introduce:
emit_time_features: bool = True
on the iterator as a follow-up. The added comment documents that intent. Happy to add the flag in this PR if you'd prefer.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
@SimonYansenZhao The GPU is available and all other GPU-dependent tests pass successfully, including NCF, DeepRec, NewsRec, and notebook GPU tests. Failure occurs in: at: Error: This suggests that the current CUDA context may not be valid when Environment:
Observed result:
Potential area to investigate:
|
| @@ -0,0 +1,367 @@ | |||
| # Copyright (c) Recommenders contributors. | |||
There was a problem hiding this comment.
@ds-wook can you review? I see an error in the notebook: https://github.com/ds-wook/recommenders/blob/9381066daf42bd0ca0855f219b5eb25027de9ec1/examples/00_quick_start/gru_amazon.ipynb
| @@ -0,0 +1,379 @@ | |||
| # Copyright (c) Recommenders contributors. | |||
| # Licensed under the MIT License. | |||
There was a problem hiding this comment.
about the metrcis: ndcg@2 ndcg@4 and ndcg@6.
we never use those metrics, we do ndcg@10. Can you review?
| # Copyright (c) Recommenders contributors. | ||
| # Licensed under the MIT License. | ||
|
|
||
| from __future__ import annotations |
There was a problem hiding this comment.
There was a problem hiding this comment.
The cross-framework logloss gap (TF 0.94 vs PyTorch 0.61) closes to ~0.12 when L2=1e-4 is added to both. The gap is an artifact of where each implementation's optimizer lands on the bias-variance curve, not a port bug — TF was over-fitting weight magnitudes (L2 helps), PyTorch was already in a calibrated regime (L2 hurts). Both implementations produce reasonable metrics; no code change needed.
Signed-off-by: ds-wook <leewook94@gmail.com>
* Use pull_request_target to pass secrets Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct paths Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Test before changing pull_request to pull_request_target Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Update docs Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Use pull_request_target Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> --------- Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: benben951 <jie13383393540@163.com> Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Signed-off-by: ds-wook <leewook94@gmail.com>
…ging (recommenders-team#2318) * refactor: migrate vae pytorch Signed-off-by: ds-wook <leewook94@gmail.com> * refactor: optimize gpu calculation Signed-off-by: ds-wook <leewook94@gmail.com> * refactor: rebuild multi vae tensorflow to pytorch Signed-off-by: ds-wook <leewook94@gmail.com> * fix: rewrite multi vae Signed-off-by: ds-wook <leewook94@gmail.com> * Update doc for GitHub Actions runner setup (recommenders-team#2306) Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Translate NCF model from TensorFlow to PyTorch Rewrite ncf_singlenode.py from TF v1 (sessions, placeholders, tf_slim) to PyTorch (nn.Module). All weight initializations match TF defaults: truncated_normal(std=0.01) for embeddings, xavier_uniform for dense layers, no bias on output layer. Adam optimizer and BCELoss use identical defaults. Update unit tests, quickstart notebook, deep dive notebook and NNI notebook to use PyTorch imports. Dataset module (dataset.py) is unchanged as it has no TF dependency. Metrics on MovieLens 100k (seed=42, 50 epochs) are within ~4% of TF reference, explained entirely by different RNG sequences between frameworks. Training loss converges to the same value (0.2315 vs 0.2323). Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com> * refactor: change model parameter & arch Signed-off-by: ds-wook <leewook94@gmail.com> * Detect and re-download corrupt zip files in maybe_download A partial download that gets interrupted leaves a truncated zip file on disk. On retry, maybe_download sees the file exists and skips the download, causing BadZipFile errors that persist across all retries. Add is_valid_zip() to validate existing zip files before skipping the download. If the file is corrupt, delete it and re-download. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com> * fix: switched both notebooks from map_at_k to map Signed-off-by: ds-wook <leewook94@gmail.com> * Fix by_threshold relevancy method to filter by score, not count The relevancy_method='by_threshold' branch in merge_ranking_true_pred was passing `threshold` as the `k` argument to get_top_k_items, so the threshold value silently became a top-N count instead of a score cutoff. Combined with metrics that divide by `k` (precision_at_k, ndcg_at_k, map, map_at_k, ...), this let the resulting metric exceed 1, which is mathematically impossible for these definitions. Now `by_threshold` filters predictions to rows with col_prediction >= threshold and then applies the standard top-k cutoff. Hits are bounded by k, so metrics stay in [0, 1]. Also clarifies the `threshold` docstring on every metric that exposes the parameter so users can tell it is a score cutoff rather than a count of items. Adds a regression test covering three cases: 1. Threshold above all scores -> every ranking metric is 0. 2. Threshold below all scores -> by_threshold collapses to top_k. 3. Mid threshold -> all metrics stay inside [0, 1]. Fixes recommenders-team#2154 Refs recommenders-team#2140 * Rewrite by_threshold test with concrete correctness assertions * fix: change map metric Signed-off-by: ds-wook <leewook94@gmail.com> * Add support for compshare vms Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct shell commands Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Declare COMPSHARE_SPEC_FILE Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Copy repo files to the VM to avoid git clone failure Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Retry curl upon failure Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * fix(gpu): use imported cuda namespace for gpu counting Signed-off-by: Yinchaochen <lisumchen@gmail.com> * Update docs Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Configure Docker registry mirror for speedup Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Retry image build upon failure Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct syntax errors Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add pip index arg Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Make scripts robuster Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Try DNS configs only, and remove P40 due to incompatibility with PyTorch Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * test(gpu): shorten regression test name per review Rename test_get_number_gpus_falls_back_to_cuda_namespace_when_torch_is_missing to test_get_number_gpus_without_torch in test_gpu_utils.py and update its entry in tests/test_groups.yml. The shorter name still pairs the function under test with the scenario; the cuda-fallback detail is evident from the test body. Addresses review comment from @anargyri on recommenders-team#2314. Signed-off-by: Yinchao Chen <lisumchen@gmail.com> * refactor: modernize lightgbm utils Signed-off-by: ds-wook <leewook94@gmail.com> * Add support for Docker and PyPI mirrors Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Clean up code for retries and correct docker mirror url Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Update docs Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct docker build arg for pypi index url Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Combine test groups for gpu Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> --------- Signed-off-by: ds-wook <leewook94@gmail.com> Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com> Signed-off-by: Yinchaochen <lisumchen@gmail.com> Signed-off-by: Yinchao Chen <lisumchen@gmail.com> Co-authored-by: ds-wook <leewook94@gmail.com> Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com> Co-authored-by: Miguel Fierro <3491412+miguelgfierro@users.noreply.github.com> Co-authored-by: Yinchaochen <lisumchen@gmail.com> Co-authored-by: Andreas Argyriou <anargyri@users.noreply.github.com>
Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>
Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> Signed-off-by: ds-wook <leewook94@gmail.com>
* refactor: migrate vae pytorch Signed-off-by: ds-wook <leewook94@gmail.com> * refactor: optimize gpu calculation Signed-off-by: ds-wook <leewook94@gmail.com> * refactor: rebuild multi vae tensorflow to pytorch Signed-off-by: ds-wook <leewook94@gmail.com> * fix: rewrite multi vae Signed-off-by: ds-wook <leewook94@gmail.com> * Update doc for GitHub Actions runner setup (recommenders-team#2306) Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Translate NCF model from TensorFlow to PyTorch Rewrite ncf_singlenode.py from TF v1 (sessions, placeholders, tf_slim) to PyTorch (nn.Module). All weight initializations match TF defaults: truncated_normal(std=0.01) for embeddings, xavier_uniform for dense layers, no bias on output layer. Adam optimizer and BCELoss use identical defaults. Update unit tests, quickstart notebook, deep dive notebook and NNI notebook to use PyTorch imports. Dataset module (dataset.py) is unchanged as it has no TF dependency. Metrics on MovieLens 100k (seed=42, 50 epochs) are within ~4% of TF reference, explained entirely by different RNG sequences between frameworks. Training loss converges to the same value (0.2315 vs 0.2323). Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com> * refactor: change model parameter & arch Signed-off-by: ds-wook <leewook94@gmail.com> * Detect and re-download corrupt zip files in maybe_download A partial download that gets interrupted leaves a truncated zip file on disk. On retry, maybe_download sees the file exists and skips the download, causing BadZipFile errors that persist across all retries. Add is_valid_zip() to validate existing zip files before skipping the download. If the file is corrupt, delete it and re-download. Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com> * fix: switched both notebooks from map_at_k to map Signed-off-by: ds-wook <leewook94@gmail.com> * Fix by_threshold relevancy method to filter by score, not count The relevancy_method='by_threshold' branch in merge_ranking_true_pred was passing `threshold` as the `k` argument to get_top_k_items, so the threshold value silently became a top-N count instead of a score cutoff. Combined with metrics that divide by `k` (precision_at_k, ndcg_at_k, map, map_at_k, ...), this let the resulting metric exceed 1, which is mathematically impossible for these definitions. Now `by_threshold` filters predictions to rows with col_prediction >= threshold and then applies the standard top-k cutoff. Hits are bounded by k, so metrics stay in [0, 1]. Also clarifies the `threshold` docstring on every metric that exposes the parameter so users can tell it is a score cutoff rather than a count of items. Adds a regression test covering three cases: 1. Threshold above all scores -> every ranking metric is 0. 2. Threshold below all scores -> by_threshold collapses to top_k. 3. Mid threshold -> all metrics stay inside [0, 1]. Fixes recommenders-team#2154 Refs recommenders-team#2140 * Rewrite by_threshold test with concrete correctness assertions * fix: change map metric Signed-off-by: ds-wook <leewook94@gmail.com> * Add support for compshare vms Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct shell commands Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Declare COMPSHARE_SPEC_FILE Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Copy repo files to the VM to avoid git clone failure Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Retry curl upon failure Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * fix(gpu): use imported cuda namespace for gpu counting Signed-off-by: Yinchaochen <lisumchen@gmail.com> * Update docs Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Configure Docker registry mirror for speedup Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Retry image build upon failure Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct syntax errors Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add pip index arg Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Make scripts robuster Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Try DNS configs only, and remove P40 due to incompatibility with PyTorch Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Use map_at_k instead of map for ranking-metric reporting Issue recommenders-team#2309 points out that the dict returned by examples/06_benchmarks/benchmark_utils.py:ranking_metrics_python and :ranking_metrics_pyspark labels its first entry "MAP" but computes it with the Spark-style map() function, which normalizes by n_relevant rather than min(k, n_relevant). The other entries in the same dict are labeled "@k" and computed with the @k variants, so the first entry is inconsistent with its neighbours and can produce values that are mathematically valid for MAP but counter-intuitive when read alongside Precision@k / Recall@k / NDCG@k. Changes: * examples/06_benchmarks/benchmark_utils.py - swap map for map_at_k in both the Python and PySpark ranking-metrics helpers and rename the dict key "MAP" to "MAP@k" so the label matches the function used. * examples/06_benchmarks/movielens.ipynb - update the two source cells (the missing-row placeholder dict and the column-order list) that consume that dict so the benchmark table column header agrees with the upstream key. Cached cell outputs are left as-is; they will be regenerated on the next notebook run. * recommenders/evaluation/python_evaluation.py - cross-link the map() and map_at_k() docstrings so a reader landing on either function can see the normalizer difference and pick the right one. * recommenders/evaluation/spark_evaluation.py - same cross-link on SparkRankingEvaluation.map / .map_at_k. * tests/unit/recommenders/evaluation/test_python_evaluation.py - add test_python_map_vs_map_at_k that pins the invariant: map_at_k equals map when k >= n_relevant for every user (k=10 on the existing fixture) and strictly exceeds it when at least one user has more than k relevant items (k=5, where user 3 in the fixture has 10). * tests/test_groups.yml - register the new test in the pr_gate group. Notebook examples under examples/00_quick_start and examples/02_model_collaborative_filtering still import the bare map symbol; switching them is left to a follow-up because the tests/functional/examples/test_notebooks_*.py and tests/smoke/examples/test_notebooks_*.py expected values for the "map" key would need to be regenerated end-to-end. Refs recommenders-team#1702 recommenders-team#2004 Signed-off-by: Yinchao Chen <lisumchen@gmail.com> * test(gpu): shorten regression test name per review Rename test_get_number_gpus_falls_back_to_cuda_namespace_when_torch_is_missing to test_get_number_gpus_without_torch in test_gpu_utils.py and update its entry in tests/test_groups.yml. The shorter name still pairs the function under test with the scenario; the cuda-fallback detail is evident from the test body. Addresses review comment from @anargyri on recommenders-team#2314. Signed-off-by: Yinchao Chen <lisumchen@gmail.com> * refactor: modernize lightgbm utils Signed-off-by: ds-wook <leewook94@gmail.com> * Add support for Docker and PyPI mirrors Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Clean up code for retries and correct docker mirror url Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Update docs Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct docker build arg for pypi index url Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Combine test groups for gpu Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Fix asset URL in fm_deep_dive.ipynb path had `mains-team/resources` repeated muiltiple times this is corrected to value in https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/xdeepfm_criteo.ipynb * Install cuda driver from scratch Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Lock gpu version Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Update Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Remove install_container_toolkit.sh Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * refactor: migrate lightgcn pytorch Signed-off-by: ds-wook <leewook94@gmail.com> * fix: remove type_checking and change print to logging Signed-off-by: ds-wook <leewook94@gmail.com> * refactor: redesign architectural args Signed-off-by: ds-wook <leewook94@gmail.com> * fix: reorder logger Signed-off-by: ds-wook <leewook94@gmail.com> * Try CUDA 13.2.1 Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add 2080 for use Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Use the latest cuda driver Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Increase notebook execution timeout Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Remove 2080 due to insufficient gpu memory for nightly tests Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add support for http proxy for speed up Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Prepend "VM_" to env variables for cache Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Update map_at_k in notebooks * PR template typo * Remove Surprise and rerun benchmarks * Fix MLLib docs link * Fix docstring for MAP * Add support for https proxy Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add more retry on failure Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add support for installing gpu drivers for P40 Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct configure.sh Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add retries for ssh key setup Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Set apt and uv to bypass SSL verification Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Update spec.json Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Remove http/https proxy because of no apparent gains on speed Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Revert Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Remove yq installation in Dockerfile Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Update https proxy config for apt Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct apt operations Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Remove apt conf Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Remove P40 Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add more retries Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Move http(s) proxy config from config.json to CLI Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * fix: fixed lightgcn model and rerun notebook Signed-off-by: ds-wook <leewook94@gmail.com> * Add support to set vm requirements Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add by_threshold ranking metrics regression test Signed-off-by: benben951 <jie13383393540@163.com> * Set VM stop schedule Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Explicitly specify secrets to use (recommenders-team#2328) Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct secrets in calling workflows Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct docker args Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Resolve key unbound error Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct empty stop time error Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Reduce spec retrying times Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Lock CUDA version to 580 on V100S Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Refactor duplicate code Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add more GPU choices Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct delete_vm.sh Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct GPUType Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Try the spot chargetype Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct jq filter Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Alternate charge type for the same gputype Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Add more GPU options Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * fix: honor benchmark recommendation args Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> * fix: address benchmark review suggestions Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> * Resolve issue on empty secrets (recommenders-team#2334) * Use pull_request_target to pass secrets Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Correct paths Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Test before changing pull_request to pull_request_target Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Update docs Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * Use pull_request_target Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> --------- Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> * fix: set default timeout for dataset downloads Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> * Correct git refs and working dir (recommenders-team#2338) Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> --------- Signed-off-by: ds-wook <leewook94@gmail.com> Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com> Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com> Signed-off-by: Yinchaochen <lisumchen@gmail.com> Signed-off-by: Yinchao Chen <lisumchen@gmail.com> Signed-off-by: benben951 <jie13383393540@163.com> Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: ds-wook <leewook94@gmail.com> Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com> Co-authored-by: Miguel Fierro <3491412+miguelgfierro@users.noreply.github.com> Co-authored-by: Yinchaochen <lisumchen@gmail.com> Co-authored-by: Andreas Argyriou <anargyri@users.noreply.github.com> Co-authored-by: seanv507 <sean.violante@gmail.com> Co-authored-by: benben951 <jie13383393540@163.com> Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
aff1ed6 to
826b52f
Compare
|
@miguelgfierro I’ve addressed the requested changes. When you have a chance, could you please take another look? Thank you! |
|
|
||
| jobs: | ||
| run-tests: | ||
| uses: ./.github/workflows/compshare-vm.yml |
There was a problem hiding this comment.
This file has no real changes. Could you unstage?
|
|
||
| jobs: | ||
| run-tests: | ||
| uses: ./.github/workflows/compshare-vm.yml |
| "\n", | ||
| "This notebook trains a GRU \\[1\\] sequential recommender on a public Amazon Movies & TV review dataset. The GRU model encodes a user's history of (item, category) interactions with a Gated Recurrent Unit, then combines the resulting hidden state with the target item / category / user embeddings to predict whether the user will interact with the target next.\n", | ||
| "\n", | ||
| "The implementation is PyTorch-native; it does not depend on the legacy TF `BaseModel` / `SequentialBaseModel` hierarchy used by the other sequential models in this repository (see [`sequential_recsys_amazondataset.ipynb`](sequential_recsys_amazondataset.ipynb) for those). It mirrors the single-file standalone `nn.Module` convention introduced by the LightGCN PyTorch migration.\n" |
There was a problem hiding this comment.
I think you can remove the stuff about TensorFlow since it is not relevant for this notebook.
| class GRUModel(SequentialBaseModel): | ||
| """GRU Model | ||
|
|
||
| class GRUModel(nn.Module): |
There was a problem hiding this comment.
Doesn't it make sense to create something similar to SequentialBaseModel as in the existing code, since we are going to reuse it for other deeprec models too? This file is now longer and the same thing will happen with the other models.

Description
Migrate the sequential GRU recommender from TF 1.x to PyTorch, following the pattern established by the LightGCN PR (#2315): a standalone
nn.Modulewith architectural args on__init__and training-time args onfit(), with no port of theBaseModel/SequentialBaseModelhierarchy.Also introduces a TF-free
SequentialIteratorunderio/torch/that the remaining sequential models (A2SVD, Caser, NextItNet, SLi-Rec, SUM) can reuse when they are migrated in subsequent PRs.New files
recommenders/models/deeprec/io/torch/__init__.pyrecommenders/models/deeprec/io/torch/sequential_iterator.py(+367) — TF-freeSequentialIterator. Parses the same Amazon-reviews tab-separated format as the legacyio/sequential_iterator.py, but yields batches asdict[str, np.ndarray]instead of TFfeed_dict. Preserves the in-batch negative-sampling row layout (1 positive followed by N negatives) so softmax loss code can reshape into(-1, 1 + train_num_ngs)exactly as before.Rewritten
recommenders/models/deeprec/models/sequential/gru.py(~80 → ~560 lines) — Standalonenn.Module. Internal_FCNmirrorsBaseModel._fcn_net(Linear → BatchNorm1d → activation → dropout). Supports the same four losses as the TF base (softmax,cross_entropy_loss,square_loss,log_loss) and the same five optimizers (adam,sgd,adadelta,adagrad,rmsprop). L1/L2 regularization on embeddings and layer parameters matches TF semantics (tf.nn.l2_loss == 0.5 * sum(x**2)). Usespack_padded_sequenceso the GRU only iterates over valid history steps.Tests
tests/unit/recommenders/models/test_deeprec_model.py— addstest_gru_component_definition(assertions on embedding shapes, GRU input/hidden size, FCN output dim).tests/smoke/recommenders/models/test_deeprec_model.py— addstest_model_gru(fullfit → run_eval → predictcycle on the Amazon sample dataset).Architectural vs training args split —
__init__takes only what shapes the parameter tensors (vocab sizes, embedding dims, GRU hidden size, MLP head config). Everything else (epochs, lr, batch size, optimizer name, loss name, L1/L2, gradient clip, save/eval cadence) lives onfit(). Same convention as the LightGCN PR.Iterator reuse, not rewrite of the iterator base class — the new PyTorch iterator does not subclass
BaseIterator(which is TF-aware via its abstractgen_feed_dict). It is a standalone class returning numpy dicts. Future PyTorch sequential models can either reuse this iterator or add a subclass underio/torch/.Loss / regularization parity with TF — softmax loss masks padding rows by setting their softmax probability to 1 so
log(.)zeros them out (matchesBaseModel._compute_data_loss).tf.nn.l2_losssemantics (factor of 0.5) preserved.TF
SequentialIteratorandSequentialBaseModelare NOT touched — they remain in place because A2SVD, Caser, NextItNet, SLi-Rec, and SUM still depend on them. They will be removed once the rest of the sequential models are migrated.Related Issues
References
Checklist:
git commit -s -m "your commit message".staging branchAND NOT TOmain branch.