Skip to content

refator: migration GRU with pytorch#2339

Open
ds-wook wants to merge 15 commits into
recommenders-team:stagingfrom
ds-wook:refactor/migrate-gru-pytorch
Open

refator: migration GRU with pytorch#2339
ds-wook wants to merge 15 commits into
recommenders-team:stagingfrom
ds-wook:refactor/migrate-gru-pytorch

Conversation

@ds-wook

@ds-wook ds-wook commented May 26, 2026

Copy link
Copy Markdown
Contributor

Description

Migrate the sequential GRU recommender from TF 1.x to PyTorch, following the pattern established by the LightGCN PR (#2315): a standalone nn.Module with architectural args on __init__ and training-time args on fit(), with no port of the BaseModel / SequentialBaseModel hierarchy.

Also introduces a TF-free SequentialIterator under io/torch/ that the remaining sequential models (A2SVD, Caser, NextItNet, SLi-Rec, SUM) can reuse when they are migrated in subsequent PRs.

New files

  • recommenders/models/deeprec/io/torch/__init__.py
  • recommenders/models/deeprec/io/torch/sequential_iterator.py (+367) — TF-free SequentialIterator. Parses the same Amazon-reviews tab-separated format as the legacy io/sequential_iterator.py, but yields batches as dict[str, np.ndarray] instead of TF feed_dict. Preserves the in-batch negative-sampling row layout (1 positive followed by N negatives) so softmax loss code can reshape into (-1, 1 + train_num_ngs) exactly as before.

Rewritten

  • recommenders/models/deeprec/models/sequential/gru.py (~80 → ~560 lines) — Standalone nn.Module. Internal _FCN mirrors BaseModel._fcn_net (Linear → BatchNorm1d → activation → dropout). Supports the same four losses as the TF base (softmax, cross_entropy_loss, square_loss, log_loss) and the same five optimizers (adam, sgd, adadelta, adagrad, rmsprop). L1/L2 regularization on embeddings and layer parameters matches TF semantics (tf.nn.l2_loss == 0.5 * sum(x**2)). Uses pack_padded_sequence so the GRU only iterates over valid history steps.

Tests

  • tests/unit/recommenders/models/test_deeprec_model.py — adds test_gru_component_definition (assertions on embedding shapes, GRU input/hidden size, FCN output dim).

  • tests/smoke/recommenders/models/test_deeprec_model.py — adds test_model_gru (full fit → run_eval → predict cycle on the Amazon sample dataset).

  • Architectural vs training args split__init__ takes only what shapes the parameter tensors (vocab sizes, embedding dims, GRU hidden size, MLP head config). Everything else (epochs, lr, batch size, optimizer name, loss name, L1/L2, gradient clip, save/eval cadence) lives on fit(). Same convention as the LightGCN PR.

  • Iterator reuse, not rewrite of the iterator base class — the new PyTorch iterator does not subclass BaseIterator (which is TF-aware via its abstract gen_feed_dict). It is a standalone class returning numpy dicts. Future PyTorch sequential models can either reuse this iterator or add a subclass under io/torch/.

  • Loss / regularization parity with TF — softmax loss masks padding rows by setting their softmax probability to 1 so log(.) zeros them out (matches BaseModel._compute_data_loss). tf.nn.l2_loss semantics (factor of 0.5) preserved.

  • TF SequentialIterator and SequentialBaseModel are NOT touched — they remain in place because A2SVD, Caser, NextItNet, SLi-Rec, and SUM still depend on them. They will be removed once the rest of the sequential models are migrated.

Related Issues

References

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • I have signed the commits, e.g. git commit -s -m "your commit message".
  • This PR is being made to staging branch AND NOT TO main branch.

@SimonYansenZhao

Copy link
Copy Markdown
Collaborator

@ds-wook Could you merge the latest changes in the staging branch? I just put a new fix onto the staging branch to correct the tests on Checks.

@SimonYansenZhao

Copy link
Copy Markdown
Collaborator

@ds-wook Thanks! Now it's working.

@SimonYansenZhao

Copy link
Copy Markdown
Collaborator

@ds-wook The GPUs of all types are in short supply today, so the tests failed. Not sure when the resources will be available. I'll keep an eye out for it.

@ds-wook

ds-wook commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@SimonYansenZhao Thank you! I will check it on my side too.

@miguelgfierro miguelgfierro left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super good @ds-wook, congrats again!

I left some comments below

@@ -0,0 +1,367 @@
# Copyright (c) Recommenders contributors.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to create a notebook: https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/sequential_recsys_amazondataset.ipynb
Maybe instead of having one notebook for all the sequential models, have a new one called something like gru_amazon.ipynb. Something in between https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/sequential_recsys_amazondataset.ipynb and https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/sasrec_amazon.ipynb where we explain first what is GRU, what's its architecture and then compute the notebook

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, when the new notebook is created, it would be good to compare the old one and new one to make sure the metrics are consistent

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric Parity Check: TF GRU vs PyTorch GRU

Setup

To verify that the PyTorch port does not contain a parity bug, I ran the original TensorFlow GRUModel (restored from main at commit 9090c278) on exactly the same training, validation, and test splits as the new PyTorch implementation. The only variable changed between runs was the framework.

Dataset

  • Amazon Movies & TV Reviews

  • 1% sample (sample_rate=0.01, matching both notebooks)

  • Train rows: 16,635 positives (train_data)

  • Test rows: 169,170 (test_num_ngs=9)

Hyperparameters (Identical Across Both Runs)

epochs: 10
batch_size: 400
learning_rate: 0.001

embed_l2: 0
layer_l2: 0

train_num_ngs: 4
max_seq_length: 50

hidden_size: 40
layer_sizes: [100, 64]
dropout: [0.3, 0.3]
enable_BN: true

seed: 42

Training Configuration

  • Loss: softmax over (1 + train_num_ngs) candidate group

  • Optimizer: Adam


Results (Final Evaluation on test_data)

Metric TF GRU (Legacy) PyTorch GRU Δ (PyTorch − TF)
AUC 0.6791 0.6581 -0.021
group_AUC 0.6785 0.6487 -0.030
logloss 0.9397 0.6062 -0.334 ✓
mean_mrr 0.4596 0.4144 -0.045
ndcg@2 0.3640 0.3068 -0.057
ndcg@4 0.4649 0.4205 -0.044
ndcg@6 0.5221 0.4802 -0.042

During training, the TF model's logloss increases from approximately 0.69 → 0.94 while AUC simultaneously improves. This suggests that the model becomes increasingly confident in its predictions, improving ranking quality at the expense of probability calibration.

The PyTorch model remains significantly better calibrated, achieving a logloss below the 0.69 random-guess baseline.


3. Training Trajectories Differ

TensorFlow:

Validation AUC
Epoch 1 → 10
0.59 → 0.71
(monotonic improvement)

PyTorch:

Test AUC
Peak at epoch 6: 0.6668
Declines afterwards

The saved best_model.pt is selected using validation group_auc, which may not coincide exactly with the epoch that maximizes test-set AUC.

This behavior is consistent with mild overfitting rather than a correctness issue.


4. No Evidence of a Port-Time Bug

The observed ~2–3% ranking gap is within reasonable framework-to-framework variance for a dataset of this size.

Potential contributors, ordered by likely impact:

A. FCN Weight Initialization

TensorFlow:

truncated_normal(stddev=0.01)

PyTorch:

nn.Linear(...)
# default Kaiming Uniform initialization

PyTorch weights begin with roughly 6× larger magnitude.

Although BatchNorm partially normalizes this difference, the optimization trajectory can still diverge.


B. GRU Cell Formulation Differences

PyTorch nn.GRU follows the cuDNN-friendly formulation:

r_t * (W_hn h_(t−1) + b_hn)

TensorFlow compat.v1.nn.rnn_cell.GRUCell follows the original Cho et al. formulation:

W [x_t, r_t * h_(t−1)]

The two formulations are mathematically similar but not identical.


C. User Embedding Usage

The TensorFlow GRU constructs:

self.user_embedding

but never consumes it in the final FCN.

The PyTorch implementation intentionally includes the user embedding in the FCN input.

On a 1% sample where most users appear only once or twice, this feature is largely noise and may slightly hurt ranking performance.

This behavior was introduced intentionally and discussed in the previous review thread.


D. Random Seed Coverage

Both implementations seed framework-level RNGs.

However, neither implementation seeds Python's:

random

module, which is used by the iterator for:

  • negative sampling

  • data shuffling

As a result, some run-to-run variation remains even with identical framework seeds.


Conclusion

The PyTorch implementation reproduces the TensorFlow baseline within expected framework-to-framework variance.

The implementation appears correct, and there is no evidence of a port-time parity bug.

The gap between the observed AUC (~0.66–0.68) and the benchmark value (~0.84) is explained by the much smaller training dataset (1% sample vs. full Amazon dataset), rather than by any issue in the PyTorch port itself.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

I can access the notebook without any issues on my side. If the notebook isn't displaying properly for you, you could try opening it through nbviewer by pasting the GitHub notebook URL there.

wt.write("\n")
return self

def load(self, model_path: str) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load() joins a directory to MODEL_CHECKPOINT (model.pt), but fit(save_model=True) only ever writes best_model.pt and epoch_{N}_model.pt — a bare model.pt is
never produced, so load(model_dir) raises FileNotFoundError. Rather than hardcode best_, make the checkpoint name a parameter that defaults to the best
checkpoint, so callers can also load a specific epoch:

def load(self, model_path: str, filename: str = f"best_{MODEL_CHECKPOINT}") -> None:
    """Load weights from a ``.pt`` file, or from ``model_path/<filename>`` if a directory.

    Args:
        model_path: Path to a ``.pt`` file, or a directory containing one.
        filename: Checkpoint name to load when ``model_path`` is a directory.
            Defaults to the best checkpoint written by ``fit(save_model=True)``;
            pass e.g. ``f"epoch_3_{MODEL_CHECKPOINT}"`` to load a specific epoch.
    """
    if os.path.isdir(model_path):
        model_path = os.path.join(model_path, filename)
    state_dict = torch.load(model_path, map_location=self.device, weights_only=True)
    self.load_state_dict(state_dict)

Defaulting to the best checkpoint keeps the common model.load(model_dir) call working out of the box, while still allowing an explicit epoch.

item_all.append(positive_item)
item_cate_all.append(item_cate_list[i])
count = 0
while True:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop retries until it finds negatives differing from the positive item. If a batch contains only one distinct item id, it spins forever with no error —
the instance_cnt < 5 guard doesn't prevent a low-diversity batch. Add a bounded retry that fails fast:

attempts = 0
while count < batch_num_ngs:
    attempts += 1
    if attempts > 100 * batch_num_ngs:
        raise ValueError("could not sample enough distinct negatives; batch has too few unique items")
    j = random.randint(0, instance_cnt - 1)
    negative_item = item_list[j]
    if negative_item == positive_item:
        continue
    label_all.append(0)
    item_all.append(negative_item)
    item_cate_all.append(item_cate_list[j])
    count += 1

self.max_seq_length = max_seq_length
self.hidden_size = hidden_size

self.user_embedding = nn.Embedding(user_vocab_length, user_embedding_dim)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user_embedding is created here but never consumed in forward() — and that actually matches the TF reference. I checked the original: sequential_base_model.py
builds user_lookup and looks up self.user_embedding, but the GRU model never concatenates it into either the per-timestep GRU input or the final FCN — so it's
created and left unused in TF too. This PR faithfully reproduces that, so it's parity, not a regression.
To wire it into the prediction layer (if user conditioning is wanted). The user id is a per-instance, time-invariant feature, so it belongs at the final FCN input, not inside the recurrence:

   # __init__: _FCN in_dim grows by user_embedding_dim
   in_dim=hidden_size + self.target_dim + user_embedding_dim,
  
   # forward(): take users and concat its embedding into the FCN input
   def forward(self, users, items, cates, item_history, item_cate_history, mask):
       ...
       target_emb = torch.cat([self.item_embedding(items), self.cate_embedding(cates)], dim=1)
       user_emb = self.user_embedding(users)
       model_output = torch.cat([final_state, target_emb, user_emb], dim=1)
       return self.fcn(model_output)

This also needs a users key added to the iterator's batch dict and to _to_device. Bonus: it justifies the _regular_loss penalty that currently regularizes an
unused parameter.

Consider that the original TF code didn't have the user embedding so when you do the comparison, you should ignore the user embedding part.


time_range = 3600 * 24

time_diff = []

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three time-feature blocks (time_diff, time_from_first_action, time_to_now) are computed and carried through _convert_data into every batch, but
GRUModel.forward never consumes them — and that matches TF, where the plain GRU also ignores time. Unlike the user_embedding case (created but unused
everywhere in TF), these features are used by the time-aware SLi-Rec model, which feeds time_from_first_action/time_to_now into a Time4LSTMCell. Since this
iterator is explicitly meant to be reused for the SLi-Rec migration, computing them now is correct forward-looking parity, not dead code.

Two small asks so it doesn't read as an oversight:

  1. Add a one-line comment here noting these are emitted for the time-aware models (SLi-Rec) and intentionally unused by GRU.
  2. Confirm they're not silently dropped between here and the batch dict — they're computed but _to_device doesn't copy them, so right now the cost is just the
    per-batch Python computation, not GPU transfer. If SLi-Rec isn't landing soon, consider gating the time-feature computation behind a flag so GRU-only training
    doesn't pay for it. Non-blocking.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On (1) — Comment Added

Added at sequential_iterator.py:104-110, right before the time_diff block in parser_one_line:

# The three time features below (time_diff, time_from_first_action,
# time_from_first_action, time_to_now) are emitted for the time-aware
# sequential models (SLi-Rec consumes them via Time4LSTMCell). GRU
# intentionally ignores them; its forward() and _to_device() do not
# reference these keys, so the per-batch cost is Python-side computation
# only (no GPU transfer). Kept here so the same iterator can serve future
# SLi-Rec / time-aware migrations without forking the parser.

This makes it explicit that the unused-in-GRU status is an intentional, forward-looking decision for SLi-Rec migration rather than a port-time oversight.


On (2) — Confirmed No Silent Drop and No GPU Transfer

I traced the three time features through the full batch pipeline:

Stage time_diff / time_from_first_action / time_to_now Cost
parser_one_line Computed (list comprehension + np.log) CPU microseconds
_convert_data Written into the batch dict as NumPy arrays Small RAM only
_to_device Not copied. Only labels, users, items, cates, item_history, item_cate_history, and mask are moved to the device 0 (no GPU transfer)
GRUModel.forward Not referenced (grep "time_" gru.py → no matches) 0

So the actual cost is limited to the per-batch Python parsing work. There is no GPU memory usage and no PCIe transfer overhead. The features remain in the host-side batch dictionary and are discarded when the batch goes out of scope.


On the Gating-Flag Suggestion

I left out a gating flag for now because the next migration in the series (SLi-Rec) will consume these features. Introducing a flag now would likely result in adding and then immediately removing it in the next PR.

If the SLi-Rec migration is delayed, we can introduce:

emit_time_features: bool = True

on the iterator as a follow-up. The added comment documents that intent. Happy to add the flag in this PR if you'd prefer.

@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ds-wook

ds-wook commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@SimonYansenZhao
While running the full GPU test suite, only tests/unit/recommenders/utils/test_gpu_utils.py::test_get_gpu_info
fails.

The GPU is available and all other GPU-dependent tests pass successfully, including NCF, DeepRec, NewsRec, and notebook GPU tests.

Failure occurs in:

recommenders/utils/gpu_utils.py

at:

meminfo = cuda.current_context().get_memory_info()

Error:

CudaAPIError: [201] Call to cuCtxGetDevice results in CUDA_ERROR_INVALID_CONTEXT

This suggests that the current CUDA context may not be valid when get_gpu_info() is called, potentially due to Numba/CUDA context handling changes or interaction with previously executed GPU tests.

Environment:

  • Python 3.9.25
  • pytest 8.4.2
  • numba (version?)
  • CUDA (version?)
  • GPU tests executed successfully except get_gpu_info()

Observed result:

  • 62 tests passed
  • 3 skipped
  • 1 failed (test_get_gpu_info)

Potential area to investigate:

  • Usage of cuda.current_context() inside get_gpu_info()
  • Context push/pop behavior in the "with gpu:" block
  • Compatibility with newer Numba/CUDA versions

@@ -0,0 +1,367 @@
# Copyright (c) Recommenders contributors.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,379 @@
# Copyright (c) Recommenders contributors.
# Licensed under the MIT License.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about the metrcis: ndcg@2 ndcg@4 and ndcg@6.
we never use those metrics, we do ndcg@10. Can you review?

# Copyright (c) Recommenders contributors.
# Licensed under the MIT License.

from __future__ import annotations

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the logloss difference, do you know why this could be happening?

Image

FYI @anargyri

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cross-framework logloss gap (TF 0.94 vs PyTorch 0.61) closes to ~0.12 when L2=1e-4 is added to both. The gap is an artifact of where each implementation's optimizer lands on the bias-variance curve, not a port bug — TF was over-fitting weight magnitudes (L2 helps), PyTorch was already in a calibrated regime (L2 hurts). Both implementations produce reasonable metrics; no code change needed.

ds-wook and others added 14 commits June 2, 2026 16:56
Signed-off-by: ds-wook <leewook94@gmail.com>
* Use pull_request_target to pass secrets

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct paths

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Test before changing pull_request to pull_request_target

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update docs

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Use pull_request_target

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

---------

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: benben951 <jie13383393540@163.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
…ging (recommenders-team#2318)

* refactor: migrate vae pytorch

Signed-off-by: ds-wook <leewook94@gmail.com>

* refactor: optimize gpu calculation

Signed-off-by: ds-wook <leewook94@gmail.com>

* refactor: rebuild multi vae tensorflow to pytorch

Signed-off-by: ds-wook <leewook94@gmail.com>

* fix: rewrite multi vae

Signed-off-by: ds-wook <leewook94@gmail.com>

* Update doc for GitHub Actions runner setup (recommenders-team#2306)

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Translate NCF model from TensorFlow to PyTorch

Rewrite ncf_singlenode.py from TF v1 (sessions, placeholders, tf_slim) to
PyTorch (nn.Module). All weight initializations match TF defaults:
truncated_normal(std=0.01) for embeddings, xavier_uniform for dense layers,
no bias on output layer. Adam optimizer and BCELoss use identical defaults.

Update unit tests, quickstart notebook, deep dive notebook and NNI notebook
to use PyTorch imports. Dataset module (dataset.py) is unchanged as it has
no TF dependency.

Metrics on MovieLens 100k (seed=42, 50 epochs) are within ~4% of TF
reference, explained entirely by different RNG sequences between frameworks.
Training loss converges to the same value (0.2315 vs 0.2323).

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* refactor: change model parameter & arch

Signed-off-by: ds-wook <leewook94@gmail.com>

* Detect and re-download corrupt zip files in maybe_download

A partial download that gets interrupted leaves a truncated zip file
on disk. On retry, maybe_download sees the file exists and skips the
download, causing BadZipFile errors that persist across all retries.

Add is_valid_zip() to validate existing zip files before skipping
the download. If the file is corrupt, delete it and re-download.

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* fix: switched both notebooks from map_at_k to map

Signed-off-by: ds-wook <leewook94@gmail.com>

* Fix by_threshold relevancy method to filter by score, not count

The relevancy_method='by_threshold' branch in merge_ranking_true_pred
was passing `threshold` as the `k` argument to get_top_k_items, so the
threshold value silently became a top-N count instead of a score cutoff.
Combined with metrics that divide by `k` (precision_at_k, ndcg_at_k,
map, map_at_k, ...), this let the resulting metric exceed 1, which is
mathematically impossible for these definitions.

Now `by_threshold` filters predictions to rows with col_prediction >=
threshold and then applies the standard top-k cutoff. Hits are bounded
by k, so metrics stay in [0, 1].

Also clarifies the `threshold` docstring on every metric that exposes
the parameter so users can tell it is a score cutoff rather than a
count of items.

Adds a regression test covering three cases:
1. Threshold above all scores -> every ranking metric is 0.
2. Threshold below all scores -> by_threshold collapses to top_k.
3. Mid threshold -> all metrics stay inside [0, 1].

Fixes recommenders-team#2154
Refs recommenders-team#2140

* Rewrite by_threshold test with concrete correctness assertions

* fix: change map metric

Signed-off-by: ds-wook <leewook94@gmail.com>

* Add support for compshare vms

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct shell commands

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Declare COMPSHARE_SPEC_FILE

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Copy repo files to the VM to avoid git clone failure

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Retry curl upon failure

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* fix(gpu): use imported cuda namespace for gpu counting

Signed-off-by: Yinchaochen <lisumchen@gmail.com>

* Update docs

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Configure Docker registry mirror for speedup

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Retry image build upon failure

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct syntax errors

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add pip index arg

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Make scripts robuster

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try DNS configs only, and remove P40 due to incompatibility with PyTorch

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* test(gpu): shorten regression test name per review

Rename test_get_number_gpus_falls_back_to_cuda_namespace_when_torch_is_missing
to test_get_number_gpus_without_torch in test_gpu_utils.py and update its
entry in tests/test_groups.yml. The shorter name still pairs the function
under test with the scenario; the cuda-fallback detail is evident from the
test body.

Addresses review comment from @anargyri on recommenders-team#2314.

Signed-off-by: Yinchao Chen <lisumchen@gmail.com>

* refactor: modernize lightgbm utils

Signed-off-by: ds-wook <leewook94@gmail.com>

* Add support for Docker and PyPI mirrors

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Clean up code for retries and correct docker mirror url

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update docs

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct docker build arg for pypi index url

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Combine test groups for gpu

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

---------

Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: Yinchaochen <lisumchen@gmail.com>
Signed-off-by: Yinchao Chen <lisumchen@gmail.com>
Co-authored-by: ds-wook <leewook94@gmail.com>
Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Co-authored-by: Miguel Fierro <3491412+miguelgfierro@users.noreply.github.com>
Co-authored-by: Yinchaochen <lisumchen@gmail.com>
Co-authored-by: Andreas Argyriou <anargyri@users.noreply.github.com>
Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>
Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
* refactor: migrate vae pytorch

Signed-off-by: ds-wook <leewook94@gmail.com>

* refactor: optimize gpu calculation

Signed-off-by: ds-wook <leewook94@gmail.com>

* refactor: rebuild multi vae tensorflow to pytorch

Signed-off-by: ds-wook <leewook94@gmail.com>

* fix: rewrite multi vae

Signed-off-by: ds-wook <leewook94@gmail.com>

* Update doc for GitHub Actions runner setup (recommenders-team#2306)

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Translate NCF model from TensorFlow to PyTorch

Rewrite ncf_singlenode.py from TF v1 (sessions, placeholders, tf_slim) to
PyTorch (nn.Module). All weight initializations match TF defaults:
truncated_normal(std=0.01) for embeddings, xavier_uniform for dense layers,
no bias on output layer. Adam optimizer and BCELoss use identical defaults.

Update unit tests, quickstart notebook, deep dive notebook and NNI notebook
to use PyTorch imports. Dataset module (dataset.py) is unchanged as it has
no TF dependency.

Metrics on MovieLens 100k (seed=42, 50 epochs) are within ~4% of TF
reference, explained entirely by different RNG sequences between frameworks.
Training loss converges to the same value (0.2315 vs 0.2323).

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* refactor: change model parameter & arch

Signed-off-by: ds-wook <leewook94@gmail.com>

* Detect and re-download corrupt zip files in maybe_download

A partial download that gets interrupted leaves a truncated zip file
on disk. On retry, maybe_download sees the file exists and skips the
download, causing BadZipFile errors that persist across all retries.

Add is_valid_zip() to validate existing zip files before skipping
the download. If the file is corrupt, delete it and re-download.

Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* fix: switched both notebooks from map_at_k to map

Signed-off-by: ds-wook <leewook94@gmail.com>

* Fix by_threshold relevancy method to filter by score, not count

The relevancy_method='by_threshold' branch in merge_ranking_true_pred
was passing `threshold` as the `k` argument to get_top_k_items, so the
threshold value silently became a top-N count instead of a score cutoff.
Combined with metrics that divide by `k` (precision_at_k, ndcg_at_k,
map, map_at_k, ...), this let the resulting metric exceed 1, which is
mathematically impossible for these definitions.

Now `by_threshold` filters predictions to rows with col_prediction >=
threshold and then applies the standard top-k cutoff. Hits are bounded
by k, so metrics stay in [0, 1].

Also clarifies the `threshold` docstring on every metric that exposes
the parameter so users can tell it is a score cutoff rather than a
count of items.

Adds a regression test covering three cases:
1. Threshold above all scores -> every ranking metric is 0.
2. Threshold below all scores -> by_threshold collapses to top_k.
3. Mid threshold -> all metrics stay inside [0, 1].

Fixes recommenders-team#2154
Refs recommenders-team#2140

* Rewrite by_threshold test with concrete correctness assertions

* fix: change map metric

Signed-off-by: ds-wook <leewook94@gmail.com>

* Add support for compshare vms

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct shell commands

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Declare COMPSHARE_SPEC_FILE

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Copy repo files to the VM to avoid git clone failure

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Retry curl upon failure

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* fix(gpu): use imported cuda namespace for gpu counting

Signed-off-by: Yinchaochen <lisumchen@gmail.com>

* Update docs

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Configure Docker registry mirror for speedup

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Retry image build upon failure

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct syntax errors

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add pip index arg

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Make scripts robuster

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try DNS configs only, and remove P40 due to incompatibility with PyTorch

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Use map_at_k instead of map for ranking-metric reporting

Issue recommenders-team#2309 points out that the dict returned by
examples/06_benchmarks/benchmark_utils.py:ranking_metrics_python and
:ranking_metrics_pyspark labels its first entry "MAP" but computes it
with the Spark-style map() function, which normalizes by n_relevant
rather than min(k, n_relevant). The other entries in the same dict are
labeled "@k" and computed with the @k variants, so the first entry is
inconsistent with its neighbours and can produce values that are
mathematically valid for MAP but counter-intuitive when read alongside
Precision@k / Recall@k / NDCG@k.

Changes:

* examples/06_benchmarks/benchmark_utils.py - swap map for map_at_k in
  both the Python and PySpark ranking-metrics helpers and rename the
  dict key "MAP" to "MAP@k" so the label matches the function used.
* examples/06_benchmarks/movielens.ipynb - update the two source cells
  (the missing-row placeholder dict and the column-order list) that
  consume that dict so the benchmark table column header agrees with
  the upstream key. Cached cell outputs are left as-is; they will be
  regenerated on the next notebook run.
* recommenders/evaluation/python_evaluation.py - cross-link the map()
  and map_at_k() docstrings so a reader landing on either function can
  see the normalizer difference and pick the right one.
* recommenders/evaluation/spark_evaluation.py - same cross-link on
  SparkRankingEvaluation.map / .map_at_k.
* tests/unit/recommenders/evaluation/test_python_evaluation.py - add
  test_python_map_vs_map_at_k that pins the invariant: map_at_k equals
  map when k >= n_relevant for every user (k=10 on the existing
  fixture) and strictly exceeds it when at least one user has more
  than k relevant items (k=5, where user 3 in the fixture has 10).
* tests/test_groups.yml - register the new test in the pr_gate group.

Notebook examples under examples/00_quick_start and
examples/02_model_collaborative_filtering still import the bare map
symbol; switching them is left to a follow-up because the
tests/functional/examples/test_notebooks_*.py and
tests/smoke/examples/test_notebooks_*.py expected values for the
"map" key would need to be regenerated end-to-end.

Refs recommenders-team#1702 recommenders-team#2004

Signed-off-by: Yinchao Chen <lisumchen@gmail.com>

* test(gpu): shorten regression test name per review

Rename test_get_number_gpus_falls_back_to_cuda_namespace_when_torch_is_missing
to test_get_number_gpus_without_torch in test_gpu_utils.py and update its
entry in tests/test_groups.yml. The shorter name still pairs the function
under test with the scenario; the cuda-fallback detail is evident from the
test body.

Addresses review comment from @anargyri on recommenders-team#2314.

Signed-off-by: Yinchao Chen <lisumchen@gmail.com>

* refactor: modernize lightgbm utils

Signed-off-by: ds-wook <leewook94@gmail.com>

* Add support for Docker and PyPI mirrors

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Clean up code for retries and correct docker mirror url

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update docs

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct docker build arg for pypi index url

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Combine test groups for gpu

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Fix asset URL in fm_deep_dive.ipynb

path had `mains-team/resources` repeated muiltiple times

this is corrected to  value in https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/xdeepfm_criteo.ipynb

* Install cuda driver from scratch

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Lock gpu version

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Remove install_container_toolkit.sh

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* refactor: migrate lightgcn pytorch

Signed-off-by: ds-wook <leewook94@gmail.com>

* fix: remove type_checking and change print to logging

Signed-off-by: ds-wook <leewook94@gmail.com>

* refactor: redesign architectural args

Signed-off-by: ds-wook <leewook94@gmail.com>

* fix: reorder logger

Signed-off-by: ds-wook <leewook94@gmail.com>

* Try CUDA 13.2.1

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add 2080 for use

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Use the latest cuda driver

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Increase notebook execution timeout

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Remove 2080 due to insufficient gpu memory for nightly tests

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add support for http proxy for speed up

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Prepend "VM_" to env variables for cache

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update map_at_k in notebooks

* PR template typo

* Remove Surprise and rerun benchmarks

* Fix MLLib docs link

* Fix docstring for MAP

* Add support for https proxy

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add more retry on failure

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add support for installing gpu drivers for P40

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct configure.sh

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add retries for ssh key setup

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Set apt and uv to bypass SSL verification

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update spec.json

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Remove http/https proxy because of no apparent gains on speed

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Revert

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Remove yq installation in Dockerfile

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update https proxy config for apt

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct apt operations

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Remove apt conf

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Remove P40

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add more retries

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Move http(s) proxy config from config.json to CLI

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* fix: fixed lightgcn model and rerun notebook

Signed-off-by: ds-wook <leewook94@gmail.com>

* Add support to set vm requirements

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add by_threshold ranking metrics regression test

Signed-off-by: benben951 <jie13383393540@163.com>

* Set VM stop schedule

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Explicitly specify secrets to use (recommenders-team#2328)

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct secrets in calling workflows

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct docker args

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Resolve key unbound error

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct empty stop time error

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Reduce spec retrying times

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Lock CUDA version to 580 on V100S

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Refactor duplicate code

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add more GPU choices

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct delete_vm.sh

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct GPUType

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Try the spot chargetype

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct jq filter

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Alternate charge type for the same gputype

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Add more GPU options

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* fix: honor benchmark recommendation args

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

* fix: address benchmark review suggestions

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

* Resolve issue on empty secrets (recommenders-team#2334)

* Use pull_request_target to pass secrets

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Correct paths

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Test before changing pull_request to pull_request_target

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Update docs

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* Use pull_request_target

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

---------

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

* fix: set default timeout for dataset downloads

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

* Correct git refs and working dir (recommenders-team#2338)

Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>

---------

Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: Simon Zhao <simonyansenzhao@gmail.com>
Signed-off-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Signed-off-by: Yinchaochen <lisumchen@gmail.com>
Signed-off-by: Yinchao Chen <lisumchen@gmail.com>
Signed-off-by: benben951 <jie13383393540@163.com>
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: ds-wook <leewook94@gmail.com>
Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Co-authored-by: Miguel Fierro <3491412+miguelgfierro@users.noreply.github.com>
Co-authored-by: Yinchaochen <lisumchen@gmail.com>
Co-authored-by: Andreas Argyriou <anargyri@users.noreply.github.com>
Co-authored-by: seanv507 <sean.violante@gmail.com>
Co-authored-by: benben951 <jie13383393540@163.com>
Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
Signed-off-by: ds-wook <leewook94@gmail.com>
@ds-wook ds-wook force-pushed the refactor/migrate-gru-pytorch branch from aff1ed6 to 826b52f Compare June 4, 2026 09:45
@ds-wook

ds-wook commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@miguelgfierro I’ve addressed the requested changes. When you have a chance, could you please take another look? Thank you!


jobs:
run-tests:
uses: ./.github/workflows/compshare-vm.yml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has no real changes. Could you unstage?


jobs:
run-tests:
uses: ./.github/workflows/compshare-vm.yml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too.

"\n",
"This notebook trains a GRU \\[1\\] sequential recommender on a public Amazon Movies & TV review dataset. The GRU model encodes a user's history of (item, category) interactions with a Gated Recurrent Unit, then combines the resulting hidden state with the target item / category / user embeddings to predict whether the user will interact with the target next.\n",
"\n",
"The implementation is PyTorch-native; it does not depend on the legacy TF `BaseModel` / `SequentialBaseModel` hierarchy used by the other sequential models in this repository (see [`sequential_recsys_amazondataset.ipynb`](sequential_recsys_amazondataset.ipynb) for those). It mirrors the single-file standalone `nn.Module` convention introduced by the LightGCN PyTorch migration.\n"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove the stuff about TensorFlow since it is not relevant for this notebook.

class GRUModel(SequentialBaseModel):
"""GRU Model

class GRUModel(nn.Module):

@anargyri anargyri Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it make sense to create something similar to SequentialBaseModel as in the existing code, since we are going to reuse it for other deeprec models too? This file is now longer and the same thing will happen with the other models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants