Fix HLLM embedding/vocab alignment; add generative data download workflow#241
Merged
1985312383 merged 6 commits intoMay 15, 2026
Merged
Conversation
Add --no_download/--data_source/--overwrite CLI options to the generative data preprocessing scripts so MovieLens-1M and Amazon Books raw or ByteDance-processed data can be fetched and validated from a single command. Update READMEs and the HSTU reproduction doc to describe the new flow.
HLLMModel now requires item_embeddings to be indexed by vocab token id (row 0 = PAD) and validates shape against vocab_size/d_model. Item embeddings are L2-normalized once at load and the user representation is normalized at scoring, so logits are cos-sim/temperature with temperature defaulting to 0.07 to match the official ByteDance HLLM scale. The MovieLens and Amazon Books HLLM preprocessing scripts now read vocab.pkl and write embeddings keyed by token id. Example run scripts compute vocab_size from item_to_idx and use temperature=1.0 in the NCE loss to avoid double scaling. Add smoke tests covering the new validation, normalization and bounded logits behavior.
Run config/format_code.py to apply isort + yapf to the generative
data download workflow and HLLM bug fix changes, and to fix
pre-existing style drift in torch_rechub/{trainers,utils} touched
by the same pass.
…ion filename Force item_id and user_id to str in read_interactions so the vocab keys built from interactions line up with the str-keyed metadata used by the HLLM pipeline; otherwise numeric ByteDance item_ids produced an all-zero embeddings tensor because the embedding-generation lookup missed every entry. Rename the ByteDance interaction default filename to amazon_books_interactions.csv so it no longer collides with the raw ratings_Books.csv and a stale raw download cannot silently be re-used under --data_source=bytedance; pair this with a schema-vs-source mismatch error in read_interactions for extra safety. Add tests covering numeric item_id coercion, str-keyed vocab/metadata lookup, and the new schema mismatch error.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request / 拉取请求
What does this PR do? / 这个PR做了什么?
Fix HLLM item-embedding/vocab alignment so embeddings are indexed by vocab token id (row 0 = PAD). HLLMModel now validates
item_embeddings.shapeagainst(vocab_size, d_model), L2-normalizes item embeddings once at load and normalizes the user representation at scoring, so logits = cos-sim(x, emb) / temperature with defaulttemperature=0.07to match the official ByteDance HLLM logit scale. MovieLens-1M and Amazon Books HLLM preprocessing scripts now readvocab.pkland write embeddings keyed by token id; the example run scripts computevocab_sizefromitem_to_idxand usetemperature=1.0in the NCE loss to avoid double scaling. Also fix two Amazon Books data bugs: the interaction loader now coercesitem_id/user_idtostrso the vocab built from numeric ByteDance ids lines up with the str-keyed metadata used by the embedding-generation lookup (previously this silently produced an all-zero embeddings tensor under the default--data_source bytedance), and the ByteDance interaction filename is split fromratings_Books.csvtoamazon_books_interactions.csvwith a schema-vs-source mismatch error so a stale raw download cannot silently be re-used. Add a unified dataset download workflow with--no_download/--data_source/--overwriteCLI flags on the MovieLens-1M and Amazon Books preprocessing scripts, supporting Stanford SNAP raw data and ByteDance HLLM preprocessed data, with progress bars, automatic format detection and basic schema validation. Add smoke tests covering HLLM vocab/embedding validation, bounded cosine logits, shape error messages, numericitem_idcoercion, str-keyed vocab/metadata lookup and schema-mismatch handling, and update the HLLM reproduction docs (EN/ZH) and data README files accordingly.Type of Change / 变更类型
How to Test / 如何测试
Checklist / 检查清单
python config/format_code.py) / 代码遵循项目风格(运行了格式化脚本)