Fix HLLM embedding/vocab alignment; add generative data download workflow by TyndaleLym · Pull Request #241 · datawhalechina/torch-rechub

TyndaleLym · 2026-05-14T12:30:30Z

Pull Request / 拉取请求

What does this PR do? / 这个PR做了什么？

Fix HLLM item-embedding/vocab alignment so embeddings are indexed by vocab token id (row 0 = PAD). HLLMModel now validates item_embeddings.shape against (vocab_size, d_model), L2-normalizes item embeddings once at load and normalizes the user representation at scoring, so logits = cos-sim(x, emb) / temperature with default temperature=0.07 to match the official ByteDance HLLM logit scale. MovieLens-1M and Amazon Books HLLM preprocessing scripts now read vocab.pkl and write embeddings keyed by token id; the example run scripts compute vocab_size from item_to_idx and use temperature=1.0 in the NCE loss to avoid double scaling. Also fix two Amazon Books data bugs: the interaction loader now coerces item_id/user_id to str so the vocab built from numeric ByteDance ids lines up with the str-keyed metadata used by the embedding-generation lookup (previously this silently produced an all-zero embeddings tensor under the default --data_source bytedance), and the ByteDance interaction filename is split from ratings_Books.csv to amazon_books_interactions.csv with a schema-vs-source mismatch error so a stale raw download cannot silently be re-used. Add a unified dataset download workflow with --no_download / --data_source / --overwrite CLI flags on the MovieLens-1M and Amazon Books preprocessing scripts, supporting Stanford SNAP raw data and ByteDance HLLM preprocessed data, with progress bars, automatic format detection and basic schema validation. Add smoke tests covering HLLM vocab/embedding validation, bounded cosine logits, shape error messages, numeric item_id coercion, str-keyed vocab/metadata lookup and schema-mismatch handling, and update the HLLM reproduction docs (EN/ZH) and data README files accordingly.

Type of Change / 变更类型

🐛 Bug fix / Bug修复
✨ New model/feature / 新模型/功能
📝 Documentation / 文档
🔧 Maintenance / 维护

How to Test / 如何测试

# Correctness smoke tests
python -m pytest tests/test_hllm_embedding_alignment.py tests/test_amazon_books_preprocess.py -v

# MovieLens-1M data preparation
python examples/generative/data/ml-1m/preprocess_ml_hstu.py
python examples/generative/data/ml-1m/preprocess_hllm_data.py --model_type tinyllama --device cuda

# Amazon Books data preparation (ByteDance preprocessed source, default)
python examples/generative/data/amazon-books/preprocess_amazon_books.py --data_source bytedance
python examples/generative/data/amazon-books/preprocess_amazon_books_hllm.py --data_source bytedance --model_type tinyllama --device cuda

Checklist / 检查清单

Code follows project style (ran python config/format_code.py) / 代码遵循项目风格（运行了格式化脚本）
Added tests for new functionality / 为新功能添加了测试
Updated documentation if needed / 如需要已更新文档
All tests pass locally / 所有测试在本地通过

Add --no_download/--data_source/--overwrite CLI options to the generative data preprocessing scripts so MovieLens-1M and Amazon Books raw or ByteDance-processed data can be fetched and validated from a single command. Update READMEs and the HSTU reproduction doc to describe the new flow.

HLLMModel now requires item_embeddings to be indexed by vocab token id (row 0 = PAD) and validates shape against vocab_size/d_model. Item embeddings are L2-normalized once at load and the user representation is normalized at scoring, so logits are cos-sim/temperature with temperature defaulting to 0.07 to match the official ByteDance HLLM scale. The MovieLens and Amazon Books HLLM preprocessing scripts now read vocab.pkl and write embeddings keyed by token id. Example run scripts compute vocab_size from item_to_idx and use temperature=1.0 in the NCE loss to avoid double scaling. Add smoke tests covering the new validation, normalization and bounded logits behavior.

Run config/format_code.py to apply isort + yapf to the generative data download workflow and HLLM bug fix changes, and to fix pre-existing style drift in torch_rechub/{trainers,utils} touched by the same pass.

…ion filename Force item_id and user_id to str in read_interactions so the vocab keys built from interactions line up with the str-keyed metadata used by the HLLM pipeline; otherwise numeric ByteDance item_ids produced an all-zero embeddings tensor because the embedding-generation lookup missed every entry. Rename the ByteDance interaction default filename to amazon_books_interactions.csv so it no longer collides with the raw ratings_Books.csv and a stale raw download cannot silently be re-used under --data_source=bytedance; pair this with a schema-vs-source mismatch error in read_interactions for extra safety. Add tests covering numeric item_id coercion, str-keyed vocab/metadata lookup, and the new schema mismatch error.

TyndaleLym added 4 commits May 14, 2026 19:53

style: apply project code formatter

32cb7f4

Run config/format_code.py to apply isort + yapf to the generative data download workflow and HLLM bug fix changes, and to fix pre-existing style drift in torch_rechub/{trainers,utils} touched by the same pass.

github-actions Bot added documentation Improvements or additions to documentation | 文档更新 enhancement New feature or request | 新功能 model New model or model improvement | 模型相关 bug Something isn't working | Bug 修复 labels May 14, 2026

TyndaleLym added 2 commits May 15, 2026 10:22

style: apply CI formatter

85eda1d

fix: avoid optional tiger dependency during hllm tests

4ad700b

1985312383 merged commit ab9c66b into datawhalechina:main May 15, 2026
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix HLLM embedding/vocab alignment; add generative data download workflow#241

Fix HLLM embedding/vocab alignment; add generative data download workflow#241
1985312383 merged 6 commits into
datawhalechina:mainfrom
TyndaleLym:feature/generative-data-download

TyndaleLym commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TyndaleLym commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request / 拉取请求

What does this PR do? / 这个PR做了什么？

Type of Change / 变更类型

How to Test / 如何测试

Checklist / 检查清单

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TyndaleLym commented May 14, 2026 •

edited

Loading