Skip to content

Fix HLLM embedding/vocab alignment; add generative data download workflow#241

Merged
1985312383 merged 6 commits into
datawhalechina:mainfrom
TyndaleLym:feature/generative-data-download
May 15, 2026
Merged

Fix HLLM embedding/vocab alignment; add generative data download workflow#241
1985312383 merged 6 commits into
datawhalechina:mainfrom
TyndaleLym:feature/generative-data-download

Conversation

@TyndaleLym

@TyndaleLym TyndaleLym commented May 14, 2026

Copy link
Copy Markdown
Contributor

Pull Request / 拉取请求

What does this PR do? / 这个PR做了什么?

Fix HLLM item-embedding/vocab alignment so embeddings are indexed by vocab token id (row 0 = PAD). HLLMModel now validates item_embeddings.shape against (vocab_size, d_model), L2-normalizes item embeddings once at load and normalizes the user representation at scoring, so logits = cos-sim(x, emb) / temperature with default temperature=0.07 to match the official ByteDance HLLM logit scale. MovieLens-1M and Amazon Books HLLM preprocessing scripts now read vocab.pkl and write embeddings keyed by token id; the example run scripts compute vocab_size from item_to_idx and use temperature=1.0 in the NCE loss to avoid double scaling. Also fix two Amazon Books data bugs: the interaction loader now coerces item_id/user_id to str so the vocab built from numeric ByteDance ids lines up with the str-keyed metadata used by the embedding-generation lookup (previously this silently produced an all-zero embeddings tensor under the default --data_source bytedance), and the ByteDance interaction filename is split from ratings_Books.csv to amazon_books_interactions.csv with a schema-vs-source mismatch error so a stale raw download cannot silently be re-used. Add a unified dataset download workflow with --no_download / --data_source / --overwrite CLI flags on the MovieLens-1M and Amazon Books preprocessing scripts, supporting Stanford SNAP raw data and ByteDance HLLM preprocessed data, with progress bars, automatic format detection and basic schema validation. Add smoke tests covering HLLM vocab/embedding validation, bounded cosine logits, shape error messages, numeric item_id coercion, str-keyed vocab/metadata lookup and schema-mismatch handling, and update the HLLM reproduction docs (EN/ZH) and data README files accordingly.

Type of Change / 变更类型

  • 🐛 Bug fix / Bug修复
  • ✨ New model/feature / 新模型/功能
  • 📝 Documentation / 文档
  • 🔧 Maintenance / 维护

How to Test / 如何测试

# Correctness smoke tests
python -m pytest tests/test_hllm_embedding_alignment.py tests/test_amazon_books_preprocess.py -v

# MovieLens-1M data preparation
python examples/generative/data/ml-1m/preprocess_ml_hstu.py
python examples/generative/data/ml-1m/preprocess_hllm_data.py --model_type tinyllama --device cuda

# Amazon Books data preparation (ByteDance preprocessed source, default)
python examples/generative/data/amazon-books/preprocess_amazon_books.py --data_source bytedance
python examples/generative/data/amazon-books/preprocess_amazon_books_hllm.py --data_source bytedance --model_type tinyllama --device cuda

Checklist / 检查清单

  • Code follows project style (ran python config/format_code.py) / 代码遵循项目风格(运行了格式化脚本)
  • Added tests for new functionality / 为新功能添加了测试
  • Updated documentation if needed / 如需要已更新文档
  • All tests pass locally / 所有测试在本地通过

Add --no_download/--data_source/--overwrite CLI options to the
generative data preprocessing scripts so MovieLens-1M and Amazon
Books raw or ByteDance-processed data can be fetched and validated
from a single command. Update READMEs and the HSTU reproduction
doc to describe the new flow.
HLLMModel now requires item_embeddings to be indexed by vocab token
id (row 0 = PAD) and validates shape against vocab_size/d_model.
Item embeddings are L2-normalized once at load and the user
representation is normalized at scoring, so logits are
cos-sim/temperature with temperature defaulting to 0.07 to match
the official ByteDance HLLM scale. The MovieLens and Amazon Books
HLLM preprocessing scripts now read vocab.pkl and write embeddings
keyed by token id. Example run scripts compute vocab_size from
item_to_idx and use temperature=1.0 in the NCE loss to avoid double
scaling. Add smoke tests covering the new validation, normalization
and bounded logits behavior.
Run config/format_code.py to apply isort + yapf to the generative
data download workflow and HLLM bug fix changes, and to fix
pre-existing style drift in torch_rechub/{trainers,utils} touched
by the same pass.
…ion filename

Force item_id and user_id to str in read_interactions so the
vocab keys built from interactions line up with the str-keyed
metadata used by the HLLM pipeline; otherwise numeric ByteDance
item_ids produced an all-zero embeddings tensor because the
embedding-generation lookup missed every entry. Rename the
ByteDance interaction default filename to
amazon_books_interactions.csv so it no longer collides with the
raw ratings_Books.csv and a stale raw download cannot silently
be re-used under --data_source=bytedance; pair this with a
schema-vs-source mismatch error in read_interactions for extra
safety. Add tests covering numeric item_id coercion, str-keyed
vocab/metadata lookup, and the new schema mismatch error.
@github-actions github-actions Bot added documentation Improvements or additions to documentation | 文档更新 enhancement New feature or request | 新功能 model New model or model improvement | 模型相关 bug Something isn't working | Bug 修复 labels May 14, 2026
@1985312383 1985312383 merged commit ab9c66b into datawhalechina:main May 15, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working | Bug 修复 documentation Improvements or additions to documentation | 文档更新 enhancement New feature or request | 新功能 model New model or model improvement | 模型相关

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants