Add Zipf distribution support for skewed index generation in ModelInput #3708

Frederick-Zhu · 2026-01-30T22:45:39Z

Summary:
Add Zipf distribution support for generating embedding indices in test utilities.

This change adds an optional zipf_alpha parameter that, when set, uses np.random.zipf() instead of uniform random sampling to
generate indices. This produces a skewed access pattern where some embedding rows are accessed much more frequently than others,
enabling benchmarking scenarios with hot/cold data characteristics.
With alpha ≈ 1.1-1.2, approximately 20% of embedding rows receive ~80% of accesses. This can be useful for testing:

SSD-offloading strategies
Caching policies
Tiered storage performance

Implementation details

Zipf logic is encapsulated in ModelInput._generate_zipf_indices() static method
numpy is lazy-imported to avoid breaking backward compatibility when numpy is unavailable
Falls back to uniform random distribution if numpy import fails
Optional seed parameter allows controlling numpy's random state independently of PyTorch's seed

Changes

input_config.py: Add zipf_alpha field to InputConfig dataclass
model_input.py: Add _generate_zipf_indices() helper; plumb zipf_alpha through generate() and _create_features_lengths_indices()
test_model.py: Add _generate_zipf_indices() helper; add zipf_alpha support to generate() for both regular and weighted tables
tests/test_model_input.py: Add unit tests for Zipf distribution functionality
tests/BUCK: Add test target for test_model_input

Differential Revision: D91909007

Summary: Add Zipf distribution support for generating embedding indices in test utilities. This change adds an optional `zipf_alpha` parameter that, when set, uses `np.random.zipf()` instead of uniform random sampling to generate indices. This produces a skewed access pattern where some embedding rows are accessed much more frequently than others, enabling benchmarking scenarios with hot/cold data characteristics. With alpha ≈ 1.1-1.2, approximately 20% of embedding rows receive ~80% of accesses. This can be useful for testing: - SSD-offloading strategies - Caching policies - Tiered storage performance ## Implementation details - Zipf logic is encapsulated in `ModelInput._generate_zipf_indices()` static method - `numpy` is lazy-imported to avoid breaking backward compatibility when `numpy` is unavailable - Falls back to uniform random distribution if `numpy` import fails - Optional seed parameter allows controlling `numpy`'s random state independently of PyTorch's seed ## Changes - `input_config.py`: Add `zipf_alpha` field to `InputConfig` dataclass - `model_input.py`: Add `_generate_zipf_indices()` helper; plumb `zipf_alpha` through `generate()` and `_create_features_lengths_indices()` - `test_model.py`: Add `_generate_zipf_indices()` helper; add `zipf_alpha` support to `generate()` for both regular and weighted tables - `tests/test_model_input.py`: Add unit tests for Zipf distribution functionality - `tests/BUCK:` Add test target for test_model_input Differential Revision: D91909007

meta-codesync · 2026-01-30T22:45:46Z

@Frederick-Zhu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91909007.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 30, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Zipf distribution support for skewed index generation in ModelInput #3708

Add Zipf distribution support for skewed index generation in ModelInput #3708

Uh oh!

Frederick-Zhu commented Jan 30, 2026

Uh oh!

meta-codesync bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Zipf distribution support for skewed index generation in ModelInput #3708

Are you sure you want to change the base?

Add Zipf distribution support for skewed index generation in ModelInput #3708

Uh oh!

Conversation

Frederick-Zhu commented Jan 30, 2026

Implementation details

Changes

Uh oh!

meta-codesync bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant