Skip to content

Conversation

@Frederick-Zhu
Copy link

Summary:
Add Zipf distribution support for generating embedding indices in test utilities.

This change adds an optional zipf_alpha parameter that, when set, uses np.random.zipf() instead of uniform random sampling to
generate indices. This produces a skewed access pattern where some embedding rows are accessed much more frequently than others,
enabling benchmarking scenarios with hot/cold data characteristics.
With alpha ≈ 1.1-1.2, approximately 20% of embedding rows receive ~80% of accesses. This can be useful for testing:

  • SSD-offloading strategies
  • Caching policies
  • Tiered storage performance

Implementation details

  • Zipf logic is encapsulated in ModelInput._generate_zipf_indices() static method
  • numpy is lazy-imported to avoid breaking backward compatibility when numpy is unavailable
  • Falls back to uniform random distribution if numpy import fails
  • Optional seed parameter allows controlling numpy's random state independently of PyTorch's seed

Changes

  • input_config.py: Add zipf_alpha field to InputConfig dataclass
  • model_input.py: Add _generate_zipf_indices() helper; plumb zipf_alpha through generate() and _create_features_lengths_indices()
  • test_model.py: Add _generate_zipf_indices() helper; add zipf_alpha support to generate() for both regular and weighted tables
  • tests/test_model_input.py: Add unit tests for Zipf distribution functionality
  • tests/BUCK: Add test target for test_model_input

Differential Revision: D91909007

Summary:
Add Zipf distribution support for generating embedding indices in test utilities.

This change adds an optional `zipf_alpha` parameter that, when set, uses `np.random.zipf()` instead of uniform random sampling to
generate indices. This produces a skewed access pattern where some embedding rows are accessed much more frequently than others,
enabling benchmarking scenarios with hot/cold data characteristics.
With alpha ≈ 1.1-1.2, approximately 20% of embedding rows receive ~80% of accesses. This can be useful for testing:
- SSD-offloading strategies
- Caching policies
- Tiered storage performance

## Implementation details
- Zipf logic is encapsulated in `ModelInput._generate_zipf_indices()` static method
- `numpy` is lazy-imported to avoid breaking backward compatibility when `numpy` is unavailable
- Falls back to uniform random distribution if `numpy` import fails
- Optional seed parameter allows controlling `numpy`'s random state independently of PyTorch's seed

## Changes
- `input_config.py`: Add `zipf_alpha` field to `InputConfig` dataclass
- `model_input.py`: Add `_generate_zipf_indices()` helper; plumb `zipf_alpha` through `generate()` and `_create_features_lengths_indices()`
- `test_model.py`: Add `_generate_zipf_indices()` helper; add `zipf_alpha` support to `generate()` for both regular and weighted tables
- `tests/test_model_input.py`: Add unit tests for Zipf distribution functionality
- `tests/BUCK:` Add test target for test_model_input

Differential Revision: D91909007
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Jan 30, 2026

@Frederick-Zhu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91909007.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant