Skip to content

INFERENG-6076: Fix granite-4.0-h-small TP for HPU#81

Open
wjhrdy wants to merge 1 commit intomainfrom
INFERENG-6076/granite-hpu-tp-fix
Open

INFERENG-6076: Fix granite-4.0-h-small TP for HPU#81
wjhrdy wants to merge 1 commit intomainfrom
INFERENG-6076/granite-hpu-tp-fix

Conversation

@wjhrdy
Copy link
Copy Markdown
Contributor

@wjhrdy wjhrdy commented Apr 14, 2026

Summary

  • Reduce tensor-parallel-size from 2 to 1 for ibm-granite/granite-4.0-h-small performance server config
  • granite-4.0-h-small is a hybrid Mamba/Attention model (1.8B params) with GQA group counts not divisible by 2, causing assert n_groups % self.tp_size == 0 on HPU
  • TP=1 is sufficient for this small model and matches the pattern of other working HPU models

Test plan

  • Re-run HPU smoke tests via ocp-test.yml with config_ref=INFERENG-6076/granite-hpu-tp-fix to verify granite-4.0-h-small passes

🤖 Generated with Claude Code

granite-4.0-h-small is a hybrid Mamba/Attention model with GQA group
counts that are not divisible by 2, causing an assertion error on HPU
when tensor-parallel-size=2. Reduce to 1 since the model is small
enough (1.8B params) to run on a single card.

Signed-off-by: Willy Hardy <whardy@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant