Skip to content

Add Paraformer-zh ASR model support with OpenVINO inference and INT8 quantization#1629

Open
padatta wants to merge 4 commits intohuggingface:mainfrom
padatta:add-paraformer-inference-model
Open

Add Paraformer-zh ASR model support with OpenVINO inference and INT8 quantization#1629
padatta wants to merge 4 commits intohuggingface:mainfrom
padatta:add-paraformer-inference-model

Conversation

@padatta
Copy link

@padatta padatta commented Mar 3, 2026

This PR adds comprehensive support for Paraformer-zh (FunASR) automatic speech recognition models in optimum-intel, including:

  1. OpenVINO inference model (OVParaformerForSpeechSeq2Seq) following the established patterns from OVModelForTextToSpeechSeq2Seq
  2. INT8 quantization support with NNCF weight compression during export
  3. GPU optimization with dtype fixes and dynamic shape handling for Intel ARC/iGPU

What's Changed

1. Paraformer Inference Model (modeling_speech2text.py)

  • New OVParaformerForSpeechSeq2Seq class for running Paraformer models with OpenVINO
  • Supports both single-model and component-based architectures (encoder, predictor, decoder)
  • Compatible with FP32, FP16, and INT8 quantized models

2. INT8 Quantization Support

  • Integrated INT8_SYM weight compression using NNCF during export
  • Export command properly handles Paraformer quantization workflow

3. GPU Compatibility Fixes

  • Fixed dtype issues: use int32 for positions/ranges, int64 for large indices
  • Implemented dynamic mask creation using shape operations (ONNX/OpenVINO compatible)
  • Added max_label_len clamping (4096) to prevent memory issues

4. Comprehensive Test Suite

Added tests/openvino/test_paraformer.py with 10 test cases:

  • Model loading from pretrained path
  • Basic forward pass inference
  • Batch inference with variable lengths
  • NumPy array input support
  • generate() API compatibility
  • CPU ↔ GPU device switching
  • Model serialization (save/load)
  • Decoding logic without token_num
  • Model properties validation
  • Output dataclass structure

Usage Example

from optimum.intel.openvino import OVParaformerForSpeechSeq2Seq
import torch

# Load INT8 model on GPU
model = OVParaformerForSpeechSeq2Seq.from_pretrained(
    "path/to/paraformer-zh-int8/ov_models",
    device="GPU"
)

# Prepare input (extracted features from audio)
speech = torch.randn(1, 100, 560)  # [batch, time_frames, features]
speech_lengths = torch.tensor([100], dtype=torch.int32)

# Run inference
output = model(speech, speech_lengths)

# Access results
print(output.logits)      # [batch, seq_len, vocab_size]
print(output.token_ids)   # [batch, seq_len] - decoded tokens
print(output.token_num)   # [batch] - number of tokens per sequence

Export to OpenVINO with INT8 Quantization

optimum-cli export openvino \
  --model damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch \
  --task automatic-speech-recognition \
  --weight-format int8 \
  paraformer-zh-int8

Testing

All 10 test cases pass:

export PARAFORMER_TEST_MODEL=/path/to/paraformer-zh/ov_models
python -m unittest tests.openvino.test_paraformer -v
test_batch_inference ... ok
test_decode_without_token_num ... ok
test_device_switching ... ok
test_generate_api ... ok
test_load_model_from_pretrained ... ok
test_model_inference ... ok
test_model_output_dataclass ... ok
test_model_properties ... ok
test_numpy_input ... ok
test_save_and_load ... ok

----------------------------------------------------------------------
Ran 10 tests in 13.071s

OK

Checklist

  • Implementation follows existing optimum-intel patterns
  • Comprehensive test suite added (10 test cases)
  • All tests passing
  • GPU and CPU support verified
  • INT8 quantization validated
  • Performance benchmarks included
  • Code follows style guidelines
  • Documentation inline with code

Note: This PR builds upon the existing Paraformer export support (commit f79a331) and adds the inference runtime capabilities to make the exported models usable within the optimum-intel framework.

10vliu13 and others added 4 commits January 12, 2026 09:39
- Implement modeling_speech2text.py following modeling_text2speech.py pattern
- Add OVParaformerForSpeechSeq2Seq for Paraformer ASR model inference
- Support single model and component-based architectures
- Add comprehensive test suite with 10 test cases
- CPU/GPU support with dynamic device switching
- FP32/FP16/INT8 model support
- Includes encoder, predictor, and decoder components
- Integrate INT8_SYM weight compression during export using NNCF
- Use compress_to_fp16=True to store FP32 constants as FP16 for GPU
- Skip redundant main_quantize pass for Paraformer models
- Fix dtype issues: use int32 for positions/ranges and indices
- Implement dynamic mask creation using shape operations for ONNX/OpenVINO
- Fix CIF predictor tensor assignments for ScatterNDUpdate op
- All changes enable successful GPU inference with INT8 models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants