feat(nemotron): offline CoreML/ANE encoder (drop-in for the MLX Conformer)#13
Closed
beshkenadze wants to merge 5 commits into
Closed
feat(nemotron): offline CoreML/ANE encoder (drop-in for the MLX Conformer)#13beshkenadze wants to merge 5 commits into
beshkenadze wants to merge 5 commits into
Conversation
34dc197 to
dcf49b8
Compare
02976b7 to
cb58a8f
Compare
dcf49b8 to
f1db220
Compare
Add `--palettize N` to convert_encoder.py: `8` = 8-bit uniform palettize, `-1` = per-channel linear int8 (robust to weight outliers). Smaller model (~2x) + faster ANE compute, accuracy validated per-model. Also port the `aten::Int` coremltools patch (the converter otherwise breaks on torch >= 2.8: "only 0-dimensional arrays can be converted to Python scalars"). Findings: 8-bit works cleanly for RNN-T encoders; Parakeet's TDT decoder is more quant-sensitive — 8-bit uniform crushes its outlier-heavy weights (encoder cosine 0.21), per-channel linear int8 recovers it (word-identical per-window), and long audio needs a smaller chunk (TDT + padded final chunk + quant). Runners: _remote_parakeet_linear.sh, _remote_offline_palettize.sh.
…rmer) Run the Nemotron 3.5 ASR FastConformer encoder on the ANE via CoreML for the offline (decode) path, mirroring the Parakeet CoreML/ANE encoder. The FastConformer has the same fixed-shape I/O, so the generic ConformerCoreMLEncoder is reused (typealias, not duplicated). The prompt MLP + RNN-T decoder stay in MLX. - NemotronASRModel: optional coreMLEncoder; decode() runs it (falls back to MLX on any failure); generate() auto-clamps chunkDuration to the model's fixed length so overlap-merge stitches long audio. - enableCoreMLEncoder(modelURL:) + --coreml-encoder <path> for Nemotron. - ParakeetCoreMLEncoder.fixedFrames made public (used for the clamp). - CI-safe tests. Streaming CoreML is a follow-up.
…oder Adds enableCoreMLEncoder(repo:) + defaultANEEncoderRepo (the matched 3.5 .mlpackage published on HF), reusing Parakeet's downloader (the encoder package is generic). mlx-audio-swift-stt --ane now auto-downloads + runs the Nemotron encoder on the ANE (no manual --coreml-encoder path). Verified end-to-end: downloads + transcribes.
cb58a8f to
c332b3b
Compare
Re-upload the offline Nemotron 3.5 encoder palettized to 8-bit (564 MB, ~2x smaller than fp16; transcript word-identical to MLX — only the int8/fp16-vs-bf16 floor) under the existing production package name, so `--ane` is unchanged. Card + upload script. Convert with `convert_encoder.py --model nvidia/nemotron-3.5-asr-streaming-0.6b --frames 1000 --palettize 8`.
f1db220 to
1343501
Compare
Machine-specific conversion runners + HF upload/card scripts are dev scaffolding, not part of the upstream feature (the converter convert_encoder.py + the Swift encoder are).
Owner
Author
|
Superseded by upstream Blaizzy#202 (offline Nemotron ANE). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on Blaizzy#199 (Parakeet CoreML/ANE encoder) — base
feat/parakeet-coreml-ane-encoder; reuses the genericConformerCoreMLEncoderit introduces.What
Runs the Nemotron 3.5 ASR FastConformer encoder on the ANE via CoreML for the offline (
decode) path. The FastConformer has the same fixed-shape I/O as Parakeet's Conformer, so the offline encoder is the sameConformerCoreMLEncoder(typealias, not duplicated). The prompt MLP + RNN-T decoder stay in MLX.Changes
NemotronASRModel: optionalcoreMLEncoder;decode()runs it (falls back to MLX on any failure);generate()auto-clampschunkDurationto the model's fixed length so overlap-merge stitches arbitrary-length audio.enableCoreMLEncoder(modelURL:)+--coreml-encoder <path>for Nemotron in the STT CLI.ParakeetCoreMLEncoder.fixedFramesmadepublic(used for the clamp).NemotronCoreMLEncoderTests).Measured (M1 Max, Python hybrid via
coremltools.predicton the ANE)Encoder on ANE ≈ 2× faster than MLX-fp32 (43 vs 80 ms / 10 s); ~1.31× end-to-end offline; GPU power ÷~9 (encoder off the GPU). MLX-Nemotron runs fp32 (no bf16 path) — why ANE wins here (filed beshkenadze/mlx-audio#25 to add bf16).
Conversion
nvidia/nemotron-speech-streaming-en-0.6b(or 3.5) encoder → CoreML.mlpackageviatools/coreml-ane/convert_encoder.py --model <id>(99% ANE-native ops); pass it to--coreml-encoder.Follow-ups
.mlpackageartifact +--aneauto-download (the published encoder must match the MLX weights; the 3.5 multilingual encoder needs NeMomainto convert).Production-safe: public
MLModel+MLComputeUnitsonly (macOS 14+), no private APIs.