Skip to content

feat(nemotron): offline CoreML/ANE encoder (drop-in for the MLX Conformer)#13

Closed
beshkenadze wants to merge 5 commits into
feat/parakeet-coreml-ane-encoderfrom
feat/nemotron-coreml-ane-encoder
Closed

feat(nemotron): offline CoreML/ANE encoder (drop-in for the MLX Conformer)#13
beshkenadze wants to merge 5 commits into
feat/parakeet-coreml-ane-encoderfrom
feat/nemotron-coreml-ane-encoder

Conversation

@beshkenadze

Copy link
Copy Markdown
Owner

Stacked on Blaizzy#199 (Parakeet CoreML/ANE encoder) — base feat/parakeet-coreml-ane-encoder; reuses the generic ConformerCoreMLEncoder it introduces.

What

Runs the Nemotron 3.5 ASR FastConformer encoder on the ANE via CoreML for the offline (decode) path. The FastConformer has the same fixed-shape I/O as Parakeet's Conformer, so the offline encoder is the same ConformerCoreMLEncoder (typealias, not duplicated). The prompt MLP + RNN-T decoder stay in MLX.

Changes

  • NemotronASRModel: optional coreMLEncoder; decode() runs it (falls back to MLX on any failure); generate() auto-clamps chunkDuration to the model's fixed length so overlap-merge stitches arbitrary-length audio.
  • enableCoreMLEncoder(modelURL:) + --coreml-encoder <path> for Nemotron in the STT CLI.
  • ParakeetCoreMLEncoder.fixedFrames made public (used for the clamp).
  • CI-safe tests (NemotronCoreMLEncoderTests).

Measured (M1 Max, Python hybrid via coremltools.predict on the ANE)

Encoder on ANE ≈ faster than MLX-fp32 (43 vs 80 ms / 10 s); ~1.31× end-to-end offline; GPU power ÷~9 (encoder off the GPU). MLX-Nemotron runs fp32 (no bf16 path) — why ANE wins here (filed beshkenadze/mlx-audio#25 to add bf16).

Conversion

nvidia/nemotron-speech-streaming-en-0.6b (or 3.5) encoder → CoreML .mlpackage via tools/coreml-ane/convert_encoder.py --model <id> (99% ANE-native ops); pass it to --coreml-encoder.

Follow-ups

  • Prebuilt HF .mlpackage artifact + --ane auto-download (the published encoder must match the MLX weights; the 3.5 multilingual encoder needs NeMo main to convert).
  • Streaming CoreML encoder (cache-aware, functional cache I/O) — validated convertible (98% ANE), Swift integration pending.

Production-safe: public MLModel + MLComputeUnits only (macOS 14+), no private APIs.

@beshkenadze beshkenadze force-pushed the feat/nemotron-coreml-ane-encoder branch from 34dc197 to dcf49b8 Compare June 10, 2026 13:50
@beshkenadze beshkenadze force-pushed the feat/parakeet-coreml-ane-encoder branch from 02976b7 to cb58a8f Compare June 10, 2026 14:22
@beshkenadze beshkenadze force-pushed the feat/nemotron-coreml-ane-encoder branch from dcf49b8 to f1db220 Compare June 10, 2026 14:22
Add `--palettize N` to convert_encoder.py: `8` = 8-bit uniform palettize, `-1` = per-channel
linear int8 (robust to weight outliers). Smaller model (~2x) + faster ANE compute, accuracy
validated per-model. Also port the `aten::Int` coremltools patch (the converter otherwise
breaks on torch >= 2.8: "only 0-dimensional arrays can be converted to Python scalars").

Findings: 8-bit works cleanly for RNN-T encoders; Parakeet's TDT decoder is more quant-sensitive
— 8-bit uniform crushes its outlier-heavy weights (encoder cosine 0.21), per-channel linear int8
recovers it (word-identical per-window), and long audio needs a smaller chunk (TDT + padded final
chunk + quant). Runners: _remote_parakeet_linear.sh, _remote_offline_palettize.sh.
…rmer)

Run the Nemotron 3.5 ASR FastConformer encoder on the ANE via CoreML for the
offline (decode) path, mirroring the Parakeet CoreML/ANE encoder. The FastConformer
has the same fixed-shape I/O, so the generic ConformerCoreMLEncoder is reused
(typealias, not duplicated). The prompt MLP + RNN-T decoder stay in MLX.

- NemotronASRModel: optional coreMLEncoder; decode() runs it (falls back to MLX on
  any failure); generate() auto-clamps chunkDuration to the model's fixed length so
  overlap-merge stitches long audio.
- enableCoreMLEncoder(modelURL:) + --coreml-encoder <path> for Nemotron.
- ParakeetCoreMLEncoder.fixedFrames made public (used for the clamp).
- CI-safe tests.

Streaming CoreML is a follow-up.
…oder

Adds enableCoreMLEncoder(repo:) + defaultANEEncoderRepo (the matched 3.5 .mlpackage
published on HF), reusing Parakeet's downloader (the encoder package is generic).
mlx-audio-swift-stt --ane now auto-downloads + runs the Nemotron encoder on the ANE
(no manual --coreml-encoder path). Verified end-to-end: downloads + transcribes.
@beshkenadze beshkenadze force-pushed the feat/parakeet-coreml-ane-encoder branch from cb58a8f to c332b3b Compare June 10, 2026 14:26
Re-upload the offline Nemotron 3.5 encoder palettized to 8-bit (564 MB, ~2x smaller than
fp16; transcript word-identical to MLX — only the int8/fp16-vs-bf16 floor) under the existing
production package name, so `--ane` is unchanged. Card + upload script. Convert with
`convert_encoder.py --model nvidia/nemotron-3.5-asr-streaming-0.6b --frames 1000 --palettize 8`.
@beshkenadze beshkenadze force-pushed the feat/nemotron-coreml-ane-encoder branch from f1db220 to 1343501 Compare June 10, 2026 14:26
Machine-specific conversion runners + HF upload/card scripts are dev scaffolding, not part
of the upstream feature (the converter convert_encoder.py + the Swift encoder are).
@beshkenadze beshkenadze marked this pull request as draft June 10, 2026 14:47
@beshkenadze

Copy link
Copy Markdown
Owner Author

Superseded by upstream Blaizzy#202 (offline Nemotron ANE).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant