feat(nemotron): cache-aware streaming CoreML/ANE encoder#203
Draft
beshkenadze wants to merge 3 commits into
Draft
feat(nemotron): cache-aware streaming CoreML/ANE encoder#203beshkenadze wants to merge 3 commits into
beshkenadze wants to merge 3 commits into
Conversation
Run the Parakeet Conformer encoder on the Apple Neural Engine via CoreML — the .mlpackage is auto-downloaded from Hugging Face while the TDT decoder stays in MLX. Opt-in encoder API (aneEncoder:) + CI-safe tests.
…rmer) Offline CoreML/ANE encoder for Nemotron 3.5 ASR — a drop-in for the MLX Conformer encoder, with the .mlpackage auto-downloaded from Hugging Face. Stacks on the Parakeet CoreML/ANE encoder.
Cache-aware streaming FastConformer encoder on the ANE via CoreML (caches as explicit in/out tensors); the .mlpackage is auto-downloaded from Hugging Face. Stacks on the offline CoreML/ANE encoder.
74a580e to
6907cee
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cache-aware streaming CoreML/ANE encoder for Nemotron
Runs the Nemotron 3.5 streaming FastConformer encoder on the ANE in
generateStream(validated uniform-121 feeding + manual cache threading; the prompt MLP and RNN-T decode stay
in MLX). 8-bit palettized (~28% faster ANE, transcript word-identical).
--stream --ane(auto-download) /
--coreml-stream-encoder.This is a power/thermal feature, not a speed one — and honestly a partial GPU offload:
only the encoder moves to the ANE; the decoder stays on the GPU each chunk. For realtime
streaming the speed-vs-MLX difference is moot (RTF ~45–58×); the value is freeing the GPU for
concurrent work + lower power / cooler for always-on / battery streaming.