Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML)#12
Closed
beshkenadze wants to merge 8 commits into
Closed
Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML)#12beshkenadze wants to merge 8 commits into
beshkenadze wants to merge 8 commits into
Conversation
…aming) (Blaizzy#193) * perf(sortformer): single bulk readback in predsToSegments predsToSegments built its result with a per-frame .item() GPU->CPU read (one synchronous round-trip per frame, per speaker, per call). On long streaming runs that is ~95k syncs and dominates the non-encoder time. Replace it with a single bulk asArray() readback followed by pure-Swift change detection. Output is bit-identical (verified 0.0% DER on a 32-min 2-speaker file vs the previous implementation; the existing SortformerPostprocessingTests cover basic / empty / min-duration / merge-gap / sorted cases). ~1.8x faster streaming end to end. * test(sortformer): pin predsToSegments boundary times incl. trailing segment The existing post-processing tests assert only segment counts/order, not exact times, and never exercise a speaker active through the final frame (the tail branch). Add predsToSegmentsExactBoundaries which pins the start edge, the inactive-close edge, and the active-to-last-frame case — locking the exact frame->time mapping the bulk-readback refactor preserves.
Co-authored-by: vanch <vanchye@outlook.com>
…aizzy#196) - Fix blocking weight-load crash: prompt_kernel used integer @ModuleInfo keys (0/2) -> MLX-swift array misinterpretation. Remap to linear0/linear2. - Add cache-aware streaming (NemotronASRStreaming.swift): per-layer attention + causal-conv caches + incremental causal subsampling (16-frame mel cache); generateStream now streams O(n) with no recompute. Validated vs NeMo CUDA reference (FLEURS en-US 200u): offline 9.62%, streaming 9.43% (CUDA 9.58%); single-clip token-exact.
648dac3 to
bb8ac70
Compare
ParakeetCoreMLEncoder is a drop-in for the MLX encoder that runs the Conformer on the Apple Neural Engine via CoreML, wired through the existing EncoderExecutionImplementation hook (.coreML case + enableCoreMLEncoder). Decoder and chunking stay in MLX. The model is fixed-shape (ANE requirement): chunk mel is padded to the fixed length and the output cropped via the subsampling formula; the stride-padded ANE output is read by strides. Falls back to the MLX encoder if CoreML is unavailable. CLI: --coreml-encoder <path>. Public MLModel + MLComputeUnits only.
convert_encoder.py traces the Conformer encoder at a fixed shape; convert_traced.py runs the coremltools conversion in an isolated numpy<2 env (coremltools 9.0 + numpy>=2 fails on a folded aten::Int const). Produces the fp16 MLProgram .mlpackage for --coreml-encoder. README documents conversion + usage.
Expose subsampledLength as a static helper and verify it matches the dw-striding output-length formula; assert a missing .mlpackage throws (model then falls back to MLX). No ANE/model/network needed -> runs in CI. swift test: 2/2 pass.
Add ANEEncoder enum (.off default / .on / .repo(String) / .package(URL)) and ParakeetModel.fromPretrained(aneEncoder:) so callers just flip it on. .on/.repo download the .mlpackage from Hugging Face (default beshkenadze/parakeet-tdt-0.6b-v3-coreml-ane) via the existing HubClient, cached. CLI gains --ane. Verified end-to-end (download + transcribe, 2802 words, ~105x RT); swift test 3/3.
6b0cd38 to
1a7b407
Compare
Owner
Author
|
Superseded by upstream PR Blaizzy#199 (rebased onto upstream/main). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Run the Parakeet Conformer encoder on the Apple Neural Engine (CoreML)
Summary
Adds an optional CoreML/ANE path for the Parakeet Conformer encoder, behind a flag.
The encoder (≈95% of the compute) runs on the Apple Neural Engine via CoreML while the
TDT decoder and chunking stay in MLX. Same transcript, lower power, and a small speedup
on top.
It plugs into the existing
EncoderExecutionImplementationhook inParakeetModel(new
.coreMLcase +enableCoreMLEncoder(modelURL:)), so the decode path is untouched.It falls back to the MLX encoder if CoreML is unavailable. Public
MLModel+MLComputeUnitsonly — no private ANE APIs.# auto-downloads the prebuilt encoder from Hugging Face mlx-audio-swift-stt --model beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16 \ --audio in.wav --output-path out --ane --chunk-duration 9.95// .off (default) · .on (default HF repo) · .repo("id") · .package(localURL) let model = try await ParakeetModel.fromPretrained(repo, aneEncoder: .on)Why
ANE has no public API — CoreML is the only sanctioned route, and MLX is GPU/Metal only.
Splitting the graph at the encoder boundary (static feed-forward → CoreML/ANE; the
autoregressive TDT loop → MLX) is a clean, lossless way to reach the ANE. The payoff is
mostly power/thermal (the encoder leaves the GPU), with a speedup as a bonus.
Results
Measured on M1 Max,
parakeet-tdt-0.6b-v3, a 20.8-min TED-LIUM 3 talk, chunk 9.95s.fp16-vs-bf16 difference (CoreML-fp16 is actually closer to fp32 than the shipped
MLX-bf16 encoder).
How it works
residency; a dynamic (
RangeDim) time axis drops it to 0%. The Swift wrapper pads eachchunk's mel to the fixed length and crops the output via the subsampling formula. Keep
--chunk-duration ≤ frames·10ms.MLMultiArrayis stride-padded, so the wrapper reads it by strides..mlpackage) lives intools/coreml-ane/(convert_encoder.py+convert_traced.py); see the README.--fp16-iogives 100% ANE / 0 CPU ops.Scope
ParakeetCoreMLEncoder.swift,ParakeetModel.swift(the.coreMLcase),App.swift(the--coreml-encoderflag).tools/coreml-ane/converter + README.Limitations / follow-ups
.mlpackageis not bundled (it's large). A prebuilt one is hosted on Hugging Face(
beshkenadze/parakeet-tdt-0.6b-v3-coreml-ane);users can also convert it via the tooling.
MLMultiArraywould lift the Swift RTF further (the power win is independent).Testing
Tests/ParakeetCoreMLEncoderTests.swift, swift-testing): theoutput-length math matches the dw-striding formula, and a missing
.mlpackagethrows(→ MLX fallback). No ANE/model/network needed;
swift test: 2/2 pass.release); transcript parity verified against the all-MLX path on thefull talk.