Skip to content

Commit a5cb49c

Browse files
committed
perf(embed): engage hybrid CPU twin under throughput and parallelize batch prep
Balance the throughput-profile hybrid split so the GPU arm and the CPU twin each receive a real share of every batch. The split target now adds the long-entity work to the GPU's share of the short work (long entities load the CPU side), so long-heavy batches keep the GPU saturated instead of emptying its subset. When the GPU subset is empty, the work runs on the CPU twin rather than being routed back to the GPU by the auto resolver, so the idle cores are used. Parallelize the embed prep hot path: extract tokenized ids and masks with a parallel iterator, then hold them as contiguous length-sorted arrays so each dispatch borrows its sub-batch instead of re-cloning token buffers per chunk. Consolidate the Metal/CPU dispatch into one routine shared by the plain and hybrid paths. The aggressive hybrid stays gated on the throughput resource profile; the proof path is unchanged. The batch budget continues to read the throughput plan's hardware-scaled token and attention-area caps, and now surfaces detected unified memory to the plan so the caps track the host. Result order is stable regardless of how a batch splits across the two arms. Add partition coverage/determinism unit tests and an ignored real-model parity test asserting the split is value- and order-preserving. Signed-off-by: Troy Fortin <troy@firelock.ai>
1 parent 2697c45 commit a5cb49c

2 files changed

Lines changed: 555 additions & 250 deletions

File tree

0 commit comments

Comments
 (0)