Commit a5cb49c
committed
perf(embed): engage hybrid CPU twin under throughput and parallelize batch prep
Balance the throughput-profile hybrid split so the GPU arm and the CPU twin
each receive a real share of every batch. The split target now adds the
long-entity work to the GPU's share of the short work (long entities load the
CPU side), so long-heavy batches keep the GPU saturated instead of emptying its
subset. When the GPU subset is empty, the work runs on the CPU twin rather than
being routed back to the GPU by the auto resolver, so the idle cores are used.
Parallelize the embed prep hot path: extract tokenized ids and masks with a
parallel iterator, then hold them as contiguous length-sorted arrays so each
dispatch borrows its sub-batch instead of re-cloning token buffers per chunk.
Consolidate the Metal/CPU dispatch into one routine shared by the plain and
hybrid paths.
The aggressive hybrid stays gated on the throughput resource profile; the proof
path is unchanged. The batch budget continues to read the throughput plan's
hardware-scaled token and attention-area caps, and now surfaces detected
unified memory to the plan so the caps track the host. Result order is stable
regardless of how a batch splits across the two arms.
Add partition coverage/determinism unit tests and an ignored real-model parity
test asserting the split is value- and order-preserving.
Signed-off-by: Troy Fortin <troy@firelock.ai>1 parent 2697c45 commit a5cb49c
2 files changed
Lines changed: 555 additions & 250 deletions
0 commit comments