Skip to content

Commit 241bbc3

Browse files
JiuChen0dannywillowliu-uchiroot
authored
Optimize shared memory usage, clean up legacy quantization, and remove unused modules (#34)
* Add batch inference support and CPU compatibility - Add --batch_size CLI argument for parallel sequence processing - Add conditional CUDA stream creation for CPU-only mode - Add device-aware ExecutionEnv and Policy resource distribution - Fix MPS compatibility on macOS * fix hardcode of model loading and support batch size * Resolving dependency conflicts * docs: refine README setup and usage sections for clarity and correctness * Add batch size related updates * delete ddebug output * delete .id files * fix max token size problem * add prompt * Reduce /dev/shm peak usage during warmup/prefill stage * delete dead code * chore: comment out unused compare_tensors function * delete bitsandbytes quant * support flexgen 4bit quant * clean debug output for server id * add effective throughput * clean up unnecessary files --------- Co-authored-by: Danny Willow Liu <dannywillowliu@uchicago.edu> Co-authored-by: root <root@investorairig80.maas>
1 parent 862bd3b commit 241bbc3

91 files changed

Lines changed: 449 additions & 17405 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

FlexLLMGen/.gitignore

Lines changed: 0 additions & 35 deletions
This file was deleted.

FlexLLMGen/LICENSE

Lines changed: 0 additions & 203 deletions
This file was deleted.

0 commit comments

Comments
 (0)