Commit 241bbc3
Optimize shared memory usage, clean up legacy quantization, and remove unused modules (#34)
* Add batch inference support and CPU compatibility
- Add --batch_size CLI argument for parallel sequence processing
- Add conditional CUDA stream creation for CPU-only mode
- Add device-aware ExecutionEnv and Policy resource distribution
- Fix MPS compatibility on macOS
* fix hardcode of model loading and support batch size
* Resolving dependency conflicts
* docs: refine README setup and usage sections for clarity and correctness
* Add batch size related updates
* delete ddebug output
* delete .id files
* fix max token size problem
* add prompt
* Reduce /dev/shm peak usage during warmup/prefill stage
* delete dead code
* chore: comment out unused compare_tensors function
* delete bitsandbytes quant
* support flexgen 4bit quant
* clean debug output for server id
* add effective throughput
* clean up unnecessary files
---------
Co-authored-by: Danny Willow Liu <dannywillowliu@uchicago.edu>
Co-authored-by: root <root@investorairig80.maas>1 parent 862bd3b commit 241bbc3
91 files changed
Lines changed: 449 additions & 17405 deletions
File tree
- FlexLLMGen
- docs
- flexgen_tp
- apps
- data_wrangle
- utils
- flexllmgen
- apps
- data_wrangle
- utils
- benchmarks
- src/bloombee
- client
- cli
- models/llama
- server
- utils
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
This file was deleted.
This file was deleted.
0 commit comments