Commit 241bbc3

authored

Optimize shared memory usage, clean up legacy quantization, and remove unused modules (#34)

* Add batch inference support and CPU compatibility - Add --batch_size CLI argument for parallel sequence processing - Add conditional CUDA stream creation for CPU-only mode - Add device-aware ExecutionEnv and Policy resource distribution - Fix MPS compatibility on macOS * fix hardcode of model loading and support batch size * Resolving dependency conflicts * docs: refine README setup and usage sections for clarity and correctness * Add batch size related updates * delete ddebug output * delete .id files * fix max token size problem * add prompt * Reduce /dev/shm peak usage during warmup/prefill stage * delete dead code * chore: comment out unused compare_tensors function * delete bitsandbytes quant * support flexgen 4bit quant * clean debug output for server id * add effective throughput * clean up unnecessary files --------- Co-authored-by: Danny Willow Liu <dannywillowliu@uchicago.edu> Co-authored-by: root <root@investorairig80.maas>

1 parent 862bd3b commit 241bbc3Copy full SHA for 241bbc3

91 files changed

FlexLLMGen
benchmarks
- benchmark_inference.py
src/bloombee
- __init__.py
- client
  - inference_session.py
  - remote_forward_backward.py
- cli
  - run_server.py
- models/llama
  - flex_llama.py
- server
- utils
  - convert_block.py
  - peft.py

`‎FlexLLMGen/.gitignore‎`

This file was deleted.

`‎FlexLLMGen/LICENSE‎`

This file was deleted.

Comments

(0)