Skip to content

Optimize shared memory usage, clean up legacy quantization, and remove unused modules#34

Merged
HaibaraAiChan merged 18 commits intoai-decentralized:mainfrom
JiuChen0:dev_shm
Nov 13, 2025
Merged

Optimize shared memory usage, clean up legacy quantization, and remove unused modules#34
HaibaraAiChan merged 18 commits intoai-decentralized:mainfrom
JiuChen0:dev_shm

Conversation

@JiuChen0
Copy link
Copy Markdown
Contributor

Description

  1. Optimized shared memory usage

    • Reduced the peak /dev/shm usage during warmup and inference, improving stability on servers with limited shared memory (e.g., 64 MB /dev/shm).
  2. Cleaned up legacy quantization from Petals

    • Removed redundant quantization logic inherited from Petals’ standard Transformer implementation.
    • Preserved and verified the FlexGen 4-bit quantization, which now works correctly for weight compression.
  3. Removed FlexLLMGen folder

    • Deleted this unintegrated and unused module to simplify the repository structure.
  4. Removed --quant_type CLI argument

    • Quantization configuration should now be enabled directly in server.py, improving code clarity and reducing CLI complexity.
  5. Removed unnecessary debug outputs

    • Cleaned up verbose logs for better readability and reduced runtime overhead.

@HaibaraAiChan HaibaraAiChan merged commit 241bbc3 into ai-decentralized:main Nov 13, 2025
@JiuChen0 JiuChen0 deleted the dev_shm branch November 21, 2025 01:30
JiuChen0 added a commit to JiuChen0/BloomBee that referenced this pull request Mar 22, 2026
…ve unused modules (ai-decentralized#34)

* Add batch inference support and CPU compatibility

- Add --batch_size CLI argument for parallel sequence processing
- Add conditional CUDA stream creation for CPU-only mode
- Add device-aware ExecutionEnv and Policy resource distribution
- Fix MPS compatibility on macOS

* fix hardcode of model loading and support batch size

* Resolving dependency conflicts

* docs: refine README setup and usage sections for clarity and correctness

* Add batch size related updates

* delete ddebug output

* delete .id files

* fix max token size problem

* add prompt

* Reduce /dev/shm peak usage during warmup/prefill stage

* delete dead code

* chore: comment out unused compare_tensors function

* delete bitsandbytes quant

* support flexgen 4bit quant

* clean debug output for server id

* add effective throughput

* clean up unnecessary files

---------

Co-authored-by: Danny Willow Liu <dannywillowliu@uchicago.edu>
Co-authored-by: root <root@investorairig80.maas>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants