Remove O(prompt_len) prompt copies by JiuChen0 · Pull Request #35 · ai-decentralized/BloomBee

JiuChen0 · 2025-11-20T13:45:17Z

Remove redundant debug output
prepare_inputs_for_generation prints whenever inputs_embeds is used, polluting stdout and adding sync overhead. This PR removes the print or switches it to a logger.
Eliminate O(prompt_len) prompt copies per step
OptimizedLlamaDecoderLayer.forward rebuilds output_ids and copies the full prompt on every forward call. This PR switches to a rolling buffer that only appends the new token, avoiding unnecessary host→device copies.

- Add --batch_size CLI argument for parallel sequence processing - Add conditional CUDA stream creation for CPU-only mode - Add device-aware ExecutionEnv and Policy resource distribution - Fix MPS compatibility on macOS

* Add batch inference support and CPU compatibility - Add --batch_size CLI argument for parallel sequence processing - Add conditional CUDA stream creation for CPU-only mode - Add device-aware ExecutionEnv and Policy resource distribution - Fix MPS compatibility on macOS * fix hardcode of model loading and support batch size * Resolving dependency conflicts * docs: refine README setup and usage sections for clarity and correctness * Add batch size related updates * delete ddebug output * delete .id files * fix max token size problem * add prompt * Reduce /dev/shm peak usage during warmup/prefill stage * delete dead code * chore: comment out unused compare_tensors function * delete bitsandbytes quant * support flexgen 4bit quant * clean debug output for server id * add effective throughput * clean up unnecessary files * fix the error of start compute time * Use rolling buffer to avoid O(prompt_len) copy on each forward * The debug I/O issue has been fixed * Use rolling buffer to avoid O(prompt_len) copy on each forward --------- Co-authored-by: Danny Willow Liu <dannywillowliu@uchicago.edu> Co-authored-by: root <root@investorairig80.maas>

dannywillowliu-uchi and others added 23 commits October 15, 2025 22:43

Add batch inference support and CPU compatibility

6192e15

- Add --batch_size CLI argument for parallel sequence processing - Add conditional CUDA stream creation for CPU-only mode - Add device-aware ExecutionEnv and Policy resource distribution - Fix MPS compatibility on macOS

fix hardcode of model loading and support batch size

48fbd69

Resolving dependency conflicts

3d3ff5b

docs: refine README setup and usage sections for clarity and correctness

9fafef5

Add batch size related updates

0b5b97a

delete ddebug output

4ad4882

delete .id files

136054a

fix max token size problem

b717a53

add prompt

5d26e9b

Reduce /dev/shm peak usage during warmup/prefill stage

ee81d94

delete dead code

8587226

chore: comment out unused compare_tensors function

c923dfb

delete bitsandbytes quant

8689cc9

support flexgen 4bit quant

9537383

clean debug output for server id

8423719

add effective throughput

8870508

clean up unnecessary files

681be3c

Merge branch 'ai-decentralized:main' into main

51c7641

fix the error of start compute time

dc72621

Use rolling buffer to avoid O(prompt_len) copy on each forward

9548686

The debug I/O issue has been fixed

6826b63

Merge branch 'batch'

c0e254b

Use rolling buffer to avoid O(prompt_len) copy on each forward

53fc0db

HaibaraAiChan approved these changes Nov 21, 2025

View reviewed changes

HaibaraAiChan merged commit b3126a6 into ai-decentralized:main Nov 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove O(prompt_len) prompt copies#35

Remove O(prompt_len) prompt copies#35
HaibaraAiChan merged 23 commits intoai-decentralized:mainfrom
JiuChen0:main

JiuChen0 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JiuChen0 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants