Skip to content

Remove O(prompt_len) prompt copies#35

Merged
HaibaraAiChan merged 23 commits intoai-decentralized:mainfrom
JiuChen0:main
Nov 21, 2025
Merged

Remove O(prompt_len) prompt copies#35
HaibaraAiChan merged 23 commits intoai-decentralized:mainfrom
JiuChen0:main

Conversation

@JiuChen0
Copy link
Copy Markdown
Contributor

  1. Remove redundant debug output
    prepare_inputs_for_generation prints whenever inputs_embeds is used, polluting stdout and adding sync overhead. This PR removes the print or switches it to a logger.

  2. Eliminate O(prompt_len) prompt copies per step
    OptimizedLlamaDecoderLayer.forward rebuilds output_ids and copies the full prompt on every forward call. This PR switches to a rolling buffer that only appends the new token, avoiding unnecessary host→device copies.

@HaibaraAiChan HaibaraAiChan merged commit b3126a6 into ai-decentralized:main Nov 21, 2025
JiuChen0 added a commit to JiuChen0/BloomBee that referenced this pull request Mar 22, 2026
* Add batch inference support and CPU compatibility

- Add --batch_size CLI argument for parallel sequence processing
- Add conditional CUDA stream creation for CPU-only mode
- Add device-aware ExecutionEnv and Policy resource distribution
- Fix MPS compatibility on macOS

* fix hardcode of model loading and support batch size

* Resolving dependency conflicts

* docs: refine README setup and usage sections for clarity and correctness

* Add batch size related updates

* delete ddebug output

* delete .id files

* fix max token size problem

* add prompt

* Reduce /dev/shm peak usage during warmup/prefill stage

* delete dead code

* chore: comment out unused compare_tensors function

* delete bitsandbytes quant

* support flexgen 4bit quant

* clean debug output for server id

* add effective throughput

* clean up unnecessary files

* fix the error of start compute time

* Use rolling buffer to avoid O(prompt_len) copy on each forward

* The debug I/O issue has been fixed

* Use rolling buffer to avoid O(prompt_len) copy on each forward

---------

Co-authored-by: Danny Willow Liu <dannywillowliu@uchicago.edu>
Co-authored-by: root <root@investorairig80.maas>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants