Skip to content

RFC: Async overlap of CPU and GPU compute during dynamic inference step #2019

@tdene

Description

@tdene

This RFC the following plan for how to best optimize the dynamic inference step function.

There are several interconnected issues at play:

  • Dynamic sampling code is currently very unoptimized. There is a PR draft that reimplements it.
  • async_generate_output_tokens_dynamic_batch mixes CPU and GPU operations indiscriminately.
  • async_generate_output_tokens_dynamic_batch may be declared async, but it has no good way of yielding the event loop. A lot of CPU time is wasted waiting for the GPU, and can be reclaimed.

The ideal solution appears to be:

  • Fix dynamic sampling code.
  • Clearly separate CPU and GPU operations.
  • Provide a place to yield the event loop.

The PR series suggested by this RFC are:

  1. Break async_generate_output_tokens_dynamic_batch apart into multiple sub-methods, which are clearly labeled as "CPU compute" vs "GPU compute".
  2. Implement barebones unoptimized dynamic sampling code.
  3. Tensorize the dynamic sampling bookkeeping.
  4. Optimize dynamic sampling code via graphed FlashInfer sampling. - A draft has been written by @kanz-nv; @tdene will finish it.
  5. Refactor dynamic logprobs computation to follow the same style as the new sampling code.
    • A draft has been written by @tdene.
  6. Implement top_n_logprobs. Needed for compatibility.
  7. Reorder the sub-methods from point 1) so that CPU/GPU compute forms separate continuous blocks of code, and yield the event loop after the CPU compute via torch polling.
    • Due to all the prep work, this will be a tiny, extremely readable, PR.
  8. Wait for a torch update, or brainstorm a way to yield the event loop without polling in the current version of pytorch.
    • Maybe by sampling on a single rank, instead of the current sampling on every rank?
    • Will discuss further in comments.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions