Open
Description
Current PrefillPlan interface:
template <typename IdType>
inline cudaError_t PrefillPlan(void* float_buffer, size_t float_workspace_size_in_bytes,
void* int_buffer, void* page_locked_int_buffer,
size_t int_workspace_size_in_bytes, PrefillPlanInfo& plan_info,
IdType* qo_indptr_h, IdType* kv_indptr_h, uint32_t total_num_rows,
uint32_t batch_size, uint32_t num_qo_heads, uint32_t num_kv_heads,
uint32_t head_dim_qk, uint32_t head_dim_vo, uint32_t page_size,
bool enable_cuda_graph, uint32_t sizeof_dtype_o,
cudaStream_t stream);
float_workspace_size_in_bytes
is an input parameter that cannot be determined in advance through function calls or other means.
Currently, PrefillPlan
may attempt to allocate variables like batch_prefill_tmp_v
and batch_prefill_tmp_s
without checking if they would exceed the available float workspace size.
To prevent this issue, a function should be implemented to notify users about the required float workspace size.
Metadata
Assignees
Labels
No labels
Activity