Skip to content

Reuse CUDA quant output buffers instead of fresh allocation #34

@SaschaOnTour

Description

@SaschaOnTour

Problem / Motivation

In cuda/quantize.rs:116-117, both packed_flat and scales_flat tensors are freshly allocated on every quantize call. Same issue as #33 but for the quantization path.

Solution

Pre-allocate scratch buffers for quant output, reuse across calls.

Key files

  • turboquant/src/cache/cuda/quantize.rs:116-117 — current fresh allocations

Acceptance criteria

  • No Tensor::zeros in the quant hot path
  • Scratch buffers allocated once, reused
  • All tests pass: cargo nextest run --features cuda
  • cargo fmt --check clean

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions