Skip to content

Status of prototype features #1807

Open
@msaroufim

Description

@msaroufim

I was parsing through our prototype folder and wanted to give my take on what should be promoted, deleted or requires further discussion

  • spinquant, awq, autoround, hqq: All of these are algorithm implementations and if we are convinced they are correct and pass reference checks relative to the original repos we should promote those to out of prototype. In particular the benefits we should lean into are accelerated performance using torch.compile and serialization support with the HF hub @jerryzh168
  • DORA: This technique didn't end up picking up as much as low bit optimizers, the idea of low rank as a form of compression is interesting but I would lean on deleting this although I could convinced that having some latent space optimizers in is relevant considering MLA is so hot right now Delete DORA #1815
  • Profiler: This never worked with torch.compile and so for us has limited utility unless @jeromeku can fix and if not we can delete
  • float8nocompile: This doesn't feel like it should be a prototype feature and I'd like to hear some detail on the promotion plan @danielvegamyhre
  • common: This folder probably shouldn't exist
  • quantized_training: This should remain in prototype because it primarily targets older or consumer GPUs, concretely as our focus moves to blackwell there won't be much of a difference between dtypes for inference vs training. Granted stochastic rounding should be promoted as a utility
  • low_bit_optim: This is great work, we should promote it out of prototype
  • Split_k: This just solves a very narrow problem so should be deleted in favor of using inductor matmul templates. It was meant more as an example of how to ship triton kernels with ao which well isn't hard because it's all JIT Remove split_k kernel #1816
  • mx_formats: With blackwell out, this should be promoted out of prototype
  • Sparsity: 2:4 sparsity for inference should be moved out of prototype, it's likely going to continue being relevant esp for future Flash attention implementations
  • Kernel: this is a kernel autotuner, it should be deleted since we can just rely on inductor's max autotune mechanism
  • dtypes: This is mostly for bitnet support, should either be deleted or refactored into quantized training

I'd love to hear more on folks especially if you disagree with anything!

cc @supriyar @jerryzh168 @drisspg @vkuzo @gau-nernst

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions