A collection of quarterly technology digests tracking the latest research and developments in AI optimization and inference.
A quarterly digest covering model compression and inference optimization, including quantization, pruning, sparse attention, KV cache optimization, and related techniques for making AI models faster and more efficient.
With AI research expanding rapidly, identifying notable results in the constant stream of new papers becomes increasingly difficult. While we strive to track state-of-the-art approaches, we cannot cover everything. We share what we've learned and welcome your input – if you spot something interesting, please open a pull request or let us know!