Stars
DeepEP: an efficient expert-parallel communication library
A high-throughput and memory-efficient inference and serving engine for LLMs
A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.
Fast and memory-efficient exact attention
Development repository for the Triton language and compiler
[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
ModelScope: bring the notion of Model-as-a-Service to life.
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Awesome LLM compression research papers and tools.
QAQ: Quality Adaptive Quantization for LLM KV Cache
📰 Must-read papers and blogs on Speculative Decoding ⚡️
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
scalable and robust tree-based speculative decoding algorithm
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
Tensors and Dynamic neural networks in Python with strong GPU acceleration
使用Fabric-sdk-go开发的第一个web service,包括链码服务
机器人视觉 移动机器人 VS-SLAM ORB-SLAM2 深度学习目标检测 yolov3 行为检测 opencv PCL 机器学习 无人驾驶
Papers for deep neural network compression and acceleration
Kuboard 是基于 Kubernetes 的微服务管理界面。同时提供 Kubernetes 免费中文教程,入门教程,最新版本的 Kubernetes v1.23.4 安装手册,(k8s install) 在线答疑,持续更新。
Strategies for Pre-training Graph Neural Networks