Skip to content
View cyLi-Tiger's full-sized avatar

Block or report cyLi-Tiger

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

DeepEP: an efficient expert-parallel communication library

Cuda 7,289 669 Updated Mar 18, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 42,529 6,444 Updated Mar 25, 2025

A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.

Python 165 12 Updated Mar 20, 2025

Fast and memory-efficient exact attention

Python 16,494 1,561 Updated Mar 25, 2025

Development repository for the Triton language and compiler

MLIR 14,968 1,884 Updated Mar 25, 2025

[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …

Python 944 47 Updated Feb 25, 2025

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 260 28 Updated Nov 22, 2024

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,911 186 Updated Mar 24, 2025
Python 39 1 Updated Nov 25, 2024

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉

3,708 261 Updated Mar 4, 2025

ModelScope: bring the notion of Model-as-a-Service to life.

Python 7,599 781 Updated Mar 24, 2025

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 614 40 Updated Mar 6, 2025

Awesome LLM compression research papers and tools.

1,433 92 Updated Mar 24, 2025

QAQ: Quality Adaptive Quantization for LLM KV Cache

Python 47 7 Updated Mar 27, 2024

📰 Must-read papers and blogs on Speculative Decoding ⚡️

656 32 Updated Mar 21, 2025

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.

Python 1,090 118 Updated Mar 23, 2025

scalable and robust tree-based speculative decoding algorithm

Python 339 38 Updated Jan 28, 2025

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,469 173 Updated Jun 25, 2024

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,219 73 Updated Mar 6, 2025

AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术

Jupyter Notebook 12,986 1,869 Updated Mar 1, 2025

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 88,213 23,673 Updated Mar 25, 2025

Simplify your onnx model

C++ 4,007 393 Updated Sep 3, 2024

MegCC是一个运行时超轻量,高效,移植简单的深度学习模型编译器

C++ 482 59 Updated Oct 23, 2024

使用Fabric-sdk-go开发的第一个web service,包括链码服务

Go 1 Updated Jun 19, 2022

VerilogHDL单周期CPU(支持10条指令)

Verilog 5 2 Updated Jun 10, 2022
Go 1 1 Updated Oct 31, 2021

机器人视觉 移动机器人 VS-SLAM ORB-SLAM2 深度学习目标检测 yolov3 行为检测 opencv PCL 机器学习 无人驾驶

C++ 8,209 2,794 Updated Jul 9, 2024

Papers for deep neural network compression and acceleration

397 80 Updated Jun 21, 2021

Kuboard 是基于 Kubernetes 的微服务管理界面。同时提供 Kubernetes 免费中文教程,入门教程,最新版本的 Kubernetes v1.23.4 安装手册,(k8s install) 在线答疑,持续更新。

JavaScript 23,226 1,552 Updated Mar 22, 2025

Strategies for Pre-training Graph Neural Networks

Python 992 165 Updated Jul 29, 2023
Next
Showing results