Skip to content

Latest commit

 

History

History
58 lines (50 loc) · 22.2 KB

File metadata and controls

58 lines (50 loc) · 22.2 KB
2025 Audio
Title & Authors Areas Tags Links
Arxiv
EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen
Area Area Cost Paper
Arxiv Star
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
Area Area Area Area Paper
GitHub
Arxiv Star
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality
Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
Area Area Area Area Paper
GitHub
Publish Star
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng
Area Type
Cost
Paper
GitHub
Publish Star
Token Pruning in Audio-Transformers: Optimizing Performance and Decoding Patch Importance
Taehan Lee, Hyukjun Lee
Area Type
Cost
Paper
GitHub
Model
Arxiv Star
Qwen2.5-Omni Technical Report
Qwen Team
Area Area Area Type
Cost
Paper
GitHub
Model
Publish Star
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro
Area Type
Cost
Paper
GitHub
Arxiv
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
Umberto Cappellazzo, Minsu Kim, Stavros Petridis
Area Type
Cost
Paper
Arxiv Star
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen
Area Type
Cost
Paper
GitHub
Model
Arxiv Star
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun
Area Type
Cost
Paper
GitHub
Model
Arxiv Star
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
ASLP@NPU
Area Type
Cost
Paper
GitHub
Model
2024 Audio
Title & Authors Areas Tags Links
Arxiv Star
Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models
Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai
Area Type
Cost
Paper
GitHub
Model
Publish
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai "Helen" Li, Yiran Chen
Area Type Type
Cost
Paper
Publish Star
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia
Area Area Area Type
Cost
Paper
GitHub
Model
Dataset
Publish Star
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja Pantic
Area Type
Cost
Paper
GitHub
Publish Star
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng
Area Type
Cost
Paper
GitHub
Model
Dataset
Arxiv Star
Qwen2-Audio Technical Report
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou
Area Type
Cost
Paper
GitHub
Model
Publish Star
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang
Area Area Area Type
Cost
Paper
GitHub
Model
Publish
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang
Area Type
Cost
Paper
Publish Star
FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation
Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani
Area Type
Cost
Paper
GitHub
Arxiv Star
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
Area Area Type
Cost
Paper
GitHub
Model
Publish
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li
Area Type
Cost
Paper
Arxiv
SpeechVerse: A Large-scale Generalizable Audio Language Model
AWS AI Team
Area Type
Cost
Paper
Arxiv
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
Area Type
Cost
Paper
2023 Audio
Title & Authors Areas Tags Links
Publish Star
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Area Type
Cost
Paper
GitHub
Model
Publish
Connecting Speech Encoder and Large Language Model for ASR
Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Area Type
Cost
Paper
Publish
Prompting Large Language Models with Speech Recognition Abilities
Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
Area Type
Cost
Paper
Publish
Accelerating Transducers through Adjacent Token Merging
Yuang Li, Yu Wu, Jinyu Li, Shujie Liu
Area Type
Cost
Paper
Publish Star
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, Lidong Bing
Area Area Area Type
Cost
Paper
GitHub
Model
2022 Audio
Title & Authors Areas Tags Links
Publish Star
HTS-AT: A Hierarchical Token-Semantic Audio-Transformer for Sound Classification and Detection
Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov
Area Type
Cost
Paper
GitHub