2025 Audio
2024 Audio
| Title & Authors | Areas | Tags | Links |
|---|---|---|---|
Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai |
Paper GitHub Model |
||
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai "Helen" Li, Yiran Chen |
Paper |
||
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia |
Paper GitHub Model Dataset |
||
Large Language Models are Strong Audio-Visual Speech Recognition Learners Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja Pantic |
Paper GitHub |
||
LLaMA-Omni: Seamless Speech Interaction with Large Language Models Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng |
Paper GitHub Model Dataset |
||
Qwen2-Audio Technical Report Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou |
Paper GitHub Model |
||
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang |
Paper GitHub Model |
||
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang |
Paper |
||
FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani |
Paper GitHub |
||
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing |
Paper GitHub Model |
||
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li |
Paper |
||
SpeechVerse: A Large-scale Generalizable Audio Language Model AWS AI Team |
Paper |
||
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen |
Paper |
2023 Audio
| Title & Authors | Areas | Tags | Links |
|---|---|---|---|
SALMONN: Towards Generic Hearing Abilities for Large Language Models Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang |
Paper GitHub Model |
||
Connecting Speech Encoder and Large Language Model for ASR Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang |
Paper |
||
Prompting Large Language Models with Speech Recognition Abilities Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer |
Paper |
||
Accelerating Transducers through Adjacent Token Merging Yuang Li, Yu Wu, Jinyu Li, Shujie Liu |
Paper |
||
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Hang Zhang, Xin Li, Lidong Bing |
Paper GitHub Model |
2022 Audio
| Title & Authors | Areas | Tags | Links |
|---|---|---|---|
HTS-AT: A Hierarchical Token-Semantic Audio-Transformer for Sound Classification and Detection Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov |
Paper GitHub |