GitHub

Updated on 2025.10.24

Usage instructions: here

Table of Contents

Text to Speech
Text to Audio
Video to Audio
Voice Conversion
Video Generation
Image Generation
Music Generation
Audio Codec
Large Audio Language Model

Text to Speech

Publish Date	Title	Authors	PDF	Code
2025-10-22	Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent	Yangshijie Zhang et.al.	2510.19641	null
2025-10-22	Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment	Maureen de Seyssel et.al.	2510.19509	null
2025-10-22	EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection	Tong Zhang et.al.	2510.19414	null
2025-10-21	StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction	Qianheng Xu et.al.	2510.18938	null
2025-10-21	KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers	Mohd Ruhul Ameen et.al.	2510.18355	null
2025-10-21	ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation	Haowei Lou et.al.	2510.18308	null
2025-10-19	U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation	Xusheng Yang et.al.	2510.16718	null
2025-10-18	Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages	Pacome Simon Mbonimpa et.al.	2510.16497	null
2025-10-18	TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model	Bin Yu et.al.	2510.16449	null
2025-10-22	VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition	Kye Shimizu et.al.	2510.16192	null
2025-10-17	High order Tensor-Train-Based Schemes for High-Dimensional Mean Field Games	Elisabetta Carlini et.al.	2510.15603	null
2025-10-16	Hints for dynamical dark energy from warm inflation	Anupama B et.al.	2510.15051	null
2025-10-16	Improving Cybercrime Detection and Digital Forensics Investigations with Artificial Intelligence	Silvia Lucia Sanna et.al.	2510.14638	null
2025-10-16	RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF	Qing Yang et.al.	2510.14628	null
2025-10-16	The tt-structure for the quantum cohomology of complex Grassmannian*	Tadashi Udagawa et.al.	2510.14483	null
2025-10-20	Radiation pressure and equation of state are important in the envelope unbinding process in common envelope evolution	Zhuo Chen et.al.	2510.14173	null
2025-10-15	Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling	Peng Kuang et.al.	2510.13918	null
2025-10-15	Generative Universal Verifier as Multimodal Meta-Reasoner	Xinchen Zhang et.al.	2510.13804	null
2025-10-15	InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue	Wenwen Tong et.al.	2510.13747	null
2025-10-15	Closing the Gap Between Text and Speech Understanding in LLMs	Santiago Cuervo et.al.	2510.13632	null
2025-10-15	Functional tensor train neural network for solving high-dimensional PDEs	Yani Feng et.al.	2510.13386	null
2025-10-15	Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models	Yizhou Peng et.al.	2510.13293	null
2025-10-15	StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation	Xi Chen et.al.	2510.13194	null
2025-10-14	Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs	Xinlu He et.al.	2510.12995	null
2025-10-14	Toward First-Principles Multi-Messenger Predictions: Coupling Nuclear Networks with GR Radiation-MHD in {\tt Gmunu}	Patrick Chi-Kit Cheong et.al.	2510.12978	null
2025-10-14	Content Anonymization for Privacy in Long-form Audio	Cristina Aggazzotti et.al.	2510.12780	null
2025-10-14	TerraCodec: Compressing Earth Observations	Julen Costa-Watanabe et.al.	2510.12670	null
2025-10-14	Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation	Greta Damo et.al.	2510.12316	null
2025-10-14	DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation	Yakun Song et.al.	2510.12210	null
2025-10-13	Actor-Enriched Time Series Forecasting of Process Performance	Aurelie Leribaux et.al.	2510.11856	null
2025-10-13	BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis	Jingyuan Xing et.al.	2510.11646	null
2025-10-13	Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker	Cheng Gong et.al.	2510.11124	null
2025-10-14	ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis	Mohammad Javad Ranjbar Kalahroodi et.al.	2510.10774	null
2025-10-14	MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations	Wenxiang Guo et.al.	2510.10396	null
2025-10-11	Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey	Jiaqi Wei et.al.	2510.09988	null
2025-10-10	Tensor-based compression of the sea temperature data	Ilya Kosolapov et.al.	2510.09778	null
2025-10-10	Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models	Donghang Wu et.al.	2510.09592	null
2025-10-10	A family of non-simple surfaces whose transport twistor spaces admit global blow-down maps	François Monard et.al.	2510.09518	null
2025-10-10	O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion	Huu Tuong Tu et.al.	2510.09061	null
2025-10-10	DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment	Zongcai Du et.al.	2510.09016	null
2025-10-09	Theoretical Analysis of Topotomography Using Small Intragranular Strain Approximations	Zheheng Liu et.al.	2510.08712	null
2025-10-09	DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching	Hanke Xie et.al.	2510.08373	null
2025-10-09	Structured covariance estimation via tensor-train decomposition	Artsiom Patarusau et.al.	2510.08174	null
2025-10-09	IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation	Wei Wang et.al.	2510.07979	null
2025-10-09	VoiceAgentBench: Are Voice Assistants ready for agentic tasks?	Dhruv Jain et.al.	2510.07978	null
2025-10-09	Self-Improving LLM Agents at Test-Time	Emre Can Acikgoz et.al.	2510.07841	null
2025-10-09	From Noisy to Native: LLM-driven Graph Restoration for Test-Time Graph Domain Adaptation	Xiangwei Lv et.al.	2510.07762	null
2025-10-09	Parallel Test-Time Scaling for Latent Reasoning Models	Runyang You et.al.	2510.07745	null
2025-10-08	AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding	Shuqing Luo et.al.	2510.07486	null
2025-10-08	Gauge Dependence of Scalar-Induced Gravitational Waves from Isocurvature Perturbations: Analytical Results	Arshad Ali et.al.	2510.07252	null

(back to top)

Text to Audio

Publish Date	Title	Authors	PDF	Code
2025-10-22	Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models	Xiaozhen Qiao et.al.	2510.19802	null
2025-10-16	Visible Imaging of Incoherent 1200-nm Light via Triplet--Triplet Annihilation Upconversion	Pournima Narayanan et.al.	2510.15184	null
2025-10-16	SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation	Jihyun Yu et.al.	2510.14634	null
2025-10-16	AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation	Hui Wang et.al.	2510.14570	null
2025-10-15	UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE	Zhenyu Liu et.al.	2510.13344	null
2025-10-15	DP-TTA: Test-time Adaptation for Transient Electromagnetic Signal Denoising via Dictionary-driven Prior Regularization	Meng Yang et.al.	2510.13160	null
2025-10-14	Controllable Collision Scenario Generation via Collision Pattern Prediction	Pin-Lun Chen et.al.	2510.12206	null
2025-10-14	Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis	Junnuo Wang et.al.	2510.12175	null
2025-10-13	UALM: Unified Audio Language Model for Understanding, Generation and Reasoning	Jinchuan Tian et.al.	2510.12000	null
2025-10-13	Efficient Edge Test-Time Adaptation via Latent Feature Coordinate Correction	Xinyu Luo et.al.	2510.11068	null
2025-10-17	MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations	Wenxiang Guo et.al.	2510.10396	null
2025-10-10	MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation	Akira Takahashi et.al.	2510.09065	null
2025-10-10	ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling	Yuxuan Jiang et.al.	2510.08878	null
2025-10-13	Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation	Liyang Chen et.al.	2510.08078	null
2025-10-09	IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries	Harsh Kavediya et.al.	2510.07837	null
2025-10-08	HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation	Samir Abou Haidar et.al.	2510.06876	null
2025-10-07	FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders	Riccardo Fosco Gramaccioni et.al.	2510.05829	null
2025-10-07	StereoSync: Spatially-Aware Stereo Audio Generation from Video	Christian Marinoni et.al.	2510.05828	null
2025-10-07	NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering	Alexander Murphy et.al.	2510.05635	null
2025-10-07	LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability	Harshil Vejendla et.al.	2510.05530	null
2025-10-06	Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers	Juncheng Wang et.al.	2510.04577	null
2025-10-05	Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space	Christian Limberg et.al.	2510.04339	null
2025-10-05	The best performance in the CARE 2025 -- Liver Task (LiSeg-Contrast): Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation	Jincan Lou et.al.	2510.04243	null
2025-10-04	AI-Assisted Pleural Effusion Volume Estimation from Contrast-Enhanced CT Images	Sanhita Basu et.al.	2510.03856	null
2025-10-03	SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos	Amir Dellali et.al.	2510.02916	null
2025-10-03	Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models	Lihua Zhou et.al.	2510.02750	null
2025-10-02	SoundReactor: Frame-level Online Video-to-Audio Generation	Koichi Saito et.al.	2510.02110	null
2025-09-30	To Remember, To Adapt, To Preempt: A Stable Continual Test-Time Adaptation Framework for Remote Physiological Measurement in Dynamic Domain Shifts	Shuyang Chu et.al.	2510.01282	null
2025-10-01	PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation	Yujia Xiao et.al.	2510.00485	null
2025-10-01	VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors	Atif Belal et.al.	2510.00458	null
2025-09-30	Post-Training Quantization for Audio Diffusion Transformers	Tanmay Khandelwal et.al.	2510.00313	null
2025-09-30	Video Object Segmentation-Aware Audio Generation	Ilpo Viertola et.al.	2509.26604	null
2025-09-30	MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms	Eleonora Ristori et.al.	2509.26007	null
2025-09-30	Annotation-Efficient Active Test-Time Adaptation with Conformal Prediction	Tingyu Shi et.al.	2509.25692	null
2025-09-30	Charge Transfer States in Donor Acceptor Bulk Heterojunctions as Triplet Triplet Annihilation Sensitizer for Solid-State Photon Upconversion	Maciej Klein et.al.	2509.25679	null
2025-09-29	EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition	Jiacheng Shi et.al.	2509.25495	null
2025-09-29	A Robust Multi-Scale Framework with Test-Time Adaptation for sEEG-Based Speech Decoding	Suli Wang et.al.	2509.24700	null
2025-09-29	When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks	Zeyu Xie et.al.	2509.24635	null
2025-09-29	Training-Free Multimodal Guidance for Video to Audio Generation	Eleonora Grassucci et.al.	2509.24550	null
2025-10-01	An Agent-Based Framework for Automated Higher-Voice Harmony Generation	Nia D'Souza Ganapathy et.al.	2509.24463	null
2025-09-29	UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities	Xuenan Xu et.al.	2509.24391	null
2025-09-28	AudioMoG: Guiding Audio Generation with Mixture-of-Guidance	Junyou Wang et.al.	2509.23727	null
2025-09-26	TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses	Sahar Dastani et.al.	2509.22813	null
2025-09-25	Prompt-aware classifier free guidance for diffusion models	Xuanhao Zhang et.al.	2509.22728	null
2025-09-26	Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment	Yunyi Liu et.al.	2509.21919	null
2025-09-25	AIBA: Attention-based Instrument Band Alignment for Text-to-Audio Diffusion	Junyoung Koh et.al.	2509.20891	null
2025-09-24	MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization	Jianxuan Yang et.al.	2509.19999	null
2025-09-25	MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model	The Hieu Pham et.al.	2509.19881	null
2025-09-24	SCORE: Scaling audio generation using Standardized COmposite REwards	Jaemin Jung et.al.	2509.19831	null
2025-09-23	SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering	Jiarui Hai et.al.	2509.18603	null

(back to top)

Video to Audio

Publish Date	Title	Authors	PDF	Code
2025-10-10	MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation	Akira Takahashi et.al.	2510.09065	null
2025-10-13	Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation	Liyang Chen et.al.	2510.08078	null
2025-10-09	IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries	Harsh Kavediya et.al.	2510.07837	null
2025-10-07	FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders	Riccardo Fosco Gramaccioni et.al.	2510.05829	null
2025-10-07	StereoSync: Spatially-Aware Stereo Audio Generation from Video	Christian Marinoni et.al.	2510.05828	null
2025-10-03	SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos	Amir Dellali et.al.	2510.02916	null
2025-10-02	SoundReactor: Frame-level Online Video-to-Audio Generation	Koichi Saito et.al.	2510.02110	null
2025-09-29	Training-Free Multimodal Guidance for Video to Audio Generation	Eleonora Grassucci et.al.	2509.24550	null
2025-09-28	AudioMoG: Guiding Audio Generation with Mixture-of-Guidance	Junyou Wang et.al.	2509.23727	null
2025-09-26	WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM	Changli Tang et.al.	2509.21990	null
2025-09-26	Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers	Jibin Song et.al.	2509.21893	null
2025-09-24	MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization	Jianxuan Yang et.al.	2509.19999	null
2025-10-05	StereoFoley: Object-Aware Stereo Audio Generation from Video	Tornike Karchkhadze et.al.	2509.18272	null
2025-09-19	Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech	Xinlei Niu et.al.	2509.15492	null
2025-09-19	RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes	Fang Li et.al.	2509.15123	null
2025-09-08	MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation	Xiaoran Yang et.al.	2509.06389	null
2025-09-05	Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper	Gehui Chen et.al.	2509.04957	null
2025-08-23	HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation	Sizhe Shan et.al.	2508.16930	null
2025-08-19	InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing	Shaoshu Yang et.al.	2508.14033	null
2025-08-21	FoleySpace: Vision-Aligned Binaural Spatial Audio Generation	Lei Zhao et.al.	2508.12918	null
2025-08-14	LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters	Haomin Zhang et.al.	2508.11074	null
2025-08-12	Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization	Chaoqun Cui et.al.	2508.08550	null
2025-07-14	DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis	Wenjie Tian et.al.	2507.10109	null
2025-07-13	Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation	Yingshan Liang et.al.	2507.04959	null
2025-06-23	Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions	Vineet Kumar Rakesh et.al.	2507.02900	null
2025-07-03	Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation	Feizhen Huang et.al.	2507.02271	null
2025-06-23	IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech	Siyi Zhou et.al.	2506.21619	null
2025-06-28	ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing	Huadai Liu et.al.	2506.21448	null
2025-06-27	Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance	Akio Hayakawa et.al.	2506.20995	null
2025-06-24	Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation	Jun Wang et.al.	2506.19774	null
2025-06-13	ViSAGe: Video-to-Spatial Audio Generation	Jaeyeon Kim et.al.	2506.12199	null
2025-05-31	Length Aware Speech Translation for Video Dubbing	Harveen Singh Chadha et.al.	2506.00740	null
2025-05-26	Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks	Chang Liu et.al.	2505.20038	link
2025-05-22	SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet	Zhi Zhong et.al.	2505.16195	null
2025-05-30	TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis	Yu Zhang et.al.	2505.14910	link
2025-05-28	Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model	Yong Ren et.al.	2505.13062	null
2025-06-03	OmniAudio: Generating Spatial Audio from 360-Degree Video	Huadai Liu et.al.	2504.14906	link
2025-04-17	CAFA: a Controllable Automatic Foley Artist	Roi Benita et.al.	2504.06778	link

(back to top)

Voice Conversion

Publish Date	Title	Authors	PDF	Code
2025-10-22	VBx for End-to-End Neural and Clustering-based Diarization	Petr Pálka et.al.	2510.19572	null
2025-10-20	Fast Agnostic Learners in the Plane	Talya Eden et.al.	2510.18057	null
2025-10-20	Joint upper Banach density, VC dimensions and Euclidean point configurations	Bruno Predojević et.al.	2510.17453	null
2025-10-23	The Parameterized Complexity of Computing the VC-Dimension	Florent Foucaud et.al.	2510.17451	null
2025-10-18	Truly Subquadratic Time Algorithms for Diameter and Related Problems in Graphs of Bounded VC-dimension	Timothy M. Chan et.al.	2510.16346	null
2025-10-22	VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition	Kye Shimizu et.al.	2510.16192	null
2025-10-16	Deadlock-free routing for Full-mesh networks without using Virtual Channels	Alejandro Cano et.al.	2510.14730	null
2025-10-15	The VC-dimension and point configurations in $\mathbb{R}^d$	Alex Iosevich et.al.	2510.13984	null
2025-10-16	VC-Dimension vs Degree: An Uncertainty Principle for Boolean Functions	Fan Chang et.al.	2510.13705	null
2025-10-15	Model-assisted estimation for MRV: How to boost the economics of SOC sequestration projects without compromising on scientific integrity	Ahmad Awad et.al.	2510.13609	null
2025-10-15	Target Controllability Score	Kazuhiro Sato et.al.	2510.13354	null
2025-10-14	VCTR: A Transformer-Based Model for Non-parallel Voice Conversion	Maharnab Saikia et.al.	2510.12964	null
2025-10-15	(R)evolution of Programming: Vibe Coding as a Post-Coding Paradigm	Kevin Krings et.al.	2510.12364	null
2025-10-13	Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker	Cheng Gong et.al.	2510.11124	null
2025-10-13	VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents	Jiliang Hu et.al.	2510.11098	null
2025-10-10	A Scalable, Privacy-Preserving Decentralized Identity and Verifiable Data Sharing Framework based on Zero-Knowledge Proofs	Hui Yuan et.al.	2510.09715	null
2025-10-10	SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion	Zhao Guo et.al.	2510.09245	null
2025-10-10	O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion	Huu Tuong Tu et.al.	2510.09061	null
2025-10-09	MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows	Guobin Ma et.al.	2510.08392	null
2025-10-09	What Makes a Visualization Complex?	Mengdi Chu et.al.	2510.08332	null
2025-10-09	VoiceAgentBench: Are Voice Assistants ready for agentic tasks?	Dhruv Jain et.al.	2510.07978	null
2025-10-06	UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models	Wenhao Guan et.al.	2510.04593	null
2025-10-05	A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation	Ananya Raghu et.al.	2510.03986	null
2025-10-03	Online Learning in the Random Order Model	Martino Bernasconi et.al.	2510.02820	null
2025-10-02	Higher-arity PAC learning, VC dimension and packing lemma	Artem Chernikov et.al.	2510.02420	null
2025-09-30	BlockSDN-VC: A SDN-Based Virtual Coordinate-Enhanced Transaction Broadcast Framework for High-Performance Blockchains	Wenyang Jia et.al.	2510.00306	null
2025-09-29	MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech	Chengyao Wang et.al.	2509.25131	null
2025-10-02	Cofinal families of finite VC-dimension	Omer Ben-Neria et.al.	2509.24744	null
2025-09-29	VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning	Yixuan Zhou et.al.	2509.24650	null
2025-09-29	ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark	Yun Chen et.al.	2509.24570	null
2025-09-29	Strong enhancement of d-wave superconductivity in an extended checkerboard Hubbard ladder	Xichen Huang et.al.	2509.24415	null
2025-09-26	ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection	Mohamed Maged et.al.	2509.22808	null
2025-09-26	Speaker Anonymisation for Speech-based Suicide Risk Detection	Ziyun Cui et.al.	2509.22148	null
2025-09-25	VC-Agent: An Interactive Agent for Customized Video Dataset Collection	Yidan Zhang et.al.	2509.21291	null
2025-09-24	Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation	Yang Cui et.al.	2509.19812	null
2025-09-22	Preconditioned Deformation Grids	Julian Kaltheuner et.al.	2509.18097	null
2025-09-21	MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances	Junhyeok Lee et.al.	2509.17143	null
2025-09-20	Advancing Reference-free Evaluation of Video Captions with Factual Analysis	Shubhashis Roy Dipta et.al.	2509.16538	null
2025-09-19	Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation	Qi Wang et.al.	2509.16010	null
2025-09-19	The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion To Singing Style Conversion	Lester Phillip Violeta et.al.	2509.15629	null
2025-09-18	FCPE: A Fast Context-based Pitch Estimation Model	Yuxin Luo et.al.	2509.15140	null
2025-09-18	MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis	Keyu An et.al.	2509.14784	null
2025-09-20	Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis	Qingyu Liu et.al.	2509.14579	null
2025-09-17	VCBench: Benchmarking LLMs in Venture Capital	Rick Chen et.al.	2509.14448	null
2025-09-16	MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement	Jingyu Li et.al.	2509.13068	null
2025-09-16	A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis	Javeria Amir et.al.	2509.12831	null
2025-09-15	Preservation of Language Understanding Capabilities in Speech-aware Large Language Models	Marek Kubis et.al.	2509.12171	null
2025-09-14	Rate-Distortion Limits for Multimodal Retrieval: Theory, Optimal Codes, and Finite-Sample Guarantees	Thomas Y. Chen et.al.	2509.11054	null
2025-09-11	Altered Histories in Version Control System Repositories: Evidence from the Trenches	Solal Rapaport et.al.	2509.09294	null
2025-09-11	DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners	Xiaoxue Luo et.al.	2509.09201	null

(back to top)

Video Generation

Publish Date	Title	Authors	PDF	Code
2025-10-22	PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis	Qing Mao et.al.	2510.19527	null
2025-10-22	GigaBrain-0: A World Model-Powered Vision-Language-Action Model	GigaBrain Team et.al.	2510.19430	null
2025-10-22	Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks	Kai Zeng et.al.	2510.19195	null
2025-10-23	Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning	Takehiro Aoshima et.al.	2510.19193	null
2025-10-21	MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models	Aritra Bhowmik et.al.	2510.19022	null
2025-10-21	UltraGen: High-Resolution Video Generation with Hierarchical Attention	Teng Hu et.al.	2510.18775	null
2025-10-23	A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition	Peiqin Zhuang et.al.	2510.18705	null
2025-10-21	MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation	Weinan Jia et.al.	2510.18692	null
2025-10-21	Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model	Zhenxing Zhang et.al.	2510.18573	null
2025-10-22	FeatureFool: Zero-Query Fooling of Video Models via Feature Map	Duoxun Tang et.al.	2510.18362	null
2025-10-22	OmniNWM: Omniscient Driving Navigation World Models	Bohan Li et.al.	2510.18313	null
2025-10-20	World-in-World: World Models in a Closed-Loop World	Jiahan Zhang et.al.	2510.18135	null
2025-10-20	Demystifying Transition Matching: When and Why It Can Beat Flow Matching	Jaihoon Kim et.al.	2510.17991	null
2025-10-20	ConsistEdit: Highly Consistent and Precise Training-free Visual Editing	Zixin Yin et.al.	2510.17803	null
2025-10-22	MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models	Yongshun Zhang et.al.	2510.17519	null
2025-10-20	From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models	Zefan Cai et.al.	2510.17247	null
2025-10-19	An empirical study of the effect of video encoders on Temporal Video Grounding	Ignacio M. De la Jara et.al.	2510.17007	null
2025-10-19	From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display	Xiangyu Mu et.al.	2510.16833	null
2025-10-17	VISTA: A Test-Time Self-Improving Video Generation Agent	Do Xuan Long et.al.	2510.15831	null
2025-10-17	Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset	Qingyan Bai et.al.	2510.15742	null
2025-10-17	DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion	Weijie Wang et.al.	2510.15264	null
2025-10-16	TGT: Text-Grounded Trajectories for Locally Controlled Video Generation	Guofeng Zhang et.al.	2510.15104	null
2025-10-16	RealDPO: Real or Not Real, that is the Preference	Guo Cheng et.al.	2510.14955	null
2025-10-16	DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation	Yu Zhou et.al.	2510.14949	null
2025-10-16	3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation	JoungBin Lee et.al.	2510.14945	null
2025-10-16	ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints	Meiqi Wu et.al.	2510.14847	null
2025-10-16	In-Context Learning with Unpaired Clips for Instruction-based Video Editing	Xinyao Liao et.al.	2510.14648	null
2025-10-19	STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding	Zhifei Chen et.al.	2510.14588	null
2025-10-17	Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning	Xiangyu Meng et.al.	2510.14256	null
2025-10-16	Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization	Liao Shen et.al.	2510.14255	null
2025-10-16	Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures	Yuancheng Xu et.al.	2510.14179	null
2025-10-15	PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning	Sihui Ji et.al.	2510.13809	null
2025-10-15	CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas	Zian Li et.al.	2510.13669	null
2025-10-15	VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator	Hyojun Go et.al.	2510.13454	null
2025-10-15	Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation	Yi Zuo et.al.	2510.13084	null
2025-10-15	Counting Hallucinations in Diffusion Models	Shuai Fu et.al.	2510.13080	null
2025-10-14	SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models	Zhengxu Tang et.al.	2510.13042	null
2025-10-14	MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars	Felix Taubner et.al.	2510.12785	null
2025-10-14	Time-Correlated Video Bridge Matching	Viacheslav Vasilev et.al.	2510.12453	null
2025-10-14	Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding	Ye Chen et.al.	2510.12256	null
2025-10-14	BIGFix: Bidirectional Image Generation with Token Fixing	Victor Besnier et.al.	2510.12231	null
2025-10-14	Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback	Xingpei Ma et.al.	2510.12089	null
2025-10-14	VIDMP3: Video Editing by Representing Motion with Pose and Position Priors	Sandeep Mishra et.al.	2510.12069	null
2025-10-13	Point Prompting: Counterfactual Tracking with Video Diffusion Models	Ayush Shrivastava et.al.	2510.11715	null
2025-10-13	IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment	Yinan Chen et.al.	2510.11647	null
2025-10-13	MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps	Jiahui Lei et.al.	2510.11107	null
2025-10-12	AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes	Yu Li et.al.	2510.10670	null
2025-10-12	DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis	Peiyin Chen et.al.	2510.10650	null
2025-10-10	Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians	Jin-Chuan Shi et.al.	2510.09438	null
2025-10-10	Stable Video Infinity: Infinite-Length Video Generation with Error Recycling	Wuyang Li et.al.	2510.09212	null

(back to top)

Image Generation

Publish Date	Title	Authors	PDF	Code
2025-10-22	Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing	Yusu Qian et.al.	2510.19808	null
2025-10-22	The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models	Xiaofeng Zhang et.al.	2510.19557	null
2025-10-22	Predicting before Reconstruction: A generative prior framework for MRI acceleration	Juhyung Park et.al.	2510.19472	null
2025-10-22	D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation	Nobline Yoo et.al.	2510.19278	null
2025-10-21	DP $^2$ O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution	Rongyuan Wu et.al.	2510.18851	null
2025-10-21	SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation	Siyong Jian et.al.	2510.18716	null
2025-10-21	UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation	Yibin Wang et.al.	2510.18701	null
2025-10-21	From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation	Ziwei Huang et.al.	2510.18263	null
2025-10-21	Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis	Xinhao Cai et.al.	2510.18229	null
2025-10-22	Chimera: Compositional Image Generation using Part-based Concepting	Shivam Singh et.al.	2510.18083	null
2025-10-20	Fine-tuning Flow Matching Generative Models with Intermediate Feedback	Jiajun Fan et.al.	2510.18072	null
2025-10-20	Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models	Jiajun Fan et.al.	2510.18053	null
2025-10-20	Inference-Time Compute Scaling For Flow Matching	Adam Stecklov et.al.	2510.17786	null
2025-10-20	VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models	Qilin Liao et.al.	2510.17759	null
2025-10-21	PICABench: How Far Are We from Physically Realistic Image Editing?	Yuandong Pu et.al.	2510.17681	null
2025-10-21	CaMiT: A Time-Aware Car Model Dataset for Classification and Generation	Frédéric LIN et.al.	2510.17626	null
2025-10-20	Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling	Feihong Yan et.al.	2510.17171	null
2025-10-20	In-situ Autoguidance: Eliciting Self-Correction in Diffusion Models	Enhao Gu et.al.	2510.17136	null
2025-10-19	One-step Diffusion Models with Bregman Density Ratio Matching	Yuanzhi Zhu et.al.	2510.16983	null
2025-10-21	Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback	Zongjian Li et.al.	2510.16888	null
2025-10-19	Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis	Nusrat Munia et.al.	2510.16887	null
2025-10-19	Region in Context: Text-condition Image editing with Human-like semantic reasoning	Thuy Phuong Vu et.al.	2510.16772	null
2025-10-17	BLIP3o-NEXT: Next Frontier of Native Image Generation	Jiuhai Chen et.al.	2510.15857	null
2025-10-17	Controlling the image generation process with parametric activation functions	Ilia Pavlov et.al.	2510.15778	null
2025-10-17	NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation	Yitong Sun et.al.	2510.15752	null
2025-10-17	Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis	Junzhi Ning et.al.	2510.15710	null
2025-10-17	Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation	Xiaoming Zhu et.al.	2510.15564	null
2025-10-16	Salient Concept-Aware Generative Data Augmentation	Tianchen Zhao et.al.	2510.15194	null
2025-10-16	Constantly Improving Image Models Need Constantly Improving Benchmarks	Jiaxin Ge et.al.	2510.15021	link
2025-10-16	Coupled Diffusion Sampling for Training-Free Multi-View Image Editing	Hadi Alzayer et.al.	2510.14981	null
2025-10-16	Learning an Image Editing Model without Image Editing Pairs	Nupur Kumari et.al.	2510.14978	link
2025-10-16	WithAnyone: Towards Controllable and ID Consistent Image Generation	Hengyuan Xu et.al.	2510.14975	null
2025-10-16	ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention	Keli Liu et.al.	2510.14882	null
2025-10-16	FraQAT: Quantization Aware Training with Fractional bits	Luca Morreale et.al.	2510.14823	null
2025-10-16	In-Context Learning with Unpaired Clips for Instruction-based Video Editing	Xinyao Liao et.al.	2510.14648	null
2025-10-16	Adapting Self-Supervised Representations as a Latent Space for Efficient Generation	Ming Gui et.al.	2510.14630	null
2025-10-16	Consistent text-to-image generation via scene de-contextualization	Song Tang et.al.	2510.14553	null
2025-10-16	Exploring Image Representation with Decoupled Classical Visual Descriptors	Chenyuan Qu et.al.	2510.14536	null
2025-10-16	Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models	Yunze Tong et.al.	2510.14526	null
2025-10-15	Generative Universal Verifier as Multimodal Meta-Reasoner	Xinchen Zhang et.al.	2510.13804	null
2025-10-15	Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation	Yifu Luo et.al.	2510.13418	null
2025-10-15	End-to-End Multi-Modal Diffusion Mamba	Chunhao Lu et.al.	2510.13253	null
2025-10-15	Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation	Yi Zuo et.al.	2510.13084	null
2025-10-15	Counting Hallucinations in Diffusion Models	Shuai Fu et.al.	2510.13080	null
2025-10-14	UniFusion: Vision-Language Model as Unified Encoder in Image Generation	Kevin Li et.al.	2510.12789	null
2025-10-14	SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models	Weiyang Jin et.al.	2510.12784	null
2025-10-14	LayerSync: Self-aligning Intermediate Layers	Yasaman Haghighi et.al.	2510.12581	null
2025-10-14	AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion	Xiaopeng Liu et.al.	2510.12260	null
2025-10-14	Local Background Features Matter in Out-of-Distribution Detection	Jinlun Ye et.al.	2510.12259	null
2025-10-14	FedMMKT:Co-Enhancing a Server Text-to-Image Model and Client Task Models in Multi-Modal Federated Learning	Ningxin He et.al.	2510.12254	null

(back to top)

Music Generation

Publish Date	Title	Authors	PDF	Code
2025-10-21	Steering Autoregressive Music Generation with Recursive Feature Machines	Daniel Zhao et.al.	2510.19127	null
2025-10-18	MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding	Jingyue Huang et.al.	2510.16273	null
2025-10-16	Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?	Qixin Deng et.al.	2510.14249	null
2025-10-15	UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE	Zhenyu Liu et.al.	2510.13344	null
2025-10-17	MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations	Wenxiang Guo et.al.	2510.10396	null
2025-10-11	ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis	Stephen Ni-Hahn et.al.	2510.10249	null
2025-10-07	LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment	Jiahao Mei et.al.	2510.05875	null
2025-10-02	Bias beyond Borders: Global Inequalities in AI-Generated Music	Ahmet Solak et.al.	2510.01963	null
2025-10-15	SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing	Jiaye Tan et.al.	2510.00395	null
2025-10-04	HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling	Hung-Ying Chu et.al.	2509.25694	null
2025-09-29	Ethics Statements in AI Music Papers: The Effective and the Ineffective	Julia Barnett et.al.	2509.25496	null
2025-09-29	Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music	Tianle Wang et.al.	2509.24603	null
2025-10-01	An Agent-Based Framework for Automated Higher-Voice Harmony Generation	Nia D'Souza Ganapathy et.al.	2509.24463	null
2025-09-28	Time-Shifted Token Scheduling for Symbolic Music Generation	Ting-Kang Wang et.al.	2509.23749	null
2025-09-28	AudioMoG: Guiding Audio Generation with Mixture-of-Guidance	Junyou Wang et.al.	2509.23727	null
2025-09-27	AI-Assisted Music Production: A User Study on Text-to-Music Models	Francesca Ronchini et.al.	2509.23364	null
2025-09-26	Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach	Zijian Zhao et.al.	2509.22378	null
2025-09-26	MusicWeaver: Coherent Long-Range and Editable Music Generation from a Beat-Aligned Structural Plan	Xuanchen Wang et.al.	2509.21714	null
2025-09-21	Difficulty-Aware Score Generation for Piano Sight-Reading	Pedro Ramoneda et.al.	2509.16913	null
2025-09-17	Assessing Data Replication in Symbolic Music via Adapted Structural Similarity Index Measure	Shulei Ji et.al.	2509.13658	null
2025-09-13	A Traditional Approach to Symbolic Piano Continuation	Christian Zhou-Zheng et.al.	2509.12267	null
2025-09-14	Decoding Musical Origins: Distinguishing Human and AI Composers	Cheng-Yang Tsai et.al.	2509.11369	null
2025-09-14	STASE: A spatialized text-to-audio synthesis engine for music generation	Tutti Chi et.al.	2509.11124	null
2025-09-10	Segment Transformer: AI-Generated Music Detection via Music Structural Analysis	Yumin Kim et.al.	2509.08283	null
2025-09-09	Continuous Audio Language Models	Simon Rouard et.al.	2509.06926	null
2025-09-24	No Encore: Unlearning as Opt-Out in Music Generation	Jinju Kim et.al.	2509.06277	null
2025-09-07	UniVerse-1: Unified Audio-Video Generation via Stitching of Experts	Duomin Wang et.al.	2509.06155	null
2025-09-04	PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music	Hayeon Bang et.al.	2509.04215	null
2025-09-03	Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings	Dyah A. M. G. Wisnu et.al.	2509.03292	null
2025-09-01	The AudioMOS Challenge 2025	Wen-Chin Huang et.al.	2509.01336	null
2025-08-31	TinyMusician: On-Device Music Generation with Knowledge Distillation and Mixed Precision Quantization	Hainan Wang et.al.	2509.00914	null
2025-09-04	AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation	Gyehun Go et.al.	2509.00813	null
2025-08-31	The Name-Free Gap: Policy-Aware Stylistic Control in Music Generation	Ashwin Nagarajan et.al.	2509.00654	null
2025-08-24	A Survey on Evaluation Metrics for Music Generation	Faria Binte Kader et.al.	2509.00051	null
2025-08-28	Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music	Hongju Su et.al.	2508.20665	null
2025-08-27	The IRMA Dataset: A Structured Audio-MIDI Corpus for Iranian Classical Music	Sepideh Shafiei et.al.	2508.19876	null
2025-08-27	CompLex: Music Theory Lexicon Constructed by Autonomous Agents for Automatic Music Generation	Zhejing Hu et.al.	2508.19603	null
2025-08-08	MuSpike: A Benchmark and Evaluation Framework for Symbolic Music Generation with Spiking Neural Networks	Qian Liang et.al.	2508.19251	null
2025-08-12	QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems	Chien-Chun Wang et.al.	2508.08957	null
2025-08-12	Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems	Liam Pram et.al.	2508.08805	null
2025-08-08	Live Music Models	Lyria Team et.al.	2508.04651	link
2025-08-03	Automatic Melody Reduction via Shortest Path Finding	Ziyu Wang et.al.	2508.01571	null
2025-07-31	DeformTune: A Deformable XAI Music Prototype for Non-Musicians	Ziqing Xu et.al.	2508.00160	null
2025-07-31	"I made this (sort of)": Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation	Bob L. T. Sturm et.al.	2507.23365	null
2025-07-28	Music Arena: Live Evaluation for Text-to-Music	Yonghyun Kim et.al.	2507.20900	null
2025-07-28	Controllable Video-to-Music Generation with Multiple Time-Varying Conditions	Junxian Wu et.al.	2507.20627	null
2025-07-27	Diffusion-based Symbolic Music Generation with Structured State Space Models	Shenghua Yuan et.al.	2507.20128	null
2025-08-07	SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion	Hei Shing Cheung et.al.	2507.19991	null
2025-07-17	A new XML conversion process for mensural music encoding : CMME_to_MEI (via Verovio)	David Fiala et.al.	2507.15991	null
2025-07-17	WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling	Qihui Yang et.al.	2507.10534	null

(back to top)

Audio Codec

Publish Date	Title	Authors	PDF	Code
2025-10-19	SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization	Wenxi Chen et.al.	2510.16841	null
2025-10-19	U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation	Xusheng Yang et.al.	2510.16718	null
2025-10-17	LDCodec: A high quality neural audio codec with low-complexity decoder	Jiawei Jiang et.al.	2510.15364	null
2025-10-17	Extending Audio Context for Long-Form Understanding in Large Audio-Language Models	Yuatyong Chaichana et.al.	2510.15231	null
2025-10-17	LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models	Xiaohan Zhao et.al.	2510.15227	null
2025-10-16	TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation	Ming-Hao Hsu et.al.	2510.14934	null
2025-10-15	Acoustic Teleportation via Disentangled Neural Audio Codec Representations	Philipp Grundhuber et.al.	2510.13221	null
2025-10-13	UALM: Unified Audio Language Model for Understanding, Generation and Reasoning	Jinchuan Tian et.al.	2510.12000	null
2025-10-13	BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis	Jingyuan Xing et.al.	2510.11646	null
2025-10-12	FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec	Yurii Halychanskyi et.al.	2510.10785	null
2025-10-11	SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation	Zeyu Ling et.al.	2510.10069	null
2025-10-11	MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction	Jianjin Wang et.al.	2510.10003	null
2025-10-10	SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion	Zhao Guo et.al.	2510.09245	null
2025-10-08	AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs	Peize He et.al.	2510.07293	null
2025-10-07	Latent Speech-Text Transformer	Yen-Ju Lu et.al.	2510.06195	null
2025-10-07	EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS	Haoxun Li et.al.	2510.05758	null
2025-10-06	UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models	Wenhao Guan et.al.	2510.04593	null
2025-10-04	Désentrelacement Fréquentiel Doux pour les Codecs Audio Neuronaux	Benoît Giniès et.al.	2510.03741	null
2025-10-04	Soft Disentanglement in Frequency Bands for Neural Audio Codecs	Benoit Ginies et.al.	2510.03735	null
2025-10-02	High-Fidelity Speech Enhancement via Discrete Audio Tokens	Luca A. Lanzendörfer et.al.	2510.02187	null
2025-10-02	MelCap: A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression	Jingyi Li et.al.	2510.01903	null
2025-10-02	FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates	Jiaqi Li et.al.	2510.00981	null
2025-10-07	Baseline Systems For The 2025 Low-Resource Audio Codec Challenge	Yusuf Ziya Isik et.al.	2510.00264	null
2025-09-30	Scaling Spoken Language Models with Syllabic Speech Tokenization	Nicholas Lee et.al.	2509.26634	null
2025-09-30	Optimizing Speech Language Models for Acoustic Consistency	Morteza Rohanian et.al.	2509.26276	null
2025-09-29	MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech	Chengyao Wang et.al.	2509.25131	null
2025-09-29	VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning	Yixuan Zhou et.al.	2509.24650	null
2025-09-29	Assessing speech quality metrics for evaluation of neural audio codecs under clean speech conditions	Wolfgang Mack et.al.	2509.24457	null
2025-09-26	StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs	Yuhan Song et.al.	2509.22220	null
2025-09-26	Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling	Junjie Cao et.al.	2509.22062	null
2025-09-26	AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook	Yushen Chen et.al.	2509.21968	null
2025-09-25	X-Streamer: Unified Human World Modeling with Audiovisual Interaction	You Xie et.al.	2509.21574	null
2025-09-24	Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens	Ismail Rasim Ulgen et.al.	2509.20485	null
2025-09-25	From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training	Tianqiao Liu et.al.	2509.20072	null
2025-09-24	Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens	Pin-Jui Ku et.al.	2509.20060	null
2025-09-25	Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration	Yifan Yang et.al.	2509.19928	null
2025-09-24	Eliminating stability hallucinations in llm-based tts models via attention guidance	ShiMing Wang et.al.	2509.19852	null
2025-09-23	Improving Test-Time Performance of RVQ-based Neural Codecs	Hyeongju Kim et.al.	2509.19186	null
2025-09-23	Enhancing Noise Robustness for Neural Speech Codecs through Resource-Efficient Progressive Quantization Perturbation Simulation	Rui-Chen Zheng et.al.	2509.19025	null
2025-09-23	HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS	Sihang Nie et.al.	2509.19001	null
2025-09-23	Direct Preference Optimization for Speech Autoregressive Diffusion Models	Zhijun Liu et.al.	2509.18928	null
2025-09-23	Towards Evaluating Generative Audio: Insights from Neural Audio Codec Embedding Distances	Arijit Biswas et.al.	2509.18823	null
2025-09-22	Does Audio Matter for Modern Video-LLMs and Their Benchmarks?	Geewook Kim et.al.	2509.17901	null
2025-09-22	Qwen3-Omni Technical Report	Jin Xu et.al.	2509.17765	null
2025-09-21	MBCodec:Thorough disentangle for high-fidelity audio compression	Ruonan Zhang et.al.	2509.17006	null
2025-09-19	FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation	Luca Della Libera et.al.	2509.16195	null
2025-09-19	VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency	Nikita Torgashov et.al.	2509.15969	null
2025-09-18	A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication	Ryan Collette et.al.	2509.15462	null
2025-09-18	MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis	Keyu An et.al.	2509.14784	null
2025-09-17	A High-Quality and Low-Complexity Streamable Neural Speech Codec with Knowledge Distillation	En-Wei Zhang et.al.	2509.13670	null

(back to top)

Large Audio Language Model

Publish Date	Title	Authors	PDF	Code
2025-10-21	MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels	Chen Chen et.al.	2510.18915	null
2025-10-20	Hearing Health in Home Healthcare: Leveraging LLMs for Illness Scoring and ALMs for Vocal Biomarker Extraction	Yu-Wen Chen et.al.	2510.18169	null
2025-10-20	SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering	Weilin Lin et.al.	2510.17633	null
2025-10-21	LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding	ZhaoYang Han et.al.	2510.17305	null
2025-10-22	OmniVIC: A Self-Improving Variable Impedance Controller with Vision-Language In-Context Learning for Safe Robotic Manipulation	Heng Zhang et.al.	2510.17150	null
2025-10-19	SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models	Chih-Kai Yang et.al.	2510.16917	null
2025-10-19	Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations	Bo-Han Feng et.al.	2510.16893	null
2025-10-19	The Augmented Lagrangian Methods: Overview and Recent Advances	Kangkang Deng et.al.	2510.16827	null
2025-10-17	OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM	Hanrong Ye et.al.	2510.15870	null
2025-10-17	Extending Audio Context for Long-Form Understanding in Large Audio-Language Models	Yuatyong Chaichana et.al.	2510.15231	null
2025-10-16	XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models	Xingrui Wang et.al.	2510.15148	null
2025-10-15	Yamaji effect in models of underdoped cuprates	Jing-Yu Zhao et.al.	2510.13943	null
2025-10-15	Generative Universal Verifier as Multimodal Meta-Reasoner	Xinchen Zhang et.al.	2510.13804	null
2025-10-15	InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue	Wenwen Tong et.al.	2510.13747	null
2025-10-16	NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching	Run Luo et.al.	2510.13721	null
2025-10-14	Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models	Tsung-En Lin et.al.	2510.12851	null
2025-10-14	Detect Anything via Next Point Prediction	Qing Jiang et.al.	2510.12798	null
2025-10-14	Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception	Ziyang Ma et.al.	2510.12720	null
2025-10-15	SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model	Lin Lin et.al.	2510.12709	null
2025-10-14	The spin Hall conductivity in the hole-doped bilayer Haldane-Hubbard model with odd-parity ALM	Minghuan Zeng et.al.	2510.12602	null
2025-10-14	Not in Sync: Unveiling Temporal Bias in Audio Chat Models	Jiayu Yao et.al.	2510.12185	null
2025-10-14	An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations	Benjamin W. Nelson et.al.	2510.12083	null
2025-10-13	Bridging the gap between ultrafast optics and resonant photonics via omni-resonance	Abbas Shiri et.al.	2510.12002	null
2025-10-13	UALM: Unified Audio Language Model for Understanding, Generation and Reasoning	Jinchuan Tian et.al.	2510.12000	null
2025-10-13	ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?	Liu Yang et.al.	2510.11549	null
2025-10-13	Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning	Kuan-Yi Lee et.al.	2510.11454	null
2025-10-13	Optimizing Cross-Domain Transfer for Universal Machine Learning Interatomic Potentials	Jaesun Kim et.al.	2510.11241	null
2025-10-13	VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents	Jiliang Hu et.al.	2510.11098	null
2025-10-12	OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs	Caorui Li et.al.	2510.10689	null
2025-10-12	Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance	Jingyi Chen et.al.	2510.10444	null
2025-10-14	Integration of the TIAGo Robot into Isaac Sim with Mecanum Drive Modeling and Learned S-Curve Velocity Profiles	Vincent Schoenbach et.al.	2510.10273	null
2025-10-10	HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation	Jingyuan Sun et.al.	2510.09221	null
2025-10-08	Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization	Rui Hu et.al.	2510.08618	null
2025-10-09	An efficient algorithm for kernel quantile regression	Shengxiang Deng et.al.	2510.07929	null
2025-10-08	AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues	Krish Patel et.al.	2510.07355	null
2025-10-08	AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs	Peize He et.al.	2510.07293	null
2025-10-07	Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding	Yi Xin et.al.	2510.06308	null
2025-10-07	AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning	Haoyu Zhang et.al.	2510.05478	null
2025-10-06	Observation and modeling of a geo-effective event observed on 2011 May 28 from the solar surface to 1au	Nishu Karna et.al.	2510.05334	null
2025-10-06	AURA Score: A Metric For Holistic Audio Question Answering Evaluation	Satvik Dixit et.al.	2510.04934	null
2025-10-06	Robustness assessment of large audio language models in multiple-choice evaluation	Fernando López et.al.	2510.04584	null
2025-10-03	Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video	Mengyao Xu et.al.	2510.03458	null
2025-10-03	AudioToolAgent: An Agentic Framework for Audio-Language Models	Gijs Wijngaard et.al.	2510.02995	null
2025-10-02	Broadband entangled-photon omni-resonance in a planar optical cavity	Bryan L. Turo et.al.	2510.01595	null
2025-10-01	Hearing the Order: Investigating Selection Bias in Large Audio-Language Models	Yu-Xiang Lin et.al.	2510.00628	null
2025-10-01	When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models	Chen-An Li et.al.	2510.00626	null
2025-10-01	Multi-level Dynamic Style Transfer for NeRFs	Zesheng Li et.al.	2510.00592	null
2025-09-30	TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics	Yi-Cheng Lin et.al.	2509.26329	null
2025-09-30	OmniDFA: A Unified Framework for Open Set Synthesis Image Detection and Few-Shot Attribution	Shiyu Wu et.al.	2509.25682	null
2025-09-29	EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition	Jiacheng Shi et.al.	2509.25495	null

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 2,387 Commits
.github		.github
assets		assets
docs		docs
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
daily_arxiv.py		daily_arxiv.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Updated on 2025.10.24

Text to Speech

Text to Audio

Video to Audio

Voice Conversion

Video Generation

Image Generation

Music Generation

Audio Codec

Large Audio Language Model

About

Uh oh!

Releases

Packages

Languages

License

ZhikangNiu/arxiv_daily

Folders and files

Latest commit

History

Repository files navigation

Updated on 2025.10.24

Text to Speech

Text to Audio

Video to Audio

Voice Conversion

Video Generation

Image Generation

Music Generation

Audio Codec

Large Audio Language Model

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages