Skip to content

ZhikangNiu/arxiv_daily

Repository files navigation

Updated on 2025.10.24

Usage instructions: here

Table of Contents
  1. Text to Speech
  2. Text to Audio
  3. Video to Audio
  4. Voice Conversion
  5. Video Generation
  6. Image Generation
  7. Music Generation
  8. Audio Codec
  9. Large Audio Language Model

Text to Speech

Publish Date Title Authors PDF Code
2025-10-22 Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent Yangshijie Zhang et.al. 2510.19641 null
2025-10-22 Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment Maureen de Seyssel et.al. 2510.19509 null
2025-10-22 EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection Tong Zhang et.al. 2510.19414 null
2025-10-21 StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction Qianheng Xu et.al. 2510.18938 null
2025-10-21 KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers Mohd Ruhul Ameen et.al. 2510.18355 null
2025-10-21 ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation Haowei Lou et.al. 2510.18308 null
2025-10-19 U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation Xusheng Yang et.al. 2510.16718 null
2025-10-18 Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages Pacome Simon Mbonimpa et.al. 2510.16497 null
2025-10-18 TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model Bin Yu et.al. 2510.16449 null
2025-10-22 VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition Kye Shimizu et.al. 2510.16192 null
2025-10-17 High order Tensor-Train-Based Schemes for High-Dimensional Mean Field Games Elisabetta Carlini et.al. 2510.15603 null
2025-10-16 Hints for dynamical dark energy from warm inflation Anupama B et.al. 2510.15051 null
2025-10-16 Improving Cybercrime Detection and Digital Forensics Investigations with Artificial Intelligence Silvia Lucia Sanna et.al. 2510.14638 null
2025-10-16 RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF Qing Yang et.al. 2510.14628 null
2025-10-16 The tt-structure for the quantum cohomology of complex Grassmannian* Tadashi Udagawa et.al. 2510.14483 null
2025-10-20 Radiation pressure and equation of state are important in the envelope unbinding process in common envelope evolution Zhuo Chen et.al. 2510.14173 null
2025-10-15 Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling Peng Kuang et.al. 2510.13918 null
2025-10-15 Generative Universal Verifier as Multimodal Meta-Reasoner Xinchen Zhang et.al. 2510.13804 null
2025-10-15 InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue Wenwen Tong et.al. 2510.13747 null
2025-10-15 Closing the Gap Between Text and Speech Understanding in LLMs Santiago Cuervo et.al. 2510.13632 null
2025-10-15 Functional tensor train neural network for solving high-dimensional PDEs Yani Feng et.al. 2510.13386 null
2025-10-15 Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models Yizhou Peng et.al. 2510.13293 null
2025-10-15 StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation Xi Chen et.al. 2510.13194 null
2025-10-14 Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs Xinlu He et.al. 2510.12995 null
2025-10-14 Toward First-Principles Multi-Messenger Predictions: Coupling Nuclear Networks with GR Radiation-MHD in {\tt Gmunu} Patrick Chi-Kit Cheong et.al. 2510.12978 null
2025-10-14 Content Anonymization for Privacy in Long-form Audio Cristina Aggazzotti et.al. 2510.12780 null
2025-10-14 TerraCodec: Compressing Earth Observations Julen Costa-Watanabe et.al. 2510.12670 null
2025-10-14 Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation Greta Damo et.al. 2510.12316 null
2025-10-14 DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation Yakun Song et.al. 2510.12210 null
2025-10-13 Actor-Enriched Time Series Forecasting of Process Performance Aurelie Leribaux et.al. 2510.11856 null
2025-10-13 BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis Jingyuan Xing et.al. 2510.11646 null
2025-10-13 Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker Cheng Gong et.al. 2510.11124 null
2025-10-14 ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis Mohammad Javad Ranjbar Kalahroodi et.al. 2510.10774 null
2025-10-14 MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations Wenxiang Guo et.al. 2510.10396 null
2025-10-11 Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey Jiaqi Wei et.al. 2510.09988 null
2025-10-10 Tensor-based compression of the sea temperature data Ilya Kosolapov et.al. 2510.09778 null
2025-10-10 Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models Donghang Wu et.al. 2510.09592 null
2025-10-10 A family of non-simple surfaces whose transport twistor spaces admit global blow-down maps François Monard et.al. 2510.09518 null
2025-10-10 O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion Huu Tuong Tu et.al. 2510.09061 null
2025-10-10 DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment Zongcai Du et.al. 2510.09016 null
2025-10-09 Theoretical Analysis of Topotomography Using Small Intragranular Strain Approximations Zheheng Liu et.al. 2510.08712 null
2025-10-09 DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching Hanke Xie et.al. 2510.08373 null
2025-10-09 Structured covariance estimation via tensor-train decomposition Artsiom Patarusau et.al. 2510.08174 null
2025-10-09 IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation Wei Wang et.al. 2510.07979 null
2025-10-09 VoiceAgentBench: Are Voice Assistants ready for agentic tasks? Dhruv Jain et.al. 2510.07978 null
2025-10-09 Self-Improving LLM Agents at Test-Time Emre Can Acikgoz et.al. 2510.07841 null
2025-10-09 From Noisy to Native: LLM-driven Graph Restoration for Test-Time Graph Domain Adaptation Xiangwei Lv et.al. 2510.07762 null
2025-10-09 Parallel Test-Time Scaling for Latent Reasoning Models Runyang You et.al. 2510.07745 null
2025-10-08 AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding Shuqing Luo et.al. 2510.07486 null
2025-10-08 Gauge Dependence of Scalar-Induced Gravitational Waves from Isocurvature Perturbations: Analytical Results Arshad Ali et.al. 2510.07252 null

(back to top)

Text to Audio

Publish Date Title Authors PDF Code
2025-10-22 Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models Xiaozhen Qiao et.al. 2510.19802 null
2025-10-16 Visible Imaging of Incoherent 1200-nm Light via Triplet--Triplet Annihilation Upconversion Pournima Narayanan et.al. 2510.15184 null
2025-10-16 SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation Jihyun Yu et.al. 2510.14634 null
2025-10-16 AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation Hui Wang et.al. 2510.14570 null
2025-10-15 UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE Zhenyu Liu et.al. 2510.13344 null
2025-10-15 DP-TTA: Test-time Adaptation for Transient Electromagnetic Signal Denoising via Dictionary-driven Prior Regularization Meng Yang et.al. 2510.13160 null
2025-10-14 Controllable Collision Scenario Generation via Collision Pattern Prediction Pin-Lun Chen et.al. 2510.12206 null
2025-10-14 Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis Junnuo Wang et.al. 2510.12175 null
2025-10-13 UALM: Unified Audio Language Model for Understanding, Generation and Reasoning Jinchuan Tian et.al. 2510.12000 null
2025-10-13 Efficient Edge Test-Time Adaptation via Latent Feature Coordinate Correction Xinyu Luo et.al. 2510.11068 null
2025-10-17 MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations Wenxiang Guo et.al. 2510.10396 null
2025-10-10 MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation Akira Takahashi et.al. 2510.09065 null
2025-10-10 ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling Yuxuan Jiang et.al. 2510.08878 null
2025-10-13 Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation Liyang Chen et.al. 2510.08078 null
2025-10-09 IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries Harsh Kavediya et.al. 2510.07837 null
2025-10-08 HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation Samir Abou Haidar et.al. 2510.06876 null
2025-10-07 FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders Riccardo Fosco Gramaccioni et.al. 2510.05829 null
2025-10-07 StereoSync: Spatially-Aware Stereo Audio Generation from Video Christian Marinoni et.al. 2510.05828 null
2025-10-07 NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering Alexander Murphy et.al. 2510.05635 null
2025-10-07 LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability Harshil Vejendla et.al. 2510.05530 null
2025-10-06 Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers Juncheng Wang et.al. 2510.04577 null
2025-10-05 Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space Christian Limberg et.al. 2510.04339 null
2025-10-05 The best performance in the CARE 2025 -- Liver Task (LiSeg-Contrast): Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation Jincan Lou et.al. 2510.04243 null
2025-10-04 AI-Assisted Pleural Effusion Volume Estimation from Contrast-Enhanced CT Images Sanhita Basu et.al. 2510.03856 null
2025-10-03 SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos Amir Dellali et.al. 2510.02916 null
2025-10-03 Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models Lihua Zhou et.al. 2510.02750 null
2025-10-02 SoundReactor: Frame-level Online Video-to-Audio Generation Koichi Saito et.al. 2510.02110 null
2025-09-30 To Remember, To Adapt, To Preempt: A Stable Continual Test-Time Adaptation Framework for Remote Physiological Measurement in Dynamic Domain Shifts Shuyang Chu et.al. 2510.01282 null
2025-10-01 PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation Yujia Xiao et.al. 2510.00485 null
2025-10-01 VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors Atif Belal et.al. 2510.00458 null
2025-09-30 Post-Training Quantization for Audio Diffusion Transformers Tanmay Khandelwal et.al. 2510.00313 null
2025-09-30 Video Object Segmentation-Aware Audio Generation Ilpo Viertola et.al. 2509.26604 null
2025-09-30 MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms Eleonora Ristori et.al. 2509.26007 null
2025-09-30 Annotation-Efficient Active Test-Time Adaptation with Conformal Prediction Tingyu Shi et.al. 2509.25692 null
2025-09-30 Charge Transfer States in Donor Acceptor Bulk Heterojunctions as Triplet Triplet Annihilation Sensitizer for Solid-State Photon Upconversion Maciej Klein et.al. 2509.25679 null
2025-09-29 EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition Jiacheng Shi et.al. 2509.25495 null
2025-09-29 A Robust Multi-Scale Framework with Test-Time Adaptation for sEEG-Based Speech Decoding Suli Wang et.al. 2509.24700 null
2025-09-29 When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks Zeyu Xie et.al. 2509.24635 null
2025-09-29 Training-Free Multimodal Guidance for Video to Audio Generation Eleonora Grassucci et.al. 2509.24550 null
2025-10-01 An Agent-Based Framework for Automated Higher-Voice Harmony Generation Nia D'Souza Ganapathy et.al. 2509.24463 null
2025-09-29 UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities Xuenan Xu et.al. 2509.24391 null
2025-09-28 AudioMoG: Guiding Audio Generation with Mixture-of-Guidance Junyou Wang et.al. 2509.23727 null
2025-09-26 TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses Sahar Dastani et.al. 2509.22813 null
2025-09-25 Prompt-aware classifier free guidance for diffusion models Xuanhao Zhang et.al. 2509.22728 null
2025-09-26 Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment Yunyi Liu et.al. 2509.21919 null
2025-09-25 AIBA: Attention-based Instrument Band Alignment for Text-to-Audio Diffusion Junyoung Koh et.al. 2509.20891 null
2025-09-24 MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization Jianxuan Yang et.al. 2509.19999 null
2025-09-25 MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model The Hieu Pham et.al. 2509.19881 null
2025-09-24 SCORE: Scaling audio generation using Standardized COmposite REwards Jaemin Jung et.al. 2509.19831 null
2025-09-23 SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering Jiarui Hai et.al. 2509.18603 null

(back to top)

Video to Audio

Publish Date Title Authors PDF Code
2025-10-10 MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation Akira Takahashi et.al. 2510.09065 null
2025-10-13 Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation Liyang Chen et.al. 2510.08078 null
2025-10-09 IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries Harsh Kavediya et.al. 2510.07837 null
2025-10-07 FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders Riccardo Fosco Gramaccioni et.al. 2510.05829 null
2025-10-07 StereoSync: Spatially-Aware Stereo Audio Generation from Video Christian Marinoni et.al. 2510.05828 null
2025-10-03 SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos Amir Dellali et.al. 2510.02916 null
2025-10-02 SoundReactor: Frame-level Online Video-to-Audio Generation Koichi Saito et.al. 2510.02110 null
2025-09-29 Training-Free Multimodal Guidance for Video to Audio Generation Eleonora Grassucci et.al. 2509.24550 null
2025-09-28 AudioMoG: Guiding Audio Generation with Mixture-of-Guidance Junyou Wang et.al. 2509.23727 null
2025-09-26 WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM Changli Tang et.al. 2509.21990 null
2025-09-26 Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers Jibin Song et.al. 2509.21893 null
2025-09-24 MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization Jianxuan Yang et.al. 2509.19999 null
2025-10-05 StereoFoley: Object-Aware Stereo Audio Generation from Video Tornike Karchkhadze et.al. 2509.18272 null
2025-09-19 Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech Xinlei Niu et.al. 2509.15492 null
2025-09-19 RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes Fang Li et.al. 2509.15123 null
2025-09-08 MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation Xiaoran Yang et.al. 2509.06389 null
2025-09-05 Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper Gehui Chen et.al. 2509.04957 null
2025-08-23 HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation Sizhe Shan et.al. 2508.16930 null
2025-08-19 InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing Shaoshu Yang et.al. 2508.14033 null
2025-08-21 FoleySpace: Vision-Aligned Binaural Spatial Audio Generation Lei Zhao et.al. 2508.12918 null
2025-08-14 LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters Haomin Zhang et.al. 2508.11074 null
2025-08-12 Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization Chaoqun Cui et.al. 2508.08550 null
2025-07-14 DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis Wenjie Tian et.al. 2507.10109 null
2025-07-13 Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation Yingshan Liang et.al. 2507.04959 null
2025-06-23 Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions Vineet Kumar Rakesh et.al. 2507.02900 null
2025-07-03 Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation Feizhen Huang et.al. 2507.02271 null
2025-06-23 IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech Siyi Zhou et.al. 2506.21619 null
2025-06-28 ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing Huadai Liu et.al. 2506.21448 null
2025-06-27 Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance Akio Hayakawa et.al. 2506.20995 null
2025-06-24 Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation Jun Wang et.al. 2506.19774 null
2025-06-13 ViSAGe: Video-to-Spatial Audio Generation Jaeyeon Kim et.al. 2506.12199 null
2025-05-31 Length Aware Speech Translation for Video Dubbing Harveen Singh Chadha et.al. 2506.00740 null
2025-05-26 Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks Chang Liu et.al. 2505.20038 link
2025-05-22 SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet Zhi Zhong et.al. 2505.16195 null
2025-05-30 TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis Yu Zhang et.al. 2505.14910 link
2025-05-28 Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model Yong Ren et.al. 2505.13062 null
2025-06-03 OmniAudio: Generating Spatial Audio from 360-Degree Video Huadai Liu et.al. 2504.14906 link
2025-04-17 CAFA: a Controllable Automatic Foley Artist Roi Benita et.al. 2504.06778 link

(back to top)

Voice Conversion

Publish Date Title Authors PDF Code
2025-10-22 VBx for End-to-End Neural and Clustering-based Diarization Petr Pálka et.al. 2510.19572 null
2025-10-20 Fast Agnostic Learners in the Plane Talya Eden et.al. 2510.18057 null
2025-10-20 Joint upper Banach density, VC dimensions and Euclidean point configurations Bruno Predojević et.al. 2510.17453 null
2025-10-23 The Parameterized Complexity of Computing the VC-Dimension Florent Foucaud et.al. 2510.17451 null
2025-10-18 Truly Subquadratic Time Algorithms for Diameter and Related Problems in Graphs of Bounded VC-dimension Timothy M. Chan et.al. 2510.16346 null
2025-10-22 VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition Kye Shimizu et.al. 2510.16192 null
2025-10-16 Deadlock-free routing for Full-mesh networks without using Virtual Channels Alejandro Cano et.al. 2510.14730 null
2025-10-15 The VC-dimension and point configurations in $\mathbb{R}^d$ Alex Iosevich et.al. 2510.13984 null
2025-10-16 VC-Dimension vs Degree: An Uncertainty Principle for Boolean Functions Fan Chang et.al. 2510.13705 null
2025-10-15 Model-assisted estimation for MRV: How to boost the economics of SOC sequestration projects without compromising on scientific integrity Ahmad Awad et.al. 2510.13609 null
2025-10-15 Target Controllability Score Kazuhiro Sato et.al. 2510.13354 null
2025-10-14 VCTR: A Transformer-Based Model for Non-parallel Voice Conversion Maharnab Saikia et.al. 2510.12964 null
2025-10-15 (R)evolution of Programming: Vibe Coding as a Post-Coding Paradigm Kevin Krings et.al. 2510.12364 null
2025-10-13 Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker Cheng Gong et.al. 2510.11124 null
2025-10-13 VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents Jiliang Hu et.al. 2510.11098 null
2025-10-10 A Scalable, Privacy-Preserving Decentralized Identity and Verifiable Data Sharing Framework based on Zero-Knowledge Proofs Hui Yuan et.al. 2510.09715 null
2025-10-10 SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion Zhao Guo et.al. 2510.09245 null
2025-10-10 O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion Huu Tuong Tu et.al. 2510.09061 null
2025-10-09 MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows Guobin Ma et.al. 2510.08392 null
2025-10-09 What Makes a Visualization Complex? Mengdi Chu et.al. 2510.08332 null
2025-10-09 VoiceAgentBench: Are Voice Assistants ready for agentic tasks? Dhruv Jain et.al. 2510.07978 null
2025-10-06 UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models Wenhao Guan et.al. 2510.04593 null
2025-10-05 A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation Ananya Raghu et.al. 2510.03986 null
2025-10-03 Online Learning in the Random Order Model Martino Bernasconi et.al. 2510.02820 null
2025-10-02 Higher-arity PAC learning, VC dimension and packing lemma Artem Chernikov et.al. 2510.02420 null
2025-09-30 BlockSDN-VC: A SDN-Based Virtual Coordinate-Enhanced Transaction Broadcast Framework for High-Performance Blockchains Wenyang Jia et.al. 2510.00306 null
2025-09-29 MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech Chengyao Wang et.al. 2509.25131 null
2025-10-02 Cofinal families of finite VC-dimension Omer Ben-Neria et.al. 2509.24744 null
2025-09-29 VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning Yixuan Zhou et.al. 2509.24650 null
2025-09-29 ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark Yun Chen et.al. 2509.24570 null
2025-09-29 Strong enhancement of d-wave superconductivity in an extended checkerboard Hubbard ladder Xichen Huang et.al. 2509.24415 null
2025-09-26 ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection Mohamed Maged et.al. 2509.22808 null
2025-09-26 Speaker Anonymisation for Speech-based Suicide Risk Detection Ziyun Cui et.al. 2509.22148 null
2025-09-25 VC-Agent: An Interactive Agent for Customized Video Dataset Collection Yidan Zhang et.al. 2509.21291 null
2025-09-24 Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation Yang Cui et.al. 2509.19812 null
2025-09-22 Preconditioned Deformation Grids Julian Kaltheuner et.al. 2509.18097 null
2025-09-21 MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances Junhyeok Lee et.al. 2509.17143 null
2025-09-20 Advancing Reference-free Evaluation of Video Captions with Factual Analysis Shubhashis Roy Dipta et.al. 2509.16538 null
2025-09-19 Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation Qi Wang et.al. 2509.16010 null
2025-09-19 The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion To Singing Style Conversion Lester Phillip Violeta et.al. 2509.15629 null
2025-09-18 FCPE: A Fast Context-based Pitch Estimation Model Yuxin Luo et.al. 2509.15140 null
2025-09-18 MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis Keyu An et.al. 2509.14784 null
2025-09-20 Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis Qingyu Liu et.al. 2509.14579 null
2025-09-17 VCBench: Benchmarking LLMs in Venture Capital Rick Chen et.al. 2509.14448 null
2025-09-16 MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement Jingyu Li et.al. 2509.13068 null
2025-09-16 A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis Javeria Amir et.al. 2509.12831 null
2025-09-15 Preservation of Language Understanding Capabilities in Speech-aware Large Language Models Marek Kubis et.al. 2509.12171 null
2025-09-14 Rate-Distortion Limits for Multimodal Retrieval: Theory, Optimal Codes, and Finite-Sample Guarantees Thomas Y. Chen et.al. 2509.11054 null
2025-09-11 Altered Histories in Version Control System Repositories: Evidence from the Trenches Solal Rapaport et.al. 2509.09294 null
2025-09-11 DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners Xiaoxue Luo et.al. 2509.09201 null

(back to top)

Video Generation

Publish Date Title Authors PDF Code
2025-10-22 PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis Qing Mao et.al. 2510.19527 null
2025-10-22 GigaBrain-0: A World Model-Powered Vision-Language-Action Model GigaBrain Team et.al. 2510.19430 null
2025-10-22 Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks Kai Zeng et.al. 2510.19195 null
2025-10-23 Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning Takehiro Aoshima et.al. 2510.19193 null
2025-10-21 MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models Aritra Bhowmik et.al. 2510.19022 null
2025-10-21 UltraGen: High-Resolution Video Generation with Hierarchical Attention Teng Hu et.al. 2510.18775 null
2025-10-23 A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition Peiqin Zhuang et.al. 2510.18705 null
2025-10-21 MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation Weinan Jia et.al. 2510.18692 null
2025-10-21 Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model Zhenxing Zhang et.al. 2510.18573 null
2025-10-22 FeatureFool: Zero-Query Fooling of Video Models via Feature Map Duoxun Tang et.al. 2510.18362 null
2025-10-22 OmniNWM: Omniscient Driving Navigation World Models Bohan Li et.al. 2510.18313 null
2025-10-20 World-in-World: World Models in a Closed-Loop World Jiahan Zhang et.al. 2510.18135 null
2025-10-20 Demystifying Transition Matching: When and Why It Can Beat Flow Matching Jaihoon Kim et.al. 2510.17991 null
2025-10-20 ConsistEdit: Highly Consistent and Precise Training-free Visual Editing Zixin Yin et.al. 2510.17803 null
2025-10-22 MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models Yongshun Zhang et.al. 2510.17519 null
2025-10-20 From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models Zefan Cai et.al. 2510.17247 null
2025-10-19 An empirical study of the effect of video encoders on Temporal Video Grounding Ignacio M. De la Jara et.al. 2510.17007 null
2025-10-19 From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display Xiangyu Mu et.al. 2510.16833 null
2025-10-17 VISTA: A Test-Time Self-Improving Video Generation Agent Do Xuan Long et.al. 2510.15831 null
2025-10-17 Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset Qingyan Bai et.al. 2510.15742 null
2025-10-17 DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion Weijie Wang et.al. 2510.15264 null
2025-10-16 TGT: Text-Grounded Trajectories for Locally Controlled Video Generation Guofeng Zhang et.al. 2510.15104 null
2025-10-16 RealDPO: Real or Not Real, that is the Preference Guo Cheng et.al. 2510.14955 null
2025-10-16 DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation Yu Zhou et.al. 2510.14949 null
2025-10-16 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation JoungBin Lee et.al. 2510.14945 null
2025-10-16 ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints Meiqi Wu et.al. 2510.14847 null
2025-10-16 In-Context Learning with Unpaired Clips for Instruction-based Video Editing Xinyao Liao et.al. 2510.14648 null
2025-10-19 STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding Zhifei Chen et.al. 2510.14588 null
2025-10-17 Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning Xiangyu Meng et.al. 2510.14256 null
2025-10-16 Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization Liao Shen et.al. 2510.14255 null
2025-10-16 Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures Yuancheng Xu et.al. 2510.14179 null
2025-10-15 PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning Sihui Ji et.al. 2510.13809 null
2025-10-15 CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas Zian Li et.al. 2510.13669 null
2025-10-15 VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator Hyojun Go et.al. 2510.13454 null
2025-10-15 Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation Yi Zuo et.al. 2510.13084 null
2025-10-15 Counting Hallucinations in Diffusion Models Shuai Fu et.al. 2510.13080 null
2025-10-14 SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models Zhengxu Tang et.al. 2510.13042 null
2025-10-14 MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars Felix Taubner et.al. 2510.12785 null
2025-10-14 Time-Correlated Video Bridge Matching Viacheslav Vasilev et.al. 2510.12453 null
2025-10-14 Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding Ye Chen et.al. 2510.12256 null
2025-10-14 BIGFix: Bidirectional Image Generation with Token Fixing Victor Besnier et.al. 2510.12231 null
2025-10-14 Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback Xingpei Ma et.al. 2510.12089 null
2025-10-14 VIDMP3: Video Editing by Representing Motion with Pose and Position Priors Sandeep Mishra et.al. 2510.12069 null
2025-10-13 Point Prompting: Counterfactual Tracking with Video Diffusion Models Ayush Shrivastava et.al. 2510.11715 null
2025-10-13 IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment Yinan Chen et.al. 2510.11647 null
2025-10-13 MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps Jiahui Lei et.al. 2510.11107 null
2025-10-12 AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes Yu Li et.al. 2510.10670 null
2025-10-12 DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis Peiyin Chen et.al. 2510.10650 null
2025-10-10 Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians Jin-Chuan Shi et.al. 2510.09438 null
2025-10-10 Stable Video Infinity: Infinite-Length Video Generation with Error Recycling Wuyang Li et.al. 2510.09212 null

(back to top)

Image Generation

Publish Date Title Authors PDF Code
2025-10-22 Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing Yusu Qian et.al. 2510.19808 null
2025-10-22 The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models Xiaofeng Zhang et.al. 2510.19557 null
2025-10-22 Predicting before Reconstruction: A generative prior framework for MRI acceleration Juhyung Park et.al. 2510.19472 null
2025-10-22 D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation Nobline Yoo et.al. 2510.19278 null
2025-10-21 DP $^2$ O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution Rongyuan Wu et.al. 2510.18851 null
2025-10-21 SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation Siyong Jian et.al. 2510.18716 null
2025-10-21 UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation Yibin Wang et.al. 2510.18701 null
2025-10-21 From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation Ziwei Huang et.al. 2510.18263 null
2025-10-21 Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis Xinhao Cai et.al. 2510.18229 null
2025-10-22 Chimera: Compositional Image Generation using Part-based Concepting Shivam Singh et.al. 2510.18083 null
2025-10-20 Fine-tuning Flow Matching Generative Models with Intermediate Feedback Jiajun Fan et.al. 2510.18072 null
2025-10-20 Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models Jiajun Fan et.al. 2510.18053 null
2025-10-20 Inference-Time Compute Scaling For Flow Matching Adam Stecklov et.al. 2510.17786 null
2025-10-20 VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models Qilin Liao et.al. 2510.17759 null
2025-10-21 PICABench: How Far Are We from Physically Realistic Image Editing? Yuandong Pu et.al. 2510.17681 null
2025-10-21 CaMiT: A Time-Aware Car Model Dataset for Classification and Generation Frédéric LIN et.al. 2510.17626 null
2025-10-20 Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling Feihong Yan et.al. 2510.17171 null
2025-10-20 In-situ Autoguidance: Eliciting Self-Correction in Diffusion Models Enhao Gu et.al. 2510.17136 null
2025-10-19 One-step Diffusion Models with Bregman Density Ratio Matching Yuanzhi Zhu et.al. 2510.16983 null
2025-10-21 Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback Zongjian Li et.al. 2510.16888 null
2025-10-19 Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis Nusrat Munia et.al. 2510.16887 null
2025-10-19 Region in Context: Text-condition Image editing with Human-like semantic reasoning Thuy Phuong Vu et.al. 2510.16772 null
2025-10-17 BLIP3o-NEXT: Next Frontier of Native Image Generation Jiuhai Chen et.al. 2510.15857 null
2025-10-17 Controlling the image generation process with parametric activation functions Ilia Pavlov et.al. 2510.15778 null
2025-10-17 NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation Yitong Sun et.al. 2510.15752 null
2025-10-17 Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis Junzhi Ning et.al. 2510.15710 null
2025-10-17 Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation Xiaoming Zhu et.al. 2510.15564 null
2025-10-16 Salient Concept-Aware Generative Data Augmentation Tianchen Zhao et.al. 2510.15194 null
2025-10-16 Constantly Improving Image Models Need Constantly Improving Benchmarks Jiaxin Ge et.al. 2510.15021 link
2025-10-16 Coupled Diffusion Sampling for Training-Free Multi-View Image Editing Hadi Alzayer et.al. 2510.14981 null
2025-10-16 Learning an Image Editing Model without Image Editing Pairs Nupur Kumari et.al. 2510.14978 link
2025-10-16 WithAnyone: Towards Controllable and ID Consistent Image Generation Hengyuan Xu et.al. 2510.14975 null
2025-10-16 ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention Keli Liu et.al. 2510.14882 null
2025-10-16 FraQAT: Quantization Aware Training with Fractional bits Luca Morreale et.al. 2510.14823 null
2025-10-16 In-Context Learning with Unpaired Clips for Instruction-based Video Editing Xinyao Liao et.al. 2510.14648 null
2025-10-16 Adapting Self-Supervised Representations as a Latent Space for Efficient Generation Ming Gui et.al. 2510.14630 null
2025-10-16 Consistent text-to-image generation via scene de-contextualization Song Tang et.al. 2510.14553 null
2025-10-16 Exploring Image Representation with Decoupled Classical Visual Descriptors Chenyuan Qu et.al. 2510.14536 null
2025-10-16 Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models Yunze Tong et.al. 2510.14526 null
2025-10-15 Generative Universal Verifier as Multimodal Meta-Reasoner Xinchen Zhang et.al. 2510.13804 null
2025-10-15 Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation Yifu Luo et.al. 2510.13418 null
2025-10-15 End-to-End Multi-Modal Diffusion Mamba Chunhao Lu et.al. 2510.13253 null
2025-10-15 Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation Yi Zuo et.al. 2510.13084 null
2025-10-15 Counting Hallucinations in Diffusion Models Shuai Fu et.al. 2510.13080 null
2025-10-14 UniFusion: Vision-Language Model as Unified Encoder in Image Generation Kevin Li et.al. 2510.12789 null
2025-10-14 SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models Weiyang Jin et.al. 2510.12784 null
2025-10-14 LayerSync: Self-aligning Intermediate Layers Yasaman Haghighi et.al. 2510.12581 null
2025-10-14 AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion Xiaopeng Liu et.al. 2510.12260 null
2025-10-14 Local Background Features Matter in Out-of-Distribution Detection Jinlun Ye et.al. 2510.12259 null
2025-10-14 FedMMKT:Co-Enhancing a Server Text-to-Image Model and Client Task Models in Multi-Modal Federated Learning Ningxin He et.al. 2510.12254 null

(back to top)

Music Generation

Publish Date Title Authors PDF Code
2025-10-21 Steering Autoregressive Music Generation with Recursive Feature Machines Daniel Zhao et.al. 2510.19127 null
2025-10-18 MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding Jingyue Huang et.al. 2510.16273 null
2025-10-16 Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics? Qixin Deng et.al. 2510.14249 null
2025-10-15 UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE Zhenyu Liu et.al. 2510.13344 null
2025-10-17 MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations Wenxiang Guo et.al. 2510.10396 null
2025-10-11 ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis Stephen Ni-Hahn et.al. 2510.10249 null
2025-10-07 LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment Jiahao Mei et.al. 2510.05875 null
2025-10-02 Bias beyond Borders: Global Inequalities in AI-Generated Music Ahmet Solak et.al. 2510.01963 null
2025-10-15 SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing Jiaye Tan et.al. 2510.00395 null
2025-10-04 HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling Hung-Ying Chu et.al. 2509.25694 null
2025-09-29 Ethics Statements in AI Music Papers: The Effective and the Ineffective Julia Barnett et.al. 2509.25496 null
2025-09-29 Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music Tianle Wang et.al. 2509.24603 null
2025-10-01 An Agent-Based Framework for Automated Higher-Voice Harmony Generation Nia D'Souza Ganapathy et.al. 2509.24463 null
2025-09-28 Time-Shifted Token Scheduling for Symbolic Music Generation Ting-Kang Wang et.al. 2509.23749 null
2025-09-28 AudioMoG: Guiding Audio Generation with Mixture-of-Guidance Junyou Wang et.al. 2509.23727 null
2025-09-27 AI-Assisted Music Production: A User Study on Text-to-Music Models Francesca Ronchini et.al. 2509.23364 null
2025-09-26 Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach Zijian Zhao et.al. 2509.22378 null
2025-09-26 MusicWeaver: Coherent Long-Range and Editable Music Generation from a Beat-Aligned Structural Plan Xuanchen Wang et.al. 2509.21714 null
2025-09-21 Difficulty-Aware Score Generation for Piano Sight-Reading Pedro Ramoneda et.al. 2509.16913 null
2025-09-17 Assessing Data Replication in Symbolic Music via Adapted Structural Similarity Index Measure Shulei Ji et.al. 2509.13658 null
2025-09-13 A Traditional Approach to Symbolic Piano Continuation Christian Zhou-Zheng et.al. 2509.12267 null
2025-09-14 Decoding Musical Origins: Distinguishing Human and AI Composers Cheng-Yang Tsai et.al. 2509.11369 null
2025-09-14 STASE: A spatialized text-to-audio synthesis engine for music generation Tutti Chi et.al. 2509.11124 null
2025-09-10 Segment Transformer: AI-Generated Music Detection via Music Structural Analysis Yumin Kim et.al. 2509.08283 null
2025-09-09 Continuous Audio Language Models Simon Rouard et.al. 2509.06926 null
2025-09-24 No Encore: Unlearning as Opt-Out in Music Generation Jinju Kim et.al. 2509.06277 null
2025-09-07 UniVerse-1: Unified Audio-Video Generation via Stitching of Experts Duomin Wang et.al. 2509.06155 null
2025-09-04 PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music Hayeon Bang et.al. 2509.04215 null
2025-09-03 Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings Dyah A. M. G. Wisnu et.al. 2509.03292 null
2025-09-01 The AudioMOS Challenge 2025 Wen-Chin Huang et.al. 2509.01336 null
2025-08-31 TinyMusician: On-Device Music Generation with Knowledge Distillation and Mixed Precision Quantization Hainan Wang et.al. 2509.00914 null
2025-09-04 AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation Gyehun Go et.al. 2509.00813 null
2025-08-31 The Name-Free Gap: Policy-Aware Stylistic Control in Music Generation Ashwin Nagarajan et.al. 2509.00654 null
2025-08-24 A Survey on Evaluation Metrics for Music Generation Faria Binte Kader et.al. 2509.00051 null
2025-08-28 Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music Hongju Su et.al. 2508.20665 null
2025-08-27 The IRMA Dataset: A Structured Audio-MIDI Corpus for Iranian Classical Music Sepideh Shafiei et.al. 2508.19876 null
2025-08-27 CompLex: Music Theory Lexicon Constructed by Autonomous Agents for Automatic Music Generation Zhejing Hu et.al. 2508.19603 null
2025-08-08 MuSpike: A Benchmark and Evaluation Framework for Symbolic Music Generation with Spiking Neural Networks Qian Liang et.al. 2508.19251 null
2025-08-12 QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems Chien-Chun Wang et.al. 2508.08957 null
2025-08-12 Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems Liam Pram et.al. 2508.08805 null
2025-08-08 Live Music Models Lyria Team et.al. 2508.04651 link
2025-08-03 Automatic Melody Reduction via Shortest Path Finding Ziyu Wang et.al. 2508.01571 null
2025-07-31 DeformTune: A Deformable XAI Music Prototype for Non-Musicians Ziqing Xu et.al. 2508.00160 null
2025-07-31 "I made this (sort of)": Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation Bob L. T. Sturm et.al. 2507.23365 null
2025-07-28 Music Arena: Live Evaluation for Text-to-Music Yonghyun Kim et.al. 2507.20900 null
2025-07-28 Controllable Video-to-Music Generation with Multiple Time-Varying Conditions Junxian Wu et.al. 2507.20627 null
2025-07-27 Diffusion-based Symbolic Music Generation with Structured State Space Models Shenghua Yuan et.al. 2507.20128 null
2025-08-07 SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion Hei Shing Cheung et.al. 2507.19991 null
2025-07-17 A new XML conversion process for mensural music encoding : CMME_to_MEI (via Verovio) David Fiala et.al. 2507.15991 null
2025-07-17 WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling Qihui Yang et.al. 2507.10534 null

(back to top)

Audio Codec

Publish Date Title Authors PDF Code
2025-10-19 SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization Wenxi Chen et.al. 2510.16841 null
2025-10-19 U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation Xusheng Yang et.al. 2510.16718 null
2025-10-17 LDCodec: A high quality neural audio codec with low-complexity decoder Jiawei Jiang et.al. 2510.15364 null
2025-10-17 Extending Audio Context for Long-Form Understanding in Large Audio-Language Models Yuatyong Chaichana et.al. 2510.15231 null
2025-10-17 LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models Xiaohan Zhao et.al. 2510.15227 null
2025-10-16 TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation Ming-Hao Hsu et.al. 2510.14934 null
2025-10-15 Acoustic Teleportation via Disentangled Neural Audio Codec Representations Philipp Grundhuber et.al. 2510.13221 null
2025-10-13 UALM: Unified Audio Language Model for Understanding, Generation and Reasoning Jinchuan Tian et.al. 2510.12000 null
2025-10-13 BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis Jingyuan Xing et.al. 2510.11646 null
2025-10-12 FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec Yurii Halychanskyi et.al. 2510.10785 null
2025-10-11 SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation Zeyu Ling et.al. 2510.10069 null
2025-10-11 MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction Jianjin Wang et.al. 2510.10003 null
2025-10-10 SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion Zhao Guo et.al. 2510.09245 null
2025-10-08 AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs Peize He et.al. 2510.07293 null
2025-10-07 Latent Speech-Text Transformer Yen-Ju Lu et.al. 2510.06195 null
2025-10-07 EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS Haoxun Li et.al. 2510.05758 null
2025-10-06 UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models Wenhao Guan et.al. 2510.04593 null
2025-10-04 Désentrelacement Fréquentiel Doux pour les Codecs Audio Neuronaux Benoît Giniès et.al. 2510.03741 null
2025-10-04 Soft Disentanglement in Frequency Bands for Neural Audio Codecs Benoit Ginies et.al. 2510.03735 null
2025-10-02 High-Fidelity Speech Enhancement via Discrete Audio Tokens Luca A. Lanzendörfer et.al. 2510.02187 null
2025-10-02 MelCap: A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression Jingyi Li et.al. 2510.01903 null
2025-10-02 FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates Jiaqi Li et.al. 2510.00981 null
2025-10-07 Baseline Systems For The 2025 Low-Resource Audio Codec Challenge Yusuf Ziya Isik et.al. 2510.00264 null
2025-09-30 Scaling Spoken Language Models with Syllabic Speech Tokenization Nicholas Lee et.al. 2509.26634 null
2025-09-30 Optimizing Speech Language Models for Acoustic Consistency Morteza Rohanian et.al. 2509.26276 null
2025-09-29 MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech Chengyao Wang et.al. 2509.25131 null
2025-09-29 VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning Yixuan Zhou et.al. 2509.24650 null
2025-09-29 Assessing speech quality metrics for evaluation of neural audio codecs under clean speech conditions Wolfgang Mack et.al. 2509.24457 null
2025-09-26 StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs Yuhan Song et.al. 2509.22220 null
2025-09-26 Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling Junjie Cao et.al. 2509.22062 null
2025-09-26 AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook Yushen Chen et.al. 2509.21968 null
2025-09-25 X-Streamer: Unified Human World Modeling with Audiovisual Interaction You Xie et.al. 2509.21574 null
2025-09-24 Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens Ismail Rasim Ulgen et.al. 2509.20485 null
2025-09-25 From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training Tianqiao Liu et.al. 2509.20072 null
2025-09-24 Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens Pin-Jui Ku et.al. 2509.20060 null
2025-09-25 Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration Yifan Yang et.al. 2509.19928 null
2025-09-24 Eliminating stability hallucinations in llm-based tts models via attention guidance ShiMing Wang et.al. 2509.19852 null
2025-09-23 Improving Test-Time Performance of RVQ-based Neural Codecs Hyeongju Kim et.al. 2509.19186 null
2025-09-23 Enhancing Noise Robustness for Neural Speech Codecs through Resource-Efficient Progressive Quantization Perturbation Simulation Rui-Chen Zheng et.al. 2509.19025 null
2025-09-23 HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS Sihang Nie et.al. 2509.19001 null
2025-09-23 Direct Preference Optimization for Speech Autoregressive Diffusion Models Zhijun Liu et.al. 2509.18928 null
2025-09-23 Towards Evaluating Generative Audio: Insights from Neural Audio Codec Embedding Distances Arijit Biswas et.al. 2509.18823 null
2025-09-22 Does Audio Matter for Modern Video-LLMs and Their Benchmarks? Geewook Kim et.al. 2509.17901 null
2025-09-22 Qwen3-Omni Technical Report Jin Xu et.al. 2509.17765 null
2025-09-21 MBCodec:Thorough disentangle for high-fidelity audio compression Ruonan Zhang et.al. 2509.17006 null
2025-09-19 FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation Luca Della Libera et.al. 2509.16195 null
2025-09-19 VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency Nikita Torgashov et.al. 2509.15969 null
2025-09-18 A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication Ryan Collette et.al. 2509.15462 null
2025-09-18 MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis Keyu An et.al. 2509.14784 null
2025-09-17 A High-Quality and Low-Complexity Streamable Neural Speech Codec with Knowledge Distillation En-Wei Zhang et.al. 2509.13670 null

(back to top)

Large Audio Language Model

Publish Date Title Authors PDF Code
2025-10-21 MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels Chen Chen et.al. 2510.18915 null
2025-10-20 Hearing Health in Home Healthcare: Leveraging LLMs for Illness Scoring and ALMs for Vocal Biomarker Extraction Yu-Wen Chen et.al. 2510.18169 null
2025-10-20 SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering Weilin Lin et.al. 2510.17633 null
2025-10-21 LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding ZhaoYang Han et.al. 2510.17305 null
2025-10-22 OmniVIC: A Self-Improving Variable Impedance Controller with Vision-Language In-Context Learning for Safe Robotic Manipulation Heng Zhang et.al. 2510.17150 null
2025-10-19 SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models Chih-Kai Yang et.al. 2510.16917 null
2025-10-19 Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations Bo-Han Feng et.al. 2510.16893 null
2025-10-19 The Augmented Lagrangian Methods: Overview and Recent Advances Kangkang Deng et.al. 2510.16827 null
2025-10-17 OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM Hanrong Ye et.al. 2510.15870 null
2025-10-17 Extending Audio Context for Long-Form Understanding in Large Audio-Language Models Yuatyong Chaichana et.al. 2510.15231 null
2025-10-16 XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models Xingrui Wang et.al. 2510.15148 null
2025-10-15 Yamaji effect in models of underdoped cuprates Jing-Yu Zhao et.al. 2510.13943 null
2025-10-15 Generative Universal Verifier as Multimodal Meta-Reasoner Xinchen Zhang et.al. 2510.13804 null
2025-10-15 InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue Wenwen Tong et.al. 2510.13747 null
2025-10-16 NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching Run Luo et.al. 2510.13721 null
2025-10-14 Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models Tsung-En Lin et.al. 2510.12851 null
2025-10-14 Detect Anything via Next Point Prediction Qing Jiang et.al. 2510.12798 null
2025-10-14 Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception Ziyang Ma et.al. 2510.12720 null
2025-10-15 SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model Lin Lin et.al. 2510.12709 null
2025-10-14 The spin Hall conductivity in the hole-doped bilayer Haldane-Hubbard model with odd-parity ALM Minghuan Zeng et.al. 2510.12602 null
2025-10-14 Not in Sync: Unveiling Temporal Bias in Audio Chat Models Jiayu Yao et.al. 2510.12185 null
2025-10-14 An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations Benjamin W. Nelson et.al. 2510.12083 null
2025-10-13 Bridging the gap between ultrafast optics and resonant photonics via omni-resonance Abbas Shiri et.al. 2510.12002 null
2025-10-13 UALM: Unified Audio Language Model for Understanding, Generation and Reasoning Jinchuan Tian et.al. 2510.12000 null
2025-10-13 ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments? Liu Yang et.al. 2510.11549 null
2025-10-13 Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning Kuan-Yi Lee et.al. 2510.11454 null
2025-10-13 Optimizing Cross-Domain Transfer for Universal Machine Learning Interatomic Potentials Jaesun Kim et.al. 2510.11241 null
2025-10-13 VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents Jiliang Hu et.al. 2510.11098 null
2025-10-12 OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Caorui Li et.al. 2510.10689 null
2025-10-12 Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance Jingyi Chen et.al. 2510.10444 null
2025-10-14 Integration of the TIAGo Robot into Isaac Sim with Mecanum Drive Modeling and Learned S-Curve Velocity Profiles Vincent Schoenbach et.al. 2510.10273 null
2025-10-10 HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation Jingyuan Sun et.al. 2510.09221 null
2025-10-08 Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization Rui Hu et.al. 2510.08618 null
2025-10-09 An efficient algorithm for kernel quantile regression Shengxiang Deng et.al. 2510.07929 null
2025-10-08 AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues Krish Patel et.al. 2510.07355 null
2025-10-08 AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs Peize He et.al. 2510.07293 null
2025-10-07 Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding Yi Xin et.al. 2510.06308 null
2025-10-07 AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning Haoyu Zhang et.al. 2510.05478 null
2025-10-06 Observation and modeling of a geo-effective event observed on 2011 May 28 from the solar surface to 1au Nishu Karna et.al. 2510.05334 null
2025-10-06 AURA Score: A Metric For Holistic Audio Question Answering Evaluation Satvik Dixit et.al. 2510.04934 null
2025-10-06 Robustness assessment of large audio language models in multiple-choice evaluation Fernando López et.al. 2510.04584 null
2025-10-03 Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video Mengyao Xu et.al. 2510.03458 null
2025-10-03 AudioToolAgent: An Agentic Framework for Audio-Language Models Gijs Wijngaard et.al. 2510.02995 null
2025-10-02 Broadband entangled-photon omni-resonance in a planar optical cavity Bryan L. Turo et.al. 2510.01595 null
2025-10-01 Hearing the Order: Investigating Selection Bias in Large Audio-Language Models Yu-Xiang Lin et.al. 2510.00628 null
2025-10-01 When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models Chen-An Li et.al. 2510.00626 null
2025-10-01 Multi-level Dynamic Style Transfer for NeRFs Zesheng Li et.al. 2510.00592 null
2025-09-30 TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics Yi-Cheng Lin et.al. 2509.26329 null
2025-09-30 OmniDFA: A Unified Framework for Open Set Synthesis Image Detection and Few-Shot Attribution Shiyu Wu et.al. 2509.25682 null
2025-09-29 EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition Jiacheng Shi et.al. 2509.25495 null

(back to top)

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages