Usage instructions: here
Table of Contents
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-10-22 | Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent | Yangshijie Zhang et.al. | 2510.19641 | null |
| 2025-10-22 | Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment | Maureen de Seyssel et.al. | 2510.19509 | null |
| 2025-10-22 | EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection | Tong Zhang et.al. | 2510.19414 | null |
| 2025-10-21 | StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction | Qianheng Xu et.al. | 2510.18938 | null |
| 2025-10-21 | KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers | Mohd Ruhul Ameen et.al. | 2510.18355 | null |
| 2025-10-21 | ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation | Haowei Lou et.al. | 2510.18308 | null |
| 2025-10-19 | U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation | Xusheng Yang et.al. | 2510.16718 | null |
| 2025-10-18 | Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages | Pacome Simon Mbonimpa et.al. | 2510.16497 | null |
| 2025-10-18 | TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model | Bin Yu et.al. | 2510.16449 | null |
| 2025-10-22 | VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition | Kye Shimizu et.al. | 2510.16192 | null |
| 2025-10-17 | High order Tensor-Train-Based Schemes for High-Dimensional Mean Field Games | Elisabetta Carlini et.al. | 2510.15603 | null |
| 2025-10-16 | Hints for dynamical dark energy from warm inflation | Anupama B et.al. | 2510.15051 | null |
| 2025-10-16 | Improving Cybercrime Detection and Digital Forensics Investigations with Artificial Intelligence | Silvia Lucia Sanna et.al. | 2510.14638 | null |
| 2025-10-16 | RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF | Qing Yang et.al. | 2510.14628 | null |
| 2025-10-16 | The tt-structure for the quantum cohomology of complex Grassmannian* | Tadashi Udagawa et.al. | 2510.14483 | null |
| 2025-10-20 | Radiation pressure and equation of state are important in the envelope unbinding process in common envelope evolution | Zhuo Chen et.al. | 2510.14173 | null |
| 2025-10-15 | Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling | Peng Kuang et.al. | 2510.13918 | null |
| 2025-10-15 | Generative Universal Verifier as Multimodal Meta-Reasoner | Xinchen Zhang et.al. | 2510.13804 | null |
| 2025-10-15 | InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue | Wenwen Tong et.al. | 2510.13747 | null |
| 2025-10-15 | Closing the Gap Between Text and Speech Understanding in LLMs | Santiago Cuervo et.al. | 2510.13632 | null |
| 2025-10-15 | Functional tensor train neural network for solving high-dimensional PDEs | Yani Feng et.al. | 2510.13386 | null |
| 2025-10-15 | Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models | Yizhou Peng et.al. | 2510.13293 | null |
| 2025-10-15 | StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation | Xi Chen et.al. | 2510.13194 | null |
| 2025-10-14 | Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs | Xinlu He et.al. | 2510.12995 | null |
| 2025-10-14 | Toward First-Principles Multi-Messenger Predictions: Coupling Nuclear Networks with GR Radiation-MHD in {\tt Gmunu} | Patrick Chi-Kit Cheong et.al. | 2510.12978 | null |
| 2025-10-14 | Content Anonymization for Privacy in Long-form Audio | Cristina Aggazzotti et.al. | 2510.12780 | null |
| 2025-10-14 | TerraCodec: Compressing Earth Observations | Julen Costa-Watanabe et.al. | 2510.12670 | null |
| 2025-10-14 | Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation | Greta Damo et.al. | 2510.12316 | null |
| 2025-10-14 | DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation | Yakun Song et.al. | 2510.12210 | null |
| 2025-10-13 | Actor-Enriched Time Series Forecasting of Process Performance | Aurelie Leribaux et.al. | 2510.11856 | null |
| 2025-10-13 | BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis | Jingyuan Xing et.al. | 2510.11646 | null |
| 2025-10-13 | Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker | Cheng Gong et.al. | 2510.11124 | null |
| 2025-10-14 | ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis | Mohammad Javad Ranjbar Kalahroodi et.al. | 2510.10774 | null |
| 2025-10-14 | MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations | Wenxiang Guo et.al. | 2510.10396 | null |
| 2025-10-11 | Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey | Jiaqi Wei et.al. | 2510.09988 | null |
| 2025-10-10 | Tensor-based compression of the sea temperature data | Ilya Kosolapov et.al. | 2510.09778 | null |
| 2025-10-10 | Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models | Donghang Wu et.al. | 2510.09592 | null |
| 2025-10-10 | A family of non-simple surfaces whose transport twistor spaces admit global blow-down maps | François Monard et.al. | 2510.09518 | null |
| 2025-10-10 | O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion | Huu Tuong Tu et.al. | 2510.09061 | null |
| 2025-10-10 | DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment | Zongcai Du et.al. | 2510.09016 | null |
| 2025-10-09 | Theoretical Analysis of Topotomography Using Small Intragranular Strain Approximations | Zheheng Liu et.al. | 2510.08712 | null |
| 2025-10-09 | DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching | Hanke Xie et.al. | 2510.08373 | null |
| 2025-10-09 | Structured covariance estimation via tensor-train decomposition | Artsiom Patarusau et.al. | 2510.08174 | null |
| 2025-10-09 | IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation | Wei Wang et.al. | 2510.07979 | null |
| 2025-10-09 | VoiceAgentBench: Are Voice Assistants ready for agentic tasks? | Dhruv Jain et.al. | 2510.07978 | null |
| 2025-10-09 | Self-Improving LLM Agents at Test-Time | Emre Can Acikgoz et.al. | 2510.07841 | null |
| 2025-10-09 | From Noisy to Native: LLM-driven Graph Restoration for Test-Time Graph Domain Adaptation | Xiangwei Lv et.al. | 2510.07762 | null |
| 2025-10-09 | Parallel Test-Time Scaling for Latent Reasoning Models | Runyang You et.al. | 2510.07745 | null |
| 2025-10-08 | AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding | Shuqing Luo et.al. | 2510.07486 | null |
| 2025-10-08 | Gauge Dependence of Scalar-Induced Gravitational Waves from Isocurvature Perturbations: Analytical Results | Arshad Ali et.al. | 2510.07252 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-10-22 | Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models | Xiaozhen Qiao et.al. | 2510.19802 | null |
| 2025-10-16 | Visible Imaging of Incoherent 1200-nm Light via Triplet--Triplet Annihilation Upconversion | Pournima Narayanan et.al. | 2510.15184 | null |
| 2025-10-16 | SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation | Jihyun Yu et.al. | 2510.14634 | null |
| 2025-10-16 | AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation | Hui Wang et.al. | 2510.14570 | null |
| 2025-10-15 | UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE | Zhenyu Liu et.al. | 2510.13344 | null |
| 2025-10-15 | DP-TTA: Test-time Adaptation for Transient Electromagnetic Signal Denoising via Dictionary-driven Prior Regularization | Meng Yang et.al. | 2510.13160 | null |
| 2025-10-14 | Controllable Collision Scenario Generation via Collision Pattern Prediction | Pin-Lun Chen et.al. | 2510.12206 | null |
| 2025-10-14 | Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis | Junnuo Wang et.al. | 2510.12175 | null |
| 2025-10-13 | UALM: Unified Audio Language Model for Understanding, Generation and Reasoning | Jinchuan Tian et.al. | 2510.12000 | null |
| 2025-10-13 | Efficient Edge Test-Time Adaptation via Latent Feature Coordinate Correction | Xinyu Luo et.al. | 2510.11068 | null |
| 2025-10-17 | MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations | Wenxiang Guo et.al. | 2510.10396 | null |
| 2025-10-10 | MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation | Akira Takahashi et.al. | 2510.09065 | null |
| 2025-10-10 | ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling | Yuxuan Jiang et.al. | 2510.08878 | null |
| 2025-10-13 | Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation | Liyang Chen et.al. | 2510.08078 | null |
| 2025-10-09 | IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries | Harsh Kavediya et.al. | 2510.07837 | null |
| 2025-10-08 | HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation | Samir Abou Haidar et.al. | 2510.06876 | null |
| 2025-10-07 | FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders | Riccardo Fosco Gramaccioni et.al. | 2510.05829 | null |
| 2025-10-07 | StereoSync: Spatially-Aware Stereo Audio Generation from Video | Christian Marinoni et.al. | 2510.05828 | null |
| 2025-10-07 | NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering | Alexander Murphy et.al. | 2510.05635 | null |
| 2025-10-07 | LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability | Harshil Vejendla et.al. | 2510.05530 | null |
| 2025-10-06 | Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers | Juncheng Wang et.al. | 2510.04577 | null |
| 2025-10-05 | Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space | Christian Limberg et.al. | 2510.04339 | null |
| 2025-10-05 | The best performance in the CARE 2025 -- Liver Task (LiSeg-Contrast): Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation | Jincan Lou et.al. | 2510.04243 | null |
| 2025-10-04 | AI-Assisted Pleural Effusion Volume Estimation from Contrast-Enhanced CT Images | Sanhita Basu et.al. | 2510.03856 | null |
| 2025-10-03 | SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos | Amir Dellali et.al. | 2510.02916 | null |
| 2025-10-03 | Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models | Lihua Zhou et.al. | 2510.02750 | null |
| 2025-10-02 | SoundReactor: Frame-level Online Video-to-Audio Generation | Koichi Saito et.al. | 2510.02110 | null |
| 2025-09-30 | To Remember, To Adapt, To Preempt: A Stable Continual Test-Time Adaptation Framework for Remote Physiological Measurement in Dynamic Domain Shifts | Shuyang Chu et.al. | 2510.01282 | null |
| 2025-10-01 | PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation | Yujia Xiao et.al. | 2510.00485 | null |
| 2025-10-01 | VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors | Atif Belal et.al. | 2510.00458 | null |
| 2025-09-30 | Post-Training Quantization for Audio Diffusion Transformers | Tanmay Khandelwal et.al. | 2510.00313 | null |
| 2025-09-30 | Video Object Segmentation-Aware Audio Generation | Ilpo Viertola et.al. | 2509.26604 | null |
| 2025-09-30 | MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms | Eleonora Ristori et.al. | 2509.26007 | null |
| 2025-09-30 | Annotation-Efficient Active Test-Time Adaptation with Conformal Prediction | Tingyu Shi et.al. | 2509.25692 | null |
| 2025-09-30 | Charge Transfer States in Donor Acceptor Bulk Heterojunctions as Triplet Triplet Annihilation Sensitizer for Solid-State Photon Upconversion | Maciej Klein et.al. | 2509.25679 | null |
| 2025-09-29 | EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition | Jiacheng Shi et.al. | 2509.25495 | null |
| 2025-09-29 | A Robust Multi-Scale Framework with Test-Time Adaptation for sEEG-Based Speech Decoding | Suli Wang et.al. | 2509.24700 | null |
| 2025-09-29 | When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks | Zeyu Xie et.al. | 2509.24635 | null |
| 2025-09-29 | Training-Free Multimodal Guidance for Video to Audio Generation | Eleonora Grassucci et.al. | 2509.24550 | null |
| 2025-10-01 | An Agent-Based Framework for Automated Higher-Voice Harmony Generation | Nia D'Souza Ganapathy et.al. | 2509.24463 | null |
| 2025-09-29 | UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities | Xuenan Xu et.al. | 2509.24391 | null |
| 2025-09-28 | AudioMoG: Guiding Audio Generation with Mixture-of-Guidance | Junyou Wang et.al. | 2509.23727 | null |
| 2025-09-26 | TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses | Sahar Dastani et.al. | 2509.22813 | null |
| 2025-09-25 | Prompt-aware classifier free guidance for diffusion models | Xuanhao Zhang et.al. | 2509.22728 | null |
| 2025-09-26 | Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment | Yunyi Liu et.al. | 2509.21919 | null |
| 2025-09-25 | AIBA: Attention-based Instrument Band Alignment for Text-to-Audio Diffusion | Junyoung Koh et.al. | 2509.20891 | null |
| 2025-09-24 | MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization | Jianxuan Yang et.al. | 2509.19999 | null |
| 2025-09-25 | MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model | The Hieu Pham et.al. | 2509.19881 | null |
| 2025-09-24 | SCORE: Scaling audio generation using Standardized COmposite REwards | Jaemin Jung et.al. | 2509.19831 | null |
| 2025-09-23 | SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering | Jiarui Hai et.al. | 2509.18603 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-10-10 | MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation | Akira Takahashi et.al. | 2510.09065 | null |
| 2025-10-13 | Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation | Liyang Chen et.al. | 2510.08078 | null |
| 2025-10-09 | IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries | Harsh Kavediya et.al. | 2510.07837 | null |
| 2025-10-07 | FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders | Riccardo Fosco Gramaccioni et.al. | 2510.05829 | null |
| 2025-10-07 | StereoSync: Spatially-Aware Stereo Audio Generation from Video | Christian Marinoni et.al. | 2510.05828 | null |
| 2025-10-03 | SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos | Amir Dellali et.al. | 2510.02916 | null |
| 2025-10-02 | SoundReactor: Frame-level Online Video-to-Audio Generation | Koichi Saito et.al. | 2510.02110 | null |
| 2025-09-29 | Training-Free Multimodal Guidance for Video to Audio Generation | Eleonora Grassucci et.al. | 2509.24550 | null |
| 2025-09-28 | AudioMoG: Guiding Audio Generation with Mixture-of-Guidance | Junyou Wang et.al. | 2509.23727 | null |
| 2025-09-26 | WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM | Changli Tang et.al. | 2509.21990 | null |
| 2025-09-26 | Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers | Jibin Song et.al. | 2509.21893 | null |
| 2025-09-24 | MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization | Jianxuan Yang et.al. | 2509.19999 | null |
| 2025-10-05 | StereoFoley: Object-Aware Stereo Audio Generation from Video | Tornike Karchkhadze et.al. | 2509.18272 | null |
| 2025-09-19 | Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech | Xinlei Niu et.al. | 2509.15492 | null |
| 2025-09-19 | RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes | Fang Li et.al. | 2509.15123 | null |
| 2025-09-08 | MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation | Xiaoran Yang et.al. | 2509.06389 | null |
| 2025-09-05 | Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper | Gehui Chen et.al. | 2509.04957 | null |
| 2025-08-23 | HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation | Sizhe Shan et.al. | 2508.16930 | null |
| 2025-08-19 | InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing | Shaoshu Yang et.al. | 2508.14033 | null |
| 2025-08-21 | FoleySpace: Vision-Aligned Binaural Spatial Audio Generation | Lei Zhao et.al. | 2508.12918 | null |
| 2025-08-14 | LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters | Haomin Zhang et.al. | 2508.11074 | null |
| 2025-08-12 | Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization | Chaoqun Cui et.al. | 2508.08550 | null |
| 2025-07-14 | DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis | Wenjie Tian et.al. | 2507.10109 | null |
| 2025-07-13 | Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation | Yingshan Liang et.al. | 2507.04959 | null |
| 2025-06-23 | Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions | Vineet Kumar Rakesh et.al. | 2507.02900 | null |
| 2025-07-03 | Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation | Feizhen Huang et.al. | 2507.02271 | null |
| 2025-06-23 | IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech | Siyi Zhou et.al. | 2506.21619 | null |
| 2025-06-28 | ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing | Huadai Liu et.al. | 2506.21448 | null |
| 2025-06-27 | Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance | Akio Hayakawa et.al. | 2506.20995 | null |
| 2025-06-24 | Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation | Jun Wang et.al. | 2506.19774 | null |
| 2025-06-13 | ViSAGe: Video-to-Spatial Audio Generation | Jaeyeon Kim et.al. | 2506.12199 | null |
| 2025-05-31 | Length Aware Speech Translation for Video Dubbing | Harveen Singh Chadha et.al. | 2506.00740 | null |
| 2025-05-26 | Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks | Chang Liu et.al. | 2505.20038 | link |
| 2025-05-22 | SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet | Zhi Zhong et.al. | 2505.16195 | null |
| 2025-05-30 | TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis | Yu Zhang et.al. | 2505.14910 | link |
| 2025-05-28 | Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model | Yong Ren et.al. | 2505.13062 | null |
| 2025-06-03 | OmniAudio: Generating Spatial Audio from 360-Degree Video | Huadai Liu et.al. | 2504.14906 | link |
| 2025-04-17 | CAFA: a Controllable Automatic Foley Artist | Roi Benita et.al. | 2504.06778 | link |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-10-22 | VBx for End-to-End Neural and Clustering-based Diarization | Petr Pálka et.al. | 2510.19572 | null |
| 2025-10-20 | Fast Agnostic Learners in the Plane | Talya Eden et.al. | 2510.18057 | null |
| 2025-10-20 | Joint upper Banach density, VC dimensions and Euclidean point configurations | Bruno Predojević et.al. | 2510.17453 | null |
| 2025-10-23 | The Parameterized Complexity of Computing the VC-Dimension | Florent Foucaud et.al. | 2510.17451 | null |
| 2025-10-18 | Truly Subquadratic Time Algorithms for Diameter and Related Problems in Graphs of Bounded VC-dimension | Timothy M. Chan et.al. | 2510.16346 | null |
| 2025-10-22 | VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition | Kye Shimizu et.al. | 2510.16192 | null |
| 2025-10-16 | Deadlock-free routing for Full-mesh networks without using Virtual Channels | Alejandro Cano et.al. | 2510.14730 | null |
| 2025-10-15 | The VC-dimension and point configurations in |
Alex Iosevich et.al. | 2510.13984 | null |
| 2025-10-16 | VC-Dimension vs Degree: An Uncertainty Principle for Boolean Functions | Fan Chang et.al. | 2510.13705 | null |
| 2025-10-15 | Model-assisted estimation for MRV: How to boost the economics of SOC sequestration projects without compromising on scientific integrity | Ahmad Awad et.al. | 2510.13609 | null |
| 2025-10-15 | Target Controllability Score | Kazuhiro Sato et.al. | 2510.13354 | null |
| 2025-10-14 | VCTR: A Transformer-Based Model for Non-parallel Voice Conversion | Maharnab Saikia et.al. | 2510.12964 | null |
| 2025-10-15 | (R)evolution of Programming: Vibe Coding as a Post-Coding Paradigm | Kevin Krings et.al. | 2510.12364 | null |
| 2025-10-13 | Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker | Cheng Gong et.al. | 2510.11124 | null |
| 2025-10-13 | VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents | Jiliang Hu et.al. | 2510.11098 | null |
| 2025-10-10 | A Scalable, Privacy-Preserving Decentralized Identity and Verifiable Data Sharing Framework based on Zero-Knowledge Proofs | Hui Yuan et.al. | 2510.09715 | null |
| 2025-10-10 | SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion | Zhao Guo et.al. | 2510.09245 | null |
| 2025-10-10 | O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion | Huu Tuong Tu et.al. | 2510.09061 | null |
| 2025-10-09 | MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows | Guobin Ma et.al. | 2510.08392 | null |
| 2025-10-09 | What Makes a Visualization Complex? | Mengdi Chu et.al. | 2510.08332 | null |
| 2025-10-09 | VoiceAgentBench: Are Voice Assistants ready for agentic tasks? | Dhruv Jain et.al. | 2510.07978 | null |
| 2025-10-06 | UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models | Wenhao Guan et.al. | 2510.04593 | null |
| 2025-10-05 | A Multilingual Framework for Dysarthria: Detection, Severity Classification, Speech-to-Text, and Clean Speech Generation | Ananya Raghu et.al. | 2510.03986 | null |
| 2025-10-03 | Online Learning in the Random Order Model | Martino Bernasconi et.al. | 2510.02820 | null |
| 2025-10-02 | Higher-arity PAC learning, VC dimension and packing lemma | Artem Chernikov et.al. | 2510.02420 | null |
| 2025-09-30 | BlockSDN-VC: A SDN-Based Virtual Coordinate-Enhanced Transaction Broadcast Framework for High-Performance Blockchains | Wenyang Jia et.al. | 2510.00306 | null |
| 2025-09-29 | MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech | Chengyao Wang et.al. | 2509.25131 | null |
| 2025-10-02 | Cofinal families of finite VC-dimension | Omer Ben-Neria et.al. | 2509.24744 | null |
| 2025-09-29 | VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning | Yixuan Zhou et.al. | 2509.24650 | null |
| 2025-09-29 | ISSE: An Instruction-Guided Speech Style Editing Dataset And Benchmark | Yun Chen et.al. | 2509.24570 | null |
| 2025-09-29 | Strong enhancement of d-wave superconductivity in an extended checkerboard Hubbard ladder | Xichen Huang et.al. | 2509.24415 | null |
| 2025-09-26 | ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection | Mohamed Maged et.al. | 2509.22808 | null |
| 2025-09-26 | Speaker Anonymisation for Speech-based Suicide Risk Detection | Ziyun Cui et.al. | 2509.22148 | null |
| 2025-09-25 | VC-Agent: An Interactive Agent for Customized Video Dataset Collection | Yidan Zhang et.al. | 2509.21291 | null |
| 2025-09-24 | Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation | Yang Cui et.al. | 2509.19812 | null |
| 2025-09-22 | Preconditioned Deformation Grids | Julian Kaltheuner et.al. | 2509.18097 | null |
| 2025-09-21 | MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances | Junhyeok Lee et.al. | 2509.17143 | null |
| 2025-09-20 | Advancing Reference-free Evaluation of Video Captions with Factual Analysis | Shubhashis Roy Dipta et.al. | 2509.16538 | null |
| 2025-09-19 | Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation | Qi Wang et.al. | 2509.16010 | null |
| 2025-09-19 | The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion To Singing Style Conversion | Lester Phillip Violeta et.al. | 2509.15629 | null |
| 2025-09-18 | FCPE: A Fast Context-based Pitch Estimation Model | Yuxin Luo et.al. | 2509.15140 | null |
| 2025-09-18 | MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis | Keyu An et.al. | 2509.14784 | null |
| 2025-09-20 | Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis | Qingyu Liu et.al. | 2509.14579 | null |
| 2025-09-17 | VCBench: Benchmarking LLMs in Venture Capital | Rick Chen et.al. | 2509.14448 | null |
| 2025-09-16 | MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement | Jingyu Li et.al. | 2509.13068 | null |
| 2025-09-16 | A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis | Javeria Amir et.al. | 2509.12831 | null |
| 2025-09-15 | Preservation of Language Understanding Capabilities in Speech-aware Large Language Models | Marek Kubis et.al. | 2509.12171 | null |
| 2025-09-14 | Rate-Distortion Limits for Multimodal Retrieval: Theory, Optimal Codes, and Finite-Sample Guarantees | Thomas Y. Chen et.al. | 2509.11054 | null |
| 2025-09-11 | Altered Histories in Version Control System Repositories: Evidence from the Trenches | Solal Rapaport et.al. | 2509.09294 | null |
| 2025-09-11 | DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners | Xiaoxue Luo et.al. | 2509.09201 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-10-22 | PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis | Qing Mao et.al. | 2510.19527 | null |
| 2025-10-22 | GigaBrain-0: A World Model-Powered Vision-Language-Action Model | GigaBrain Team et.al. | 2510.19430 | null |
| 2025-10-22 | Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks | Kai Zeng et.al. | 2510.19195 | null |
| 2025-10-23 | Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning | Takehiro Aoshima et.al. | 2510.19193 | null |
| 2025-10-21 | MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models | Aritra Bhowmik et.al. | 2510.19022 | null |
| 2025-10-21 | UltraGen: High-Resolution Video Generation with Hierarchical Attention | Teng Hu et.al. | 2510.18775 | null |
| 2025-10-23 | A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition | Peiqin Zhuang et.al. | 2510.18705 | null |
| 2025-10-21 | MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation | Weinan Jia et.al. | 2510.18692 | null |
| 2025-10-21 | Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model | Zhenxing Zhang et.al. | 2510.18573 | null |
| 2025-10-22 | FeatureFool: Zero-Query Fooling of Video Models via Feature Map | Duoxun Tang et.al. | 2510.18362 | null |
| 2025-10-22 | OmniNWM: Omniscient Driving Navigation World Models | Bohan Li et.al. | 2510.18313 | null |
| 2025-10-20 | World-in-World: World Models in a Closed-Loop World | Jiahan Zhang et.al. | 2510.18135 | null |
| 2025-10-20 | Demystifying Transition Matching: When and Why It Can Beat Flow Matching | Jaihoon Kim et.al. | 2510.17991 | null |
| 2025-10-20 | ConsistEdit: Highly Consistent and Precise Training-free Visual Editing | Zixin Yin et.al. | 2510.17803 | null |
| 2025-10-22 | MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models | Yongshun Zhang et.al. | 2510.17519 | null |
| 2025-10-20 | From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models | Zefan Cai et.al. | 2510.17247 | null |
| 2025-10-19 | An empirical study of the effect of video encoders on Temporal Video Grounding | Ignacio M. De la Jara et.al. | 2510.17007 | null |
| 2025-10-19 | From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display | Xiangyu Mu et.al. | 2510.16833 | null |
| 2025-10-17 | VISTA: A Test-Time Self-Improving Video Generation Agent | Do Xuan Long et.al. | 2510.15831 | null |
| 2025-10-17 | Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset | Qingyan Bai et.al. | 2510.15742 | null |
| 2025-10-17 | DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion | Weijie Wang et.al. | 2510.15264 | null |
| 2025-10-16 | TGT: Text-Grounded Trajectories for Locally Controlled Video Generation | Guofeng Zhang et.al. | 2510.15104 | null |
| 2025-10-16 | RealDPO: Real or Not Real, that is the Preference | Guo Cheng et.al. | 2510.14955 | null |
| 2025-10-16 | DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation | Yu Zhou et.al. | 2510.14949 | null |
| 2025-10-16 | 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation | JoungBin Lee et.al. | 2510.14945 | null |
| 2025-10-16 | ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints | Meiqi Wu et.al. | 2510.14847 | null |
| 2025-10-16 | In-Context Learning with Unpaired Clips for Instruction-based Video Editing | Xinyao Liao et.al. | 2510.14648 | null |
| 2025-10-19 | STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding | Zhifei Chen et.al. | 2510.14588 | null |
| 2025-10-17 | Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning | Xiangyu Meng et.al. | 2510.14256 | null |
| 2025-10-16 | Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization | Liao Shen et.al. | 2510.14255 | null |
| 2025-10-16 | Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures | Yuancheng Xu et.al. | 2510.14179 | null |
| 2025-10-15 | PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning | Sihui Ji et.al. | 2510.13809 | null |
| 2025-10-15 | CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas | Zian Li et.al. | 2510.13669 | null |
| 2025-10-15 | VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator | Hyojun Go et.al. | 2510.13454 | null |
| 2025-10-15 | Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation | Yi Zuo et.al. | 2510.13084 | null |
| 2025-10-15 | Counting Hallucinations in Diffusion Models | Shuai Fu et.al. | 2510.13080 | null |
| 2025-10-14 | SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models | Zhengxu Tang et.al. | 2510.13042 | null |
| 2025-10-14 | MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars | Felix Taubner et.al. | 2510.12785 | null |
| 2025-10-14 | Time-Correlated Video Bridge Matching | Viacheslav Vasilev et.al. | 2510.12453 | null |
| 2025-10-14 | Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding | Ye Chen et.al. | 2510.12256 | null |
| 2025-10-14 | BIGFix: Bidirectional Image Generation with Token Fixing | Victor Besnier et.al. | 2510.12231 | null |
| 2025-10-14 | Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback | Xingpei Ma et.al. | 2510.12089 | null |
| 2025-10-14 | VIDMP3: Video Editing by Representing Motion with Pose and Position Priors | Sandeep Mishra et.al. | 2510.12069 | null |
| 2025-10-13 | Point Prompting: Counterfactual Tracking with Video Diffusion Models | Ayush Shrivastava et.al. | 2510.11715 | null |
| 2025-10-13 | IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment | Yinan Chen et.al. | 2510.11647 | null |
| 2025-10-13 | MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps | Jiahui Lei et.al. | 2510.11107 | null |
| 2025-10-12 | AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes | Yu Li et.al. | 2510.10670 | null |
| 2025-10-12 | DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis | Peiyin Chen et.al. | 2510.10650 | null |
| 2025-10-10 | Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians | Jin-Chuan Shi et.al. | 2510.09438 | null |
| 2025-10-10 | Stable Video Infinity: Infinite-Length Video Generation with Error Recycling | Wuyang Li et.al. | 2510.09212 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-10-22 | Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing | Yusu Qian et.al. | 2510.19808 | null |
| 2025-10-22 | The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models | Xiaofeng Zhang et.al. | 2510.19557 | null |
| 2025-10-22 | Predicting before Reconstruction: A generative prior framework for MRI acceleration | Juhyung Park et.al. | 2510.19472 | null |
| 2025-10-22 | D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation | Nobline Yoo et.al. | 2510.19278 | null |
| 2025-10-21 | DP |
Rongyuan Wu et.al. | 2510.18851 | null |
| 2025-10-21 | SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation | Siyong Jian et.al. | 2510.18716 | null |
| 2025-10-21 | UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation | Yibin Wang et.al. | 2510.18701 | null |
| 2025-10-21 | From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation | Ziwei Huang et.al. | 2510.18263 | null |
| 2025-10-21 | Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis | Xinhao Cai et.al. | 2510.18229 | null |
| 2025-10-22 | Chimera: Compositional Image Generation using Part-based Concepting | Shivam Singh et.al. | 2510.18083 | null |
| 2025-10-20 | Fine-tuning Flow Matching Generative Models with Intermediate Feedback | Jiajun Fan et.al. | 2510.18072 | null |
| 2025-10-20 | Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models | Jiajun Fan et.al. | 2510.18053 | null |
| 2025-10-20 | Inference-Time Compute Scaling For Flow Matching | Adam Stecklov et.al. | 2510.17786 | null |
| 2025-10-20 | VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models | Qilin Liao et.al. | 2510.17759 | null |
| 2025-10-21 | PICABench: How Far Are We from Physically Realistic Image Editing? | Yuandong Pu et.al. | 2510.17681 | null |
| 2025-10-21 | CaMiT: A Time-Aware Car Model Dataset for Classification and Generation | Frédéric LIN et.al. | 2510.17626 | null |
| 2025-10-20 | Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling | Feihong Yan et.al. | 2510.17171 | null |
| 2025-10-20 | In-situ Autoguidance: Eliciting Self-Correction in Diffusion Models | Enhao Gu et.al. | 2510.17136 | null |
| 2025-10-19 | One-step Diffusion Models with Bregman Density Ratio Matching | Yuanzhi Zhu et.al. | 2510.16983 | null |
| 2025-10-21 | Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback | Zongjian Li et.al. | 2510.16888 | null |
| 2025-10-19 | Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis | Nusrat Munia et.al. | 2510.16887 | null |
| 2025-10-19 | Region in Context: Text-condition Image editing with Human-like semantic reasoning | Thuy Phuong Vu et.al. | 2510.16772 | null |
| 2025-10-17 | BLIP3o-NEXT: Next Frontier of Native Image Generation | Jiuhai Chen et.al. | 2510.15857 | null |
| 2025-10-17 | Controlling the image generation process with parametric activation functions | Ilia Pavlov et.al. | 2510.15778 | null |
| 2025-10-17 | NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation | Yitong Sun et.al. | 2510.15752 | null |
| 2025-10-17 | Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis | Junzhi Ning et.al. | 2510.15710 | null |
| 2025-10-17 | Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation | Xiaoming Zhu et.al. | 2510.15564 | null |
| 2025-10-16 | Salient Concept-Aware Generative Data Augmentation | Tianchen Zhao et.al. | 2510.15194 | null |
| 2025-10-16 | Constantly Improving Image Models Need Constantly Improving Benchmarks | Jiaxin Ge et.al. | 2510.15021 | link |
| 2025-10-16 | Coupled Diffusion Sampling for Training-Free Multi-View Image Editing | Hadi Alzayer et.al. | 2510.14981 | null |
| 2025-10-16 | Learning an Image Editing Model without Image Editing Pairs | Nupur Kumari et.al. | 2510.14978 | link |
| 2025-10-16 | WithAnyone: Towards Controllable and ID Consistent Image Generation | Hengyuan Xu et.al. | 2510.14975 | null |
| 2025-10-16 | ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention | Keli Liu et.al. | 2510.14882 | null |
| 2025-10-16 | FraQAT: Quantization Aware Training with Fractional bits | Luca Morreale et.al. | 2510.14823 | null |
| 2025-10-16 | In-Context Learning with Unpaired Clips for Instruction-based Video Editing | Xinyao Liao et.al. | 2510.14648 | null |
| 2025-10-16 | Adapting Self-Supervised Representations as a Latent Space for Efficient Generation | Ming Gui et.al. | 2510.14630 | null |
| 2025-10-16 | Consistent text-to-image generation via scene de-contextualization | Song Tang et.al. | 2510.14553 | null |
| 2025-10-16 | Exploring Image Representation with Decoupled Classical Visual Descriptors | Chenyuan Qu et.al. | 2510.14536 | null |
| 2025-10-16 | Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models | Yunze Tong et.al. | 2510.14526 | null |
| 2025-10-15 | Generative Universal Verifier as Multimodal Meta-Reasoner | Xinchen Zhang et.al. | 2510.13804 | null |
| 2025-10-15 | Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation | Yifu Luo et.al. | 2510.13418 | null |
| 2025-10-15 | End-to-End Multi-Modal Diffusion Mamba | Chunhao Lu et.al. | 2510.13253 | null |
| 2025-10-15 | Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation | Yi Zuo et.al. | 2510.13084 | null |
| 2025-10-15 | Counting Hallucinations in Diffusion Models | Shuai Fu et.al. | 2510.13080 | null |
| 2025-10-14 | UniFusion: Vision-Language Model as Unified Encoder in Image Generation | Kevin Li et.al. | 2510.12789 | null |
| 2025-10-14 | SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models | Weiyang Jin et.al. | 2510.12784 | null |
| 2025-10-14 | LayerSync: Self-aligning Intermediate Layers | Yasaman Haghighi et.al. | 2510.12581 | null |
| 2025-10-14 | AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion | Xiaopeng Liu et.al. | 2510.12260 | null |
| 2025-10-14 | Local Background Features Matter in Out-of-Distribution Detection | Jinlun Ye et.al. | 2510.12259 | null |
| 2025-10-14 | FedMMKT:Co-Enhancing a Server Text-to-Image Model and Client Task Models in Multi-Modal Federated Learning | Ningxin He et.al. | 2510.12254 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-10-21 | Steering Autoregressive Music Generation with Recursive Feature Machines | Daniel Zhao et.al. | 2510.19127 | null |
| 2025-10-18 | MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding | Jingyue Huang et.al. | 2510.16273 | null |
| 2025-10-16 | Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics? | Qixin Deng et.al. | 2510.14249 | null |
| 2025-10-15 | UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE | Zhenyu Liu et.al. | 2510.13344 | null |
| 2025-10-17 | MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations | Wenxiang Guo et.al. | 2510.10396 | null |
| 2025-10-11 | ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis | Stephen Ni-Hahn et.al. | 2510.10249 | null |
| 2025-10-07 | LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment | Jiahao Mei et.al. | 2510.05875 | null |
| 2025-10-02 | Bias beyond Borders: Global Inequalities in AI-Generated Music | Ahmet Solak et.al. | 2510.01963 | null |
| 2025-10-15 | SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing | Jiaye Tan et.al. | 2510.00395 | null |
| 2025-10-04 | HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling | Hung-Ying Chu et.al. | 2509.25694 | null |
| 2025-09-29 | Ethics Statements in AI Music Papers: The Effective and the Ineffective | Julia Barnett et.al. | 2509.25496 | null |
| 2025-09-29 | Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music | Tianle Wang et.al. | 2509.24603 | null |
| 2025-10-01 | An Agent-Based Framework for Automated Higher-Voice Harmony Generation | Nia D'Souza Ganapathy et.al. | 2509.24463 | null |
| 2025-09-28 | Time-Shifted Token Scheduling for Symbolic Music Generation | Ting-Kang Wang et.al. | 2509.23749 | null |
| 2025-09-28 | AudioMoG: Guiding Audio Generation with Mixture-of-Guidance | Junyou Wang et.al. | 2509.23727 | null |
| 2025-09-27 | AI-Assisted Music Production: A User Study on Text-to-Music Models | Francesca Ronchini et.al. | 2509.23364 | null |
| 2025-09-26 | Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach | Zijian Zhao et.al. | 2509.22378 | null |
| 2025-09-26 | MusicWeaver: Coherent Long-Range and Editable Music Generation from a Beat-Aligned Structural Plan | Xuanchen Wang et.al. | 2509.21714 | null |
| 2025-09-21 | Difficulty-Aware Score Generation for Piano Sight-Reading | Pedro Ramoneda et.al. | 2509.16913 | null |
| 2025-09-17 | Assessing Data Replication in Symbolic Music via Adapted Structural Similarity Index Measure | Shulei Ji et.al. | 2509.13658 | null |
| 2025-09-13 | A Traditional Approach to Symbolic Piano Continuation | Christian Zhou-Zheng et.al. | 2509.12267 | null |
| 2025-09-14 | Decoding Musical Origins: Distinguishing Human and AI Composers | Cheng-Yang Tsai et.al. | 2509.11369 | null |
| 2025-09-14 | STASE: A spatialized text-to-audio synthesis engine for music generation | Tutti Chi et.al. | 2509.11124 | null |
| 2025-09-10 | Segment Transformer: AI-Generated Music Detection via Music Structural Analysis | Yumin Kim et.al. | 2509.08283 | null |
| 2025-09-09 | Continuous Audio Language Models | Simon Rouard et.al. | 2509.06926 | null |
| 2025-09-24 | No Encore: Unlearning as Opt-Out in Music Generation | Jinju Kim et.al. | 2509.06277 | null |
| 2025-09-07 | UniVerse-1: Unified Audio-Video Generation via Stitching of Experts | Duomin Wang et.al. | 2509.06155 | null |
| 2025-09-04 | PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music | Hayeon Bang et.al. | 2509.04215 | null |
| 2025-09-03 | Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings | Dyah A. M. G. Wisnu et.al. | 2509.03292 | null |
| 2025-09-01 | The AudioMOS Challenge 2025 | Wen-Chin Huang et.al. | 2509.01336 | null |
| 2025-08-31 | TinyMusician: On-Device Music Generation with Knowledge Distillation and Mixed Precision Quantization | Hainan Wang et.al. | 2509.00914 | null |
| 2025-09-04 | AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation | Gyehun Go et.al. | 2509.00813 | null |
| 2025-08-31 | The Name-Free Gap: Policy-Aware Stylistic Control in Music Generation | Ashwin Nagarajan et.al. | 2509.00654 | null |
| 2025-08-24 | A Survey on Evaluation Metrics for Music Generation | Faria Binte Kader et.al. | 2509.00051 | null |
| 2025-08-28 | Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music | Hongju Su et.al. | 2508.20665 | null |
| 2025-08-27 | The IRMA Dataset: A Structured Audio-MIDI Corpus for Iranian Classical Music | Sepideh Shafiei et.al. | 2508.19876 | null |
| 2025-08-27 | CompLex: Music Theory Lexicon Constructed by Autonomous Agents for Automatic Music Generation | Zhejing Hu et.al. | 2508.19603 | null |
| 2025-08-08 | MuSpike: A Benchmark and Evaluation Framework for Symbolic Music Generation with Spiking Neural Networks | Qian Liang et.al. | 2508.19251 | null |
| 2025-08-12 | QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems | Chien-Chun Wang et.al. | 2508.08957 | null |
| 2025-08-12 | Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems | Liam Pram et.al. | 2508.08805 | null |
| 2025-08-08 | Live Music Models | Lyria Team et.al. | 2508.04651 | link |
| 2025-08-03 | Automatic Melody Reduction via Shortest Path Finding | Ziyu Wang et.al. | 2508.01571 | null |
| 2025-07-31 | DeformTune: A Deformable XAI Music Prototype for Non-Musicians | Ziqing Xu et.al. | 2508.00160 | null |
| 2025-07-31 | "I made this (sort of)": Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation | Bob L. T. Sturm et.al. | 2507.23365 | null |
| 2025-07-28 | Music Arena: Live Evaluation for Text-to-Music | Yonghyun Kim et.al. | 2507.20900 | null |
| 2025-07-28 | Controllable Video-to-Music Generation with Multiple Time-Varying Conditions | Junxian Wu et.al. | 2507.20627 | null |
| 2025-07-27 | Diffusion-based Symbolic Music Generation with Structured State Space Models | Shenghua Yuan et.al. | 2507.20128 | null |
| 2025-08-07 | SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion | Hei Shing Cheung et.al. | 2507.19991 | null |
| 2025-07-17 | A new XML conversion process for mensural music encoding : CMME_to_MEI (via Verovio) | David Fiala et.al. | 2507.15991 | null |
| 2025-07-17 | WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling | Qihui Yang et.al. | 2507.10534 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-10-19 | SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization | Wenxi Chen et.al. | 2510.16841 | null |
| 2025-10-19 | U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation | Xusheng Yang et.al. | 2510.16718 | null |
| 2025-10-17 | LDCodec: A high quality neural audio codec with low-complexity decoder | Jiawei Jiang et.al. | 2510.15364 | null |
| 2025-10-17 | Extending Audio Context for Long-Form Understanding in Large Audio-Language Models | Yuatyong Chaichana et.al. | 2510.15231 | null |
| 2025-10-17 | LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models | Xiaohan Zhao et.al. | 2510.15227 | null |
| 2025-10-16 | TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation | Ming-Hao Hsu et.al. | 2510.14934 | null |
| 2025-10-15 | Acoustic Teleportation via Disentangled Neural Audio Codec Representations | Philipp Grundhuber et.al. | 2510.13221 | null |
| 2025-10-13 | UALM: Unified Audio Language Model for Understanding, Generation and Reasoning | Jinchuan Tian et.al. | 2510.12000 | null |
| 2025-10-13 | BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis | Jingyuan Xing et.al. | 2510.11646 | null |
| 2025-10-12 | FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec | Yurii Halychanskyi et.al. | 2510.10785 | null |
| 2025-10-11 | SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation | Zeyu Ling et.al. | 2510.10069 | null |
| 2025-10-11 | MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction | Jianjin Wang et.al. | 2510.10003 | null |
| 2025-10-10 | SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion | Zhao Guo et.al. | 2510.09245 | null |
| 2025-10-08 | AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs | Peize He et.al. | 2510.07293 | null |
| 2025-10-07 | Latent Speech-Text Transformer | Yen-Ju Lu et.al. | 2510.06195 | null |
| 2025-10-07 | EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS | Haoxun Li et.al. | 2510.05758 | null |
| 2025-10-06 | UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models | Wenhao Guan et.al. | 2510.04593 | null |
| 2025-10-04 | Désentrelacement Fréquentiel Doux pour les Codecs Audio Neuronaux | Benoît Giniès et.al. | 2510.03741 | null |
| 2025-10-04 | Soft Disentanglement in Frequency Bands for Neural Audio Codecs | Benoit Ginies et.al. | 2510.03735 | null |
| 2025-10-02 | High-Fidelity Speech Enhancement via Discrete Audio Tokens | Luca A. Lanzendörfer et.al. | 2510.02187 | null |
| 2025-10-02 | MelCap: A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression | Jingyi Li et.al. | 2510.01903 | null |
| 2025-10-02 | FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates | Jiaqi Li et.al. | 2510.00981 | null |
| 2025-10-07 | Baseline Systems For The 2025 Low-Resource Audio Codec Challenge | Yusuf Ziya Isik et.al. | 2510.00264 | null |
| 2025-09-30 | Scaling Spoken Language Models with Syllabic Speech Tokenization | Nicholas Lee et.al. | 2509.26634 | null |
| 2025-09-30 | Optimizing Speech Language Models for Acoustic Consistency | Morteza Rohanian et.al. | 2509.26276 | null |
| 2025-09-29 | MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech | Chengyao Wang et.al. | 2509.25131 | null |
| 2025-09-29 | VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning | Yixuan Zhou et.al. | 2509.24650 | null |
| 2025-09-29 | Assessing speech quality metrics for evaluation of neural audio codecs under clean speech conditions | Wolfgang Mack et.al. | 2509.24457 | null |
| 2025-09-26 | StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs | Yuhan Song et.al. | 2509.22220 | null |
| 2025-09-26 | Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling | Junjie Cao et.al. | 2509.22062 | null |
| 2025-09-26 | AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook | Yushen Chen et.al. | 2509.21968 | null |
| 2025-09-25 | X-Streamer: Unified Human World Modeling with Audiovisual Interaction | You Xie et.al. | 2509.21574 | null |
| 2025-09-24 | Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens | Ismail Rasim Ulgen et.al. | 2509.20485 | null |
| 2025-09-25 | From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training | Tianqiao Liu et.al. | 2509.20072 | null |
| 2025-09-24 | Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens | Pin-Jui Ku et.al. | 2509.20060 | null |
| 2025-09-25 | Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration | Yifan Yang et.al. | 2509.19928 | null |
| 2025-09-24 | Eliminating stability hallucinations in llm-based tts models via attention guidance | ShiMing Wang et.al. | 2509.19852 | null |
| 2025-09-23 | Improving Test-Time Performance of RVQ-based Neural Codecs | Hyeongju Kim et.al. | 2509.19186 | null |
| 2025-09-23 | Enhancing Noise Robustness for Neural Speech Codecs through Resource-Efficient Progressive Quantization Perturbation Simulation | Rui-Chen Zheng et.al. | 2509.19025 | null |
| 2025-09-23 | HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS | Sihang Nie et.al. | 2509.19001 | null |
| 2025-09-23 | Direct Preference Optimization for Speech Autoregressive Diffusion Models | Zhijun Liu et.al. | 2509.18928 | null |
| 2025-09-23 | Towards Evaluating Generative Audio: Insights from Neural Audio Codec Embedding Distances | Arijit Biswas et.al. | 2509.18823 | null |
| 2025-09-22 | Does Audio Matter for Modern Video-LLMs and Their Benchmarks? | Geewook Kim et.al. | 2509.17901 | null |
| 2025-09-22 | Qwen3-Omni Technical Report | Jin Xu et.al. | 2509.17765 | null |
| 2025-09-21 | MBCodec:Thorough disentangle for high-fidelity audio compression | Ruonan Zhang et.al. | 2509.17006 | null |
| 2025-09-19 | FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation | Luca Della Libera et.al. | 2509.16195 | null |
| 2025-09-19 | VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency | Nikita Torgashov et.al. | 2509.15969 | null |
| 2025-09-18 | A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication | Ryan Collette et.al. | 2509.15462 | null |
| 2025-09-18 | MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis | Keyu An et.al. | 2509.14784 | null |
| 2025-09-17 | A High-Quality and Low-Complexity Streamable Neural Speech Codec with Knowledge Distillation | En-Wei Zhang et.al. | 2509.13670 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-10-21 | MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels | Chen Chen et.al. | 2510.18915 | null |
| 2025-10-20 | Hearing Health in Home Healthcare: Leveraging LLMs for Illness Scoring and ALMs for Vocal Biomarker Extraction | Yu-Wen Chen et.al. | 2510.18169 | null |
| 2025-10-20 | SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering | Weilin Lin et.al. | 2510.17633 | null |
| 2025-10-21 | LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding | ZhaoYang Han et.al. | 2510.17305 | null |
| 2025-10-22 | OmniVIC: A Self-Improving Variable Impedance Controller with Vision-Language In-Context Learning for Safe Robotic Manipulation | Heng Zhang et.al. | 2510.17150 | null |
| 2025-10-19 | SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models | Chih-Kai Yang et.al. | 2510.16917 | null |
| 2025-10-19 | Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations | Bo-Han Feng et.al. | 2510.16893 | null |
| 2025-10-19 | The Augmented Lagrangian Methods: Overview and Recent Advances | Kangkang Deng et.al. | 2510.16827 | null |
| 2025-10-17 | OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM | Hanrong Ye et.al. | 2510.15870 | null |
| 2025-10-17 | Extending Audio Context for Long-Form Understanding in Large Audio-Language Models | Yuatyong Chaichana et.al. | 2510.15231 | null |
| 2025-10-16 | XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models | Xingrui Wang et.al. | 2510.15148 | null |
| 2025-10-15 | Yamaji effect in models of underdoped cuprates | Jing-Yu Zhao et.al. | 2510.13943 | null |
| 2025-10-15 | Generative Universal Verifier as Multimodal Meta-Reasoner | Xinchen Zhang et.al. | 2510.13804 | null |
| 2025-10-15 | InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue | Wenwen Tong et.al. | 2510.13747 | null |
| 2025-10-16 | NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching | Run Luo et.al. | 2510.13721 | null |
| 2025-10-14 | Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models | Tsung-En Lin et.al. | 2510.12851 | null |
| 2025-10-14 | Detect Anything via Next Point Prediction | Qing Jiang et.al. | 2510.12798 | null |
| 2025-10-14 | Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception | Ziyang Ma et.al. | 2510.12720 | null |
| 2025-10-15 | SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model | Lin Lin et.al. | 2510.12709 | null |
| 2025-10-14 | The spin Hall conductivity in the hole-doped bilayer Haldane-Hubbard model with odd-parity ALM | Minghuan Zeng et.al. | 2510.12602 | null |
| 2025-10-14 | Not in Sync: Unveiling Temporal Bias in Audio Chat Models | Jiayu Yao et.al. | 2510.12185 | null |
| 2025-10-14 | An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations | Benjamin W. Nelson et.al. | 2510.12083 | null |
| 2025-10-13 | Bridging the gap between ultrafast optics and resonant photonics via omni-resonance | Abbas Shiri et.al. | 2510.12002 | null |
| 2025-10-13 | UALM: Unified Audio Language Model for Understanding, Generation and Reasoning | Jinchuan Tian et.al. | 2510.12000 | null |
| 2025-10-13 | ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments? | Liu Yang et.al. | 2510.11549 | null |
| 2025-10-13 | Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning | Kuan-Yi Lee et.al. | 2510.11454 | null |
| 2025-10-13 | Optimizing Cross-Domain Transfer for Universal Machine Learning Interatomic Potentials | Jaesun Kim et.al. | 2510.11241 | null |
| 2025-10-13 | VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents | Jiliang Hu et.al. | 2510.11098 | null |
| 2025-10-12 | OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs | Caorui Li et.al. | 2510.10689 | null |
| 2025-10-12 | Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance | Jingyi Chen et.al. | 2510.10444 | null |
| 2025-10-14 | Integration of the TIAGo Robot into Isaac Sim with Mecanum Drive Modeling and Learned S-Curve Velocity Profiles | Vincent Schoenbach et.al. | 2510.10273 | null |
| 2025-10-10 | HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation | Jingyuan Sun et.al. | 2510.09221 | null |
| 2025-10-08 | Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization | Rui Hu et.al. | 2510.08618 | null |
| 2025-10-09 | An efficient algorithm for kernel quantile regression | Shengxiang Deng et.al. | 2510.07929 | null |
| 2025-10-08 | AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues | Krish Patel et.al. | 2510.07355 | null |
| 2025-10-08 | AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs | Peize He et.al. | 2510.07293 | null |
| 2025-10-07 | Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding | Yi Xin et.al. | 2510.06308 | null |
| 2025-10-07 | AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning | Haoyu Zhang et.al. | 2510.05478 | null |
| 2025-10-06 | Observation and modeling of a geo-effective event observed on 2011 May 28 from the solar surface to 1au | Nishu Karna et.al. | 2510.05334 | null |
| 2025-10-06 | AURA Score: A Metric For Holistic Audio Question Answering Evaluation | Satvik Dixit et.al. | 2510.04934 | null |
| 2025-10-06 | Robustness assessment of large audio language models in multiple-choice evaluation | Fernando López et.al. | 2510.04584 | null |
| 2025-10-03 | Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video | Mengyao Xu et.al. | 2510.03458 | null |
| 2025-10-03 | AudioToolAgent: An Agentic Framework for Audio-Language Models | Gijs Wijngaard et.al. | 2510.02995 | null |
| 2025-10-02 | Broadband entangled-photon omni-resonance in a planar optical cavity | Bryan L. Turo et.al. | 2510.01595 | null |
| 2025-10-01 | Hearing the Order: Investigating Selection Bias in Large Audio-Language Models | Yu-Xiang Lin et.al. | 2510.00628 | null |
| 2025-10-01 | When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models | Chen-An Li et.al. | 2510.00626 | null |
| 2025-10-01 | Multi-level Dynamic Style Transfer for NeRFs | Zesheng Li et.al. | 2510.00592 | null |
| 2025-09-30 | TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics | Yi-Cheng Lin et.al. | 2509.26329 | null |
| 2025-09-30 | OmniDFA: A Unified Framework for Open Set Synthesis Image Detection and Few-Shot Attribution | Shiyu Wu et.al. | 2509.25682 | null |
| 2025-09-29 | EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition | Jiacheng Shi et.al. | 2509.25495 | null |