


default search action
26th Interspeech 2025: Rotterdam, The Netherlands
- Odette Scharenborg, Catharine Oertel, Khiet Truong:

26th Annual Conference of the International Speech Communication Association, Interspeech 2025, Rotterdam, The Netherlands, 17-21 August 2025. ISCA 2025
Keynote 1 - Roger Moore: From Talking and Listening Devices to Intelligent Communicative Machines
- Roger K. Moore:

From Talking and Listening Devices to Intelligent Communicative Machines.
Spoken Machine Translation 1
- Luca Ducceschi, Greta H. Franzini:

Speech transcription from South Tyrolean Dialect to Standard German with Whisper. - Aswin Shanmugam Subramanian, Harveen Singh Chadha, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li:

Length Aware Speech Translation for Video Dubbing. - Vishal Kumar, Vinayak Abrol:

ArticulateX: End-to-End Monolingual Speech Translation in Articulator Space. - Jiale Ou, Hongying Zan:

CMSP-ST: Cross-modal Mixup with Speech Purification for End-to-End Speech Translation. - Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi:

End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model. - Yu Pu, Xiaoqian Liu, Guangyu Zhang, Zheng Yan, Wei-Qiang Zhang, Xie Chen:

Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data. - Gerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando:

Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios. - Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe:

Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs. - Aishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan Devarkonda, Santosh Kesiraju, Anil Kumar Vuppala:

End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data. - Shaowen Wang, Xinyuan Chen, Yao Xu:

Self-Improvement for Audio Large Language Model using Unlabeled Speech.
Real-time Speech Enhancement
- Yan Ru Pei, Ritik Shrivastava, Sidharth:

Optimized Real-time Speech Enhancement with Deep SSMs on Raw Audio. - Shenghui Lu, Hukai Huang, Jinanglong Yao, Kaidi Wang, Qingyang Hong, Lin Li:

A Two-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement. - Teng Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang:

Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations. - Yonghun Song, Yeeun Kim, Yoonyoung Chung:

Lightweight Speech Enhancement Model Based on Harmonic Attention and Phase Estimation with Skin-Attachable Accelerometer. - Yi Gao, Hangting Chen, Siyu Zhang, Qingshan Yang, Jingcong Chen:

TSDT-Net: Ultra-Low-Complexity Two-Stage Model Combining Dual-Path-Transformer and Transform-Average-Concatenate Network for Speech Enhancement. - Chidambar B, Hanumanth Rao Naidu:

Structured Codebook Based Hierarchical Framework for DNN for Computationally Efficient Speech Enhancement.
Multilinguality, Cross-linguistic Studies, L2 Speech
- Qian Zhou, Mathilde Hutin:

Evaluation of Three Automatic Alignment Tools for the Processing of Non-native French. - Hongchen Wu, Yixin Gu:

CrossPhon: An Auto Phone Mapping Tool to Streamline Cross-language Modeling for Phone Alignment of Low-resource Languages. - Ryo Magoshi, Shinsuke Sakai, Jaeyoung Lee, Tatsuya Kawahara:

Multi-lingual and Zero-Shot Speech Recognition by Incorporating Classification of Language-Independent Articulatory Features. - Haley Hsu, Dani Byrd, Khalil Iskarous, Louis Goldstein:

Instantaneous changes in acoustic signals reflect syllable progression and cross-linguistic syllable variation. - Linda Bakkouche, Brechtje Post:

Influence of Proficiency and L2 Experience on Dynamic Spectral Cue Utilization in L2 Vowel Perception and Production. - Kakeru Yazawa, Takayuki Konishi:

A Bayesian Approach to L2 Fluency Ratings by Native and Nonnative Listeners. - Silke Hamann, Andrea Alicehajic:

Are loan sequences different from foreign sequences? A perception study with Japanese listeners on coronal obstruent - high front vowel sequences. - Le Xuan Chan, Annika Heuser:

Relative cue weighting in multilingual stop voicing production. - Hannah White

, Joshua Penney
, Felicity Cox
:
Variability in Intervocalic /t/ and Community Diversity in Australian English.
Speech Emotion Recognition 1
- Pravin Mote, Donita Robinson, Elizabeth Richerson, Carlos Busso:

Vector Quantized Cross-lingual Unsupervised Domain Adaptation for Speech Emotion Recognition. - Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma:

HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion Recognition. - Shi-Xin Fang, Liang-Yeh Shen, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee:

Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning. - Yi Chang, Zhao Ren, Zhonghao Zhao, Thanh Tam Nguyen, Kun Qian, Tanja Schultz, Björn W. Schuller:

Breaking Resource Barriers in Speech Emotion Recognition via Data Distillation. - Mehedi Hasan Bijoy

, Dejan Porjazovski, Tamás Grósz, Mikko Kurimo:
Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition. - Ziwei Gong, Pengyuan Shi, Kaan Donbekci, Lin Ai, Run Chen, David Sasu, Zehui Wu, Julia Hirschberg:

Learning More with Less: Self-Supervised Approaches forLow-Resource Speech Emotion Recognition.
Multimodal Resources
- Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu:

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model. - Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu:

Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning. - Thai-Binh Nguyen, Thi Van Nguyen, Quoc Truong Do, Chi Mai Luong:

ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition. - Yihan Wu, Yichen Lu, Yijing Chen, Jiaqi Song, William Chen, Ruihua Song, Shinji Watanabe:

GALAXY: A Large-Scale Open-Domain Dataset for Multimodal Learning. - Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng:

FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems. - Sho Inoue, Shuai Wang, Haizhou Li:

PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs. - Boya Dong, Wentao Lei, Li Liu:

FFD: Fine-Finger Diffusion Model for Music to Fine-grained Finger Dance Generation. - Manjie Xu, Chenxing Li, Yong Ren, Xinyi Tu, Ruibo Fu, Wei Liang, Dong Yu:

Towards Diverse and Efficient Audio Captioning via Diffusion Models. - Amit Sofer, Yoav Goldman, Shlomo E. Chazan:

Pull It Together: Reducing the Modality Gap in Contrastive Learning.
Interpretability in Audio and Speech Technology
- Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D. Plumbley:

EnvSDD: Benchmarking Environmental Sound Deepfake Detection. - Dennis Fucci, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli:

Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution. - Cecilia Bolaños, Leonardo Pepino, Martín Meza, Luciana Ferrer:

Benchmarking Time-localized Explanations for Audio Classification Models. - Andrew Chang, Yike Li, Iran R. Roman, David Poeppel:

Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds. - Kentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu:

Discrete Tokens Exhibit Interlanguage Speech Intelligibility Benefit: an Analytical Study Towards Accent-robust ASR Only with Native Speech Data. - Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Kohei Matsuura, Shota Horiguchi:

Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains. - Yaroslav Getman, Tamás Grósz, Tommi Lehtonen, Mikko Kurimo:

Is your model big enough? Training and interpreting large-scale monolingual speech foundation models. - Andreas Patakis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou:

Semantic-Aware Interpretable Multimodal Music Auto-Tagging. - Asim Ersoy, Basel Ahmad Mousi, Shammur Absar Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani:

From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models. - Yen Meng, Sharon Goldwater, Hao Tang:

Effective Context in Neural Speech Models. - Martijn Bentum, Louis ten Bosch, Tomas O. Lentz

:
Word stress in self-supervised speech models: A cross-linguistic comparison. - Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem H. Zuidema, Martijn Bentum:

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. - Robin Huo, Ewan Dunbar:

Iterative Refinement, Not Training Objective, Makes HuBERT Behave Differently from wav2vec 2.0. - Gaofei Shen, Hosein Mohebbi, Arianna Bisazza, Afra Alishahi, Grzegorz Chrupala:

On the reliability of feature attribution methods for speech classification. - Emma Cathrine Liisborg Leschly, Oliver Roesler, Michael Neumann, Jackson Liscombe, Abhishek Hosamath, Lakshmi Arbatti, Line H. Clemmensen

, Melanie Ganz, Vikram Ramanarayanan:
An Exploration of Interpretable Deep Learning Models for the Assessment of Mild Cognitive Impairment.
Summarization
- Steffen Freisinger, Philipp Seeberger, Thomas Ranzenberger, Tobias Bocklet, Korbinian Riedhammer:

Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation. - Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Ryo Fukuda, William Chen, Shinji Watanabe:

Pick and Summarize: Integrating Extractive and Abstractive Speech Summarization. - Othman Istaiteh, Salima Mdhaffar, Yannick Estève:

Beyond Similarity Scoring: Detecting Entailment and Contradiction in Multilingual and Multimodal Contexts. - Ziwei Gong, Lin Ai, Harsh Deshpande, Alexander Johnson, Emmy Phung, Zehui Wu, Ahmad Emami, Julia Hirschberg:

Comparison-Based Automatic Evaluation for Meeting Summarization.
Show and Tell 1: ASR / Tools
- Alessandro De Luca, Srikanth Madikeri, Volker Dellwo:

Voxplorer: Voice data exploration and projection in an interactive dashboard. - Anand Kumar Rai, Satyam Rahangdale, Utkarsh Anand, Animesh Mukherjee:

ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems. - Christoph Draxler, Julian Pömp, Henk van den Heuvel, Fabio Ardolino, Arjan van Hessen:

Transcribing Oral History Recordings Using the Transcription Portal. - Teodora Vukovic, Jérémy Zehr, Jonathan Schaber, Igor Mustac, Nikolina Rajovic, Daniel McDonald, Johannes Graën, Noah Bubenhofer:

LiRI Corpus Platform: Demonstration of a Web-Based Infrastructure for Multimodal Corpus Analysis. - Zirong Li, Hongchen Wu, Yixin Gu, Yao Du, Yang Yue:

Speech Annotation for A: Accuracy, Access, and Application. - Arturs Znotins, Didzis Gosko, Normunds Gruzitis:

LATE: Open Source Toolkit for Latvian and Latgalian Speech Transcription. - Kumarmanas Nethil, Vaibhav Mishra, Kriti Anandan, Kavya Manohar:

Scalable Offline ASR for Command-Style Dictation in Courtrooms.
Models of Speech Production
- Yijing Lu, Khalil Iskarous, Louis Goldstein:

Towards a dynamical model of transitions between fluent and stuttered speech. - Juliette Dindart

, Agnès Rouxel, Crystal Lin, Trung Kien Bui, Muriel Lefort, Claire Pillot-Loiseau, Christophe Trésallet, Frédérique Frouin:
Study of vocal fold vibration using M-mode ultrasound: a proof of concept. - Jihwan Lee, Kevin Huang, Kleanthis Avramidis, Simon Pistrosch, Monica González Machorro, Yoonjeong Lee, Björn W. Schuller, Louis Goldstein, Shrikanth Narayanan:

Articulatory Feature Prediction from Surface EMG during Speech Production. - Saba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Y. Espy-Wilson:

Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality.
Speech and Grammar/Articulatory Analyses
- Anna Stein, Kevin Tang:

Modeling Probabilistic Reduction using Information Theory and Naive Discriminative Learning. - Zofia Malisz, Jan Foremski, Malgorzata Kul:

Contextual predictability effects on acoustic distinctiveness in read Polish speech. - Ivan Yuen

, Katherine Demuth
, Stefanie Shattuck-Hufnagel:
How do both phonological and syntactic complexity influence speech planning? - Anqi Xu, Yu-Yin Hsu:

When focus shapes the flow: prosodic restructuring in Mandarin complex nominals. - Sofoklis Kakouros

:
Investigating the Impact of Word Informativeness on Speech Emotion Recognition. - Bowei Shao, Philipp Buech, Anne Hermes, Maria Giavazzi:

Lexical stress affects lenition: The case of Italian palato-alveolar affricates. - Peter Birkholz, Tianyi Zhang:

Evaluation of a model for sound radiation from the vocal tract wall. - Satu Hopponen, Tomi Kinnunen, Alexandre Nikolaev, Rosa González Hautamäki, Lauri Tavi, Einar Meister:

FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents.
Speaking Styles, Register and Conversational Speech
- Yunzhuo Xiang, Jingyi Sun:

Modeling Formant Dynamics in Mandarin /ai/: Effects of Speech Style and Speech Rate. - Livia Qian, Carol Figueroa, Gabriel Skantze:

Representation of Perceived Prosodic Similarity of Conversational Feedback. - Oana Niculescu, Monica Vasileanu:

Prolongation in Romanian. - Kübra Bodur, Corinne Fredouille, Christine Meunier:

Speech Reduction in French: The Relationship Between Vowel Space and Articulation Dynamics. - Andre Batchelder-Schwab, Vasileios Michos, Jonathan Barnes:

Stress in Spoken and Whistled Greek.
Emotional Distress in Speech
- Justyna Krzywdziak, Bartlomiej Eljasiak, Joanna Stepien, Michal Swiatek, Agnieszka Pruszek:

Leveraging Text and Speech Processing for Suicide Risk Classification in Chinese Adolescents. - Wen Wu

, Ziyun Cui, Chang Lei
, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou, Runsen Chen, Chao Zhang:
The 1st SpeechWellness Challenge: Detecting Suicide Risk Among Adolescents. - Yifan Gao, Jiao Fu, Long Guo, Hong Liu:

Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection. - Xi Chen, Renzhe Yu, Yanshen Tan, Yiyi Li, Quan Qian, Ying Lin:

Predicting Adolescent Suicidal Risk from Multi-task-based Speech: An Ensemble Learning Approach. - Filomene Roquefort, Alexandre Ducorroy, Rachid Riad:

In-context learning capabilities of Large Language Models to detect suicide risk among adolescents from speech transcripts. - June-Woo Kim, Wonkyo Oh, Haram Yoon, Sung-Hoon Yoon, Dae-Jin Kim, Dong-Ho Lee, Sang-Yeol Lee, Chan-Mo Yang:

Language-Agnostic Suicidal Risk Detection Using Large Language Models. - Vincent P. Martin, Charles Brazier, Maxime Amblard, Michel Musiol, Jean-Luc Rouas:

Network of acoustic characteristics for the automatic detection of suicide risk from speech. Contribution to the 2025 SpeechWellness challenge by the Semawave team.
Prosody in Speech Synthesis
- Eray Eren, Qingju Liu, Hyeongwoo Kim, Pablo Garrido, Abeer Alwan:

ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs. - Kyowoon Lee, Artyom Stitsyuk, Gunu Jho, Inchul Hwang, Jaesik Choi:

Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models. - Paul Mayer, Florian Lux, Alejandro Pérez González de Martos, Angelina Elizarova, Lindsey Vanderlyn, Dirk Väth, Ngoc Thang Vu:

Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis. - Tadashi Ogura, Takuma Okamoto, Yamato Ohtani, Erica Cooper, Tomoki Toda, Hisashi Kawai:

GST-BERT-TTS: Prosody Prediction Without Accentual Labels For Multi-Speaker TTS Using BERT With Global Style Tokens. - Anindita Mondal, Monica Surtani, Anil Kumar Vuppala, Parameswari Krishnamurthy, Chiranjeevi Yarra:

ExagTTS: An Approach Towards Controllable Word Stress Incorporated TTS for Exaggerated Synthesized Speech Aiding Second Language Learners. - Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim:

Synthetic Data Generation for Phrase Break Prediction with Large Language Model.
Depression Detection and Assessment 1
- Lauren L. White, Ewan Carr, Judith Dineley

, Catarina Botelho, Pauline Conde, Faith Matcham, Carolin Oetzmann
, Amos Folarin, George Fairs, Agnes Norbury, Stefano Goria, Srinivasan Vairavan, Til Wykes, Richard J. B. Dobson
, Vaibhav Naraya, Matthew Hotopf, Alberto Abad, Isabel Trancoso, Nicholas Cummins:
Speech Reference Intervals: An Assessment of Feasibility in Depression Symptom Severity Prediction. - Wenrui Liang, Rong Zhang, Xuezhen Zhang, Ying Ma, Wei-Qiang Zhang:

DepressGEN: Synthetic Data Generation Framework for Depression Detection. - Yuqiu Zhou, Yongjie Zhou, Yudong Yang, Yang Liu, Jun Huang, Shuzhi Zhao, Rongfeng Su, Lan Wang, Nan Yan:

Emotion-Guided Graph Attention Networks for Speech-Based Depression Detection under Emotion-Inducting Tasks. - Patawee Prakrankamanant, Shinji Watanabe, Ekapol Chuangsuwanich:

Explainable Depression Detection using Masked Hard Instance Mining. - Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore:

Test-Time Training for Speech-based Depression Detection. - Lishi Zuo, Man-Wai Mak:

Leveraging Ordinal Information for Speech-based Depression Classification. - Erfan Loweimi, Sofia de la Fuente Garcia, Saturnino Luz:

Zero-Shot Speech-Based Depression and Anxiety Assessment with LLMs. - Robert Lewis, Szymon Fedor, Nelson Hidalgo Julia, Joshua Curtiss, Jiyeon Kim, Noah Jones, David Mischoulon, Thomas F. Quatieri, Nicholas Cummins, Paola Pedrelli, Rosalind W. Picard:

Towards the Objective Characterisation of Major Depressive Disorder Using Speech Data from a 12-week Observational Study with Daily Measurements. - Sophie Young

, Fuxiang Tao, Bahman Mirheidari, Madhurananda Pahar, Markus Reuber, Heidi Christensen
:
Can Speech Accurately Detect Depression in Patients With Comorbid Dementia? An Approach for Mitigating Confounding Effects of Depression and Dementia.
Speech Analysis, Detection and Classification 1
- Shaojie Li, Qintuya Si, De Hu:

Temporal Convolutional Network with Smoothed and Weighted Losses for Distant Voice Activity and Overlapped Speech Detection. - Kumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik:

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion. - Miseul Kim, Seyun Um, Hyeonjin Cha, Hong-Goo Kang:

SpeechMLC: Speech Multi-label Classification. - Dohyun Kim, Jiwook Hwang:

Fully End-to-end Streaming Open-vocabulary Keyword Spotting with W-CTC Forced Alignment. - Anna Seo Gyeong Choi, Alexander Richardson, Ryan Partlan, Sunny X. Tang, Sunghye Cho:

Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech Analysis. - Tahiya Chowdhury, Verónica Romero:

Can We Trust Machine Learning? The Reliability of Features from Open-Source Speech Analysis Tools for Speech Modeling.
Speech-based Cognitive Assessment 1
- Vi Jun Sean Yong, Serkan Kumyol, Pau Le Lisa Low, Winnie Suk Wai Leung, Tristan Braud:

HK-GenSpeech: A Generative AI Scene Creation Framework for Speech Based Cognitive Assessment. - Parismita Gogoi, Vishwanath Pratap Singh, Seema Khadirnaikar, Soma Siddhartha, Sishir Kalita, Jagabandhu Mishra, Md. Sahidullah, Priyankoo Sarmah, S. R. M. Prasanna:

Leveraging AM and FM Rhythm Spectrograms for Dementia Classification and Assessment. - Yin-Long Liu, Yuanchao Li, Rui Feng, Liu He, Jia-Xin Chen, Yi-Ming Wang, Yu-Ang Chen, Yan-Han Peng, Jia-Hong Yuan, Zhen-Hua Ling:

Leveraging Cascaded Binary Classification and Multimodal Fusion for Dementia Detection through Spontaneous Speech. - Kaichen Jia, Jinpeng Li, Ke Li, Wei-Qiang Zhang:

Whisper-Based Multilingual Alzheimer's Disease Detection and Improvements for Low-Resource Language. - Qi Sun, Ziyue Qiu, Yu Pu, Jinpeng Li, Xuchu Chen, Wei-Qiang Zhang:

PPGs-BERT: Leveraging Phoneme Sequence and BERT for Alzheimer's Disease Detection from Spontaneous Speech.
Large Language Models in Speech Recognition
- Te Ma, Min Bi, Saierdaer Yusuyin, Hao Huang, Zhijian Ou:

LLM-based phoneme-to-grapheme for phoneme-based speech recognition. - Jie Zhengjie, Gaofeng Cheng:

Pinyin-Guided Chinese Speech Recognition with Large Language Model. - Hang Su, Yuxiang Kong, Lichun Fan, Jian Luan:

Text-Enhanced Audio Encoder for Large Language Model based Speech Recognition via Cross-Modality Pre-training with Unpaired Audio-Text Data. - Jinda Zhang, Aanchan Mohan

:
Towards atypical speech transcription using LLM-based ASR. - Jeena Prakash, Blessingh Kumar, Kadri Hacioglu, Bidisha Sharma, Sindhuja Gopalan, Malolan Chetlur, Shankar Venkatesan, Andreas Stolcke:

Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLM. - Tianyi Xu, Hongjie Chen, Qing Wang, Hang Lv, Jian Kang, Jie Li, Zhennan Lin, Yongxiang Li, Lei Xie:

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis.
Speech Coding and Echo Cancellation
- Shanhui Gan, Zijian Liang, Kai Niu, Ping Zhang:

Synonymity-Based Semantic Coding for Efficient Speech Compression. - Byeong Hyeon Kim, Hyungseob Lim, Inseon Jang, Hong-Goo Kang:

Towards an Ultra-Low-Delay Neural Audio Coding with Computational Efficiency. - Zixiang Wan, Guochang Zhang, Yifeng He, Jianqiang Wei:

SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain. - Haibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li:

TS3-Codec: Transformer-Based Simple Streaming Single Codec. - Yunkee Chae, Kyogu Lee:

Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ. - Bowen Zhang, Ian McLoughlin, Xiaoxiao Miao, A. S. Madhukumar:

LSPnet: an ultra-low bitrate hybrid neural codec. - Yao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling:

Vision-Integrated High-Quality Neural Speech Coding. - Woongjib Choi, Byeong Hyeon Kim, Hyungseob Lim, Inseon Jang, Hong-Goo Kang:

Neural Spectral Band Generation for Audio Coding. - Fei Zhao, Xueliang Zhang, Zhong-Qiu Wang:

Multi-Channel Acoustic Echo Cancellation Based on Direction-of-Arrival Estimation.
Decoding Algorithms
- Koji Okabe, Hitoshi Yamamoto:

Simultaneous Masked and Unmasked Decoding with Speculative Decoding Masking for Fast ASR without Accuracy Loss. - Hainan Xu, Vladimir Bataev

, Lilit Grigoryan, Boris Ginsburg:
WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection. - Vladimir Bataev

, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg:
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding. - Lilit Grigoryan, Vladimir Bataev

, Andrei Andrusenko, Hainan Xu, Vitaly Lavrukhin, Boris Ginsburg:
Pushing the Limits of Beam Search Decoding for Transducer-based ASR models. - Ashish R. Mittal, Darshan Prabhu, Sunita Sarawagi, Preethi Jyothi:

Skip-Salsa: Skip Synchronous Fusion of ASR LLM Decoders. - Kwok Chin Yuen, Jia Qi Yip:

Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition.
Queer and Trans Speech Science and Technology
- Tara McAllister, Collin Eagen, Yi Shan, Peter Traver, Daphna Harel, Tae Hong Park, Vesna D. Novak:

Web-Based Application for Real-Time Biofeedback of Vocal Resonance in Gender-Affirming Voice Training: Design and Usability Evaluation. - Robin Netzorg, Naomi Carvalho, Andrea Guzman, Lydia Wang, Juliana Francis, Klo Vivienne Garoute, Keith Johnson, Gopala Anumanchipalli:

On the Production and Perception of a Single Speaker's Gender. - Alice Ross

, Cliodhna Hughes, Eddie L. Ungless, Catherine Lai:
Conveying Gender Through Speech: Insights from Trans Men. - Ingo Siegert, Jan Marquenie, Sven Grawunder:

Queer Waves: A German Speech Dataset Capturing Gender and Sexual Diversity from Podcasts and YouTube. - Carlos Hartmann

:
Reddit FlairShare: A Human-Annotated Dataset of Gender-Progressive Online Discourse. - Maxwell Hope, Éva Székely:

Voices of 'cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologies.
Tone
- Xiao Dong, Fengming Liu, Chien-Jer Charles Lin, Monica Nesbitt, Shuju Shi:

Neutral Tone Variation in Beijing Mandarin: Is Neutral Tone Toneless? - Siqi Lu, Hui Feng, Ziyu Xiong:

The Role of Syntactic Structures in Shaping Directionality in Trisyllabic Tone Sandhi: Evidence from Tianjin Mandarin. - Zhijie Li, Hui Feng:

Acoustic Representation and Realization of Weak Elements Subcategories: In the Case of Tianjin Mandarin. - Lishan Li, Yaolin Zhou, Xiaoying Xu:

Lexical competition in the process of Cantonese tone merging: Diverse Impact Mechanisms Across Different Individuals and Tone Pairs. - Zhenrui Zhang, Fang Hu:

Tonal Perception in Changde Mandarin. - Changhong Du, Fang Hu:

Tonal Contrasts in the Malipo Variety of the Mienic Language.
Cross-Lingual and Multilingual Processing
- Yanir Marmor, Yair Lifshitz, Yoad Snapir, Kinneret Misgav:

Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. - Ondrej Klejch, William Lamb, Peter Bell:

A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic. - Razhan Hameed, Sina Ahmadi

, Hanah Hadi, Rico Sennrich:
Automatic Speech Recognition for Low-Resourced Middle Eastern Languages. - Zhaolin Li, Jan Niehues

:
In-context Language Learning for Endangered Languages in Speech Recognition. - Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Sai Lodagala, William Chen, Olga Iakovenko

, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali, Sanjeev Khudanpur, Shinji Watanabe:
CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset. - Mingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie:

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR. - Tuan Nguyen, Huy Dat Tran:

Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages. - Leonora Vesterbacka, Faton Rekathati, Robin Kurtz, Justyna Sikora, Agnes Toftgård:

Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition.
Echo Cancellation, Feedback Control, and Near-end Enhancement
- Fei Zhao, Shulin He, Xueliang Zhang:

Room Impulse Response as a Prompt for Acoustic Echo Cancellation. - Yuyang Wang, Yonghui Liu, Jianbing Liu, Kai Niu, Zhiqiang He:

CAGCRN: Real-Time Speech Enhancement with a Lightweight Model for Joint Acoustic Echo Cancellation and Noise Suppression. - Jinfu Wang, Ziteng Wang, Xin Liu, Yang Liu, Qing Shi, Zhengqiang Luo, Feiran Yang:

Exploiting Echo Path Priors for Enhanced Stereo Acoustic Echo Cancellation. - Quang Minh Dinh, Hoda Rezaee Kaviani, Mehrdad Hosseinzadeh, Yuanhao Yu:

Extended Loss: Incorporating Long Context into Training Models when using Short Audio Frames. - Filippo Villani

, Wai-Yip Chan, Zheng-Hua Tan
, Jan Østergaard
, Jesper Jensen:
Analysis and Extension of a Near-End Listening Enhancement Method Based on Long-Term Fractile Noise Statistics. - Yuan-Kuei Wu, Juan Azcarreta Ortiz, Kashyap Patel, Buye Xu, Jung-Suk Lee, Sanha Lee, Ashutosh Pandey:

A Novel Deep Learning Framework for Efficient Multichannel Acoustic Feedback Control. - Bunlong Lay, Rostilav Makarov, Timo Gerkmann:

Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency.
Pathological Speech Analysis 1
- Xiaokang Liu, Xingfeng Li, Yudong Yang, Lan Wang, Nan Yan:

Addressing Task Conflicts in Stuttering Detection via MMoE-Based Multi-Task Learning. - Y. S. Upendra Vishwanath, Tanuka Bhattacharjee, Deekshitha G, Sathvik Udupa, Chowdam Venkata Thirumala Kumar, Madassu Keerthipriya, Darshan Chikktimmegowda, Dipti Baskar, Yamini Belur, Seena Vengalil, Atchayaram Nalini, Prasanta Kumar Ghosh:

Comparison of Acoustic and Textual Features for Dysarthria Severity Classification in Amyotrophic Lateral Sclerosis. - Suhita Ghosh, Mélanie Jouaiti, Jan-Ole Perschewski, Sebastian Stober:

StutterCut: Uncertainty-Guided Normalised Cut for Dysfluency Segmentation. - Giulia Sanguedolce, Jón Guðnason, Dragos-Cristian Gruia, Emilie D'Olne, Fatemeh Geranmayeh, Patrick A. Naylor

:
Physiologically-Informed Feature Analysis of Acquired Speech Disorders for Stroke Assessment.
Hearing Disorders
- Gloria Araiza-Illan

, Luke Meyer
, Bert Maat
, Deniz Baskent
:
Robot-assisted Recognition of Vocal Emotions in Pseudospeech for Cochlear Implanted Adolescents. - Ahsan J. Cheema, Sunil Puria:

Using Neurogram Similarity Index Measure (NSIM) to Model Hearing Loss and Cochlear Neural Degeneration. - Longbin Jin, Donghun Min, Jung Eun Shin, Eun Yi Kim:

Contrastive Learning-based Syllable-Level Mispronunciation Detection and Diagnosis for Speech Audiometry. - Hsin-Tien Chiang, John H. L. Hansen:

A Deformable Convolution GAN Approach for Speech Dereverberation in Cochlear Implant Users. - Fengyuan Hao, Brian C. J. Moore, Huiyong Zhang, Xiaodong Li, Chengshi Zheng:

L3C-DeepMFC: Low-Latency Low-Complexity Deep Marginal Feedback Cancellation with Closed-Loop Fine Tuning for Hearing Aids. - Man Wang, Yixin Ding, Niels O. Schiller

:
Semantic Processing During Spoken Word Production by Children with Cochlear Implants. - Yuting Ding, Xuefei Wang, Fei Chen:

Linguistic Masking and Its Release in Simulated Electric-acoustic Hearing.
Interspeech 2025 URGENT Challenge
- Wangyou Zhang, Kohei Saijo, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Wei Wang, Yihui Fu, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian:

Lessons Learned from the URGENT 2024 Speech Enhancement Challenge. - Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach

, Yihui Fu, Wei Wang, Tim Fingscheidt, Shinji Watanabe:
Interspeech 2025 URGENT Speech Enhancement Challenge. - Xiaobin Rong, Dahan Wang, Qinwen Hu, Yushi Wang, Yuxiang Hu, Jing Lu:

TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network. - Xiaohuai Le, Zhuangqi Chen, Siyu Sun, Xianjun Xia, Chuanzeng Huang:

Multistage Universal Speech Enhancement System for URGENT Challenge. - Zhihang Sun, Andong Li, Tong Lei, Rilin Chen, Meng Yu, Chengshi Zheng, Yi Zhou, Dong Yu:

Scaling beyond Denoising: Submitted System and Findings in URGENT Challenge 2025. - Sanberk Serbest, Tijana Stojkovic, Milos Cernak, Andrew Harper:

DeepFilterGAN: A Full-band Real-time Speech Enhancement System with GAN-based Stochastic Regeneration. - Nabarun Goswami

, Tatsuya Harada:
FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge. - Rong Chao, Rauf Nasretdinov, Yu-Chiang Frank Wang, Ante Jukic, Szu-Wei Fu, Yu Tsao:

Universal Speech Enhancement with Regression and Generative Mamba.
Spoken Machine Translation 2
- Jean-Luc Rouas, Charles Brazier, Leila Ben Letaifa, Rafael Medina, Pedro Palacios, David Atienza, Giovanni Ansaloni:

Structured pruning for efficient systolic array accelerated cascade Speech-to-Text Translation. - Mohammad MohammadAmini, Aghilas Sini, Marie Tahon, Antoine Laurent:

Scaling pseudo-labeling data for end-to-end low-resource speech translation (the case of Kurdish language). - Kirandevraj R, Vinod K. Kurmi, Vinay P. Namboodiri, C. V. Jawahar:

Multilingual Query-by-Example KWS for Indian Languages using Transliteration. - Chenyang Le, Yinfeng Xia, Huiyan Li, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian:

Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation. - Verena Blaschke, Miriam Winkler, Constantin Förster, Gabriele Wenger-Glemser, Barbara Plank:

A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation. - Tahir Javed, Kaushal Santosh Bhogale, Mitesh M. Khapra:

NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data.
Spatial Audio and Acoustics 1
- Sheng Lyu, Yuemin Yu, Chenshu Wu:

Temporal Modeling of Room Impulse Response Generation via Multi-Scale Autoregressive Learning. - Yunqi C. Zhang, Dhruv Jagmohan, Hong Kit Li, C. T. Justine Hui, Yusuke Hioka:

Effect of Noise Floor in Room Impulse Response on Speech Perception Under Spherical Harmonics-based Spatial Sound Reproduction. - Christopher Ick, Gordon Wichern, Yoshiki Masuyama, François G. Germain, Jonathan Le Roux:

Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses. - Linya Fu, Yu Liu, Zhijie Liu, Zedong Yang, Zhong-Qiu Wang, Youfu Li, He Kong:

AuralNet: Hierarchical Attention-based 3D Binaural Localization of Overlapping Speakers. - Tuochao Chen, D. Shin, Hakan Erdogan, Sinan Hersek:

SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction. - Yang Xiao, Rohan Kumar Das:

TF-Mamba: A Time-Frequency Network for Sound Source Localization.
Articulatory and Vocal Tract Modelling
- Frédéric Berthommier:

Articulatory modeling of the S-shaped F2 trajectories observed in Öhman's spectrographic analysis of VCV syllables. - Allan Vurma, Einar Meister, Lya Meister, Jaan Ross, Marju Raju, Veeda Kala, Tuuri Dede:

The Role of Voiced Consonant Duration in Sung Vowel-Consonant and Consonant-Vowel Recognition. - Riccarda Funk, Melanie Weirich, Adrian P. Simpson:

How sibilant spectra shape gender perception in prepubertal children: A voice morphing study. - Tharinda Piyadasa, Joan Glaunès, Amelia Gully, Michael Proctor

, Kirrie J. Ballard, Tünde Szalay
, Naeim Sanaei, Sheryl Foster
, David Waddington, Craig T. Jin:
Constrained LDDMM for Dynamic Vocal Tract Morphing: Integrating Volumetric and Real-Time MRI. - Rongshuai Wu, Debasish Ray Mohapatra, Sidney Fels:

2D Immersed Boundary Method in Vocal Tract Acoustics: An Eulerian-Lagrangian Model for Simulation of Diphthongs. - Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie:

Reconstruction of the Complete Vocal Tract Contour Through Acoustic to Articulatory Inversion Using Real-Time MRI Data. - Yubin Zhang, Prakash Kumar, Ye Tian, Ziwei Zhao, Xuan Shi, Kevin Huang, Kevin Lee, Haley Hsu, Shrikanth Narayanan, Krishna S. Nayak, Louis Goldstein:

Co-registration of real-time MRI and respiration for speech research.
Acoustic Assessment of Respiratory Health
- Loes van Bemmel

, Lauren G. Reinders
, Folkert Brijker, Bas Holverda, Frits M. E. Franssen, Hanneke van Helvoort, Visara Urovi
, Marieke Spreeuwenberg, Sami O. Simons:
SPEAKtoCOPD: a flashmob study to collect COPD speech. - Yuyang Yan, Sami O. Simons, Visara Urovi:

Developing a LeFF Transformer Model for Exacerbated Speech Detection in COPD and Asthma. - Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada:

Towards Pre-training an Effective Respiratory Audio Foundation Model. - Lauren G. Reinders

, Loes van Bemmel
, Alexander Mackay, David Nobbs, Frits M. E. Franssen, Hester Gietema, Simona Schäfer, Sami O. Simons:
Effect of physical exercise on voice in people living with COPD. - Gaoyang Dong, Zhicheng Zhang, Ping Sun, Minghui Zhang:

Adaptive Differential Denoising for Respiratory Sounds Classification. - Peidong Wei, Shiyu Miao, Lin Li:

Disentangling Dual-Encoder Masked Autoencoder for Respiratory Sound Classification. - Seung Gyu Jeong, Seong Eun Kim:

Patient-Aware Feature Alignment for Robust Lung Sound Classification: Cohesion-Separation and Global Alignment Losses. - Miika Toikkanen, June-Woo Kim:

Improving Respiratory Sound Classification with Architecture-Agnostic Knowledge Distillation from Ensembles.
Advances in Modelling and Imaging
- Mélen Guillaume, Anahita Basirat, Julien Diard:

Theoretical proposal for a unified Bayesian model of adaptation in non-interactive and interactive speech production. - Juraj Simko, Benjamin Elie, Alice Turk:

Self-supervised Optimality-Guided Learning of Speech Articulation. - Zhe-chen Guo, Bharath Chandrasekaran:

Extended High-frequency Cues to Phoneme Recognition: Insights from ASR. - Jia-Xin Chen, Yi-Ming Wang, Ziyu Zhang, Jiayang Han, Yin-Long Liu, Rui Feng, Xiuyuan Liang, Zhen-Hua Ling, Jia-Hong Yuan:

Decoding Speaker-Normalized Pitch from EEG for Mandarin Perception. - Tong Zhu, Xiaoke Yang, Jian Zhou, Lu Li, Zhao Lv, Cunhang Fan:

SSF-DST: A Spectro-Spatial Features Enhanced Deep Spatiotemporal Network for EEG-Based Auditory Attention Detection. - Yujie Yan, Xiran Xu, Haolin Zhu, Songyi Li, Bo Wang, Xihong Wu, Jing Chen:

Overestimated performance of auditory attention decoding caused by experimental design in EEG recordings. - Chetan Sharma, Vaishnavi Chandwanshi, Shreya Shrikant Karkun, Aditya Anand Gupta, Prasanta Kumar Ghosh:

A real-time MRI study on asymmetry in velum dynamics during VCV production with nasal sounds. - Carey Smith, Hu Cheng, Pertti Palo, Daniel Aalto, Steven M. Lulich:

Exploratory Analysis of Brainstem fMRI Data During Sustained Phonation.
Conversation, Communication and Interaction 1
- Seongsil Heo, Christi Miller, Calvin Murdock, Michael J. Proulx:

Gaze-Enhanced Multimodal Turn-Taking Prediction in Triadic Conversations. - Sam O'Connor Russell, Naomi Harte:

Visual Cues Support Robust Turn-taking Prediction in Noise. - Yoshinori Fukunaga, Ryota Nishimura, Kengo Ohta, Norihide Kitaoka:

Backchannel prediction for natural spoken dialog systems using general speaker and listener information. - Muhammad Yeza Baihaqi, Angel F. Garcia Contreras, Seiya Kawano, Koichiro Yoshino:

Rapport-Building Dialogue Strategies for Deeper Connection: Integrating Proactive Behavior, Personalization, and Aizuchi Backchannels. - Lena-Marie Huttner, Jeppe H. Christensen, Gitte Keidser, Tobias May, Torsten Dau, Sergi Rotger-Griful:

Does effortful speech production indicate communication difficulty caused by noise and hearing aid support? - Julio Cesar Cavalcanti

, Gabriel Skantze:
"Dyadosyncrasy", Idiosyncrasy and Demographic Factors in Turn-Taking.
Robust Speaker Verification
- Théo Lepage

, Réda Dehak:
SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification. - Minu Kim, Kangwook Jang, Hoirin Kim:

ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction. - Zhe Li, Man-Wai Mak, Jen-Tzung Chien, Mert Pilanci, Zezhong Jin, Helen Meng:

Disentangling Speaker and Content in Pre-trained Speech Models with Latent Diffusion for Robust Speaker Verification. - Alexandre Ferro Filho, Diogo Fernandes Costa Silva, Pedro Elias Engelberg Silva Borges, Arlindo Rodrigues Galvão Filho:

Evaluating Deep Speaker Embedding Robustness to Domain, Sampling Rate, and Codec Variations. - Zhiyong Chen, Shuhang Wu, Xinnuo Li, Zhiqi Ai

, Shugong Xu:
Towards Robust Speaker Recognition against Intrinsic Variation with Foundation Model Few-shot Tuning and Effective Speech Synthesis. - Jin Li, Man-Wai Mak, Johan Rohdin, Kong Aik Lee, Hynek Hermansky:

Bayesian Learning for Domain-Invariant Speaker Verification and Anti-Spoofing.
Multilingual ASR
- Masato Mimura, Jaeyoung Lee, Tatsuya Kawahara:

Switch Conformer with Universal Phonetic Experts for Multilingual ASR. - Hongli Yang, Sheng Li, Hao Huang, Ayiduosi Tuohan, Yizhou Peng:

Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR. - Jiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, Yanmin Qian:

Efficient Multilingual ASR Finetuning via LoRA Language Experts. - Raphaël Bagat, Irina Illina, Emmanuel Vincent:

Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition. - Zheng Xin Yong, Vineel Pratap, Michael Auli, Jean Maillard:

Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR. - Pouya Mehralian, Hugo Van hamme

:
Leveraging Geographic Metadata for Dialect-Aware Speech Recognition. - Ömer Tarik Özyilmaz

, Matt Coler
, Matias Valdenegro-Toro
:
Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning. - Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen:

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining. - Yingzhi Wang, Anas Alhmoud, Muhammad Alqurishi:

Open Universal Arabic ASR Leaderboard.
Multi-channel Speech Enhancement
- Yujie Yang, Bing Yang, Xiaofei Li:

Mel-McNet: A Mel-Scale Framework for Online Multichannel Speech Enhancement. - Zheng Wang, Xiaobin Rong, Yu Sun, Tianchi Sun, Zhibin Lin, Jing Lu:

A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions. - Pengjie Shen, Xueliang Zhang, Zhong-Qiu Wang:

ARiSE: Auto-Regressive Multi-Channel Speech Enhancement. - Lu Han, Junqi Zhao, Renhua Peng:

WTFormer: A Wavelet Conformer Network for MIMO Speech Enhancement with Spatial Cues Peservation. - Nurali Alip, Tianrui Wang, Rui Cao, Meng Ge, Jingru Lin, Longbiao Wang, Jianwu Dang:

A Three-Stage Beamforming with Harmonic Guidance for Multi-Channel Speech Enhancement. - Chengyuan Qin, Wenmeng Xiong, Jing Zhou, Maoshen Jia, Changchun Bao:

Speech Enhancement with Dual-path Multi-Channel Linear Prediction Filter and Multi-norm Beamforming.
Self-supervised Learning
- Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu:

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR. - Nik Vaessen, Roeland Ordelman, David A. van Leeuwen:

Self-supervised learning of speech representations with Dutch archival data. - Aleksandr Kutsakov, Alexandr Maximenko, Georgii Gospodinov, Pavel Bogomolov, Fyodor Minkin:

GigaAM: Efficient Self-Supervised Learner for Speech Recognition. - Hyung-Gun Chi, Zakaria Aldeneh, Tatiana Likhomanenko, Oggi Rudovic, Takuya Higuchi, Li-Wei Chen, Shinji Watanabe, Ahmed Hussen Abdelaziz:

DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective. - Kentaro Onda, Yosuke Kashiwagi, Emiru Tsunoo, Hayato Futami, Shinji Watanabe:

Differentiable K-means for Fully-optimized Discrete Token-based ASR. - Ryan Whetten, Lucas Maison, Titouan Parcollet, Marco Dinarelli, Yannick Estève:

Towards Early Prediction of Self-Supervised Speech Model Performance.
Singing Voice and Audio Synthesis
- Joon-Seung Choi, Dong-Min Byun, Hyung-Seok Oh, Seong-Whan Lee:

VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion. - Chenyu Yang, Hangting Chen, Shuai Wang, Haina Zhu, Haizhou Li:

TVC-MusicGen: Time-Varying Structure Control for Background Music Generation via Self-Supervised Training. - Mu Yang, Bowen Shi, Matthew Le, Wei-Ning Hsu, Andros Tjandra:

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation. - Mingda Liu, Jiatong Shi:

Bridging Speech and Singing: Multi-stage Speech-Prompted Singing Voice Conversion with Speaker Embedding Adaptation. - Yicheng Gu, Chaoren Wang, Zhizheng Wu, Lauri Juvela:

Neurodyne: Neural Pitch Manipulation with Representation Learning and Cycle-Consistency GAN. - Zijing Zhao, Kai Wang, Hao Huang, Ying Hu, Liang He, Jichen Yang:

VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge. - Wei Chen, Binzhu Sha, Dan Luo, Jing Yang, Zhuo Wang, Fan Fan, Zhiyong Wu:

DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching. - Wangjin Zhou, Tianjiao Du, Chenglin Xu, Sheng Li, Yi Zhao, Tatsuya Kawahara:

Simple and Effective Content Encoder for Singing Voice Conversion via SSL-Embedding Dimension Reduction. - Yunkee Chae, Eunsik Shin, Suntae Hwang, Seungryeol Paik, Kyogu Lee:

Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control.
Acoustic and Articulatory Cues in Speech Perception
- Wenwei Dong, Alif Silpachai, Catia Cucchiarini, Helmer Strik:

Multitalker Babble in English Vowel Perception Training: A Comparison between Humans and Neural Models. - Etienne Gaudrain, Sarah Verhulst, Deniz Baskent

:
Speech stimulus design to study the neural coding of speech and the impact of cochlear synaptopathy. - Esther Janse, Chen Shen, Martin Cooke:

Prediction of listening effort ratings for habitual and clear-Lombard speech presented in noise. - Shengyue Xiong, Zhe-chen Guo, Bharath Chandrasekaran:

Language and Accent Familiarity Effects on the Use of Acoustic Cues in Talker Identification. - Laura Rachman, Deniz Baskent

:
Characterization of voice cue sensitivity and vocal emotion recognition across the adult lifespan. - Zixia Fan, Ronny Ibrahim, Joshua Penney

, Felicity Cox
:
Creaky Voice Facilitates More Efficient Phonological Processing of Mandarin Tone 3.
Audio Event Detection and Classification
- Tomoya Yoshinaga, Yoshiaki Bando, Keitaro Tanaka, Keisuke Imoto, Masaki Onishi, Shigeo Morishima:

Training Onset-and-Offset-Aware Sound Event Detection on a Heterogeneous Dataset via Probabilistic Sequential Modeling. - Yulu Fang, Mingyue He, Qisheng Xu, Jianqiao Zhao, Cheng Yang, Kele Xu, Yong Dou:

Multi-view Fusion and Parameter Perturbation for Few-Shot Class-Incremental Audio Classification. - Yongjie Si, Yanxiong Li, Jiaxin Tan, Qianhua He, Il-Youp Kwak:

Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier. - Claudia Montero-Ramírez, Alba Martínez-Serrano, Jorge Garcelán-Gómez, Francisco J. Valverde-Albacete, Carmen Peláez-Moreno:

Beyond Conventional Metrics: using Entropic Triangles to Explain Balancing Methods in Acoustic Scene Classification. - Emiliano Acevedo, Martín Rocamora, Magdalena Fuentes:

Domain Adaptation Method and Modality Gap Impact in Audio-Text Models for Prototypical Sound Classification. - Myeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park:

Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation.
Inclusivity
- Chris Emezue, NaijaVoices Community, Busayo Awobade, Abraham Toluwase Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue, Nefertiti Nneoma Emezue, Sewade Ogun, Bunmi Akinremi, David Ifeoluwa Adelani, Chris Pal:

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages. - Maliha Jahan, Yinglun Sun, Priyam Mazumdar, Zsuzsanna Fagyal, Thomas Thebaud, Jesús Villalba, Mark Hasegawa-Johnson, Najim Dehak, Laureano Moro-Velázquez:

FaiST: A Benchmark Dataset for Fairness in Speech Technology. - Kemal Altwlkany, Amar Kuric, Emanuel Lacic:

On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs. - José Giraldo, Alex Peiró Lilja, Carme Armentano-Oller, Rodolfo Zevallos, Cristina España-Bonet:

Evaluating Speech Enhancement Performance Across Demographics and Language.
Voice Conversion 1
- Seymanur Akti, Tuan-Nam Nguyen, Alexander Waibel:

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion. - Hitoshi Suda, Shinnosuke Takamichi, Satoru Fukayama:

Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora. - Ishan D. Biyani, Nirmesh J. Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv Ratn Shah:

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion. - Alexander Lobashev, Assel Yermekova, Maria A. Larchenko:

Training-Free Voice Conversion with Factorized Optimal Transport. - Yihan Liu, Zhengyang Chen, Leying Zhang, Yanmin Qian:

E2E-BPVC: End-to-End Background-Preserving Voice Conversion via In-Context Learning. - Kaidi Wang, Wenhao Guan, Ziyue Jiang, Hukai Huang, Peijie Chen, Weijie Wu, Qingyang Hong, Lin Li:

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion. - Pengyu Ren, Wenhao Guan, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li:

ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization. - Jiawei Jin, Zhihan Yang, Yixuan Zhou, Zhiyong Wu:

In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion. - Herman Kamper, Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau:

LinearVC: Linear Transformations of Self-Supervised Features Through the Lens of Voice Conversion. - Desheng Hu, Yang Xiang, Jian Lu, Xinhui Hu, Xinkang Xu:

Speaker Normalization and Content Restoration for Zero-Shot Voice Conversion with Attention-Enhanced Discriminator.
Speech-based Cognitive Assessment 2
- Xiaoquan Ke, Man-Wai Mak, Helen Meng:

Optimizing Pause Context in Fine-Tuning Pre-trained Large Language Models for Dementia Detection. - Emmanuel Akinrintoyo, Nadine Abdelhalim, Nicole Salomons:

WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper. - Catarina Botelho, David Gimeno-Gómez, Francisco Teixeira

, John Mendonça, Patrícia Pereira, Diogo A. P. Nunes, Thomas Rolland, Anna Pompili, Rubén Solera-Ureña, Maria Ponte, David Martins de Matos, Carlos D. Martínez-Hinarejos, Isabel Trancoso, Alberto Abad:
Acoustic and Linguistic Biomarkers for Cognitive Impairment Detection from Speech. - Yao Xiao, Heidi Christensen

, Stefan Goetze:
Alzheimer's Dementia Detection Using Perplexity from Paired Large Language Models. - Mansi, Anastasios Lepipas

, Dominika C. Woszczyk, Yiying Guan, Soteris Demetriou:
Understanding Dementia Speech Alignment with Diffusion-Based Image Generation. - Dominika C. Woszczyk, Ranya Aloufi, Soteris Demetriou:

ClaritySpeech: Dementia Obfuscation in Speech.
Source Separation 1
- Jihyun Kim, Doyeon Kim, Hyewon Han, Jinyoung Lee, Jonguk Yoo, Chang Woo Han, Jeongook Song, Hoon-Young Cho, Hong-Goo Kang:

Quadruple Path Modeling with Latent Feature Transfer for Permutation-free Continuous Speech Separation. - Kangqi Jing, Wenbin Zhang, Yu Gao:

End-to-End DOA-Guided Speech Extraction in Noisy Multi-Talker Scenarios. - Xue Yang, Guiru Shen, Yu Yang:

Speaker Separation for an Unknown Number of Speakers with Encoder-Decoder-Based Contextual Information Module. - Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen:

Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers. - Hadi Alizadeh, Rahil Mahdian Toroghi, Hassan Zareian:

ReSepNet: A Unified-Light Model for Recursive Speech Separation with Unknown Speaker Count. - Tzlil Avidan, Bracha Laufer-Goldshtein:

Deep-Simplex Multichannel Speech Separation. - Haoxu Wang, Yiheng Jiang, Gang Qiao, Pengteng Shi, Biao Tian:

FLASepformer: Efficient Speech Separation with Gated Focused Linear Attention Transformer. - Liang Tao, Maoshen Jia, Yonggang Hu:

Power Spectral Density Estimation for Acoustic Source Separation Using A Spherical Microphone Array. - Yiheng Jiang, Haoxu Wang, Yafeng Chen, Gang Qiao, Biao Tian:

Exploring Efficient Directional and Distance Cues for Regional Speech Separation.
Language and Accent Identification and Speaker Privacy
- Spandan Dey, Hirak Mondal, Sanjay Kumar Kurmi:

Teacher-Free Knowledge Distillation for Improving Short-Utterance Spoken Language Identification. - Niyati Bafna, Matthew Wiesner:

LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech. - Gowtham Premananth, Vinith Kugathasan, Carol Y. Espy-Wilson:

Analyzing the Impact of Accent on English Speech: Acoustic and Articulatory Perspectives. - Eliathamby Ambikairajah, Jingyao Wu, Ting Dang, Vidhyasaharan Sethu:

A Study of Speech Embedding Similarities Between Australian Aboriginal and High-Resource Languages. - Abderrahim Fathan, Jahangir Alam, Xiaolin Zhu:

An Investigative Study on Recent Sharpness- and Flatness-Based Optimizers for Enhanced Self-Supervised Speaker Verification. - Chenguang Hu, Yaqian Hao, Fulin Zhang, Xiaoxue Luo, Yao Shen, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng:

Privacy-Preserving Speaker Verification via End-to-End Secure Representation Learning. - Elvir Karimov, Alexander Varlamov, Danil Ivanov, Dmitrii Korzh, Oleg Rogov:

Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy. - Ying Meng, Zhihua Fang, Liang He:

Federated Learning with Feature Space Separation for Speaker Recognition.
Source Tracing: The Origins of Synthetic or Manipulated Speech
- Pierre Falez, Tony Marteau, Damien Lolive, Arnaud Delhay:

Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification. - Ajinkya Kulkarni

, Sandipana Dowerah, Tanel Alumäe, Mathew Magimai-Doss:
Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion. - Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du

, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang:
Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy. - Adriana Stan, David Combei, Dan Oneata, Horia Cucu:

TADA: Training-free Attribution and Out-of-Domain Detection of Audio Deepfakes. - Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro:

Source Verification for Speech Deepfakes. - Anton Firc, Manasi Chhibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, Kamil Malinka:

STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution. - Dimitrios Koutsianos, Stavros Zacharopoulos, Yannis Panagakis, Themos Stafylakis:

Synthetic Speech Source Tracing using Metric Learning. - Yang Xiao, Rohan Kumar Das:

Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incremental Learning Method for Audio Deepfake Source Tracing. - Thien-Phuc Doan, Kihun Hong, Souhwan Jung:

VIB-based Real Pre-emphasis Audio Deepfake Source Tracing. - Jiankun Zhao, Lingwei Meng, Chengxi Deng, Helen Meng, Xixin Wu:

Defending Unauthorized Voice Cloning with Watermark-Aware Codecs. - Nicholas Klein, Hemlata Tak, Elie Khoury:

Open-Set Source Tracing of Audio Deepfake Systems.
Speaker Diarization 1
- Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Díez, Jan Cernocký, Lukás Burget:

Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization. - Prabhav Singh, Jesús Villalba, Najim Dehak:

Count Your Speakers! Multitask Learning for Multimodal Speaker Diarization. - David Palzer, Matthew Maciejewski, Eric Fosler-Lussier:

End-to-End Diarization utilizing Attractor Deep Clustering. - Berkin Durmus, Blaise Munyampirwa, Eduardo Pacheco, Atila Orhon, Andrey Leonov:

SDBench: A Comprehensive Benchmark Suite for Speaker Diarization. - Fengyun Tan, Tao Wei, Kun Zou, Ning Cheng, Shaojun Wang, Jing Xiao:

Enhancing Serialized Output Training for Multi-Talker ASR with Soft Monotonic Alignment and Utterance-level Timestamp. - Shota Horiguchi, Atsushi Ando, Naohiro Tawara, Marc Delcroix:

Pretraining Multi-Speaker Identification for Neural Speaker Diarization.
Multilingual Speech Synthesis and Special Applications 1
- Ki-Joong Kwon, Jun-Ho So, Sang-Hoon Lee:

Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual Learning. - Qibing Bai, Sho Inoue, Shuai Wang, Zhongjie Jiang, Yannan Wang, Haizhou Li:

Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data. - Chang Liu, Zhen-Hua Ling, Yu Gu:

LIST: Language-Independent Speech Token for Multilingual Speech Synthesis with Language Models. - Fatima Naseem, Maham Sajid, Farah Adeeba, Sahar Rauf, Asad Mustafa, Sarmad Hussain:

Developing High-Quality TTS for Punjabi and Urdu: Benchmarking against MMS Models. - Frederik Rautenberg, Fritz Seebauer, Jana Wiechmann, Michael Kuhlmann, Petra Wagner, Reinhold Haeb-Umbach:

Synthesizing Speech with Selected Perceptual Voice Qualities - A Case Study with Creaky Voice. - Christina Tånnander, David House, Jonas Beskow, Jens Edlund:

Intrasentential English in Swedish TTS: perceived English-accentedness.
Characterization and Multimodal Approaches for Speaker Recognition
- Shengyu Peng, Wu Guo, Jie Zhang, Yu Guan, Lipeng Dai, Zuoliang Li:

Parameter-Efficient Fine-tuning with Instance-Aware Prompt and Parallel Adapters for Speaker Verification. - Nathan Griot, Driss Matrouf, Raphaël Blouet, Jean-François Bonastre, Ana Mantecon:

Unified Text and Speaker Verification using SSL model for Text-Dependent Speaker Verification. - Zhaokai Sun, Li Zhang, Qing Wang, Pan Zhou, Lei Xie:

Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM. - N. Shashaank, Xiao Quan, Andrew Kaluzny, Leonard Varghese, Marko Stamenovic, Chuan-Che Huang:

Towards Secure User Authentication for Headphones via In-Ear or In-Earcup Microphones. - Gwangyeol Yu

, Junhyeok Lee, Seoryeong Kim, Jimin Lee, Jehyuk Lee:
Mimic Blocker: Self-Supervised Adversarial Training for Voice Conversion Defense with Pretrained Feature Extractors. - Bhasi K. C., Rajeev Rajan:

A Siamese Network-Based Framework for Voice Mimicry Proficiency Assessment Using X-Vector Embeddings. - Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma:

Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models. - Rishabh Ranjan, Ayinala Likhith, Mayank Vatsa, Richa Singh:

Multimodal Zero-Shot Framework for Deepfake Hate Speech Detection in Low-Resource Languages. - Chikara Maeda, Muhammad Shakeel, Yui Sudo:

Joint Target-Speaker ASR and Activity Detection. - Wooil Kim, Bongsu Jung:

DLF-EEND: Dynamic Layer Fusion for End-to-End Speaker Diarization.
Acoustic Analysis and Bioacoustics
- Noumida A, Rajeev Rajan:

Analysis of Avian Biphonic Vocalization Using Computational Modelling. - Xingyuan Li, Kenny Q. Zhu, Mengyue Wu:

Dog2vec: Self-Supervised Pre-Training for Canine Vocal Representation. - Ezhini Rasendiran R, Chandresh Kumar Maurya:

Improving Bird Classification with Primary Color Additives. - Chenhao Wu, Xiangjun Cai, Haojie Zhang, Tianrui Jia, Yilu Deng, Kun Qian, Björn W. Schuller, Yoshiharu Yamamoto, Jiang Liu:

Exploring the Power of Empirical Mode Decomposition for Sensing the Sound of Silence: A Pilot Study on Mice Autism Detection via Ultrasonic Vocalisation. - Yuchen Song, Yucong Zhang, Ming Li:

Exploring Pre-trained models on Ultrasound Modeling for Mice Autism Detection with Uniform Filter Bank and Attentive Scoring. - Zijiang Yang, Meishu Song, Xin Jing, Haojie Zhang, Kun Qian, Bin Hu, Kota Tamada, Toru Takumi, Björn W. Schuller, Yoshiharu Yamamoto:

MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge. - Szymon Szmajdzinski, Juliusz Wójtowicz-Kruk, Ivan Ryzhankow, Lukasz Lazarski, Jakub Zak, Wladyslaw Sredniawa:

Significance of Time-Frequency preprocessing for automatic Ultrasonic Vocalization classification in Autism Spectrum Disorder model detection. - Quentin Le Tellier, Marc Evrard, Albert Rilliard, Jean-Sylvain Liénard:

Robust Vocal Intensity Prediction: Overcoming Dataset Bias with Pretrained Deep Models. - Ryo Terashima, Yuma Shirahata, Masaya Kawamura:

SLASH: Self-Supervised Speech Pitch Estimation Leveraging DSP-derived Absolute Pitch.
Keynote 2 - Alexander Waibel: From Speech Science to Language Transparence
- Alexander Waibel:

From Speech Science to Language Transparence.
Spoken Dialogue Systems 1
- Truong Do, Phuong Minh Nguyen

, Le-Minh Nguyen:
PruneSLU: Efficient On-device Spoken Language Understanding through Vocabulary and Structural Pruning. - Haris Gulzar, Monikka Roslianna Busto, Akiko Masaki, Takeharu Eda, Ryo Masumura:

Leveraging LLMs for Written to Spoken Style Data Transformation to Enhance Spoken Dialog State Tracking. - Simon Sedlácek, Bolaji Yusuf, Jan Svec, Pradyoth Hegde

, Santosh Kesiraju, Oldrich Plchot, Jan Cernocký:
Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs. - Kiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel F. Garcia Contreras, Koichiro Yoshino:

What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems. - Minghan Wang, Ye Bai

, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari:
SpeechDialogueFactory: A Framework for Natural Speech Dialogue Generation. - Xiaohan Shi, Xingfeng Li, Tomoki Toda:

Who, When, and What: Leveraging the "Three Ws" Concept for Emotion Recognition in Conversation. - Alkis Koudounas, Claudio Savelli, Flavio Giobergia, Elena Baralis:

"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding. - Ebru Arisoy, Merve Ünlü Menevse, Yusufcan Manav, Arzucan Özgür:

Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering. - Maria Teleki, Lingfeng Shi, Chengkai Liu, James Caverlee:

I want a horror - comedy - movie: Slips-of-the-Tongue Impact Conversational Recommender System Performance. - Atsumoto Ohashi, Shinya Iizuka, Jingjing Jiang, Ryuichiro Higashinaka:

Towards a Japanese Full-duplex Spoken Dialogue System.
Speech Assessment
- Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee:

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information. - Guitao Wang, Jinming Zhao, Hao Yang, Guilin Qi, Tongtong Wu, Gholamreza Haffari:

Continual Speech Learning with Fused Speech Features. - Jiatong Shi, Hye-jin Shim, Shinji Watanabe:

Uni-VERSA: Versatile Speech Assessment with a Unified Network. - John Alderete, Macarious Kin Fung Hui, Aanchan Mohan

:
Evaluating ASR Robustness to Spontaneous Speech Errors: A Study of WhisperX Using a Speech Error Database. - Tomoya Mizumoto, Atsushi Kojima, Yusuke Fujita, Lianbo Liu, Yui Sudo:

Is Synthetic Data Truly Effective for Training Speech Language Models? - Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, Sébastien Bratières, Paolo Merialdo, Simone Scardapane:

How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not.
Audio-Visual ASR and Multimodal System
- Julián Zapata, Lara Hanna:

Text Entry for All: Towards Speech-based Multimodal Interaction for Inclusion, Accessibility and the Preservation of the World's Linguistic Heritage. - Umberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, Alessio Brutti

:
Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach. - Thai-Binh Nguyen, Ngoc-Quan Pham, Alexander Waibel:

Cocktail-Party Audio-Visual Speech Recognition. - Zhengyang Li

, Pascal Reichert, Thomas Graave, Patrick Blumenberg, Tim Fingscheidt:
Efficient Noise-Robust Hybrid Audiovisual Encoder with Joint Distillation and Pruning for Audiovisual Speech Recognition. - Naoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura:

Unified Audio-Visual Modeling for Recognizing Which Face Spoke When and What in Multi-Talker Overlapped Speech and Video.
Speech and Voice Disorders 1
- Shangkun Huang, Jing Deng, Jintao Kang, Rong Zheng:

Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection. - Zongli Ye, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Haodong Li, Shuhe Li, Chenxu Guo, Anaisha Das, Peter Park, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli:

Seamless Dysfluent Speech Text Alignment for Disordered Speech Analysis. - Jinming Zhang, Xuanru Zhou, Jiachen Lian, Shuhe Li, William Li, Zoe Ezzes, Rian Bogley, Lisa Wauters, Zachary Miller, Jet Vonk, Brittany Morin, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli:

Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection. - Yeseul Park, Bowon Lee:

Fine-tuning Strategies for Automatic Speech Recognition of Low-Resource Speech with Autism Spectrum Disorder. - Margot Masson, Isabelle Ferrané, Julie Mauclair:

Identification of Pathological Pronunciation Profiles in ASR Transcription Errors. - Hadrien Titeux, Quang Tuan Rémy Nguyen, Andres Gil-Salcedo, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux:

A simple method for predicting Clinical Scores in Huntington's Disease by leveraging ASR's uncertainty on spontaneous speech. - Itay Ben-Dom, Catherine I. Watson, Clare M. McCann:

Introducing EMOPARKNZ: the Emotional Speech Database from New Zealand English Speakers with Parkinson's Disease. - Naoki Hojo, Ryoichi Takashima, Chihiro Sugiyama, Nobukazu Tanaka, Kanji Nohara, Kazunori Nozaki, Tetsuya Takiguchi:

Revisiting WFST-based Hybrid Japanese Speech Recognition System for Individuals with Organic Speech Disorders.
Multimodal Information Based Speech Processing (MISP) 2025 Challenge
- Longjie Luo, Shenghui Lu, Lin Li, Qingyang Hong:

Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge. - Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg:

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition. - Zhaoyang Li, Haodong Zhou, Longjie Luo, XiaoXiao Li, Yongxin Chen, Lin Li, Qingyang Hong:

Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge. - Ming Cheng, Fei Su, Cancan Li, Juan Liu, Ming Li:

Multi-Channel Sequence-to-Sequence Neural Diarization: Experimental Results for The MISP 2025 Challenge. - Zeyan Song, Tianchi Sun, Ronghui Hu, Kai Chen, Jing Lu:

Leveraging Self-Supervised Learning Based Speaker Diarization for MISP 2025 AVSD Challenge. - Shangkun Huang, Yuxuan Du, Jingwen Yang, Dejun Zhang, Xupeng Jia, Jing Deng, Jintao Kang, Rong Zheng:

Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge.
Speaker Extraction 1
- Md Asif Jalal, Luca Remaggi, Vasileios Moschopoulos, Thanasis Kotsiopoulos, Vandana Rajan, Karthikeyan Saravanan, Anastasios Drosou, Junho Heo, Hyuk Oh, Seokyeong Jeong:

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling. - Wang Dai, Archontis Politis, Tuomas Virtanen:

Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction. - Shaole Li, Shuai Wang, Jiangyu Han, Ke Zhang, Wupeng Wang, Haizhou Li:

REAL-T: Real Conversational Mixtures for Target Speaker Extraction. - Zexu Pan, Wupeng Wang, Shengkui Zhao, Chong Zhang, Kun Zhou, Yukun Ma, Bin Ma:

Online Audio-Visual Autoregressive Speaker Extraction. - Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma:

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction.
Low Resource Speech Recognition
- Salvatore Carta, Alessandro Giuliani

, Marco Manolo Manca, Mirko Marras, Leonardo Piano:
SardinianVoxes: A Speech Recognition Dataset for the Sardinian Languages. - Griffin Dietz Smith, Dianna Yee, Jennifer King Chen, Leah Findlater:

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection. - Bowen Zhang, Nur Afiqah Abdul Latiff, Justin Kan, Rong Tong, Donny Soh, Xiaoxiao Miao, Ian McLoughlin:

Automated evaluation of children's speech fluency for low-resource languages. - King Yiu Suen, Rudolf Chow, Albert Y. S. Lam:

Cantonese Punctuation Restoration using LLM Annotated Data. - David Sasu, Benedict Quartey, Kweku Andoh Yamoah, Natalie Schluter:

Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody. - Abhijit Sinha, Hemant Kumar Kathania, Mikko Kurimo:

Beyond Traditional Speech Modifications : Utilizing Self Supervised Features for Enhanced Zero-Shot Children ASR. - Nicol Visser, Herman Kamper:

Spoken Language Modeling with Duration-Penalized Self-Supervised Units.
Computational Resource Constrained ASR
- Zhaoqing Li, Haoning Xu, Zengrui Jin, Lingwei Meng, Tianzi Wang, Huimeng Wang, Youjun Chen, Mingyu Cui, Shujie Hu, Xunying Liu:

Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision. - Zhaoqing Li, Haoning Xu, Xurong Xie, Zengrui Jin, Tianzi Wang, Xunying Liu:

Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models. - Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu:

Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates. - Tianteng Gu, Bei Liu, Haoyu Wang, Yanmin Qian:

Ultra-Low Bit Post-Training Quantization of Large Speech Models via K-Means Clustering and Mixed Precision Allocation. - Masao Someki, Shikhar Bharadwaj, Atharva Anand Joshi, Chyi-Jiunn Lin, Jinchuan Tian, Jee-weon Jung, Markus Müller, Nathan Susanj, Jing Liu, Shinji Watanabe:

Context-Driven Dynamic Pruning for Large Speech Foundation Models. - Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter:

Analyzing the Importance of Blank for CTC-Based Knowledge Distillation. - Seraphina Fong, Marco Matassoni, Alessio Brutti:

Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages.
Speech and Language Technology for Health Applications
- Yue Pan, Liwei Liu, Changxin Li, Xingyao Wang, Yili Xia, Hanyue Zhang, Ming Chu:

A Chinese Heart Failure Status Speech Database with Universal and Personalised Classification. - Harish Battula, Gauri Deshpande, Yagna Gudipalli, Sachin Patel:

Heart Rate as a Proxy Measure to Assess Human Confidence in Spoken Speech. - Jingping Nie, Tien Dung Tran, Karan Thakkar, Vasudha Kowtha, Jon Huang, Carlos Avendaño, Erdrin Azemi, Vikramjit Mitra:

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation. - Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Santanu Roy, Arun Balaji Buduru, Rajesh Sharma:

Towards Fusion of Neural Audio Codec-based Representations with Spectral for Heart Murmur Classification via Bandit-based Cross-Attention Mechanism. - Yizhou Chen, Xiyu Wu:

Perception of Emotional Speech by Individuals with High Borderline Personality Features. - Agata Sage, Zuzanna Miodonska, Michal Krecichwost, Ewa Kwasniok, Pawel Badura:

Visual features of the oral region in Polish sibilants produced by children with various sibilance patterns. - Roseline Polle, Agnes Norbury, Alexandra Livia Georgescu, Nicholas Cummins, Stefano Goria:

Meta-Learning Approaches for Speaker-Dependent Voice Fatigue Models. - Ning Wang, Bingyang Wen, Minghui Wu, Yang Sun, Zongru Shao, Haojie Zhou, K. P. Subbalakshmi:

Decoding Alzheimer's: Interpretable Visual and Logical Attention in Picture Description Tasks.
Responsible Speech Foundation Models + SUPERB Challenge
- Antonios Alexos, Raghuveer Peri, Sai Muralidhar Jayanthi, Metehan Cekic, Srikanth Vishnubhotla, Kyu J. Han, Srikanth Ronanki:

Defending Speech-enabled LLMs Against Adversarial Jailbreak Threats. - Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee:

Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach. - Dariia Puhach, Amir H. Payberah, Éva Székely:

Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM. - Mengzhe Geng, Patrick Littell, Aidan Pine, Robbie Jimerson, Gilles Boulianne, Vishwa Gupta, Rolando Coto-Solano, Anna Kazantseva, Marc Tessier, Delaney Lothian, Akwiratékha' Martin, Eric Joanis, Samuel Larkin, Roland Kuhn:

Evaluating Speech Foundation Models for Automatic Speech Recognition in the Low-Resource Kanyen'kéha Language. - Debarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy:

Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning. - Chun-Yi Kuan, Hung-yi Lee:

Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples. - Ke-Han Lu, Chun-Yi Kuan, Hung-yi Lee:

Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models. - Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul:

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models. - Qingzheng Wang, Jiancheng Sun, Yifan Peng, Shinji Watanabe:

Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC. - William Chen, Chutong Meng, Jiatong Shi, Martijn Bartelds, Shih-Heng Wang, Hsiu-Hsuan Wang, Rafael Mosquera, Sara Hincapie, Dan Jurafsky, Antonis Anastasopoulos, Hung-yi Lee, Karen Livescu, Shinji Watanabe:

The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties. - Tanel Alumäe, Artem Fedorchenko:

TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge.
Dysarthric Speech Assessment 1
- Tao Zhong, Mengzhe Geng, Shujie Hu, Guinan Li, Xunying Liu:

Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition. - Yejin Jeon, Solee Im, Youngjae Kim, Gary Geunbae Lee:

Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning. - Xueyuan Chen, Dongchao Yang, Wenxuan Wu, Minglin Wu, Jing Xu, Xixin Wu, Zhiyong Wu, Helen Meng:

DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model. - Shoutrik Das, Nishant Singh, Arjun Gangwar, S. Umesh:

Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching. - Ahmed Aboeitta, Ahmed Sharshar, Youssef Nafea, Shady Shehata:

Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative Approaches. - Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen:

Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages. - Yan Xiong, Visar Berisha, Julie Liss, Chaitali Chakrabarti:

Mitigating Overfitting During Speech Foundation Model Fine-tuning: Applications to Dysarthric Speech Detection. - Seohyun Park, Chitralekha Gupta, Michelle Kah Yian Kwan, Xinhui Fung, Alexander Wenjun Yip, Suranga Nanayakkara:

Towards Temporally Explainable Dysarthric Speech Clarity Assessment.
Show and Tell 2: Speech Synthesis
- Vishal Gourav, Phanindra Mankale:

Code Mix TTS: An Approach to Infer Human Like Speech for Multi-Lingual Input Texts. - Binh Nguyen, Thai Le:

Turing's Echo: Investigating Linguistic Sensitivity of Deepfake Voice Detection via Gamification. - Namhyun Cho, Sunmin Kim, Minsu Kang, Seolhee Lee, Choonghyeon Lee, Yangsun Lee:

Unleashing the Inner Monster: Demonstrating High-Fidelity Human to Non-Human Voice Conversion. - Victor Shepardson, Jonathan Reus, Thor Magnusson:

Tungnaá In Live Performance: An Implementation Of Interactive Artistic Text-To-Voice. - Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely:

Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AI. - Takayuki Arai:

Vocal-tract model with two directions: Static design for a dummy head and dynamic design for a speaking machine.
Databases and Progress in Methodology
- Arnav Rustagi, Satvik Bajpai, Nimrat Kaur, Siddharth:

Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for Hindi. - Alexandra Fort, Francis Tyers:

Evaluating Wav2Vec2-Bert for Computer-Assisted Pronunciation Training for isiZulu. - Lubos Marcinek, Jonas Beskow, Joakim Gustafson:

Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments. - Lidea Shahidi, Erdem Baha Topbas, Thu Ngan Dang

, Tobias Goehring:
Harnessing Text-to-Speech Voice Cloning Models for Improved Audiological Speech Assessment. - Xuan Shi, Yubin Zhang, Yijing Lu, Marcus Ma, Tiantian Feng

, Asterios Toutios, Haley Hsu, Louis Goldstein, Shrikanth Narayanan:
75-Speaker Annot-16: A benchmark dataset for speech articulatory rt-MRI annotation with articulator contours and phonetic alignment. - Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Daniel Yamins:

Representing Speech Through Autoregressive Prediction of Cochlear Tokens. - Chanwoo Park, Anna Seo Gyeong Choi, Sunghye Cho, Chanwoo Kim:

Reasoning-Based Approach with Chain-of-Thought for Alzheimer's Detection Using Speech and Large Language Models. - Linda Bakkouche, Charles McGhee, Emily Lau, Stephanie Cooper, Xinbing Luo, Madeleine Rees, Kai Alter, Brechtje Post, Julia Schwarz:

Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings. - Kentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu:

Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora.
Novel Architectures for ASR
- Enes Yavuz Ugan, Ngoc-Quan Pham, Alexander Waibel:

Weight Factorization and Centralization for Continual Learning in Speech Recognition. - Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Peter Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli:

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection. - I-Ting Hsieh, Chung-Hsien Wu:

Dysarthric Speech Recognition Using Curriculum Learning and Multi-stream Architecture. - Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe:

DYNAC: Dynamic Vocabulary-based Non-Autoregressive Contextualization for Speech Recognition. - Hojun Jin, Eunsoo Hong, Ziwon Hyung, Sungjun Lim, Seungjin Lee, Keunseok Cho:

Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts. - Yifan Peng, Muhammad Shakeel, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe:

OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning.
Deepfake Detection
- Kwok Chin Yuen, Jia Qi Yip, Zhen Qiu, Chi-Hung Chi, Kwok-Yan Lam:

Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems. - Yassine El Kheir, Tim Polzehl, Sebastian Möller:

BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention. - Neta Glazer, David Chernin, Idan Achituve, Sharon Gannot, Ethan Fetaya:

Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes. - Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Adriana Stan, Aditya Tirumala Bukkapatnam, Karla Pizzi, Alexander Wagner, Philip Sperl:

Replay Attacks Against Audio Deepfake Detection. - Seung-bin Kim, Hyun-seo Shin, Jungwoo Heo, Chan-yeong Lim, Kyo-Won Koo, Jisoo Son, Sanghyun Hong, Souhwan Jung, Ha-Jin Yu:

Enhancing Audio Deepfake Detection by Improving Representation Similarity of Bonafide Speech. - Mingru Yang, Yanmei Gu, Qianhua He, Yanxiong Li, Peirong Zhang, Yongqiang Chen, Zhiming Wang, Huijia Zhu, Jian Liu, Weiqiang Wang:

Generalizable Audio Deepfake Detection via Hierarchical Structure Learning and Feature Whitening in Poincaré sphere.
Tools for Speech Analysis
- Kun Jin

, Siva Penke, Srinivasa Algubelli:
VoiceNet: Multilingual On-Device Phoneme-To-Audio Alignment. - Maya Dewhurst, Jack Collins, Justin J. H. Lo, Roy Alderton

, Sam Kirkham
:
Nosey: Open-Source Hardware for Acoustic Nasalance. - James Tanner, Morgan Sonderegger, Jane Stuart-Smith, Jeff Mielke, Tyler Kendall:

Automatic classification of stop realisation with wav2vec2.0.
Text Processing and Evaluation for Speech Synthesis 1
- Siqi Sun, Korin Richmond:

Acquiring Pronunciation from Speech Audio via Multi-task Learning. - Sujoy Roychowdhury, Ranjani H. G., Sumit Soman, Nishtha Paul, Subhadip Bandyopadhyay, Siddhanth Iyengar:

Intelligibility of Text-to-Speech Systems for Mathematical Expressions. - Praveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra:

The State Of TTS: A Case Study with Human Fooling Rates. - Jinzuomu Zhong, Suyuan Liu, Dan Wells, Korin Richmond:

Pairwise Evaluation of Accent Similarity in Speech Synthesis. - Harm Lameris, Joakim Gustafsson, Éva Székely:

VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech. - Michael Kuhlmann, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach:

Towards Frame-level Quality Predictions of Synthetic Speech.
Segmental and Tonal Units
- C. T. Justine Hui, Jenice Kuzhikombil, Isabella Shields, Hiraia Haami-Wells, Catherine I. Watson, Peter J. Keegan:

Perception of Long and Short Vowel Contrast in Te Reo Māori in Clean and Everyday Listening Environments. - Patrik Hrabánek, Michaela Watkins, Silke Hamann:

The function of creaky voice in South Korean: A perception study. - Mingxi Lu, Ran Tao, Yujia Tian:

Talker Normalization in Chinese Bilinguals: A Comparative Study. - Terumichi Ariga:

Coping with segmental-prosodic incongruity in spoken word recognition in Japanese. - Saskia Wepner, Lucas Eckert, Gernot Kubin

, Barbara Schuppler:
What the Filler? Both ASR Systems and Humans Struggle More With Other Kinds of Disfluencies Than With Filler Particles.
Speech Quality Assessment
- Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann:

Non-intrusive Speech Quality Assessment with Diffusion Models Trained on Clean Speech. - Wafaa Wardah, Robert P. Spang, Vincent Barriac, Jan Reimes, Anna Llagostera, Jens Berger, Sebastian Möller:

SQ-AST: A Transformer-Based Model for Speech Quality Prediction. - Imran E. Kibria, Donald S. Williamson:

AttentiveMOS: A Lightweight Attention-Only Model forSpeech Quality Prediction. - Yu-Fei Shi, Yang Ai, Zhen-Hua Ling:

Universal Preference-Score-based Pairwise Speech Quality Assessment. - Enjamamul Hoq, Nikhil Gupta, Danielle Omondi, Ifeoma Nwogu:

FUSE-MOS: Fusion of Speech Embeddings for MOS Prediction with Uncertainty Quantification. - Wen-Chin Huang, Erica Cooper, Tomoki Toda:

SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit.
Speech Enhancement
- Haici Yang, Gordon Wichern, Ryo Aihara, Yoshiki Masuyama, Sameer Khurana, François G. Germain, Jonathan Le Roux:

Investigating continuous autoregressive generative speech enhancement. - Venkatesh Parvathala, K. Sri Rama Murty:

Dynamic Layer Gating for Speech Enhancement. - Saisamarth Rajesh Phaye, Milos Cernak, Andrew Harper:

Model as Loss: A Self-Consistent Training Paradigm. - Avishkar Behera, Riya Ann Easow, Venkatesh Parvathala, K. Sri Rama Murty:

Test-Time Training for Speech Enhancement. - Seungu Han, Sungho Lee, Juheon Lee, Kyogu Lee:

Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement. - Venkatesh Parvathala, Ramesh Gundluru, Sreekanth Sankala, K. Sri Rama Murty:

Exploiting Bispectral Features for Single-Channel Speech Enhancement.
Language Learning and Assessment
- Olli Kuparinen

:
Automatic Dialectal Transcription: An Evaluation on Finnish and Norwegian. - Wieke Harmsen, Roeland van Hout, Catia Cucchiarini, Helmer Strik:

Can ASR generate valid measures of child reading fluency? - Chowdam Venkata Thirumala Kumar, Chiranjeevi Yarra:

SGED-Probe: Probing E2E ASR decoder and aligner for spoken grammar error detection under three speaking practice conditions. - Aditya Kamlesh Parikh, Cristian Tejedor García, Catia Cucchiarini, Helmer Strik:

Evaluating Logit-Based GOP Scores for Mispronunciation Detection. - Yassine El Kheir, Omnia Ibrahim, Amit Meghanani, Nada Almarwani, Hawau Olamide Toyin, Sadeen Alharbi, Modar Alfadly, Lamya Alkanhal, Ibrahim Selim, Shehab Elbatal, Salima Mdhaffar, Thomas Hain

, Yasser Hifny, Mostafa Shahin
, Ahmed Ali:
Towards a Unified Benchmark for Arabic Pronunciation Assessment: Qur'anic Recitation as Case Study. - Wen-Wei Hsieh, Hao-Wei Chi, Kuan-Chen Wang, Ping-Cheng Yeh, Te-Hsin Liu, Chen-Yu Chiang:

OMPAL: Bridging Speech and Learning with an Open-Source Mandarin Pronunciation Assessment Corpus for Global Learners. - Haopeng Geng, Daisuke Saito, Nobuaki Minematsu:

A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater's Shadowing and Sequence-to-sequence Voice Conversion. - Sehyun Oh, Sunhee Kim, Minhwa Chung:

Multimodal and Multitask Learning for Predicting Multiple Scores in L2 English Speech. - Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu:

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving. - Nhan Phan, Mikko Kuronen, Maria Kautonen, Riikka Ullakonoja, Anna von Zansen, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo:

Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland Swedish.
Speech Synthesis Paradigms and Methods 1
- Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song:

RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching. - Qixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiaofei Wang, Kai Yu, Xie Chen:

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling. - Changfeng Gao, Zhihao Du, Shiliang Zhang:

Differentiable Reward Optimization for LLM based TTS system. - Zhipeng Li, Xiaofen Xing, Jingyuan Xing, Hangrui Hu, Heng Lu, Xiangmin Xu:

Long-Context Speech Synthesis with Context-Aware Memory. - Yike Zhang, Yiming Li, Jie Chen, Qinghua Wu, Songjun Cao, Long Ma:

Monotonic Attention for Robust Text-to-Speech Synthesis in Large Language Model Frameworks. - Ye-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai, Zhen-Hua Ling:

Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising. - Frank Zalkow, Paolo Sani, Kishor Kayyar Lakshminarayana, Emanuël A. P. Habets, Nicola Pia, Christian Dittmar:

Bridging the Training-Inference Gap in TTS: Training Strategies for Robust Generative Postprocessing for Low-Resource Speakers. - Chunhui Lu, Xue Wen, Liming Song, Junkwang Oh:

Robust Neural Codec Language Modeling with Phoneme Position Prediction for Zero-Shot TTS.
Spatial Audio and Acoustics 2
- Roland Hartanto, Sakriani Sakti, Koichi Shinoda:

SepVAC: Multitask Learning of Speaker Separation, Speaker Localization, Microphone Array Localization, and Room Acoustic Parameter Estimation in Various Acoustic Conditions. - Junhui Zhao, Hang Chen, Qing Wang, Jun Du, Yanhui Tu, Feng Ma:

TA-RIR: Topology-Aware Neural Modeling of Acoustic Propagation for Room Impulse Response Synthesis. - Hyun-Soo Kim

, Da-Hee Yang, Joon-Hyuk Chang:
Spatially Weighted Contrastive Learning for Robust Sound Source Localization. - Yiyuan Yang, Shitong Xu, Niki Trigoni, Andrew Markham:

Efficient and Microphone-Fault-Tolerant 3D Sound Source Localization. - De Hu, Shuyao Liu, Yanrong He:

Joint Reference Microphone Selection and Filter Order Determination in Multi-channel Active Noise Control. - Liang Tao, Maoshen Jia, Yonggang Hu:

Direct-path Relative Harmonic Coefficients Detection for Multi-source Direction-of-Arrival Estimation in Reverberant Environments. - Junsheng Hu, Shaojie Li, Qintuya Si, De Hu:

D-GAT: Dual Graph Attention Network for Global HRTF Interpolation. - Mateusz Guzik

, Giulio Cengarle, Daniel Arteaga:
Deep learning based spatial aliasing reduction in beamforming for audio capture. - Xiaoming Zhang, Ke-Yue Zhang, Taiping Yao, Songjun Cao, Shouhong Ding, Long Ma:

SonarGuard2: Ultrasonic Face Liveness Detection Based on Adaptive Doppler Effect Feature Extraction.
Text Processing and Evaluation for Speech Synthesis 2
- Hien Ohnaka, Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto:

Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning. - Noe Berger, Siqi Sun, Korin Richmond:

Non-Standard Accent TTS Support via Large Multi-Accent Frontend Pronunciation Knowledge Transfer. - Timothy Shin Heng Mak, King Yiu Suen, Albert Y. S. Lam:

Speech-guided Grapheme-to-Phoneme Conversion for Cantonese Text-to-Speech. - Rui Hu, Xiaolong Lin, Jiawang Liu, Shixi Huang, Zhenpeng Zhan:

Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation. - Sébastien Le Maguer

, Gwénolé Lecorvé, Damien Lolive, Naomi Harte, Juraj Simko:
Enabling the replicability of speech synthesis perceptual evaluations. - Natacha Miniconi, Meysam Shamsi, Anthony Larcher:

When The MOS Predictor Asks For Training Annotation In Cross Lingual/Domain Adaptation. - Ryo Setoguchi, Yoshiko Arimoto:

Assessment of the synthetic quality and controllability of laughing onset in speech-laugh synthesis.
General Topics in ASR
- Nick Rossenbach, Benedikt Hilmes, Leon Brackmann, Moritz Gunz, Ralf Schlüter:

Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach. - Ke Hu, Krishna C. Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg:

Word Level Timestamp Generation for Automatic Speech Recognition and Translation. - Ju Lin, Yiteng Huang, Ming Sun, Frank Seide, Florian Metze:

Directional Speech Recognition with Full-Duplex Capability. - Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda:

CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models. - Hongfei Xue, Yufeng Tang, Jun Zhang, Xuelong Geng, Lei Xie:

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty.
Acoustic Event Detection and Classification
- James Taylor, Wolfgang Mack:

Improving Audio Classification by Transitioning from Zero- to Few-Shot. - Kohei Uehara, Ryoichi Takashima, Tetsuya Takiguchi:

Zero-Shot Learning for Acoustic Event Classification Using an Attribute Vector and Conditional GAN. - Lipeng Dai, Qing Wang, Jie Zhang, Shengyu Peng, Yu Guan, Wu Guo:

Leveraging Multi-Level Features of ATST with Conformer-Based Dual-Branch Network for Sound Event Detection. - Tatsuya Komatsu, Hokuto Munakata, Yuchi Ishikawa:

Leveraging Unlabeled Audio for Audio-Text Contrastive Learning via Audio-Composed Text Features. - Yuchi Ishikawa, Shota Nakada, Hokuto Munakata, Kazuhiro Saito, Tatsuya Komatsu, Yoshimitsu Aoki:

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos. - Yusuke Fujita, Tomoya Mizumoto, Atsushi Kojima, Lianbo Liu, Yui Sudo:

AC/DC: LLM-based Audio Comprehension via Dialogue Continuation. - Yawei Wang, Qiaoling Zhang, Yi Zhang, Junyao Hu:

Anomalous Sound Detection Based Feature Fusion and Dual-path Non-linear Independent Components Estimation. - Nan Jiang

, Yan Song, Qing Gu, Haoyu Song, Lirong Dai, Ian McLoughlin:
An Effective Anomalous Sound Detection Method Based on Global and Local Attribute Mining. - Long-Vu Hoang, Tuan Nguyen, Huy Dat Tran:

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment.
Keyword Spotting and Retrieval
- Anup Singh

, Kris Demuynck, Vipul Arora:
Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient Retrieval. - Akanksha Singh, Yi-Ping Phoebe Chen

, Vipul Arora:
H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing. - Ruofan Hu, Yan Xia, Minjie Hong, Jieming Zhu, Bo Chen, Xiaoda Yang, Minghui Fang, Tao Jin:

Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval. - Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho:

Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting. - Minghui Fang, Shengpeng Ji, Jialong Zuo, Xize Cheng, Wenrui Liu, Xiaoda Yang, Ruofan Hu, Jieming Zhu, Zhou Zhao:

GTA: Towards Generative Text-To-Audio Retrieval via Multi-Scale Tokenizer. - Changin Choi, Sungjun Lim, Wonjong Rhee:

Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning. - Ruochu Yang, Milind Rao, Harshavardhan Sundar, Anirudh Raju, Aparna Khare, Srinath Tankasala, Di He, Venkatesh Ravichandran:

On Retrieval of Long Audios with Complex Text Queries. - Jin-Gyo Lim, Seong-Eun Kim:

SIDC-KWS: Efficient Spiking Inception-Dilated Conformer with Self-Attention for Keyword Spotting. - Dzmitry Saladukha, Ivan Koriabkin, Kanstantsin Artsiom, Aliaksei Rak, Nikita Ryzhikov:

Multichannel Keyword Spotting for Noisy Conditions. - Pai Zhu, Quan Wang, Dhruuv Agarwal, Kurt Partridge:

LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting. - Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang:

GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples. - Firoj Alam, Md. Arid Hasan, Shammur Absar Chowdhury:

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs.
Multimodal Systems
- Sun-Kyung Lee, Jong-Hwan Kim:

CAMER: Contribution-Aware Multimodal Emotion Recognition. - Jiajun He, Jinyi Mi, Tomoki Toda:

GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints. - Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma:

SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer. - Zehua Liu, Xiaolou Li, Chen Chen, Lantian Li, Dong Wang:

CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge. - Abdul Hannan, Muhammad Arslan Manzoor, Shah Nawaz

, Muhammad Irzam Liaqat, Markus Schedl, Mubashir Noman:
PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association. - Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Zelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg:

Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model. - Ziqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie:

U-SAM: An Audio Language Model for Unified Speech, Audio, and Music Understanding. - Yun Tang, Eesung Kim, Vijendra Raj Apsingekar:

Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data. - Yi Wang, Oli Danyi Liu, Peter Bell:

The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models.
Dysarthric Speech Assessment 2
- Éva Székely, Péter Mihajlik, Máté Soma Kádár, László Tóth:

Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication. - Dimme de Groot, Tanvina Patel, Devendra Kayande, Odette Scharenborg, Zhengjun Yue:

Objective and Subjective Evaluation of Diffusion-Based Speech Enhancement for Dysarthric Speech. - Minseop Kim, Minsu Han, Seokyoung Hong, Myoung-wan Koo:

Data Augmentation using Speech Synthesis for Speaker-Independent Dysarthria Severity Classification. - Anuprabha M, Krishna Gurugubelli, Anil Kumar Vuppala:

Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS. - Jingting Li, Keyi Feng, Xinran Zhao, Yan Wang, Su-Jing Wang:

Synthetic Dysarthric Speech: A Supplement, Not a Substitute for Authentic Data in Dysarthric Speech Recognition. - Karl El Hajal, Enno Hermann

, Sevada Hovsepyan
, Mathew Magimai-Doss:
Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech.
Dialect Identification in Different Languages
- Lorenz Gutscher, Michael Pucher:

Audio-Based Classification and Geographic Regression of Austrian Dialects. - Saurabh Kumar, Amartyaveer

, Prasanta Kumar Ghosh:
Jointly Improving Dialect Identification and ASR in Indian Languages using Multimodal Feature Fusion. - Haroun Elleuch

, Salima Mdhaffar, Yannick Estève, Fethi Bougares:
ADI-20: Arabic Dialect Identification dataset and models. - Lea Fischbach, Akbar Karimi, Caroline Kleen, Alfred Lameli, Lucie Flek:

Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion. - Phoebe Parsons, Heming Strømholt Bremnes, Knut Kvale, Torbjørn Svendsen, Giampiero Salvi:

Effects of Prosodic Information on Dialect Classification Using Whisper Features. - Badr M. Abdullah, Matthew Baas, Bernd Möbius

, Dietrich Klakow:
Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification.
Connecting Speech Science and Speech Technology for Children's Speech
- Xulin Fan, Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain:

Band-Split Self-supervised Mamba for Infant-centered Audio Analysis. - Nina R. Benway, Saba Tabatabaee, Benjamin Munson

, Jonathan Preston, Carol Y. Espy-Wilson:
Subtyping Speech Errors in Childhood Speech Sound Disorders with Acoustic-to-Articulatory Speech Inversion. - Amanda Eads, Heather Kabakoff, Nina Benway, Elaine Hitchcock, Jonathan L. Preston, Tara McAllister:

PERCEPT-US: A Multimodal American English Child Speech Corpus Specialized for Articulatory Feedback. - Ajinkya Kulkarni

, Francisco Teixeira
, Enno Hermann, Thomas Rolland, Isabel Trancoso, Mathew Magimai-Doss:
Children's Voice Privacy: First Steps and Emerging Challenges. - Saba Tabatabaee, Jing Liu, Carol Y. Espy-Wilson:

FT-Boosted SV: Towards Noise Robust Speaker Verification for English Speaking Classroom Environments. - Zhonghao Shi

, Xuan Shi, Anfeng Xu, Tiantian Feng
, Harshvardhan Srivastava, Shrikanth Narayanan, Maja J. Mataric:
Examining Test-Time Adaptation for Personalized Child Speech Recognition. - Theo Zhang, Madurya Suresh, Anne Warluamont, Kasia Hitczenko, Alejandrina Cristià, Margaret Cychosz:

Employing self-supervised learning models for cross-linguistic child speech maturity classification. - Ankita, Shambhavi, Syed Shahnawazuddin:

On Enhancing the Performance of Children's ASR Task in Limited Data Scenario. - Tiantian Feng

, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan:
Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling. - Anfeng Xu, Tiantian Feng, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan:

Large Language Models based ASR Error Correction for Child Conversations. - Tarek Kunze, Marianne Métais, Hadrien Titeux, Lucas Elbert, Joseph Coffey, Emmanuel Dupoux, Alejandrina Cristià, Marvin Lavechin:

Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier. - Lingyun Gao, Cristian Tejedor García, Catia Cucchiarini, Helmer Strik:

Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts. - Jazmín Vidal, Luciana Ferrer, Juan Esteban Kamienkowski, Pablo Riera:

Improving Automatic Speech Recognition for Children's Reading Assessment with Disfluency-aware Language Models. - Sneha Raman, Preeti Rao:

Oral Reading Errors by Grade 3 Children in Indian Schools: A Hindi-English Perspective. - Christopher Gebauer, Lars Rumberg, Lars Köhn, Hanna Ehlert

, Edith Beaulac
, Jörn Ostermann
:
Grammatical Error Detection on Spontaneous Children's Speech Using Iterative Pseudo Labeling. - Koharu Horii, Naohiro Tawara, Atsunori Ogawa, Shoko Araki:

Why is children's ASR so difficult? Analyzing children's phonological error patterns using SSL-based phoneme recognizers. - Darline Monika Marx, Marco Matassoni, Alessio Brutti:

Automatic detection of speech sound disorders in German-speaking children: augmenting the data with typically developed speech. - Edem Ahadzi, Vishwanath Pratap Singh, Tomi Kinnunen, Ville Hautamäki:

Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic Intelligence. - Thomas Rolland, Alberto Abad:

Exploring Shared-Weight Mechanisms in Transformer and Conformer Architectures for Automatic Speech Recognition. - Karen Rosero, Ali N. Salman, Shreeram Suresh Chandra, Berrak Sisman, Cortney Van't Slot, Alex A. Kane, Rami R. Hallac, Carlos Busso:

Advancing Pediatric ASR: The Role of Voice Generation in Disordered Speech. - Natarajan Balaji Shankar, Zilai Wang, Kaiyuan Zhang, Mohan Shi, Abeer Alwan:

CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASR. - Vishwanath Pratap Singh, Md. Sahidullah, Tomi Kinnunen:

Causal Structure Discovery for Error Diagnostics of Children's ASR.
Brain and Cognition
- Omer Moussa, Mariya Toneva:

Brain-tuned Speech Models Better Reflect Speech Processing Stages in the Brain. - Rini A. Sharon, Hema A. Murthy:

Enhancing Syllabic Recognition via Speech-EEG Phase Analysis and Non-Activity State Modeling. - Saravanakumar Duraisamy, Maurice Rekrut, Luis A. Leiva:

Functional Connectivity and Hilbert-Based Features for Covert Speech EEG Variability Analysis and Classification. - Siavash Shams, Richard J. Antonello, Gavin Mischler, Stephan Bickel, Ashesh D. Mehta, Nima Mesgarani:

Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEG. - Gabriel Ivucic, Saurav Pahuja, Dashanka De Silva, Tanja Schultz:

Selective Auditory Attention Decoding in Naturalistic Conversations Using EEG-Based Speech Envelope Tracking in Multi-Speaker Environments. - Mohammed Salah Al-Radhi, Géza Németh, Branislav Gerazov:

MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction.
Regional, Social and Diachronic Variation
- Gustavo Silveira, Aviad Albert, Martine Grice:

Probing Prosodic Differences Between Two Regional Varieties of Brazilian Portuguese. - Gilly Marchini, Jeremy Steffman:

Data-driven approaches to pitch modelling in two Mexican Spanish ethnolects: K-means Clustering & GAMMs. - Anisia Popescu, Lori Lamel, Marc Evrard, Ioana Vasilescu:

Tracking /r/ Deletion: Forced Alignment of Pronunciation Variants and Sociophonetic Insights into Post-Obstruent Final /r/ in French. - Lilian von Bressensdorf, Pia Greca, Jonathan Harrington:

Agent-based modelling, sound change, and metaphony in Southern Italian varieties of Italo-Romance. - John McGahay:

Modeling Vowel System Typology Using Iterated Confusion Minimization. - Bingliang Zhao

, Xiyu Wu:
Investigating Glottal Stop Coda Loss During Sound Change of Checked Syllables Based on Speech-EGG Voice Offset Alignment.
Speaker Extraction 2
- Thomas Serre, Mathieu Fontaine, Eric Benhaim, Slim Essid:

MTSE: Multi-Target Speaker Extraction for Conversation Scenarios. - Daniel-José Alcala Padilla, Nils L. Westhausen, Swati Vivekananthan, Bernd T. Meyer:

Location-Aware Target Speaker Extraction for Hearing Aids. - Shengkui Zhao, Zexu Pan, Bin Ma:

ClearerVoice-Studio: Bridging Advanced Speech Processing Research and Practical Deployment. - Cheng Yu, Vahid Ahmadi Kalkhorani, Buye Xu, DeLiang Wang:

Online AV-CrossNet: a Causal and Efficient Audiovisual System for Speech Enhancement and Target Speaker Extraction. - Jakob Kienegger, Timo Gerkmann:

Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios.
Multimodal Emotion Recognition
- Haoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang, Jiabei He, Shiwan Zhao, Xiangyu Kong, Desheng Hu, Xinkang Xu, Xinhui Hu, Yong Qin:

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval. - Shreeram Suresh Chandra, Lucas Goncalves, Junchen Lu, Carlos Busso, Berrak Sisman:

EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast. - Georgios Chochlakis, Turab Iqbal, Woo Hyun Kang, Zhaocheng Huang:

Modality-Agnostic Multimodal Emotion Recognition using a Contrastive Masked Autoencoder. - Maxim Markitantov, Elena Ryumina, Heysem Kaya, Alexey Karpov

:
Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion.
Conversation, Communication and Interaction 2
- Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara:

Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems. - Michael Paierl, Martin Hagmüller, Barbara Schuppler:

Continuous prediction of backchannel timing for human-robot interaction. - Valeska Slomianka, Tobias May, Torsten Dau:

Impact of Background Noise on Turn-Taking Dynamics in Triadic Conversations. - Delphine Charuau, Naomi Harte:

Multimodal Dynamics of Hand Gestures and Pauses in Multiparty Interactions. - Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, Jakub Hoscilowicz:

TinyClick: Single-Turn Agent for Empowering GUI Automation. - Shoki Kawanishi, Akinori Ito, Yuya Chiba, Takashi Nose:

Improving User Impression of Spoken Dialogue Systems by Controlling Para-linguistic Expression Based on Intimacy. - Kiyotada Mori, Seiya Kawano, Angel F. Garcia Contreras, Koichiro Yoshino:

Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model.
Multimodal Speech and Language Processing in Healthcare Settings
- Aditya Kommineni, Digbalay Bose, Tiantian Feng

, So Hyun Kim, Helen Tager-Flusberg, Somer Bishop, Catherine Lord, Sudarsana Kadiri, Shrikanth Narayanan:
Can Multimodal Foundation Models Help Analyze Child-Inclusive Autism Diagnostic Videos? - Jihyun Mun, Sunhee Kim, Minhwa Chung:

A Cascaded Multimodal Framework for Automatic Social Communication Severity Assessment in Children with Autism Spectrum Disorder. - Daniel Tisdale, Jackson Liscombe, David Pautler, Michael Neumann, Vikram Ramanarayanan:

Accessible Real-time Eye-gaze Tracking for Neurocognitive Health Assessment: A Multimodal Web-based Approach. - Gowtham Premananth, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Y. Espy-Wilson:

Multimodal Biomarkers for Schizophrenia: Towards Individual Symptom Severity Estimation. - Suhas BN, Han-Chin Shing, Lei Xu, Mitch Strong, Jon Burnsky, Jessica Ofor, Jordan R. Mason, Susan Chen, Sundararajan Srinivasan, Chaitanya Shivade, Jack Moriarty, Joseph Paul Cohen:

Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization. - Hongchen Wu, Yao Du, Zirong Li, Yixin Gu, Disha Thotappala Jayaprakash, Li Sheng:

Evaluating Automatic Speech Recognition Pipelines for Mandarin-English Bilingual Child Language Assessment in Telehealth.
Music and Audio Analysis
- Jiawen Huang, Felipe Sousa, Emir Demirel, Emmanouil Benetos, Igor Gadelha:

Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss. - Pei-Chin Hsieh, Yih-Liang Shen, Ngoc Son Tran, Tai-Shih Chi:

Tonality-Based Accompaniment-Guided Automatic Singing Evaluation. - Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma:

Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS Prediction. - Lekshmi Chandrika Reghunath, Rajeev Rajan:

Focal Modulation Network: A Novel Solution for Polyphonic Music Instrument Recognition without Attention and Aggregation Strategy. - Jiabo Jing, Ying Hu, Hao Huang, Liang He, Zhijian Ou:

A Joint Network for Singing Melody Extraction from Polyphonic Music with Attention Aggregation and Self-Consistency Training. - Yuetonghui Xu, Yiwen Wang, Xihong Wu, Xiaobing Li, Feng Yu:

Position also matters! Separating Same Instruments in String Quartet using Timbral and Positional Cues. - Ruoxuan Liang, Xiangjian Zeng, Zhen Liu, Qingqiang Wu, Ruichen Zhang, Le Ren:

WhisperMSS: A Two-Stage Framework for Mandarin Singing Transcription and Segmentation Using Pretrained Models. - Rishabh Gupta, MLNS Karthik, Yughendaran Palanivel:

Low Complex IIR Adaptive Hear-Through Ambient Filtering for Overcoming Practical Constraints in Earbuds. - Rishabh Gupta, MLNS Karthik, Omsrinath Chelamkuri:

Sub-band based Adaptive IIR Algorithm with Biquad Filter Stability Constraints for Feedforward Hear-Through Equalization.
Audio Analysis, Generation and Assessment
- Jingguang Tian, Haoqin Sun, Xinhui Hu, Xinkang Xu:

Discrete Audio Representations for Automated Audio Captioning. - Daiki Takeuchi, Binh Thien Nguyen, Masahiro Yasuda, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada:

CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer. - Ho-Young Choi, Jae-Heung Cho, Pil Moo Byun, Won-Gook Choi, Joon-Hyuk Chang:

Temp4Cap: Temporally-aligned Automated Audio Captioning. - Seyun Ahn, Pil Moo Byun, Won-Gook Choi, Joon-Hyuk Chang:

Optimizing CLAP Reward with LLM Feedback for Semantically Aligned and Diverse Automated Audio Captioning. - Seung-jae Lee, Paul Hongsuck Seo:

Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models. - Suqi Zhang, Zheqi Dai, Yongyi Zang, Yin Cao, Qiuqiang Kong:

DiffStereo: End-to-End Mono-to-Stereo Audio Generation with Diffusion Transformer. - Yusuke Kanamori, Yuki Okamoto, Taisei Takano, Shinnosuke Takamichi, Yuki Saito, Hiroshi Saruwatari:

RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audio. - Laura Lechler, Chamran Moradi, Ivana Balic:

Crowdsourcing MUSHRA Tests in the Age of Generative Speech Technologies: A Comparative Analysis of Subjective and Objective Testing Methods. - Sivakumar Balasubramanian, Jose Antonio Jimenez Amador, Kaustubh Kalgaonkar, King-Wei Hor, Sriram Srinivasan

:
SMARTMOS: Modeling Subjective Audio Quality Evaluation for Real-Time Applications.
Other Topics in Speech Recognition
- Vikram C. M, Sanjoy Pal, Nidhi Mantri, Gopal Kumar Agrawal:

Effect of Loudspeaker Emitted Speech on ASR performance. - Zhennan Lin, Kaixun Huang, Wei Ren, Linju Yang, Lei Xie:

Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation. - Chanho Park

, Oscar Saz:
Character Error Rate Estimation for Semi-Supervised Training of Speech Recognition for Arabic Dialects. - Nune Tadevosyan, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Ante Jukic:

Unified Semi-Supervised Pipeline for Automatic Speech Recognition. - Christoph Minixhofer, Ondrej Klejch, Peter Bell:

Scaling Laws for Synthetic Speech for Model Training. - Minh Tran, Debjyoti Paul, Yutong Pang, Laxmi Pandey, Jinxi Guo, Ke Li, Shun Zhang, Xuedong Zhang, Xin Lei:

R2S: Real-to-Synthetic Representation Learning for Training Speech Recognition Models on Synthetic Data. - Julian Linke

, Jana Winkler, Barbara Schuppler
:
Context is all you need? Low-resource conversational ASR profits from context, coming from the same or from the other speaker. - Dana Serditova, Kevin Tang, Jochen Steffens:

Automatic Speech Recognition Biases in Newcastle English: an Error Analysis.
Privacy and Anonymization
- Jiali Cheng, Hadi Amiri:

Speech Unlearning. - Zhe Liu:

Unlearning LLM-Based Speech Recognition Models. - Jixun Yao, Hexin Liu, Eng Siong Chng, Lei Xie:

EASY: Emotion-aware Speaker Anonymization via Factorized Distillation. - Carlos Franzreb, Arnab Das, Tim Polzehl, Sebastian Möller:

Private kNN-VC: Interpretable Anonymization of Converted Speech. - Nathalie Vauquier, Brij Mohan Lal Srivastava, Seyed Ahmad Hosseini, Emmanuel Vincent:

Legally validated evaluation framework for voice anonymization.
Language Modeling for Conversational Systems
- Chi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, Hung-yi Lee:

Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models. - Alan Dao, Dinh Bach Vu, Huy Hoang Ha, Tuan Le Duc Anh, Shreyas Gopal, Yue Heng Yeo, Warren Keng Hoong Low, Eng Siong Chng, Jia Qi Yip:

Speechless: Speech Instruction Training Without Speech for Low Resource Languages. - Pooneh Mousavi, Shubham Gupta, Cem Subakan, Mirco Ravanelli:

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs. - Anton Mitrofanov, Sergey Novoselov, Tatiana Prisyach, Vladislav Marchevskiy, Arseniy Karelin, Nikita Khmelev, Dmitry Dutov, Stepan Malykh, Igor Agafonov, Aleksandr Nikitin, Oleg Petrov:

Cryfish: On deep audio analysis with Large Language Models. - Long Mai, Julie Carson-Berndsen:

Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning. - Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe:

OpusLM: A Family of Open Unified Speech Language Models. - Jen-Tzung Chien, Po-Chun Huang:

CAPR: Confidence-Aware Prompt Refinement in Large Language Models.
Speech Accessibility Project Challenge
- Xiuwen Zheng, Bornali Phukon, Jonghwan Na, Ed Cutrell, Kyu J. Han, Mark Hasegawa-Johnson, Pan-Pan Jiang, Aadhrik Kuila, Colin Lea, Bob MacDonald, Gautam Varma Mantena, Venkatesh Ravichandran, Leda Sari, Katrin Tomanek, Chang D. Yoo, Chris Zwilling:

The Interspeech 2025 Speech Accessibility Project Challenge. - Nada Gohider, Otman Basir:

Towards Inclusive and Fair ASR: Insights from the SAPC Challenge for Optimizing Disordered Speech Recognition. - Alexandre Ducorroy

, Rachid Riad:
Robust fine-tuning of speech recognition models via model merging: application to disordered speech. - Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi:

Exploring Generative Error Correction for Dysarthric Speech Recognition. - Ilja Baumann, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet:

Pathology-Aware Speech Encoding and Data Augmentation for Dysarthric Speech Recognition. - Dominik Wagner, Ilja Baumann, Natalie Engert, Seanie Lee, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet:

Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition. - Shiyao Wang, Jiaming Zhou, Shiwan Zhao, Yong Qin:

A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition. - Kaito Takahashi, Keigo Hojo, Toshimitsu Sakai, Yukoh Wakabayashi, Norihide Kitaoka:

Fine-tuning Parakeet-TDT for Dysarthric Speech Recognition in the Speech Accessibility Project Challenge. - Tianyi Tan, Xin'an Chen, Xiaohuai Le, Wenzhi Fan, Xianjun Xia, Chuanzeng Huang, Jing Lu:

CBA-Whisper: Curriculum Learning-Based AdaLoRA Fine-Tuning on Whisper for Low-Resource Dysarthric Speech Recognition.
Neural Network Training Methods 1
- Shunsuke Mitsumori, Sara Kashiwagi, Keitaro Tanaka, Shigeo Morishima:

Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition. - Jatin Agrawal, Bramhendra Koilakuntla, Srikanth Konjeti:

Spot and Merge: A Hybrid Context Biasing Approach for Rare Word and Out of Vocabulary Recognition. - Martin Ratajczak, Jean-Philippe Robichaud, Jennifer Drexler Fox:

Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR. - Changhan Oh, Kiyoung Park, Jeom-ja Kang, Woo Yong Choi, Hwa Jeon Song:

Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition. - Sashi Novitasari, Takashi Fukuda, Gakuto Kurata:

Improving End-to-end Mixed-case ASR with Knowledge Distillation and Integration of Voice Activity Cues. - Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai:

Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR.
Diversity: Age, Sex, Gender, Ethnicity, and More
- Yi Lin, Shumeng Ni, Yangfan Lu:

Age-related changes in multisensory integration of emotions in an audiovisual face-prosody-semantics Stroop task. - Melanie Weirich, Adrian P. Simpson:

Investigating effects of sex hormones, cycle phases and age on female fundamental frequency. - Meike Rommel, Mísa Hejná

, Nicole Dehé:
Pre-aspiration in Iceland Is Conditioned by Gender/Sex. - Andreas Weilinghoff:

Transcribing Diverse Voices: Using Whisper for ICE corpora. - Oluwasegun Amoniyan:

Is it all about race?: A Cross-examination of /s/ in a Multilingual (Nigerian) Context. - Aarish Shah Mohsin, Mohammad Nadeem, Shahab Saquib Sohail, Tughrul Arslan, Mandar Gogate, Nasir Saleem, Amir Hussain:

Investigating Gender Bias in Text-to-Audio Generation Models.
Anomalous Sound Detection
- Dong Wang, Jiqing Han, Tieran Zheng, Guibin Zheng, Yongjun He:

Dual Orthogonality Sub-center Loss for Enhanced Anomalous Sound Detection. - Dong Wang, Jiqing Han, Guibin Zheng, Tieran Zheng, Yongjun He:

Adaptive Across-Subcenter Representation Learning for Imbalanced Anomalous Sound Detection. - Ho-Hsiang Wu, Wei-Cheng Lin, Abinaya Kumar, Luca Bondi, Shabnam Ghaffarzadegan, Juan Pablo Bello:

Towards Few-Shot Training-Free Anomaly Sound Detection. - Nan Jiang

, Yan Song, Qing Gu, Haoyu Song, Lirong Dai, Ian McLoughlin:
Finetune Large Pre-Trained Model Based on Frequency-Wise Multi-Query Attention Pooling for Anomalous Sound Detection. - Dengjian Zhou, Jianghan Hai, Sijia Liao, Yue Ivan Wu, Kainam Thomas Wong, Xiujuan Zheng:

Acoustic Detection of UAV Abnormality Using One Ground-Based Acoustic Vector Sensor. - Ben Niu, Yangjie Wei, Gang Yang, Yuqiao Wang, Shengling Yu:

StarGAN-Aug: A Cross-domain Fault Audio Generation Method for High-performance Fault Diagnosis of Power Transformers.
Far-field and Robust Speech Recognition
- Longjie Luo, Lin Li, Qingyang Hong:

SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recognition. - Siyi Zhao, Wei Wang, Yanmin Qian:

Lightweight Front-end Enhancement for Robust ASR via Frame Resampling and Sub-Band Pruning. - Yingzhi Wang, Anas Alhmoud, Saad Alsahly, Muhammad Alqurishi, Mirco Ravanelli:

Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down. - Hyebin Ahn, Kangwook Jang, Hoirin Kim:

HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization. - Naoyuki Kamo, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani:

MOVER: Combining Multiple Meeting Recognition Systems. - Ashish Panda, Sunil Kumar Kopparapu

:
EmbedAug: An Augmentation Scheme for End-to-End Automatic Speech Recognition. - Cathal Ó Faoláin, Andrew Hines

:
Attention Models and Auditory Transduction Features for Noise Robustness. - Xiangzhu Kong, Hao Huang, Zhijian Ou:

Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform. - Kunlong Zhao, Gongping Huang, Xudong Zhao

, Jingdong Chen, Jacob Benesty, Zoran Cvetkovic:
On the Design of a Robust Superdirective Beamformer and Topology Parameter Optimization with Frustum-Shaped Microphone Arrays Featuring Multiple Rings.
Speech Synthesis Paradigms and Methods 2
- Andres Fernandez, Juan Azcarreta Ortiz, Çagdas Bilen, Jesus Monge-Alvarez:

Efficient Neural and Numerical Methods for High-QualityOnline Speech Spectrogram Inversion via Gradient Theorem. - Jingyi Chen, Ju-Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault:

Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback. - Jeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, Xie Chen:

Accelerating Diffusion-based Text-to-Speech Model Trainingwith Dual Modality Alignment. - Liming Liang, Dongchao Yang, Xianwei Zhuang, Yuxin Xie, Luo Chen, Yuehan Jin, Yuexian Zou:

SpeechSEC: A Unified Multi-Task Framework for Speech Synthesis, Editing, and Continuation. - Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Pin-Jui Ku, Ante Jukic, Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee, Szu-Wei Fu:

VoiceNoNG: Robust High-Quality Speech Editing Model without Hallucinations. - Yihan Wu, Ruibo Chen, Georgios Milis, Junfeng Guo, Heng Huang:

A Watermark for Auto-Regressive Speech Generation Models.
Keynote 3 - Carol Y. Espy-Wilson: Speech Kinematic Analysis from Acoustics: Scientific, Clinical and Practical Applications
- Carol Y. Espy-Wilson:

Speech Kinematic Analysis from Acoustics: Scientific, Clinical and Practical Applications.
Articulatory Analyses
- Peter Birkholz, Dominik Schäfer, Patrick Häsner, Jihyeon Yun, Iris Kruppke, Rémi Blandin:

Influence of wall coverings of 3D-printed vocal tract models on measured transfer functions. - Paul McGuire, Kye Shibata, Thanh Viet Cao, Feng-fan Hsieh, Yueh-Chin Chang:

Supralaryngeal Kinematics of Implosives in Central Vietnamese: An EMA Study. - Tünde Szalay

, Michael Proctor
, Amelia Gully, Tharinda Piyadasa, Craig T. Jin, David Waddington, Naeim Sanaei, Sheryl Foster
, Kirrie J. Ballard:
Lateral Channel Formation in Australian English /l/: Insights from Magnetic Resonance Imaging. - Jing Huang, Feng-fan Hsieh, Yueh-Chin Chang:

Articulatory variations in Apical Vowels in Southwestern Mandarin. - Michael Proctor

, Tünde Szalay
, Tharinda Piyadasa, Craig T. Jin, Naeim Sanaei, Amelia Gully, David Waddington, Sheryl Foster
, Kirrie J. Ballard:
Rhotic Articulation in Australian English: Insights from MRI. - Justin J. H. Lo, Patrycja Strycharczuk, Sam Kirkham

:
Articulatory Strategy in Vowel Production as a Basis for Speaker Discrimination.
Speech and Audio Analysis and Representation
- Nadav Har-Tuv, Or Tal, Yossi Adi:

PAST: Phonetic-Acoustic Speech Tokenizer. - Sameer Khurana, Dominik Klement, Antoine Laurent, Dominik Bobos, Juraj Novosad, Peter Gazdik, Ellen Zhang, Zili Huang, Amir Hussein, Ricard Marxer, Yoshiki Masuyama, Ryo Aihara, Chiori Hori, François G. Germain, Gordon Wichern, Jonathan Le Roux:

Factorized RVQ-GAN For Disentangled Speech Tokenization. - Leonardo Pepino, Pablo Riera, Luciana Ferrer:

EnCodecMAE: leveraging neural codecs for universal audio representation learning. - Sarthak Yadav

, Sergios Theodoridis, Zheng-Hua Tan
:
AxLSTMs: learning self-supervised audio representations with xLSTMs.
Show and Tell 3: Signal Processing / Multimodal processing
- Keigo Wakayama, Tomoko Kawase, Takafumi Moriya, Marc Delcroix, Hiroshi Sato, Tsubasa Ochiai, Masahiro Yasuda, Shoko Araki:

Real-time TSE demonstration via SoundBeam with KD. - Bunlong Lay, Rostilav Makarov, Timo Gerkmann:

Real-Time Diffusion Buffer for Speech Enhancement On A Laptop. - Muhammad Yeza Baihaqi, Angel García Contreras, Seiya Kawano, Koichiro Yoshino:

Co-Speech Motion for Virtual Agents in Dialogue Using LLM-Driven Primitive Action Selection. - Arun Kumar Pallala, Nivedita Chennupati, Balaji Padmanaban, Rakesh Pogula, Uma Subhashini Ravuri, Naveen Ellanki, Harish Rajamani, Naveen Ambati:

TargetVoice: Single Channel Low-Latency Target Speaker Extraction. - Yuni Amaloa Quintero Villalobos, Wafaa Wardah, Sebastian Möller, Robert P. Spang:

Rollback Speech: Smart Feedback Prompts for Lost Utterances in Unstable Online Calls. - Takuma Okamoto, Michiyo Kono:

Simultaneous Speech Translation Integrated Compact Multiple Sound Spot Synthesis System On A Laptop Carried Out With A Backpack. - Santosh V. Patapati, Aashrith Tatineni, Trisanth Srinivasan:

GenECA: A General-Purpose Framework for Real-Time Adaptive Multimodal Embodied Conversational Agents.
Speech and Voice Disorders 2
- Zhou Du, Hang Chen, Huijun Ding, Jun Du, Zhen Chen:

Hybrid Expert Knowledge and Self-Supervised Learning for Diagnostic Modeling of Adductor Spasmodic and Primary Myotonic Dysphonia. - Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, Elena Baralis:

MVP: Multi-source Voice Pathology detection. - Chih-Ning Chen, Yu-Lan Chuang, Ming-Jhang Yang, Wei-Cheng Hsu, Yung-An Tsou, Yi-Wen Liu:

Phonetic Posteriorgram-Based Phoneme Selection for Vocal Cord Disorder Classification in Continuous Mandarin Speech. - Thomas Tienkamp, Fleur van Ast, Roos van der Veen, Teja Rebernik, Raoul Buurke, Nikki Hoekzema, Katharina Polsterer, Hedwig Sekeres, Rob van Son, Martijn Wieling, Max J. H. Witjes, Sebastiaan A. H. J. de Visscher, Defne Abur:

Articulatory clarity and variability before and after surgery for tongue cancer. - Zuzanna Miodonska:

Hybrid HMM-SVM classifier using frication-based features for detection of non-normative sibilant articulation patterns in Polish children's speech. - Dena F. Mujtaba, Nihar R. Mahapatra:

Fine-Tuning ASR for Stuttered Speech: Personalized vs. Generalized Approaches.
Neural Network Training Methods 2
- SooHwan Eom, Mark Hasegawa-Johnson, Chang D. Yoo:

SiamCTC: Learning Speech Representations through Monotonic Temporal Alignment. - Ye-Eun Ko, Mun-Hak Lee, Dong-Hyun Kim, Joon-Hyuk Chang:

Improving Generalization of End-to-End ASR through Diversity and Independence Regularization. - Carlos Carvalho, Jinchuan Tian, William Chen, Yifan Peng, Alberto Abad, Shinji Watanabe:

Exploring Linear Variant Transformers and k-NN Memory Inference for Long-Form ASR. - Takafumi Moriya, Masato Mimura, Kiyoaki Matsui, Hiroshi Sato, Kohei Matsuura:

Attention-Free Dual-Mode ASR with Latency-Controlled Selective State Spaces. - Dianwen Ng, Kun Zhou, Bin Ma, Eng Siong Chng:

Thinking Fast and Slow: Robust Speech Recognition via Deep Filter-Tuning. - Ziyang Zhuang, Tao Wei, Ming Fang, Ning Cheng, Shaojun Wang, Jing Xiao:

Towards Efficiently Whisper Fine-tuning with Monotonic Alignments. - Jingjing Xu, Zijian Yang, Albert Zeyer, Eugen Beck, Ralf Schlüter, Hermann Ney:

Dynamic Acoustic Model Architecture Optimization in Training for ASR. - Xiaocan Zhang, Weiwei Jiang, Guibin Zheng, Chenhao Jing, Jiqing Han, Tieran Zheng:

Knowledge Distillation Method for Pruned RNN-T Models via Pruning Bounds Sharing and Losses Confusion. - Abdul Hannan, Alessio Brutti, Shah Nawaz

, Mubashir Noman:
An Effective Training Framework for Light-Weight Automatic Speech Recognition Models. - Andrés Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esaú Villatoro-Tello, Petr Motlícek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke:

Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering.
Disentanglement of Information for Speaker Recognition
- Aditya Srinivas Menon, Raj Prakash Gohil, Kumud Tripathi, Pankaj Wasnik:

LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention. - Shanshan Yao, Dianlong Liu, Tian Li:

SCD-Conformer: Semantic Content Disentanglement for Text-Independent Speaker Verification. - Biel Tura Vecino, Subhadeep Maji, Aravind Varier, Antonio Bonafonte, Ivan Valles, Michael Owen, Constantinos Papayiannis, Leif Rädel, Grant P. Strimel, Oluwaseyi Feyisetan, Roberto Barra-Chicote, Ariya Rastrow, Volker Leutnant, Trevor Wood:

Universal Semantic Disentangled Privacy-preserving Speech Representation Learning. - Carole Millot, Clara Ponchard, Cédric Gendrot, Jean-François Bonastre, Orane Dufour:

Using gender, phonation and age to interpret automatically discovered speech attributes for explainable speaker recognition. - Alicja Martinek, Joanna Gajewska, Ewelina Bartuzi-Trokielewicz:

Do you read me? - flow of speech effect on speaker recognition systems. - Zhiqi Ai

, Meixuan Bao, Zhiyong Chen, Zhi Yang, Xinnuo Li, Shugong Xu:
VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin.
Error Correction and Confidence Estimation
- Natsuo Yamashita, Masaaki Yamamoto, Hiroaki Kokubo, Yohei Kawaguchi:

LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context. - Nagarathna Ravi, Thishyan Raj T, Ravi Teja Chaganti, Vipul Arora:

ASR Confidence Estimation using True Class Lexical Similarity Score. - Chanho Park, Thomas Hain

:
Semi-Supervised Learning for Automatic Speech Recognition with Word Error Rate Estimation and Targeted Domain Data Selection. - Sashi Novitasari, Takashi Fukuda, Gakuto Kurata:

Voice Activity-based Text Segmentation for ASR Text Denormalization. - Christophe Van Gysel, Maggie Wu, Lyan Verwimp, Caglar Tirkaz, Marco Bertola, Zhihong Lei, Youssef Oualil:

Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction. - Ahmed Adel Attia, Dorottya Demszky, Jing Liu, Carol Y. Espy-Wilson:

From Weak Labels to Strong Results: Utilizing 5, 000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data.
Training and Scoring Methods for Speaker Recognition
- Seongkyu Mun, Jubum Han:

Boundary-Conscious Pruning: Hard Set-Aware Model Compression for Efficient Speaker Recognition. - Yafeng Chen, Chong Deng, Hui Wang, Yiheng Jiang, Han Yin, Qian Chen, Wen Wang:

Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score Normalization. - Qing Gu, Yan Song, Haoyu Song, Nan Jiang, Lirong Dai, Ian McLoughlin:

A Domain Robust Pre-Training Method with Local Prototypes for Speaker Verification. - Piotr Masztalski, Michal Romaniuk, Jakub Zak, Mateusz Matuszewski, Konrad Kowalczyk

:
Clustering-based Hard Negative Sampling for Supervised Contrastive Speaker Verification. - Yang Liu, Li Wan, Yiteng Huang, Ming Sun, Xinhao Mei, Xubo Liu, Yangyang Shi, Florian Metze:

MASV: Speaker Verification with Global and Local Context Mamba. - Amro Asali, Yehuda Ben-Shimol, Itshak Lapidot:

ATMM-SAGA: Alternating Training for Multi-Module with Score-Aware Gated Attention SASV system. - Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han:

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification. - Kihyun Nam, Jungwoo Heo, Jee-weon Jung, Gangin Park, Chaeyoung Jung, Ha-Jin Yu, Joon Son Chung:

SEED: Speaker Embedding Enhancement Diffusion Model. - Sandro Cumani:

A Copula-Based Generative Score-Level Fusion Model for Speaker Verification.
Pathological Speech Analysis 2
- Anne Hermes, Ivana Didirková, Philipp Buech, Gilles Vannuscorps:

Acoustic similarities, articulatory uniqueness: Speech production mechanisms in individuals with congenital lip paralysis. - Bence Mark Halpern, Thomas Tienkamp

, Teja Rebernik, Rob J. J. H. van Son, Martijn Wieling
, Defne Abur, Tomoki Toda:
Relationship between objective and subjective perceptual measures of speech in individuals with head and neck cancer. - Terry Yi Zhong, Esther Janse, Cristian Tejedor-Garcia, Louis ten Bosch, Martha A. Larson:

Evaluating the Usefulness of Non-Diagnostic Speech Data for Developing Parkinson's Disease Classifiers. - Francesco Pierotti, Andrea Bandini:

Multimodal Assessment of Speech Impairment in Amyotrophic Lateral Sclerosis Using Audio-Visual and Machine Learning Approaches. - Tanya Talkar, Kan Kawabata, Connor Higgins, Sean Tobyne:

Development and Validation of a Wav2Vec 2.0-Based Cross-Language Methodology for Measurement of Articulatory Precision. - Charan Sridhar, Shaomei Wu:

J-j-j-just Stutter: Benchmarking Whisper's Performance Disparities on Different Stuttering Patterns.
Multimodal and Visual Speech Synthesis
- Junjie Zheng, Zihao Chen, Chaofan Ding, Yunming Liang, Yihan Fan, Huan Yang, Lei Xie, Xinhan Di:

MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing. - Hyung Kyu Kim, Hak Gu Kim:

Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation. - Fang Kang, Yin Cao, Haoyu Chen:

Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation. - Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic:

Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis. - Shumin Que, Anton Ragni:

VisualSpeech: Enhancing Prosody Modeling in TTS Using Video. - Yifan Liang, Kang Yang, Fangkun Liu, Andong Li, Xiaodong Li, Chengshi Zheng:

LightL2S: Ultra-Low Complexity Lip-to-Speech Synthesis for Multi-Speaker Scenarios.
Lexicon and Grammar
- Atty Schouwenaars, Esther Ruigendijk:

Processing of grammatical information in cochlear implant simulated speech by German adult listeners. - Chengjia Ye, James M. McQueen, Hans Rutger Bosker:

A Gradient Effect of Hand Beat Timing on Spoken Word Recognition. - Wei Xue, Iuliia Zaitova, Bernd Möbius

:
The Effect of Word Predictability on Spoken Cross-Language Intelligibility. - Yizhi Liu, Luyuan Geng, Yan Gu, Mengru Han:

Sentence-Final Particles in Mandarin Child-Directed Speech: Frequency and Impact on Speech Rate. - Ashwin Ram, Marisol Muñoz, Zoi Gkalitsiou, Alexandros G. Dimakis:

Bilingual Speakers Exhibit Cognitive Fatigue: A Speech Disfluencies Case Study on Research Talks.
Noise Reduction and Dereverberation
- Chandra Mohan Sharma, Arnab Kumar Roy, Anupam Mandal, Prasanta Kumar Ghosh, Prasanna Kumar Kr:

Boosting StoRM Convergence with Metric Guidance and Non-uniform State-Sampling for Optimal Dereverberation. - Louis Lalay, Mathieu Fontaine, Roland Badeau:

Unified Variational and Physics-aware Model for Room Impulse Response Estimation. - Kaixuan Luan, Xiaoda Yang, Shile Cai, Ruofan Hu, Minghui Fang, Wenrui Liu, Jialong Zuo, Jiaqi Duan, Yuhang Ma, Junyu Lu:

MelRe: Vision-Based Mel-Spectrogram Restoration. - Sirui Li, Shuai Wang, Zhijun Liu, Zhongjie Jiang, Yannan Wang, Haizhou Li:

SpeechRefiner: Towards Perceptual Quality Refinement for Front-End Algorithms. - Yunsik Kim, Yoonyoung Chung:

Modality-Specific Speech Enhancement and Noise-Adaptive Fusion for Acoustic and Body-Conduction Microphone Framework. - De Hu, Qilong Li:

Joint Rate Allocation and Sensor Selection for Speech Enhancement in Wireless Acoustic Sensor Networks. - Chuan Wen, Sarah Verhulst:

Individualized speech enhancement for hearing-impaired listeners. - Shaoxiang Dang, Li Li, Shogo Seki, Hiroaki Kudo:

First Analyze Then Enhance: A Task-Aware System for Speech Separation, Denoising, and Dereverberation. - Yaqi Zhu, Lei Zhou, Hongqing Liu, Liming Shi, Lu Gan:

A Robust Hybrid ACC-PM Approach for Personal Sound Zones.
Neural Network Training Methods and Architectures
- Fabian Ritter Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H. M. Wong, Eng Siong Chng, Nancy F. Chen, Hung-yi Lee:

Distilling a speech and music encoder with task arithmetic. - Dimitrios Damianos, Georgios Paraskevopoulos, Alexandros Potamianos:

MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR. - Jie Song, Wang Xiang, Jian Zhou, Cunhang Fan, Zhao Lv:

REB-former: RWKV-enhanced E-branchformer for Speech Recognition. - Jan Schuster, Alexander Wölfel, Fabian Brunner, Christian Bergler:

PredTrAD - Prediction-based Transformer for Anomaly Detection in Multivariate Time Series Data. - Jongsuk Kim, Jaemyung Yu, Minchan Kwon, Junmo Kim:

FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition. - Hamid Mojarad, Kevin Tang:

Automatic Speech Recognition of African American English: Lexical and Contextual Effects. - Kwok Chin Yuen, Jia Qi Yip, Eng Siong Chng:

Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function. - Naoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura:

SOMSRED-SVC: Sequential Output Modeling with Speaker Vector Constraints for Joint Multi-Talker Overlapped ASR and Speaker Diarization. - Jiamin Xie, Ju Lin, Yiteng Huang, Tyler Vuong, Zhaojiang Lin, Zhaojun Yang, Peng Su, Prashant Rawat, Sangeeta Srivastava, Ming Sun, Florian Metze:

Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition.
Challenges in Speech Data Collection, Curation and Annotation - Part 1
- Hang Chen, Jun Du, Qing Wang, Juan Xie, Shi-Fu XIong:

A Study of Real-world Audio-Visual Corpus Design and Production: A Perspective from MISP Challenges. - Yuxi Wang, Yikang Wang, Qishan Zhang, Hiromitsu Nishizaki, Ming Li:

VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset. - Sayaka Shiota, Suzuka Horie, Kouta Kanno, Shinnosuke Takamichi:

J-SPAW: Japanese speaker verification and spoofing attacks recorded in-the-wild dataset. - Coralie Serrand, Amira Morsli, Gilles Boulianne:

CommissionsQC: a Québec French Speech Corpus for Automatic Speech Recognition. - Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti

, Boris Ginsburg:
Granary: Speech Recognition and Translation Dataset in 25 European Languages. - Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Sai Adupa, Lekha Bollinani, Hafiz Malik

:
Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges. - Miao Zhang, Aref Farhadipour, Annie Baker, Jiachen Ma, Bogdan Pricop, Eleanor Chodroff:

Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis. - Chris Zwilling, Mark Hasegawa-Johnson, Heather Hodges, Lorraine O. Ramig, Adina Bradshaw, Clarion Mendes, Heejin Kim, Alexandria Barkhimer, Laura Mattie, Meg Dickinson, Shawnise Carter, Marie Moore Channell:

The Speech Accessibility Project: Best Practices for Collection and Curation of Disordered Speech. - Zhengjun Yue, Mara Barberis, Tanvina Patel, Judith Dineley

, Willemijn Doedens, Lottie Stipdonk, Yuanyuan Zhang
, Elke De Witte, Erfan Loweimi, Hugo Van hamme
, Djaina Satoer
, Marina B. Ruiter, Laureano Moro-Velázquez, Nicholas Cummins, Odette Scharenborg
:
Challenges and practical guidelines for atypical speech data collection, annotation, usage and sharing: A multi-project perspective. - Loann Peurey, Marvin Lavechin, Tarek Kunze, Manel Khentout, Lucas Gautheron, Emmanuel Dupoux, Alejandrina Cristià:

Fifteen Years of Child-Centered Long-Form Recordings: Promises, Resources, and Remaining Challenges to Validity. - Qiongqiong Wang, Hardik B. Sailor, Tianchi Liu, Ai Ti Aw:

Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation. - Kalle Lahtinen

, Einari Vaaras, Liisa Mustanoja
, Okko Räsänen
:
Investigating Affect Mining Techniques for Annotation Sample Selection in the Creation of Finnish Affective Speech Corpus. - Zaid Sheikh, Shuichiro Shimizu, Siddhant Arora, Jiatong Shi, Samuele Cornell, Xinjian Li, Shinji Watanabe:

Scalable Spontaneous Speech Dataset (SSSD): Crowdsourcing Data Collection to Promote Dialogue Research. - Xiyuan Gao, Bruce Xiao Wang, Meiling Zhang, Shuming Huang, Zhu Li, Shekhar Nayak, Matt Coler:

A Multimodal Chinese Dataset for Cross-lingual Sarcasm Detection. - Zhu Li, Yuqing Zhang

, Xiyuan Gao
, Shekhar Nayak
, Matt Coler
:
Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection.
Evaluation and Forensic Applications of Speaker Recognition
- Sandro Cumani, Anna Silnova, Sara Barahona, Ladislav Mosner, Oldrich Plchot, Johan Rohdin:

Analysis of the ABC Classification Backends for NIST SRE24. - Stepan Malykh, Alexander Anikin, Nikita Khmelev, Anastasia Korenevskaya, Anastasia Zorkina, Sergey Novoselov, Vladislav Marchevskiy, Vladimir Volokhov, Andrey Shulipa, Alexander Kozlov, Alexander Melnikov, Vasiliy Galyuk, Timur Pekhovskiy:

STCON NIST SRE24 System: Composite Speaker Recognition Solution for Challenging Scenarios. - Jaejun Lee, Kyogu Lee:

Vo-Ve: An Explainable Voice-Vector for Speaker Identity Evaluation. - Lauren Harrington, Vincent Hughes, Philip Harrison, Paul Foulkes, Jessica Wormald, Finnian Kelly, David van der Vloed:

Variability in performance across four generations of automatic speaker recognition systems. - Paul M. Reuter, Michael Jessen:

On the influence of language similarity in non-target speaker verification trials. - Ruichen Zuo, Kong Aik Lee, Zilong Huang, Man-Wai Mak:

The Sub-3Sec Problem: From Text-Independent to Text-Dependent Corpus.
Language Resources
- Yu-Xiang Luo, Yi-Cheng Lin, Ming-To Chuang, Jia-Hung Chen, I-Ning Tsai, Pei Xing Kiew, Yueh-Hsuan Huang, Chien-Feng Liu, Yu-Chen Chen, Bo-Han Feng, Wenze Ren, Hung-yi Lee:

ToxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance Tonality. - Huy Ba Do, Vy Le-Phuong Huynh, Luan Thanh Nguyen:

ViToSA: Audio-Based Toxic Spans Detection on Vietnamese Speech Utterances. - William N. Havard, Renauld Govain, Benjamin Lecouteux, Emmanuel Schang:

Self-Supervised Models of Speech Processing for Haitian Creole. - Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, Junichi Yamagishi:

AfriHuBERT: A self-supervised speech representation model for African languages. - Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar:

The Faetar Speech Recognition Benchmark. - Jaume Santamaria-Jorda, Pablo Segovia-Martínez, Gonçal V. Garcés Díaz-Munío, Joan Albert Silvestre-Cerdà

, Adrià Giménez
, Rubén Gaspar Aparicio, René Fernández Sánchez, Jorge Civera
, Albert Sanchís
, Alfons Juan:
LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking. - Lucia Ormaechea Grijalba, Nikos Tsourakis, Pierrette Bouillon

, Benjamin Lecouteux, Didier Schwab:
Towards High-Quality LLM-Based Data for French Spontaneous Speech Simplification: an Exo-Refinement Approach. - Xun Gong, Anqi Lv, Wangyou Zhang, Zhiming Wang, Huijia Zhu, Yanmin Qian:

BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM. - Raymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg:

SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription. - Titouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier C. van Dalen:

Loquacious Set: 25, 000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use. - Lucas Maison, Thomas Soulas, Marie-Jean Meurs:

CEREALES : a new dataset of Quebec French accented speech with applications to speech recognition.
Bandwidth Expansion and Diffusion-based Speech Enhancement
- Yang Xiang, Canan Huang, Desheng Hu, Jingguang Tian, Xinhui Hu, Chao Zhang:

A Semantic Information-based Hierarchical Speech Enhancement Method Using Factorized Codec and Diffusion Model. - Kyungguen Byun, Jason Filos, Erik Visser, Sunkuk Moon:

Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion Framework. - Yonghyeon Jun, Beomjun Woo, Myeonghun Jeong, Nam Soo Kim:

SNR-Aligned Consistent Diffusion for Adaptive Speech Enhancement. - Nan Xu, Zhaolong Huang, Xiaonan Zhi:

MDDM: A Multi-view Discriminative Enhanced Diffusion-based Model for Speech Enhancement. - Xi Liu, Mu Yang, Szu-Jui Chen, John H. L. Hansen:

A Neural Codec Approach for Noise-Robust Bandwidth Expansion. - Xin Liu, Shulin He, Xueliang Zhang:

HWB-Net: A Novel High-Performance and Efficient Hybrid Waveform Bandwidth Extension Method. - Hongtao Bao, Xueliang Zhang:

Frequency-Domain Enhanced Extreme Bandwidth Extension Network with ICCRN for Superior Speech Quality.
Spoken Language Understanding
- Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam:

QUADS: Quantized Distillation Framework for Efficient Speech Language Understanding. - Neeraj Agrawal, Sriram Ganapathy:

Spoken Language Understanding on Unseen Tasks With In-Context Learning. - Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset:

Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning. - Yi Huang, Si Chen, Jingyu Yao, Junlan Feng:

Modeling Multi-Turn Spoken Language Understanding with Dynamic Graph Convolutional Networks. - Ankit Kumar, Munir Georges:

DRI-GAN: A Novel Dual Real Input GAN with Triplet Loss for Cross-Lingual and Noisy SLU. - Alkis Koudounas, Moreno La Quatra, Eliana Pastor, Sabato Marco Siniscalchi, Elena Baralis:

"KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding.
Multilingual Speech Synthesis and Special Applications 2
- Ashwin Sankar, Yoach Lacombe, Sherry Thomas, Praveen Srinivasa Varadhan, Sanchit Gandhi, Mitesh M. Khapra:

Rasmalai : Resources for Adaptive Speech Modeling in IndiAn Languages with Accents and Intonations. - Utkarsh Pathak, Chandra Sai Krishna Gunda, Anusha Prakash, Keshav Agarwal, Hema A. Murthy:

Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages. - Ariadna Sanchez, Simon King:

Can We Reconstruct a Dysarthric Voice with the Large Speech Model Parler TTS? - Samuel Stucki, Jan Deriu, Mark Cieliebak:

Voice Adaptation for Swiss German. - Thiago Henrique Gomes Lobato, Magnus Schäfer:

Gradual modeling of the Lombard effect by modifying speaker embeddings from a Text-To-Speech model. - Minsu Kang, Seolhee Lee, Choonghyeon Lee, Namhyun Cho:

When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds. - Yizhong Geng, Wenxin Fu, Qihang Lu, Bingsong Bai, Cong Wang, Yingming Gao, Ya Li:

EEG-based Voice Conversion : Hearing the Voice of Your Brain. - Tuan-Nam Nguyen, Ngoc-Quan Pham, Seymanur Akti, Alexander Waibel:

Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement. - Alon Levkovitch, Julian Salazar, Soroosh Mariooryad, R. J. Skerry-Ryan, Nadav Bar, W. Bastiaan Kleijn, Eliya Nachmani:

Zero-Shot Mono-to-Binaural Speech Synthesis.
Prosody and Voice Quality
- Parismita Gogoi, Sishir Kalita, Wendy Lalhminghlui, Viyazonuo Terhiija, Moakala Tzudir, Priyankoo Sarmah, S. R. M. Prasanna:

Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models. - Jingyi Sun, Nicolas Audibert, Yaru Wu, Martine Adda-Decker:

Corpus-Based Insights into Mandarin Neutral Tone: Effects of Tonal Context and Structural Patterns in Spontaneous Speech. - Yu-Ying Chuang, Sheng-Fu Wang:

Tonal Variation and Word Meaning in Taiwanese. - Sofoklis Kakouros, Haoyu Chen:

Sounding Like a Winner? Prosodic Differences in Post-Match Interviews. - Anna Havras, Carlos Mendes, Helena Moniz, Gueorgui Hristovsky, João Miranda:

Exploratory Study of Filled Pauses in Ukrainian Language: Phonetic Properties of Filled Pauses. - Chloe Patman, Paul Foulkes, Kirsty McDougall:

Evaluating the suitability of acoustic parameters for capturing breathy voice in non-pathological female speakers. - Michaela Watkins, Rasmus Puggaard-Rode, Paul Boersma, Silke Hamann:

Robustness of F0 Ratio as a Diagnostic: Comparing Creaky Voice in Danish and Seoul Korean.
Generative Models for Audio
- Kfir Cohen, Lior Wolf, Bracha Laufer-Goldshtein:

Discovering Directions of Uncertainty in Speech Inpainting. - Chaeyoung Jung, Hojoon Ki, Ji-Hoon Kim, Junmo Kim, Joon Son Chung:

InfiniteAudio: Infinite-Length Audio Generation with Consistency. - Liming Liang, Luo Chen, Yuehan Jin, Xianwei Zhuang, Yuxin Xie, Yongkang Yin, Yuexian Zou:

FoleyMaster: High-Quality Video-to-Audio Synthesis via MLLM-Augmented Prompt Tuning and Joint Semantic-Temporal Adaptation. - Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu:

Video-to-Audio Generation with Fine-grained Temporal Semantics. - Yuxuan He, Xiaoran Yang, Ningning Pan, Gongping Huang:

TTMBA: Towards Text To Multiple Sources Binaural Audio Generation. - Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu:

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer.
Challenges in Speech Data Collection, Curation and Annotation - Part 2
- Ünal Ege Gaznepoglu, Anna Leschanowsky, Ahmad Aloradi, Prachi Singh, Daniel Tenbrinck, Emanuël A. P. Habets, Nils Peters:

You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks. - Sakshi Joshi, Eldho Ittan George, Tahir Javed, Kaushal Bhogale, Nikhil Narasimhan, Mitesh M. Khapra:

Recognizing Every Voice: Towards Inclusive ASR for Rural Bhojpuri Women. - Fan Liu, Cheng Gong, Boyu Zhu, Ruihao Jing, Chunyu Qiang, Tianrui Wang, Xiao-Lei Zhang, Xuelong Li:

Augment Mandarin to Cantonese Speech Databases via Retrieval-Augmented Generation and Speech Synthesis. - Alexander Johnson, Harsh Deshpande, Emmy Phung, Ahmad Emami:

An Exploratory Framework for LLM-assisted Human Annotation of Speech Datasets. - Abderrahim Fathan, Jahangir Alam:

Automatic Labeling and Correction of Noisy Labels for Robust Self-Supervised Speaker Verification. - Xiangyu Zhang, Daijiao Liu, Tianyi Xiao, Cihan Xiao, Tünde Szalay, Mostafa Shahin

, Beena Ahmed, Julien Epps:
Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction. - Tünde Szalay, Mostafa Shahin

, Tharmakulasingam Sirojan, Zheng Nan, Renata Huang, Kirrie J. Ballard, Beena Ahmed:
AusKidTalk: Using Strategic Data Collection and Out-of-Domain Tools to Semi-Automate Novel Corpora Annotation. - Rui Cai, Titia Benders:

ASR-based segmentation for the analysis of larger child-speech datasets: Performance evaluation on vowels from Australian-English speaking children aged 4 to 11 years. - Polychronia Christodoulidou, James Tanner, Jane Stuart-Smith, Michael McAuliffe, Mridhula Murali, Amy Smith

, Lauren Taylor, Joanne Cleland
, Anja Kuschmann
:
A semi-automatic pipeline for transcribing and segmenting child speech. - Komei Hiruta, Yosuke Yamano, Hideaki Tamori:

Hybrid Data Sampling for ASR: Integrating Acoustic Diversity and Transcription Uncertainty. - William Ravenscroft

, George Close, Kit Bower-Morris, Jamie Stacey, Dmitry Sityaev, Kris Y. Hong:
Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification. - Astik Biswas, Oleg Shevelev, Amine Abdaoui, Vivek Tyagi, Abdelmoumene Boumadane:

Adapting Whisper for low-resource Hindi-English Code-Mix speech with on-the-fly Augmentation & LLM-Synthesised Data. - Carlos Mena, Pol Serra, Jacobo Romero, Abir Messaoudi, José Giraldo, Carme Armentano-Oller, Rodolfo Zevallos, Iván Meza, Javier Hernando:

Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies. - Nikolay Karpov, Sofia Kostandian, Nune Tadevosyan, Alexan Ayrapetyan, Andrei Andrusenko, Ara Yeroyan, Mher Yerznkanyan, Vitaly Lavrukhin:

From Scarcity to Sufficiency: Speech Recognition Pipeline for Zero-resource Language. - Yifan Cheng, Ruoyi Zhang, Jiatong Shi:

MIKU-PAL: An Automated and Standardized Multimodal Method for Speech Paralinguistic and Affect Labeling. - Andrew Chang, Chenkai Hu, Ji Qi, Zhuojian Wei, Kexin Zhang, Viswadruth Akkaraju, David Poeppel, Dustin Freeman:

Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience. - Ana Rita Valente, Rufael Marew, Hawau Olamide Toyin, Hamdan Al-Ali, Anelise Bohnen, Inma Becerra, Elsa Marta Soares, Gonçalo Leal, Hanan Aldarmaki:

Clinical Annotations for Automatic Stuttering Severity Assessment.
Speech Emotion Recognition 3
- Wonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar, Ozlem Kalinli:

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech. - Jialong Mai, Xiaofen Xing, Weidong Chen, Yuanbo Fang, Xiangmin Xu:

AA-SLLM: An Acoustically Augmented Speech Large Language Model for Speech Emotion Recognition. - Xiaohan Shi, Xingfeng Li, Tomoki Toda:

Speaker-Aware Multi-Task Learning for Speech Emotion Recognition. - Ziv Tamir, Thomas Thebaud, Jesús Villalba, Najim Dehak, Oren Kurland:

Multimodal Emotion Diarization: Frame-Wise Integration of Text and Audio Representations. - Pravin Mote, Abinay Reddy Naini, Donita Robinson, Elizabeth Richerson, Carlos Busso:

Analysis of Phonetic Level Similarities Across Languages in Emotional Speech. - Jiaxi Hu, Leyuan Qu, Haoxun Li, Taihao Li:

Label Semantic-Driven Contrastive Learning for Speech Emotion Recognition. - Minji Ryu, Ji-Hyeon Hur, Sung Heuk Kim, Gahgene Gweon:

Pitch Contour Model (PCM) with Transformer Cross-Attention for Speech Emotion Recognition.
Emotion and Expressivity in Speech Synthesis and Voice Conversion
- Jingyuan Xing, Zhipeng Li, Shuaiqi Chen, Xiaofen Xing, Xiangmin Xu:

EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-Speech. - Kenichi Fujita, Shota Horiguchi, Yusuke Ijima:

Voice Impression Control in Zero-Shot TTS. - Haoxun Li, Leyuan Qu, Jiaxi Hu, Taihao Li:

EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis. - Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee:

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech. - Nam-Gyu Kim, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee:

Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech. - Masato Murata, Koichi Miyazaki, Tomoki Koriyama:

Speaker-agnostic Emotion Vector for Cross-speaker Emotion Intensity Control. - Xueru Li, Jingyuan Xing, Xiaofen Xing, Zhipeng Li, Xiangmin Xu:

SA-RAS: Speaker-Aware Style Retrieval Augmented Generation for Expressive Zero-Shot Text-to-Speech Synthesis. - Xiaosu Su, Bowen Yang, Xiaowei Yi, Yun Cao:

DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice Conversion. - Hsing-Hang Chou, Yun-Shao Lin, Ching-Chin Sung, Yu Tsao, Chi-Chun Lee:

ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism. - Zhichao Wu, Yueteng Kang, Songjun Cao, Long Ma, Qiulin Li, Qun Yang:

MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt.
Streaming ASR
- Yinfeng Xia, Huiyan Li, Chenyang Le, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian:

MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition. - Longhao Li, Yangze Li, Hongfei Xue, Jie Liu, Shuai Fang, Kai Wang, Lei Xie:

Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR. - Yunjae Nam, Jeong U. Han, Kiyeon Kim, Jaemin Lim:

Parameter-efficient Fine-tuning of Conformer-based Streaming Speech Recognition into Non-streaming Models. - Kwanghee Choi, Masao Someki, Emma Strubell, Shinji Watanabe:

On-device Streaming Discrete Speech Units. - Haoran Zhou, Xingchen Song, Brendan Fahy, Qiaochu Song, Binbin Zhang, Zhendong Peng, Anshul Wadhawan, Denglin Jiang, Apurv Verma, Vinay Ramesh, Srivas Prasad, Michele M. Franceschini:

Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding. - Luong Ho, Khanh Le, Vinh Pham, Bao Nguyen, Tan Tran, Duc Chau:

Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization.
L1 and L2 Acquisition, Perception and Processing
- Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik:

Evaluating Progress of CALL System Users on Accentedness and Comprehensibility: An Acoustic and ASR-Based Approach. - Rory Turnbull, Elisa Kiefer, Sharon Peperkamp:

Does English fish sound like French fiche? Perceptual similarity judgments versus acoustic similarity. - Jinxin Ji, Yiying Hu, Xiaohu Yang, Gang Peng:

Acoustic Features of Mandarin Tone Production in Noise: A Comparison Between Chinese Native Speakers and Korean L2 Learners. - Fengyue Lisa Zhao, Jennifer Kuo:

The Role of Contextual Variation in Learning Cantonese Tones from Naturalistic Speech. - Mengxue Cao, Tianxin Zheng, Jiewen Zheng:

Pitch Target Realization in Putonghua Tone Production of Children from Dialect-Speaking Regions. - Aijun Li, Zhiwei Wang, Jun Gao, Xin Zhou:

The Development of Speech Rhythm in Putonghua-Learning Preschool Children in South Xinjiang Uyghur Autonomous Region of China.
Speech Emotion Recognition 2
- Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Jaya Sai Kiran Patibandla, Arun Balaji Buduru, Rajesh Sharma:

PARROT: Synergizing Mamba and Attention-based SSL Pre-Trained Models via Parallel Branch Hadamard Optimal Transport for Speech Emotion Recognition. - Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Shubham Singh, Swarup Ranjan Behera, Vandana Rajan, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma:

Towards Machine Unlearning for Paralinguistic Speech Processing. - Junyu Zhou, Yanxiong Li, Haolin Yu:

Infant Cry Emotion Recognition Using Improved ECAPA-TDNN with Multi-scale Feature Fusion and Attention Enhancement. - Shanshan Xiang, Hankiz Yilahun, Askar Hamdulla:

Speech Mutil-label Emotion Recognition Using Asymmetric Class Loss Function Based on Effective Samples. - Felix Burkhardt, Oliver Schrüfer, Uwe D. Reichel, Hagen Wierstorf, Anna Derington, Florian Eyben, Björn W. Schuller:

EmoDB 2.0: A Database of Emotional Speech in a World that is not Black or White but Grey. - Zhaohui Zhou, Hui Luo:

Cross-corpus open-set Speech Emotion Recognition Method Based on Spatiotemporal Features with Inverse-Entropy Regularization. - Jiacheng Shi, Yanfu Zhang, Ye Gao

:
CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning. - Varsha Pendyala, Pedro Morgado, William A. Sethares:

Leveraging Unlabeled Audio-Visual Data in Speech Emotion Recognition using Knowledge Distillation.
Speaker Traits Recognition
- Ambika Kirkland, Jens Edlund:

Who knows best? Effects of speech disfluencies on incentivized decision-making. - Phuoc Hoang Ho, Dragos Alexandru Balan

, Dirk K. J. Heylen, Khiet P. Truong:
Enhancing Transcripts of Open-Source Automatic Speech Recognition Models Through Fine-Tuning with Laughter and Speech-Laugh. - Pranjal Aggarwal, Ghritachi Mahajani, Pavan Kumar Malasani, Vaibhav Jamadagni, Caroline J. Wendt, Ehsanul Haque Nirjhar, Theodora Chaspari:

Investigating the Reasoning Abilities of Large Language Models for Understanding Spoken Language in Interpersonal Interactions. - Karumannil Mohamed Ismail Yasar Arafath, Mohammed Abeer K. C., Aurobinda Routray:

A Naturally Elicited Multimodal Stress Database and Speech Breathing Based Stress Detection. - Debasmita Bhattacharya, Aanya Tolat, Julia Hirschberg:

From Context to Code-switching: Examining the Interplay of Language Proficiency and Multilingualism in Speech. - D. Fortune Kponou, Salima Mdhaffar, Fréjus A. A. Laleye, Eugène C. Ezin, Yannick Estève:

Extending the Fongbe to French Speech Translation Corpus: resources, models and benchmark. - Kevin Huang, Sean Foley, Jihwan Lee, Yoonjeong Lee, Dani Byrd

, Shrikanth Narayanan:
On the Relationship between Accent Strength and Articulatory Features. - Katerina Papadimitriou, Gerasimos Potamianos:

A Multi-Stream Framework Utilizing 3D Human Reconstruction for Cued Speech Recognition. - Oliver Niebuhr

:
On the cross-modal makeup of charisma: Insights from a field-data analysis.
Spoofing and Adversarial Attacks
- Arnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller:

Generalizable Audio Spoofing Detection using Non-Semantic Representations. - Sreekanth Sankala, Venkatesh Parvathala, Ramesh Gundluru, K. Sri Rama Murty:

Adversarial Attacks on Text-dependent Speaker Verification System. - Shilpa Chandra

, Akansha Tyagi, Shiven Patel, Padmanabhan Rajan:
Beyond Attacks: Advancing Fake Speech Detection with Attack-Agnostic Methods. - Avishai Weizman

, Yehuda Ben-Shimol, Itshak Lapidot:
ASVspoof2019 vs. ASVspoof5: Assessment and Comparison. - Aykut Büker, Oguzhan Kurnaz, Sule Bekiryazici, Selim Can Demirtas, Cemal Hanilçi:

Evaluating Parameter Sharing for Spoofing-Aware Speaker Verification: A Case Study on the ASVspoof 5 Dataset. - Bikash Dutta, Rishabh Ranjan, Shyam Sathvik, Mayank Vatsa, Richa Singh:

Can Quantized Audio Language Models Perform Zero-Shot Spoofing Detection?
Voice Conversion 2
- Yu Pan, Yanni Hu, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao:

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech. - Tianhua Qi, Shiyan Wang, Cheng Lu, Tengfei Song, Hao Yang, Zhanglin Wu, Wenming Zheng:

PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts. - Fengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan, Zhiyong Wu:

StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion. - Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo:

FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation.
Pathological Speech Analysis 3
- Emmy Postma, Cristian Tejedor García:

Evaluating the Effectiveness of Pre-Trained Audio Embeddings for Classification of Parkinson's Disease Speech Data. - Shujie Hu, Xurong Xie, Mengzhe Geng, Jiajun Deng, Huimeng Wang, Guinan Li, Chengxi Deng, Tianzi Wang, Mingyu Cui, Helen Meng, Xunying Liu:

On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition. - Jia-Jyu Su, Yen-Ting Lin, Wu-Hao Li, Chao-Kai Chang, Yan-Zhi Chen, Chen-Yu Chiang:

Lightweight Speech Enhancement for Mandarin Esophageal Speech. - Yubin Kim, Taehan Kim, Wonjune Kang, Eugene Park, Joonsik Yoon, Dongjae Lee, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Cynthia Breazeal, Hae Won Park:

VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation. - Sumaya Ahmed Salihs, Isaac Wiafe, Jamal-Deen Abdulai, Elikem Doe Atsakpo, Gifty Ayoka, Richard Cave

, Akon Obu Ekpezu, Catherine Holloway, Katrin Tomanek, Fiifi Baffoe Payin Winful:
A Cookbook for Community-driven Data Collection of Impaired Speech in Low-Resource Languages. - Jaya Narain, Vasudha Kowtha, Colin Lea, Lauren Tooley, Dianna Yee, Vikramjit Mitra, Zifang Huang, Miquel Espi Marques, Jon Huang, Carlos Avendaño, Shirley Ren:

Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect.
Speech Emotion Recognition in Naturalistic Conditions Challenge
- Youjun Chen, Xurong Xie, Haoning Xu, Mengzhe Geng, Guinan Li, Chengxi Deng, Huimeng Wang, Shujie Hu, Xunying Liu:

Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition. - Drishya Uniyal, Vinayak Abrol:

From Pretraining to Performance: Benchmarking Self-Supervised Speech Models for Interspeech-25 SER Challenge. - Tiantian Feng

, Thanathai Lertpetchpun, Dani Byrd
, Shrikanth Narayanan:
Developing a Top-tier Framework in Naturalistic Conditions Challenge for Categorized Emotion Prediction: From Speech Foundation Models and Learning Objective to Data Augmentation and Engineering Choices. - Thanathai Lertpetchpun, Tiantian Feng, Dani Byrd

, Shrikanth Narayanan:
Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction. - Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee:

EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification. - Tahitoa Leygue, Astrid Sabourin, Christian Bolzmacher, Sylvain Bouchigny, Margarita Anastassova, Quoc-Cuong Pham:

Explainable Speech Emotion Recognition Through Attentive Pooling: Insights from Attention-Based Temporal Localization. - Soumya Dutta, Smruthi Balaji, Varada R, Viveka Salinamakki, Sriram Ganapathy:

ABHINAYA - A System for Speech Emotion Recognition In Naturalistic Conditions Challenge. - Abinay Reddy Naini, Lucas Goncalves, Ali N. Salman, Pravin Mote, Ismail Rasim Ulgen, Thomas Thebaud, Laureano Moro-Velázquez, Leibny Paola García, Najim Dehak, Berrak Sisman, Carlos Busso:

The Interspeech 2025 Challenge on Speech Emotion Recognition in Naturalistic Conditions. - Hyo Jin Jon, Longbin Jin, Hyuntaek Jung, Hyunseo Kim, Donghun Min, Eun Yi Kim:

MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition. - Bartlomiej Zgórzynski, Juliusz Wójtowicz-Kruk, Piotr Masztalski, Wladyslaw Sredniawa:

Multi-task learning for speech emotion recognition in naturalistic conditions. - Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos

, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos:
Medusa: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions. - Yuyun Liu, Yujia Gu, Jiahao Luo, Wenming Zheng, Cheng Lu, Yuan Zong:

Interactive Fusion of Multi-View Speech Embeddings via Pretrained Large-Scale Speech Models for Speech Emotional Attribute Prediction in Naturalistic Conditions. - Xiaohan Shi, Jinyi Mi, Xingfeng Li, Tomoki Toda:

Advancing Emotion Recognition via Ensemble Learning: Integrating Speech, Context, and Text Representations. - Lucas H. Ueda, João Lima, Leonardo Marques, Paula Dornhofer Paro Costa:

Improving Speech Emotion Recognition Through Cross Modal Attention Alignment and Balanced Stacking Model. - Prabhav Singh, Jesús Villalba:

EmoJudge: LLM Based Post-Hoc Refinement for Multimodal Speech Emotion Recognition. - Jing-Tong Tzeng, Bo-Hao Su, Ya-Tse Wu, Hsing-Hang Chou, Chi-Chun Lee:

Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the Wild. - Honghong Wang, Jing Deng, Fanqin Meng, Rong Zheng:

Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion.
Prosody, Phoneme and Stress Modeling in ASR
- Iddo Yosha, Dorin Shteyman, Yossi Adi:

WhiStress: Enriching Transcriptions with Sentence Stress Detection. - Sarenne Wallbridge, Christoph Minixhofer, Catherine Lai, Peter Bell:

Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning. - David Portes, Ales Horák:

Learning Optimal Prosody Embedding Codebook based on F0 and Energy. - David Sasu, Natalie Schluter:

Pitch Accent Detection improves Pretrained Automatic Speech Recognition. - Xuanru Zhou, Jiachen Lian, Cheol Jun Cho, Tejas S. Prabhune, Shuhe Li, William Li, Rodrigo Ortiz, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli:

Towards Accurate Phonetic Error Detection Through Phoneme Similarity Modeling. - Louise Coppieters de Gibson, Philip N. Garner:

Exploring auditory feedback mechanisms in speech recognition.
Segments
- Mathilde Hutin, Mélanie Lancien, Noam Faust:

French schwa is not acoustically distinct from its two lexical neighbors /ø/ and /œ/. - Jingyi Sun, Bowei Shao, Martine Adda-Decker:

Apical vs. Regular Vowel Duration: A Corpus-based Analysis of Contextual Influences in Standard Mandarin. - Xuying Wang, Fang Hu:

On Apical Vowels in Eastern Zhenjiang Mandarin. - Philipp Buech, Anne Hermes, Rachid Ridouane:

Equivalence and differences: Formant patterns of labialization and pharyngealization in Tashlhiyt. - Yifan Yang

, Zhiheng Qian:
Temporal organization of prenuclear glides in Hefei Mandarin. - Chloe D. Kwon:

Speaker-specific Patterns of Phonetic Covariation in Korean Word-medial Stops and the Role of Phonological and Morphological Contexts.
Datasets and Tools for Speech Synthesis
- Ryan Langman, Xuesong Yang, Paarth Neekhara, Shehzeen Hussain, Edresson Casanova, Evelina Bakhturina, Jason Li:

HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset. - Yuto Kondo, Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko:

JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles. - Mina Serajian, Saeed Najafzadeh Rahaghi, Hadi Veisi, Saman Haratizadeh:

FaVC: A Validated, Transcribed, Parallel Farsi Speech Dataset for Voice Conversion. - Vasista Sai Lodagala, Lamya Alkanhal, Daniel Izham, Shivam Mehta, Shammur Absar Chowdhury, Aqeelah Makki, Hamdy S. Hussein, Gustav Eje Henter, Ahmed Ali:

SawtArabi: A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-Switching. - Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas W. D. Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe:

The Text-to-speech in the Wild (TITW) Database. - Rui Liu, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li:

Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset. - Hawau Olamide Toyin, Rufael Marew, Humaid Alblooshi, Samar M. Magdy, Hanan Aldarmaki:

ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis. - Alejandro Sosa Welford, Leonardo Pepino:

A Dataset for Automatic Assessment of TTS Quality in Spanish.
Spoken Dialogue Systems 2
- Pradyoth Hegde

, Santosh Kesiraju, Jan Svec, Simon Sedlácek, Bolaji Yusuf, Oldrich Plchot, Deepak K. T, Jan Cernocký:
Factors affecting the in-context learning abilities of LLMs for dialogue state tracking. - Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, Assaf Arbelle:

Spoken Question Answering for Visual Queries. - Taesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim:

Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech. - Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe:

Chain-of-Thought Training for Open E2E Spoken Dialogue Systems.
Speech Enhancement and Representation Learning
- Xihao Yuan, Siqi Liu, Yan Chen, Hang Zhou, Chang Liu, Hanting Chen, Jie Hu:

SaD: A Scenario-Aware Discriminator for Speech Enhancement. - Soo-Whan Chung, Min-Seok Choi:

Listen through the Sound: Generative Speech Restoration Leveraging Acoustic Context Representation. - Xingwei Sun, Heinrich Dinkel, Yadong Niu, Linzhang Wang, Junbo Zhang, Jian Luan:

Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders. - Mandar Gogate, Kia Dashtipour, Amir Hussain:

Towards Personalised Audio Visual Speech Enhancement. - Ziqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, Lei Xie:

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching. - Seonggyu Lee, Sein Cheong, Sangwook Han, Kihyuk Kim, Jong Won Shin:

Speech Enhancement based on cascaded two flows. - Junbo Zhang, Heinrich Dinkel, Yadong Niu, Chenyu Liu, Si Cheng, Anbei Zhao, Jian Luan:

X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance. - Oguzhan Baser, Ahmet Ege Tanriverdi, Kaan Kale, Sandeep Chinchali, Sriram Vishwanath:

WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing.
Neural Codecs and Vocoders
- Youqiang Zheng, Weiping Tu, Yueteng Kang, Jie Chen, Yike Zhang, Li Xiao, Yuhong Yang, Long Ma:

FreeCodec: A Disentangled Neural Speech Codec with Fewer Tokens. - Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, Zhizheng Wu:

DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation. - Reo Yoneyama, Masaya Kawamura, Ryo Terashima, Ryuichi Yamamoto, Tomoki Toda:

Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments. - Junchuan Zhao

, Xintong Wang, Ye Wang:
Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning. - Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo:

Vocoder-Projected Feature Discriminator. - Zhuangqi Chen, Xianjun Xia, Xiaohuai Le, Siyu Sun, Chuanzeng Huang:

AF-Vocoder: Artifact-Free Neural Vocoder with Global Artifact Filter. - Peijie Chen, Wenhao Guan, Kaidi Wang, Weijie Wu, Hukai Huang, Qingyang Hong, Lin Li:

DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec. - Masato Takagi, Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda:

PeriodCodec: A Pitch-Controllable Neural Audio Codec Using Periodic Signals for Singing Voice Synthesis.
Adaptation and Target-speaker ASR
- Ju-Seok Seong, Jeong-Hwan Choi, Ye-Rin Jeoung, Ilseok Kim, Joon-Hyuk Chang:

Enhancing Target-speaker Automatic Speech Recognition Using Multiple Speaker Embedding Extractors with Virtual Speaker Embedding. - Yuta Hirano, Sakriani Sakti:

SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition. - Pradeep Rangappa, Andrés Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth R. Madikeri, Esaú Villatoro-Tello, Bidisha Sharma, Petr Motlícek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke:

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering. - Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Yicong Jiang, Jiankun Zhao, Jiajun Deng, Guinan Li, Youjun Chen, Huimeng Wang, Haoning Xu, Mingyu Cui, Xunying Liu:

MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition. - Zhao Yang, Rui Jiang, Yue Heng Yeo, Xiao Fu, Wei Xi, Jizhong Zhao:

Visually-Adaptive Guided Robust Speech Recognition with Parameter-Efficient Adaptation. - Peter Vieting, Maximilian Kannen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney:

Regularizing Learnable Feature Extraction for Automatic Speech Recognition. - Yuanbo Fang, Xiaofen Xing, Xueru Li, Weibin Zhang, Xiangmin Xu:

MMLoRA: Multitask Memory Parameter-Efficient Fine-Tuning for Multimodal SER. - Rogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya:

Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes.
Show and Tell 4: Education / Assistive Technology
- Javier Román, Pol Pastells, Mauro Vázquez Chas, Clara Puigventós, Montserrat Nofre, Mariona Taulé, Mireia Farrús:

SCRIBAL: A Digital Transcription Tool in Higher Education. - Juliana Francis, Joakim Gustafsson, Éva Székely:

From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS. - Dhia Eddine Merzougui, Nilesh Tete, Fabrice Maurel, Gaël Dias, Mohammed Hasanuzzaman, Aurélien Bournonville, Edgar Madelaine, Thomas Berthelin Le Tellier, François Ledoyen, Laure Poutrain-Lejeune, François Rioult, Jérémie Pantin:

Concurrent Speech and Auditory Tag Clouds for Non-Visual Web Interaction. - Alex Peiró Lilja, Rodolfo Zevallos, Carme Armentano-Oller, José Giraldo, Cristina España-Bonet, Mireia Farrús:

Towards Domain-Specific Spoken Language Understanding for a Catalan Voice-Controlled Video Game. - Tara McAllister, Peter Traver, Amanda Eads, William Haack, Helen Carey, Yi Shan, Wendy Liang, Tae Hong Park:

Accessible Delivery of Visual-Acoustic Biofeedback for Speech Sound Disorder. - Giri Raju, Sandeep Konam:

End-to-End Indian Language Dubbing with Zero-Shot Speaker Preservation.
Source Separation 2
- Junqi Yang, Yuhong Yang, Weiping Tu, Xin Zhao, Cedar Lin:

Band-SCNet: A Causal, Lightweight Model for High-Performance Real-Time Music Source Separation. - Runduo Han, Yanxin Hu, Yihui Fu, Zihan Zhang, Yukai Jv, Li Chen, Lei Xie:

CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-car Speech Separation with Distributed Heterogeneous Arrays. - Geonyoung Lee, Geonhee Han, Paul Hongsuck Seo:

DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization. - Xue Yang, Guiru Shen, Yu Yang:

Cross-Attention-Based Target Sound Extraction by Fully Leveraging Enrollment in a Shared Latent Space. - Takuya Hasumi, Yusuke Fujita:

DnR-nonverbal: Cinematic Audio Source Separation DatasetContaining Non-Verbal Sounds. - Malek Itani, Ashton Graves, Sefik Emre Eskimez, Shyamnath Gollakota:

Neural Speech Extraction with Human Feedback.
Speech Coding
- Hanglei Zhang, Yiwei Guo, Zhihan Li, Xiang Hao, Xie Chen, Kai Yu:

Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate. - Liang Wen, Lizhong Wang, Yuxing Zheng, Weijing Shi, Kwang Pyo Choi:

SPCODEC: Split and Prediction for Neural Speech Codec. - Wei-Cheng Tseng, David Harwath:

Probing the Robustness Properties of Neural Speech Codecs. - Yiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang, Xie Chen, Kai Yu:

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec. - Samir Sadok, Julien Hauret, Éric Bavu:

Bringing Interpretability to Neural Audio Codecs. - Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukic, Jason Li, Boris Ginsburg:

NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference.
Multimodality
- Matteo Maran, Renske Rötjes, Anna R. E. Schreurs, Hans Rutger Bosker:

Beat gestures made by human-like avatars affect speech perception. - Dan Oneata, Leanne Nortje, Yevgen Matusevych

, Herman Kamper:
The mutual exclusivity bias of bilingual visually grounded speech models. - Kyeongman Park, Seongho Joo, Kyomin Jung:

MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers. - Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li:

Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction. - João Menezes, Aubin Mouras, Arne-Lukas Fietkau, Dani Kazzy, Peter Birkholz:

Multimodal Silent Recognition of Phonemes Using Radar and Optopalatographic Silent Speech Interfaces.
Speech Assessment and Language Learning
- Meenakshi Sirigiraju, Chiranjeevi Yarra:

GoP2Vec: A few shot learning for pronunciation assessment with goodness of pronunciation (GoP) based representations from an i-vector framework and augmentation. - Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik:

Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge. - Sehyun Oh, Minhwa Chung, Sunhee Kim:

Multilingual Speech Assessment Using Cross-Attention and Multitask Learning. - Rao Ma, Mengjie Qian, Siyuan Tang, Stefano Bannò, Kate M. Knill, Mark J. F. Gales:

Assessment of L2 Oral Proficiency using Speech Large Language Models. - Mengjie Qian, Rao Ma, Stefano Bannò, Kate M. Knill, Mark J. F. Gales:

Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction. - Muyeol Choi, HyunJung Choi, Yohan Lim, Jeong-Uk Bang, Minkyu Lee, Seon Hui Kim, Seung Yun, Donghyun Kim, Minsoo Kim, Sanghun Kim:

Bidirectional Spoken-Written Text Conversion with Large Language Models.
Watermarking and Anonymization
- Yaoxun Xu, Jianwei Yu, Hangting Chen, Zhiyong Wu, Xixin Wu, Dong Yu, Rongzhi Gu, Yi Luo:

WAKE: Watermarking Audio with Key Enrichment. - Yu-Sheng Lin, Ching-Yu Yang, Hsing-Hang Chou, Ya-Tse Wu, Bo-Hao Su, Chi-Chun Lee:

Defend for Self-Vocoding: A Novel Enhanced Decoder Network for Watermark Recovery. - Minyoung Kim, Sehwan Park, Sungmin Cha, Paul Hongsuck Seo:

Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual Forgeries. - Haiyun Li, Zhiyong Wu, Xiaofeng Xie, Jingran Xie, Yaoxun Xu, Hanyang Peng:

VoiceMark: Zero-Shot Voice Cloning-Resistant Watermarking Approach Leveraging Speaker-Specific Latents. - Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh, Wei-Hsiang Liao, Yuki Mitsufuji:

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs? - Xijie Zeng, Frank Rudzicz:

How to Recover Long Audio Sequences Through Gradient Inversion Attack With Dynamic Segment-based Reconstruction. - Sarina Meyer, Ekaterina Kolos, Ngoc Thang Vu:

First Steps Towards Voice Anonymization for Code-Switching Speech. - Natalia A. Tomashenko, Emmanuel Vincent, Marc Tommasi:

Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems. - Zhe Zhang, Wen-Chin Huang, Xin Wang, Xiaoxiao Miao, Junichi Yamagishi:

Mitigating Language Mismatch in SSL-Based Speaker Anonymization.
Single-channel Speech Enhancement
- Venkatesh Parvathala, K. Sri Rama Murty:

MSFNet: A Nested Model for Multi-Sampling-Frequency Speech Enhancement. - Zixuan Li, Shulin He, Jinglin Bai, Xueliang Zhang:

TF-SkiMNet: Speech Enhancement Based on Inplace Modeling and Skipping Memory in Time-Frequency Domain. - Nikolai Lund Kühne

, Jan Østergaard
, Jesper Jensen, Zheng-Hua Tan
:
xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement. - Haoyang Li, Yuchen Hu, Chen Chen, Sabato Marco Siniscalchi, Songting Liu, Eng Siong Chng:

From KAN to GR-KAN: Advancing Speech Enhancement with KAN-Based Methodology. - Jangyeon Kim, Ui-Hyeop Shin, Jaehyun Ko, Hyung-Min Park:

Stack Less, Repeat More: A Block Reusing Approach for Progressive Speech Enhancement. - Se-Ha Kim, Tae-Gyeong Kim, Chang-Jae Chun:

Mamba-based Hybrid Model for Speech Enhancement. - Yu Zhao, Zengqiang Shang, Mou Wang, Xin Liu, Pengyuan Zhang:

Restoring Harmonics: Enhancing Speech Quality with Deep Mask and Harmonic Restoration Network.
Contextual Biasing and Adaptation
- Yuxiang Kong, Fan Cui, Liyong Guo, Heinrich Dinkel, Lichun Fan, Junbo Zhang, Jian Luan:

GLCLAP: A Novel Contrastive Learning Pre-trained Model for Contextual Biasing in ASR. - Yu Nakagome, Michael Hentschel:

WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing. - Haoxiang Hou, Xun Gong, Wangyou Zhang, Wei Wang, Yanmin Qian:

Ranking and Selection of Bias Words for Contextual Bias Speech Recognition. - Yui Sudo, Yusuke Fujita, Atsushi Kojima, Tomoya Mizumoto, Lianbo Liu:

OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary. - Zijian Yang, Minh-Nghia Phan, Ralf Schlüter, Hermann Ney:

Label-Context-Dependent Internal Language Model Estimation for CTC. - Rodolfo Zevallos, Martí Cortada Garcia, Sarah Solito, Carlos Mena, Alex Peiró Lilja, Javier Hernando:

Assessing the Performance and Efficiency of Mamba ASR in Low-Resource Scenarios. - Hongli Yang, Yizhou Peng, Hao Huang, Sheng Li:

Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning.
Speaker Diarization 2
- Shota Horiguchi, Takanori Ashihara, Marc Delcroix, Atsushi Ando, Naohiro Tawara:

Mitigating Non-Target Speaker Bias in Guided Speaker Embedding. - Zhaoyang Li, Jie Wang, XiaoXiao Li, Wangjie Li, Longjie Luo, Lin Li, Qingyang Hong:

Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation Algorithm. - Samuel J. Broughton, Lahiru Samarakoon:

Pushing the Limits of End-to-End Diarization. - Tobias Cord-Landwehr, Tobias Gburrek, Marc Deegen, Reinhold Haeb-Umbach:

Spatio-Spectral Diarization of Meetings by Combining TDOA-based Segmentation and Speaker Embedding-based Clustering. - Hongyu Zhang, Ming Cheng, Jing Feng, Ming Li:

Selective Channel Attention based Target Speaker Voice Activity Detection for Speaker Diarization under AD-HOC Microphone Array Settings. - Joonas Kalda, Clément Pagés, Tanel Alumäe, Hervé Bredin:

Diarization-Guided Multi-Speaker Embeddings. - Ivan Medennikov, Taejin Park, Weiqing Wang, He Huang, Kunal Dhawan, Jinhan Wang, Jagadeesh Balam, Boris Ginsburg:

Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering. - Bongjun Kim, Arindam Ghosh, Mark C. Fuhs, Anurag Chowdhury, Deblin Bagchi, Monika Woszczyna:

A Hybrid Approach to Combining Role Diarization with ASR for Professional Conversations.
Depression Detection and Assessment 2
- Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia:

An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech. - Lucía Gómez-Zaragozá, Javier Marín-Morales, Mariano Alcañiz, Mohammad Soleymani:

Speech and Text Foundation Models for Depression Detection: Cross-Task and Cross-Language Evaluation. - Bubai Maji, Monorama Swain, Shazia Nasreen, Debabrata Majumdar, Rajlakshmi Guha, Aurobinda Routray, Anders Søgaard:

A Study on The Impact of Foundation Models on Automatic Depression Detection from Speech Signals. - Nelson Hidalgo Julia, Robert Lewis, Craig Ferguson, Simon Goldberg, Wendy Lau, Caroline Swords, Gabriela Valdivia, Christine D. Wilson-Mendenhall, Raquel Tartar, Rosalind Picard, Richard Davidson:

Identifying Vocal and Facial Biomarkers of Depression in Large-Scale Remote Recordings: A Multimodal Study Using Mixed-Effects Modeling. - Jiajun You, Shuai Wang, Xun Gong, Xiang Wan:

M3L: A Multi-Modal and Multi-Lingual Depression Detection Framework.
Keynote 4 - Judith Holler: Using and comprehending language in face-to-face conversation
- Judith Holler:

Using and comprehending language in face-to-face conversation.
Pathological Speech Analysis 4
- David Gimeno-Gómez, Rubén Solera-Ureña, Anna Pompili, Carlos D. Martínez-Hinarejos, Rita Cardoso, Isabel Guimarães, Joaquim J. Ferreira, Alberto Abad:

On the Relevance of Clinical Assessment Tasks for the Automatic Detection of Parkinson's Disease Medication State from Speech. - Sevada Hovsepyan

, Mathew Magimai-Doss:
Speech power spectra: a window into neural oscillations in Parkinson's disease. - Daniel Escobar-Grisales, Cristian David Ríos-Urrego, Sabato Marco Siniscalchi, Adolfo M. García, Yamile Bocanegra, Leonardo Moreno, Elmar Nöth, Juan Rafael Orozco-Arroyave:

Synchronous analysis of abnormal acoustic and linguistic production in Parkinson's speech. - Fritz Peters, W. Richard Bevan-Jones, Grace Threlfall, Jenny M. Harris, Julie S. Snowden, Matthew Jones, Jennifer C. Thompson, Daniel J. Blackburn

, Heidi Christensen
:
Automatic Detection and Sub-typing of Primary Progressive Aphasia from Speech: Integrating Task-Specific Features and Spatio-Semantic Graphs. - Priyanka Kommagouni, Pragya Khanna, Vamshiraghusimha Narasinga, Anirudh Bocha, Anil Kumar Vuppala:

Towards Classification of Typical and Atypical Disfluencies: A Self Supervised Representation Approach. - Genzo Miyahara, Tsuneo Kato, Akihiro Tamura:

Stuttering Detection Based on Self-Attention Weights of Temporal Acoustic Vector Sequence. - Jihyun Mun, Minhwa Chung, Sunhee Kim:

Speech-Based Automatic Chronic Kidney Disease Diagnosis via Transformer Fusion of Glottal and Spectrogram Features. - Sven Franz, Tanja Grewe, Bernd T. Meyer, Jörg Bitzer:

Influence of Room Acoustics on Objective Voice Assessment Methods in the Context of Speech and Language Therapy. - Hardik Kothare, Michael Neumann, Vikram Ramanarayanan:

Multimodal Speech-Based Biomarkers Outperform the ALS Functional Rating Scale in Predicting Individual Disease Progression in ALS.
Speech Deepfakes
- Taewoo Kim, Guisik Kim, Choongsang Cho, Young Han Lee:

Naturalness-Aware Curriculum Learning with Dynamic Temperature for Speech Deepfake Detection. - Hoan My Tran, Damien Lolive, David Guennec, Aghilas Sini, Arnaud Delhay, Pierre-François Marteau:

Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection. - Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang

:
A Comparative Study on Proactive and Passive Detection of Deepfake Speech. - Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep Chinchali:

PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection. - Wen Huang, Xuechen Liu, Xin Wang, Junichi Yamagishi, Yanmin Qian:

From Sharpness to Better Generalization for Speech Deepfake Detection. - David Combei, Adriana Stan, Dan Oneata, Nicolas M. Müller, Horia Cucu:

Unmasking real-world audio deepfakes: A data-centric approach. - Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj:

A Data-Driven Diffusion-based Approach for Audio Deepfake Explanations. - You Zhang, Baotong Tian, Lin Zhang, Zhiyao Duan:

PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing. - Falih Gozi Febrinanto

, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia:
Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection.
Prosody
- Shuwen Chen, Qingke Sun, Yue Huang, Yingyi Luo:

The Prosodic Characteristics of Standard Chinese Rhetorical Questions in Naturalistic Settings. - Anindita Mondal, Rahul Biju, Anil Kumar Vuppala, Reni K. Cherian, Chiranjeevi Yarra:

ProBiEM: Acoustic and Lexical Correlates of Prosodic Prominence in English-Malayalam Bilingual Speech. - Sophia Fünfgeld, Angelika Braun, Katharina Zahner-Ritter:

Are You Being Sarcastic? Prosodic Cues to Irony Perception in German. - Zilong Wang, Xiaoxue Zhang, Xinyang Jiang, Kaitao Song, Jue Yu:

Can AI Understand Mandarin Speech Prosody? A Framework and Benchmark Showcase. - Ha Eun Shim, Olivia Yung, Paige Tuttösí, Boey Kwan, Angelica Lim, Yue Wang, H. Henny Yeung:

Generating Consistent Prosodic Patterns from Open-Source TTS Systems. - Bogdan Vlasenko, Mathew Magimai-Doss:

Multimodal Prosody Modeling: A Use Case for Multilingual Sentence Mode Prediction.
Speech Analysis and Quality Assessment
- Amir Hussein, Sameer Khurana, Gordon Wichern, François G. Germain, Jonathan Le Roux:

HASRD: Hierarchical Acoustic and Semantic Representation Disentanglement. - Ka Ki SO, Chenzi Xu, Grace Wenling Cao, Peggy Mok:

Performance of Montreal Forced Aligner on Cantonese Spontaneous Speech. - Nicholas Sanders, Yuanchao Li, Korin Richmond, Simon King:

Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information. - Yang Xiao, Tianyi Peng, Yanghao Zhou, Rohan Kumar Das:

AdaKWS: Towards Robust Keyword Spotting with Test-Time Adaptation. - Fredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee:

Multivariate Probabilistic Assessment of Speech Quality. - Shafique Ahmed, Ryandhimas E. Zezario, Nasir Saleem, Amir Hussain, Hsin-Min Wang

, Yu Tsao:
A Study on Speech Assessment with Visual Cues. - Mattias Nilsson, Riccardo Miccini

, Julian Rossbroich, Clément Laroche, Tobias Piechowiak, Friedemann Zenke:
Efficient Streaming Speech Quality Prediction with Spiking Neural Networks. - Cheng-Hung Hu, Yusuke Yasuda, Akifumi Yoshimoto, Tomoki Toda:

Unifying Listener Scoring Scales: Comparison Learning Framework for Speech Quality Assessment and Continuous Speech Emotion Recognition.
Emotions and Foundational Models
- Hongfei Du, Sidi Lu, Gang Zhou, Ye Gao

:
EAA: Emotion-Aware Audio Large Language Models with Dual Cross-Attention and Context-Aware Instruction Tuning. - Jialong Mai, Xiaofen Xing, Yangbiao Li, Xiangmin Xu:

Chain-of-Thought Distillation with Fine-Grained Acoustic Cues for Speech Emotion Recognition. - Edmilson da Silva Morais, Hagai Aronowitz, Aharon Satt, Ron Hoory, Avihu Dekel, Brian Kingsbury, George Saon:

Exploring the Limits of Conformer CTC-Encoder for Speech Emotion Recognition using Large Language Models. - Jule Valendo Halim, Siyi Wang, Hong Jia, Ting Dang:

Token-Level Logits Matter: A Closer Look at Speech Foundation Models for Ambiguous Emotion Recognition. - Ankush Raut, Projna Paromita, Sydney R. Begerowski, Suzanne T. Bell, Theodora Chaspari:

Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions. - Yi-Wen Chao, Yizhou Peng, Dianwen Ng, Yukun Ma, Chongjia Ni, Eng Siong Chng, Eng Siong Chng:

A-SMiLE: Affective Sparse Mixture-of-Experts Adapter with Multi-Task Learning for Spoken Dialogue Models.
Prediction and Evaluation of Speech Quality and Intelligibility
- Katsuhiko Yamamoto, Koichi Miyazaki:

Non-Intrusive Binaural Speech Intelligibility Prediction Using Mamba for Hearing-Impaired Listeners. - Haoshuai Zhou, Changgeng Mo, Boxuan Cao, Linkai Li, Shan Xiang Wang:

No Audiogram: Leveraging Existing Scores for Personalized Speech Intelligibility Prediction. - Ryandhimas E. Zezario, Sabato Marco Siniscalchi, Fei Chen, Hsin-Min Wang

, Yu Tsao:
Feature Importance across Domains for Improving Non-Intrusive Speech Intelligibility Prediction in Hearing Aids. - Aymen Bashir, Haolan Wang

, Amin Edraki, Wai-Yip Chan, Jesper Jensen:
Intelligibility Prediction for Time-Modified Speech Signals Using Spectro-Temporal Modulation Features. - Thomas Joubaud, Julien Hauret, Véronique Zimpfer, Éric Bavu:

French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement. - Anna Leschanowsky, Kishor Kayyar Lakshminarayana, Anjana Rajasekhar, Lyonel Behringer, Ibrahim Kilinc, Guillaume Fuchs, Emanuël A. P. Habets:

Benchmarking Neural Speech Codec Intelligibility with SITool.
Multi-Talker ASR
- Yuhang Dai, He Wang, Xingchen Li, Zihan Zhang, Shuiyuan Wang, Lei Xie, Xin Xu, Hongxiao Guo, Shaoji Zhang, Hui Bu, Wei Chen:

AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition. - Weiqing Wang, Taejin Park, Ivan Medennikov, Jinhan Wang, Kunal Dhawan, He Huang, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg:

Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR. - Asahi Sakuma, Hiroaki Sato, Ryuga Sugano, Tadashi Kumano, Yoshihiko Kawai, Tetsuji Ogawa:

Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition. - Aswin Shanmugam Subramanian, Amit Das, Naoyuki Kanda, Jinyu Li, Xiaofei Wang, Yifan Gong:

Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios.
Speech Synthesis Paradigms and Methods 3
- Joun Yeop Lee, Sangjun Park, Byoung Jin Choi, Ji-Hyun Lee, Min-Kyung Kim, Hoon-Young Cho:

Efficient Streaming TTS Acoustic Model with Depthwise RVQ Decoding Strategies in a Mamba Framework. - Hyungchan Yoon, Chanwoo Lee, Hoodong Lee, Stanley Jungkyu Choi:

APTTS: Adversarial Post-training in Latent Flow Matching for Fast and High-fidelity Text-to-Speech. - Masato Murata, Koichi Miyazaki, Tomoki Koriyama, Tomoki Toda:

Eigenvoice Synthesis based on Model Editing for Speaker Generation. - Wanli Sun, Anton Ragni:

Score-Based Training for Energy-Based TTS Models. - Zijian Lin, Yang Zhang, Yougen Yuan, Yuming Yan, Jinjiang Liu, Zhiyong Wu, Pengfei Hu, Qun Yu:

Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding. - Masaya Kawamura, Takuya Hasumi, Yuma Shirahata, Ryuichi Yamamoto:

BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing.
Biosignal-enabled Spoken Communication
- Saurav Pahuja, Gabriel Ivucic, Siqi Cai, Dashanka De Silva, Haizhou Li, Tanja Schultz:

GTAnet: Geometry-Guided Temporal Attention for EEG-Based Sound Source Tracking in Cocktail Party Scenarios. - Zheyuan Lin, Siqi Cai, Haizhou Li:

Decoding Listener's Identity: Person Identification from EEG Signals Using a Lightweight Spiking Transformer. - Owais Mujtaba Khanday, Pablo Rodríguez San Esteban

, Zubair Ahmad Lone, Marc Ouellet, Jose A. Gonzalez-Lopez:
Recreating Neural Activity During Speech Production with Language and Speech Model Embeddings. - Sparsh Rastogi, Harsh Dadwal, Khushboo Modi, Jatin Bedi, Jasmeet Singh:

Towards Sentence Level Imagined Speech Generation from EEG signals. - Jingya Huang, Aashish N. Patel, Sowmya Manojna Narasimha, Gal Mishne, Vikash Gilja:

Word-Level Error Analysis in Decoding Systems: From Speech Recognition to Brain-Computer Interfaces. - Dashanka De Silva, Siqi Cai, Saurav Pahuja, Tanja Schultz, Haizhou Li:

NeuroSpex+: Dual-Task Training of Neuro-Guided Speaker Extraction with Speech Envelope and Waveform. - Kevin Scheck, Tom Dombeck, Zhao Ren, Peter Wu, Michael Wand, Tanja Schultz:

DiffMV-ETS: Diffusion-based Multi-Voice Electromyography-to-Speech Conversion using Speaker-Independent Speech Training Targets. - Ibrahim Ibrahimov, Csaba Zainkó, Gábor Gosztolya:

Conformer-based Ultrasound-to-Speech Conversion. - Charles McGhee, Mark J. F. Gales, Kate M. Knill:

Training Articulatory Inversion Models for Interspeaker Consistency. - Jesuraj Bandekar, Prasanta Kumar Ghosh:

Enhancing Acoustic-to-Articulatory Inversion with Multi-Target Pretraining for Low-Resource Settings. - Kristin Teplansky, Emily Rangel, Mimi LaValley, Jinuk Kwon, Beiming Cao, Jun Wang:

Articulatory Vowel Distinctiveness in Spanish. - Peiran Li, Fei Chen, Xixin Wu:

EEG-based Speech Decoding Based on Multi-mode Joint Modeling. - Masakazu Inoue, Motoshige Sato, Kenichi Tomeoka, Nathania Nah, Eri Hatakeyama, Kai Arulkumaran, Ilya Horiguchi, Shuntaro Sasai:

A Silent Speech Decoding System from EEG and EMG with Heterogenous Electrode Configurations. - Neil Shah, Shirish Karande, Vineet Gandhi:

NAM-to-Speech Conversion with Multitask-Enhanced Autoregressive Models. - Long-Khanh Pham, Thanh V. T. Tran, Minh-Tan Pham, Van Nguyen:

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling.
Speech Deepfakes, Antispoofing and Backdoor Attacks
- Yixuan Xiao, Ngoc Thang Vu:

Layer-Wise Decision Fusion for Fake Audio Detection Using XLS-R. - Rishabh Ranjan, Kishan Pipariya, Mayank Vatsa, Richa Singh:

SynHate: Detecting Hate Speech in Synthetic Deepfake Audio. - Aurosweta Mahapatra, Ismail Rasim Ulgen, Abinay Reddy Naini, Carlos Busso, Berrak Sisman:

Can Emotion Fool Anti-spoofing? - Tuan Dat Phuong, Long-Vu Hoang, Huy Dat Tran:

Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models. - Thanapat Trachu, Thanathai Lertpetchpun, Ekapol Chuangsuwanich:

Amplifying Artifacts with Speech Enhancement in Voice Anti-spoofing. - Ticho Urai, Pachara Boonsarngsuk, Ekapol Chuangsuwanich:

Thai Speech Spoofing Detection Dataset with Variations in Speaking Styles. - Yuheng Huang, Ying Ren, Wenjie Zhang, Diqun Yan:

CBA: Backdoor Attack on Deep Speech Classification via Audio Compression. - Zexin Li, Wenhan Yao, Ye Xiao, Jinsu Yang, Fen Xiao, Weiping Wen:

LRBA: Stealthy Backdoor Attacks on Speech Classification via Latent Rearrangement in VITS. - Nidheesh Gorthi, Kartik Thakral, Rishabh Ranjan, Richa Singh, Mayank Vatsa:

LitMAS: A Lightweight and Generalized Multi-Modal Anti-Spoofing Framework for Biometric Security.
Pathological Speech Analysis 5
- Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer:

Pitfalls and Limits in Automatic Dementia Assessment. - Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng:

On the Within-class Variation Issue in Alzheimer's Disease Detection. - Yongqi Shao, Tao Fang:

Alzheimer's Disease Detection Using Co-Attention Mechanism for Acoustic and ASR-Transcribed Text Features. - Yin-Long Liu, Rui Feng, Jia-Xin Chen, Yi-Ming Wang, Jia-Hong Yuan, Zhen-Hua Ling:

Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer's Disease Detection. - Injune Hwang, Jung-Min Kim, Ju Seok Ryu, Kyogu Lee:

Voice-Based Dysphagia Detection: Leveraging Self-Supervised Speech Representation. - Kunxiao Gao, Anna Favaro, Najim Dehak, Laureano Moro-Velázquez:

ADCeleb: A Longitudinal Speech Dataset from Public Figures for Early Detection of Alzheimer's Disease. - Johnny Tam

, Christine Weaver, Oliver Watts, Siddharthan Chandran, Suvankar Pal, Rowling Speech Consortium:
Anne Rowling Neurological Speech Corpus: clinically annotated longitudinal dataset for developing speech biomarkers in neurodegenerative disorders. - Selina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So:

Multitask Learning with Fused Attention for Improved ASR and Mispronunciation Detection in Children's Speech Sound Disorders. - Michael Neumann, Hardik Kothare, Beverly Insel, Anzalee Khan, Danyah Nadim, Jean-Pierre Lindenmayer, Vikram Ramanarayanan:

Multimodal Speech, Language and Orofacial Analysis for Remote Assessment of Positive, Negative and Cognitive Symptoms in Schizophrenia.
ASR Assessment and Foundational Models
- Bornali Phukon, Xiuwen Zheng, Mark Hasegawa-Johnson:

Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches. - Yixuan Hou, Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang:

SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant. - Sujith Pulikodan, Sahapthan K, Prasanta Kumar Ghosh, Visruth Sanka, Nihar Desai:

An approach to measuring the performance of Automatic Speech Recognition(ASR) models in the context of Large Language Model(LLM) powered applications. - Heng-Jui Chang, Hongyu Gong, Changhan Wang, James R. Glass, Yu-An Chung:

DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models. - Shunsuke Kando, Yusuke Miyao, Shinnosuke Takamichi:

Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models. - Dirk Eike Hoffner, Simon Weihe, Thomas Brand, Bernd T. Meyer:

Hearing deficits of transformer-based ASR for anechoic and spatial signals.
Speaker Recognition
- Karen Jones, Kevin Walker, Christopher Caruso, Elliot Singer, Trang Nguyen, Robert B. Dunn, Stephanie M. Strassel:

TELVID: A Multilingual Multi-modal Corpus for Speaker Recognition. - Magdalena Golebiowska

, Piotr Syga:
EmoSpeechAuth: Emotion-Aware Speaker Verification. - Craig S. Greenberg, Lukas L. Diduch, Audrey Tong, Elliot Singer, Trang Nguyen, Robert Dunn, Lisa P. Mason, Beth Matys:

The 2024 NIST Speaker Recognition Evaluation. - Wenjie Zhong, Jason Naradowsky, Yusuke Miyao:

A Simple-Yet-Effective Data Augmentation Method for Speaker Identification in Novels. - Chong-Xin Gan, Zhe Li, Zezhong Jin, Zilong Huang, Man-Wai Mak, Kong Aik Lee:

IDIR: Identifying and Distilling Informative Relations for Speaker Verification. - Sara Barahona, Anna Silnova, Ladislav Mosner, Junyi Peng, Oldrich Plchot, Johan Rohdin, Lin Zhang, Jiangyu Han, Petr Pálka, Federico Landini, Lukás Burget, Themos Stafylakis, Sandro Cumani, Dominik Bobos, Miroslav Hlavácek, Martin Kodovsky, Tomás Pavlícek:

Analysis of ABC Frontend Audio Systems for the NIST-SRE24.
Speech Analysis, Detection and Classification 2
- Nikola Ljubesic, Ivan Porupski, Peter Rupnik:

Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models. - Jhansi Mallela, Upendra Vishwanath Y. S., Sankara Bharadwaj Rangavajjala, Bhaskar Bhatt, Chiranjeevi Yarra:

SupraDoRAL: Automatic Word Prominence Detection Using Suprasegmental Dependencies of Representations with Acoustic and Linguistic Context. - Maxime Jacquelin, Maëva Garnier, Laurent Girin, Rémy Vincent, Olivier Perrotin:

LombardTokenizer: Disentanglement and Control of Vocal Effort in a Neural Speech Codec. - Yuke Lin, Jun Chen, Wenjie Li, Longshuai Xiao, Chao Weng:

Robust Personal Voice Activity Detection for Mitigating Domain Mismatch and False Acceptance Scenarios. - Hyung-Gun Chi, Florian Pesce, Wonil Chang, Oggi Rudovic, Arturo Argueta, Stefan Braun, Vineet Garg, Ahmed Hussen Abdelaziz:

Adaptive Knowledge Distillation for Device-Directed Speech Detection. - En-Lun Yu, Chien-Chun Wang, Jeih-Weih Hung, Shih-Chieh Huang, Berlin Chen:

Flexible VAD-PVAD Transition: A Detachable PVAD Module for Dynamic Encoder RNN VAD. - Matthew Maciejewski:

Speaker Conditioning of Voice Activity Detection via Implicit Separation. - Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang:

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning. - Prabash Reddy Male, Swayambhu Nath Ray, Harish Arsikere, Akshat Jaiswal, Prakhar Swarup, Prantik Sen, Debmalya Chakrabarty, K. V. Vijay Girish, Nikhil Bhave, Frederick Weber, Sambuddha Bhattacharya, Sri Garimella:

DuRep: Dual-Mode Speech Representation Learning via ASR-Aware Distillation.

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.


Google
Google Scholar
Semantic Scholar
Internet Archive Scholar
CiteSeerX
ORCID














