SHENG LI

SHENG LI (李勝) Japanese version (CV)

Assistant Professor (RIKEN Scientist)
Institute of Science Tokyo (formerly Tokyo Tech)
School of Engineering, Dept. of Information and Communications Engineering
( faculty DB, faculty page )
E-mail: sheng.li [at] ieee.org / li.s.az [at] m.titech.ac.jp
Address: Suzukakedai campus, Nagatsuta-cho 4259, Midori-ku, Yokohama, Kanagawa 226-8501, Japan (Mail BOX: G2-2)
self-introduction at Youtube

Short Bio:

Research Interest:

multilingual speech recognition/translation/dialogue
(ASEAN languages, middle Asian languages, Tibetan, dialectal Chinese)
security-aware speech processing
robotic audition/social robotics

Education:

2006.7 B.S Computer Science, Nanjing University (QS≒Nagoya Univ.)
2009.7 M.E Software and Embedded Systems, Nanjing University (QS≒Nagoya Univ.)
(Joint Program with Shenzhen City, Chinese University of Hong Kong)
2016.3 Ph.D. Informatic Science, Kyoto University
(Japanese Government MEXT, admission/tuition fee total exemption during PhD study)

Work Experiences:

Jul.2009 - Apr.2012:Chinese University of Hong Kong and Shenzhen City Joint lab [Guangdong China]
Apr.2016 - Dec.2016: Kyoto University, Speech and Audio Processing Lab. (ERATO researcher)
Jan.2017 - Feb.2025: National Institute of Information & Communications Technology (NICT) (researcher with PI startup fund)
Mar.2025 - present: Institute of Science Tokyo (Assistant Professor, not project specific)

Awards:

ScienceTokyo
2025 Awards and Research Grants from the School of Engineering Common Fund
2025 Next Generation Star, IEEE IROS2025 ( video at 56 second ) ( university report )
2025 IES SYPA Award, IEEE IROS2025
2025 Best Reviewer, IEEE RO-MAN2025 ( university report )
2025 Nominated as IEEE senior member
2025 Nominated as member of Applied Signal Processing Systems Technical Committee (ASPS TC) of the IEEE Signal Processing Society (SPS)
NICT
2024 awarded in SLT2024 grand challenge LLM GER (task1: speech recognition error correction using LLM)
2023 ICASSP2024 ICMC-ASR (In-Car Multi-Channel Automatic Speech Recognition) Challenge. (top2 in one track)
2023 1st place in one track in ASRU2023 special session: VoiceMOS challenge
2022 1st places of 6 (in total 16) index of Main/OOD tracks in INTERSPEECH2022 special session: VoiceMOS challenge
2021 3rd/4th place in constrained/unconstrained resource multilingual ASR tracks of OLR2021 challenge
2021 R3 NICT Award: Outstanding Performance Award Excellence Award (Group)
Kyoto University
2018 IEEE Signal Processing Society Japan Student Journal Paper Award
2016 Paper nominated as ACM/IEEE Trans. Audio, Speech and Language Process. cover
2012-2016 Kyoto Univ. admission/tuition fee total exemption
2012 MEXT scholarship by Japanese Government (recommended by Kyoto Univ.)
Joint Lab Shenzhen city and CUHK
2012 Travel grant by IBM research for INTERSPEECH2012 at Portland, USA
2011 Best Creative Project Award in Young Entrepreneur Program 2011, HK
2011 Excellent Staff Award
Nanjing University and before
2004 Encouragement Scholarship of Nanjing University
2002 Chen Yinchuan Scholarship (Hongkong) for Excellent University New Students
2002 Award of Chemistry and Biology Olympic for high school students (My wife got the Physics Olympic award).

Teach/Supervise

2023年 Supervised Ph.D. student of Kyoto Univ. (Wangjin Zhou) got 1st place in one track in ASRU2023 special session: VoiceMOS challenge
2023年 Supervised Ph.D. student of Kyoto Univ. (Qianying Liu) got IEEE-SPS grant for IEEE-ICASSP2023 oral presentation
2022年 Supervised Ph.D. student of Univ. of Tokyo (Zhuo Gong) successfully graduated from the Univ. of Tokyo. I supervised every paper.
2022年 Supervised Master student of Kyoto Univ. (Wangjin Zhou and Zhengdong Yang) got 1st of 6 indexes (16 in total) VoiceMOS Challenge2022
2021年 Supervised Ph.D. student of Kyoto Univ. (Soky Kak) who got best student paper nomination in O-COCOSDA2021
2020年 Supervised Master student of Kyoto Univ. (Yaowei Han) who got the best student paper nomination in IEEE-ICME2020

Fundings and Grants:

Multilingual Speech Recognition/Translation/Dialogue

Grant-in-Aid for Scientific Research (B) (Co-I): 2023-2028 (ongoing)
意図を的確に伝える音声対話翻訳の基盤技術の創出
Grant-in-Aid for Scientific Research (C) (PI): 2023-2026 (ongoing)
M3OLR: Towards Effective Multilingual, Multimodal and Multitask Oriental Low-resourced Language Speech Recognition
Grant-in-Aid for Research Activity Start-up (PI): 2019-2021
Next generation multilingual End-to-End speech recognition (from G30 to G200)
NICT tenure-track start-up funding (PI): 2020-2022
Advanced Multilingual End-to-End Speech Recognition

Speech Security

Grant-in-Aid for Young Scientists (PI): 2021-2023
Phantom in the Opera -- The Vulnerabilities of Speech Interface for Robotic Dialogue System
NII Open Collaborative Research (collaborator): 2020-2021
Speaker De-identification with Provable Privacy in Speech Data Release

Publications (Selected)

Ph.D Thesis:

Sheng Li (supervised by Prof. Tatsuya Kawahara).
Speech Recognition Enhanced by Lightly-supervised and Semi-supervised Acoustic Model Training.
Ph.D. Thesis, Kyoto University, Feb 2016.

Book Chapter:

S. Li, Bridging Eurasia: Multilingual Speech Recognition for Silkroad, ISBN: 978-4-904020-29-6, 2023.
S. Li, Voices of the Himalayas: Investigation of Speech Recognition Technology for the Tibetan Language, ISBN: 978-4-904020-28-9, 2022.
S. Li, Phantom in the Opera: The Vulnerabilities of Speech-based Artificial Intelligence Systems, ISBN: 978-4-904020-26-5, 2022.
X. Lu, S. Li, M. Fujimoto, Speech-to-Speech Translation, pp. 21-38, Springer Singapore, 2020.
S. Li, Chapter: From Shallow to Deep and Very Deep.
S. Li, Chapter: End-to-End and CTC models.

Invited Talks:

Phoneme-level articulatory animation in pronunciation training using EMA data,
2012, Speech Synthesis Lab., Tsinghua University, host: Prof. Zhiyong Wu.
Lightly-supervised training and confidence estimation using CRF classifiers,
2014, Speech and Cognition Lab., Tianjin University, host: Prof. Jianwu Dang and Prof. Kiyoshi Honda.
End-to-End Speech Recognition,
2019, University of Tokyo.
Towards Security-aware Speech Recognition System,
2023, NECTEC-NICT joint seminar, Thailand.
Self-Supervised Learning MOS Prediction with Listener Enhancement,
2023, VoiceMOS mini workshop, NII, Tokyo.

Journals (Peer reviewed):

Haiyan Yang, Jun Wang, S. Li, Juncheng Li, Jun Shi.
Unified Multi-prototype Network with Pretrained Swin Transformer for Visual and Audio Open Set Recognition.
Signal, Image and Video Processing, Springer Nature, (accepted for publication), 2025.
Haiyan Yang, Jun Wang, S. Li, Di Zhou, Xingwei Chen, Juncheng Li, Yufeng Hua, Jun Shi.
Collaborative Transformer Prototype Network with Pretrained Contrastive Language-Audio Encoder for Open Set Audio Recognition.
IEEE Trans. Signal Process. (T-SP), (accepted for publication), 2025.
Zhengdong Yang, Qianying Liu, S. Li, Fei Cheng, Chenhui Chu.
Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition.
IEEE Trans. Audio, Speech \& Language Process. (TASLP), (accepted for publication), 2025.
Longfei Yang, Jiyi Li, S. Li, Takahiro Shinozaki.
Multi-Domain Dialogue State Tracking with Large Language Model Rationale and Disentangled Domain-Slot Attention.
IEEE Trans. Audio, Speech \& Language Process. (TASLP), (accepted for publication), 2025.
Kai Wang, Lili Yin, S. Li, Madina Mansurova, Hao Huang.
Neural TTS-based Dynamic Data Augmentation for Improved Speech Separation.
IEEE Trans. Audio, Speech \& Language Process. (TASLP), (accepted for publication), 2025.
Chin Yuen Kwok, He Xin Liu, Jia Qi Yip, S. Li, Eng Siong Chng.
A Two-Stage LoRA Strategy for Expanding Language Capabilities in Multilingual ASR Models.
IEEE Trans. Audio, Speech \& Language Process. (TASLP), (accepted for publication), 2025.
S. Li, J. Li, C. Chu.
Voices of the Himalayas: Benchmarking Speech Recognition Systems for the Tibetan Language.
International Journal of Asian Language Processing, Vol. 34, No. 1, pp. 2450001, 2024.
S. Li, J. Li and Y. Cao,
Phantom in the Opera: Adversarial Music Attack for Robot Dialogue System.
Frontiers in Computer Science, section Human-Media Interaction, Vol.6, 2024.(invited paper)
Z. Yang, S. Shimizu, C. Chu, S. Li, S. Kurohashi,
End-to-end Japanese-English Speech-to-text Translation with Spoken-to-Written Style Conversion.
Journal of Natural Language Processing, Vol.31, No. 3, 2024.
N. Li, L. Wang, M. Ge, M. Unoki, S. Li and J. Dang,
Robust Voice Activity Detection Using an Auditory-Inspired Masked Modulation Encoder Based Convolutional Attention Network.
Speech Communication (SPEECH COMMUN), Vol. 157, No. 103024, 2024.
Y. Lin, L. Wang, J. Dang, S. Li and C. Ding,
Disordered Speech Recognition Considering Low Resources and Abnormal Articulation.
Speech Communication (SPEECH COMMUN), Vol. 155, No. 103002, 2023.
K. Soky, S. Li, C. Chu, T. Kawahara.
Finetuning Pretrained Model with Embedding of Domain and Language Information for ASR of Very Low-Resource Settings.
International Journal of Asian Language Processing, Vol. 33, No. 4, pp. 2350024, 2023.
K. Soky, M. Mimura, C. Chu, T. Kawahara, S. Li, C. Ding, S. Sam.
TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies.
International Journal of Asian Language Processing, Vol. 31, No. 03&04, pp. 2250007, 2022. (invited paper for best student paper nominated in OCOCOSDA-2021)
C. Fan, H. Zhang, J. Yi, Z. Lv, J. Tao, T. Li, G. Pei, X. Wu, S. Li.
SpecMNet: Spectrum Mend Network for Monaural Speech Enhancement.
Applied Acoustics, Vol. 194, pp. 108792, 2022.
S. Shimizu, C. Chu, S. Li, S. Kurohashi.
Cross-Lingual Transfer Learning for End-to-End Speech Translation.
Journal of Natural Language Processing (JNLP), Vol.29, No.2, 2022.
S. Qin, L. Wang, S. Li (corresponding), J. Dang and L. Pan.
Improving Low-resource Tibetan End-to-end ASR by Multilingual and Multi-level Unit Modeling.
EURASIP Journal on Audio, Speech and Music Processing. (EURASIP JASMP), No.2, 2022.
X. Chen, H. Huang, and S. Li,
Adversarial Attack and Defense on Deep Neural Network-based Voice Processing Systems: An Overview.
Applied Sciences, Special Issues of Machine Speech Communication, Vol. 11, No. 18, pp. 8450, 2021. (Peer-reviewed, invited survey paper)
P. Shen, X. Lu, S. Li(patent co-inventor), H. Kawai.
Knowledge Distillation-based Representation Learning for Short-Utterance Spoken Language Identification.
IEEE Trans. Audio, Speech \& Language Process. (TASLP), vol. 28, pp. 2674--2683, 2020.
S. Li, Y.Akita, and T.Kawahara.
Semi-supervised acoustic model training by discriminative data selection from multiple ASR systems' hypotheses.
IEEE Trans. Audio, Speech \& Language Process. (TASLP), Vol.24, No.9, pp.1520--1530, 2016.
(Cover Paper, IEEE Signal Processing Society Japan Student Journal Paper Award)
S. Li, Y.Akita, and T.Kawahara.
Automatic lecture transcription based on discriminative data selection for lightly supervised acoustic model training.
IEICE Trans., Vol.E98-D, No.8, pp.1545--1552, 2015.
L. Wang, H. Chen, S. Li, and H. Meng.
Phoneme-level articulatory animation in pronunciation training,
Speech Communication (SPEECH COMMUN), Vol. 54, Issue 7, Sept. pp. 845--856, 2012.

International Conferences (Peer reviewed):

2025@Science Tokyo

Chengxi Lei, S. Li, Satwinder Singh, Feng Hou, Huia Jahnke, Ruili Wang,
Empowering Māori Automatic Speech Recognition through EMD-Based Augmentation,
in Proc. Pacific Rim International Conference on Artificial Intelligence (PRICAI), pp. (accepted for presentation), 2025.
Jianing Yang, S. Li, Takahiro Shinozaki, Yuki Saito, Hiroshi Saruwatari,
Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement,
in Proc. APSIPA ASC, pp. (accepted for presentation), 2025.
Pengcheng Wang, S. Li, Takahiro Shinozaki,
RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition,
in Proc. Interspeech2025 MLC-SLM Challenge workshop, pp. (accepted for presentation), 2025.
Zhengdong Yang, Zhen Wan, S. Li, Chao-Han Huck Yang, Chenhui Chu,
CoVoGER: A Multilingual Multitask Benchmark for Speech-to-text Generative Error Correction with Large Language Models,
in Proc. EMNLP (main long), pp. (accepted for presentation), 2025.
Jing Li, S. Li, Emilia I. Barakova, Felix Schijve, Jun Hu,
Designing an LLM-powered Social Robot for Supporting Emotion Regulation in Parent-Child Dyads,
in Proc. RO-MAN (late breaking), pp. (accepted for presentation), 2025.
Jing Li, Felix Schijve, S. Li, Yuye Yang, Jun Hu, Emilia Barakova,
Towards Emotion Co-regulation with LLM-powered Socially Assistive Robots: Integrating LLM Prompts and Robotic Behaviors to Support Parent-Neurodivergent Child Dyads,
in Proc. IROS, pp. (accepted for presentation), 2025.
Haowei Lou, Hye young Paik, Pari Delir Haghighi, S. Li, Wen Hu, Lina Yao,
LatentSpeech: Latent Diffusion for Text-To-Speech Generation,
in Proc. RO-MAN, pp. (accepted for presentation), 2025.
Wangjin Zhou, Tianjiao Du, Chenglin Xu, S. Li, Yi Zhao, Tatsuya Kawahara,
Simple and Effective Content Encoder for Singing Voice Conversion via Dimension Reduction,
in Proc. INTERSPEECH, pp. (accepted for presentation), 2025.
Hongli Yang, Yizhou Peng, Hao Huang, S. Li,
Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning,
in Proc. INTERSPEECH, pp. (accepted for presentation), 2025.
Hongli Yang, S. Li, Hao Huang, Ayiduosi Tuohan, Yizhou Peng,
Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR,
in Proc. INTERSPEECH, pp. (accepted for presentation), 2025.
Zhengdong Yang, S. Li, Chenhui Chu,
Generative Error Correction for Emotion-aware Speech-to-text Translation,
in Proc. ACL (findings), pp. (accepted for presentation), 2025.
Zhen Wan, Chao-Han Huck Yang, Yahan Yu, Jinchuan Tian, S. Li, Ke Hu, Zhehuai Chen, Shinji Watanabe, Fei Cheng, Chenhui Chu, Sadao Kurohashi,
SIQ: Exterminating Speech Intelligence Quotient Cross Cognitive Levels in Voice Understanding Large Language Models,
in Proc. ACL (main long), pp. (accepted for presentation), 2025.
Yu Xu, Xiaokai Qin, Tianyu Fan, Eng Siong Chng, S. Li, Nobuaki Minematsu, Daisuke Saito,
Bandwidth Extension System for Throat Microphone Speech Reconstruction,
in Proc. IEEE-ICME, pp. (accepted for presentation), 2025.
Z. Ren, R. Rammohan, K. Scheck, S. Li, T. Schultz,
End-to-end Acoustic-linguistic Emotion and Intent Recognition Enhanced by Semi-supervised Learning,
in Proc. International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. (accepted for presentation), 2025.
C. Kwok, S. Li, J. Yip, C. Chu, T. Kawahara, E. Chng,
Extending Whisper for Emotion Prediction Using Word-level Pseudo Labels,
in Proc. IEEE-ICASSP, pp. (accepted for presentation), 2025.
J. Wang, S. Li, L. Lu, S. Kao, J. Jang,
Similarity-based accent recognition with continuous and discrete self-supervised speech representations,
in Proc. IEEE-ICASSP, pp. (accepted for presentation), 2025.
J. Hu, Z. Li, M. Shen, H. Ai, S. Li, J. Zhang,
Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding,
in Proc. IEEE-ICASSP, pp. (accepted for presentation), 2025.
2024@NICT
J. Chen, C. Chu, S. Li, T. Kawahara,
Data Selection using Spoken Language Identification for Low-Resource and Zero-Resource Speech Recognition,
in Proc. APSIPA ASC, pp. 1--6, 2024.
S. Li, Y. Ko, A. Ito,
LLM as decoder: Investigating Lattice-based Speech Recognition Hypotheses Rescoring Using LLM,
in Proc. APSIPA ASC, pp. 1--5, 2024.
C. Kwok, S. Li, J. Yip, E. Chng,
Low-resource Language Adaptation with Ensemble of PEFT Approaches,
in Proc. APSIPA ASC, pp. 1--6, 2024.
C. Tan, S. Li, Y. Cao, Z. Ren, T. Schultz,
Investigating Effective Speaker Property Privacy Protection in Federated Learning for Speech Emotion Recognition,
in Proc. ACM Multimedia Asia, pp. 1--7, 2024.
S. Li, C. Chen, C. Kwok, C. Chu, E. Chng, H. Kawai,
Investigating ASR Error Correction with Large Language Model and Multilingual 1-best Hypotheses.
in Proc. INTERSPEECH, pp. 1315--1319, 2024.
S. Li, J. Li, Y. Cao,
Automatic Post-Editing of Speech Recognition System Output Using Large Language Models.
in Proc. International Conference on Database Systems for Advanced Applications (DASFAA) Workshop, pp. 178--186, 2024.
Y. Wu, Y. Nakashima, N. Garcia, S. Li, Z. Zeng,
Reproducibility Companion Paper: Stable Diffusion for Content-Style Disentanglement in Art Analysis.
in Proc. ACM ICMR (reproducibility paper), pp. 1228--1231, 2024.
L. Zheng, Y. Cao, R. Jiang, K. Taura, Y. Shen, S. Li, M. Yoshikawa,
Enhancing Privacy of Spatiotemporal Federated Learning against Gradient Inversion Attacks.
in Proc. International Conference on Database Systems for Advanced Applications (DASFAA), pp. 457--473, 2024.
Y. Zhao, C. Qiang, H. Li, Y. Hu, W. Zhou, S. Li,
Enhancing Realism in 3D Facial Animation Using Conformer-Based Generation and Automated Post-Processing.
in Proc. IEEE-ICASSP, pp. 8341--8345, 2024.
W. Zhou, Z. Yang, C. Chu, S. Li, R. Dabre, Y. Zhao, T. Kawahara,
MOS-FAD: Improving fake audio detection via automatic mean opinion score prediction,
in Proc. IEEE-ICASSP, pp. 876--880, 2024.

2023@NICT

W. Zhou, Z. Yang, S. Li, C. Chu,
KyotoMOS: An Automatic MOS Scoring System for Speech Synthesis.
in Proc. ACM Multimedia Asia Workshop, pp. 7:1-7:3, 2023.
X. Chen, S. Li, J. Li, H. Huang, Y. Cao, L. He,
Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization.
in Proc. ACM Multimedia Asia, pp. 93:1-93:5, 2023.
X. Chen, S. Li, J. Li, Y. Cao, H. Huang, L. He,
GhostVec: A New Threat to Speaker Privacy of End-to-End Speech Recognition System.
in Proc. ACM Multimedia Asia, pp. 94:1-94:5, 2023.
W. Wei, Z. Yang, G. Yuan, J. Li, C. Chu, S. Okada, S. Li (corresponding),
FedCPC: an Effective Federated Contrastive Learning Method for Privacy Preserving Early-Stage Alzheimer’s Speech Detection.
in Proc. IEEE Workshop Automatic Speech Recognition \& Understanding (IEEE-ASRU), pp. 1-6, 2023.
Z. Qi, X. Hu, W. Zhou, S. Li, H. Wu, J. Lu, X. Xu,
LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement.
in Proc. IEEE Workshop Automatic Speech Recognition \& Understanding (IEEE-ASRU), pp. 1-6, 2023.
S. Li, J. Li,
Correction while Recognition: Combining Pretrained Language Model for Taiwan-accented Speech Recognition.
in Proc. International Conference on Artificial Neural Networks (ICANN), pp. 389--400, 2023.
Z. Yang, S. Shimizu, W. Zhou, S. Li, C. Chu,
The Kyoto Speech-to-Speech Translation System for IWSLT 2023.
in Proc. International Conference on Spoken Language Translation (IWSLT), pp. 357--362, 2023.
L. Yang, J. Li, S. Li, T. Shinozaki,
Dialogue State Tracking with Sparse Local Slot Attention.
in Proc. ACL, Workshop on NLP for Conversational AI, pp. 39--46, 2023.
L. Yang, J. Li, S. Li, T. Shinozaki,
Multi-Domain Dialogue State Tracking with Disentangled Domain-Slot Attention.
in Proc. ACL, (findings), pp. 4928--4938, 2023.
S. Shimizu, C. Chu, S. Li, S. Kurohashi,
Towards Speech Dialogue Translation Mediating Speakers of Different Languages.
in Proc. ACL, (findings), pp. 1122--1134, 2023.
C. Tan, Y. Cao, S. Li, M. Yoshikawa,
General or Specific? Investigating Effective Privacy Protection in Federated Learning for Speech Emotion Recognition.
in Proc. IEEE-ICASSP, pp. 1-5, 2023.
K. Wang, Y. Yang, H. Huang, Y. Hu, S. Li,
SpeakerAugment: Data Augmentation for Generalizable Source Separation via Speaker Parameter Manipulation.
in Proc. IEEE-ICASSP, pp. 1-5, 2023.
Y. Yang, H. Xu, H. Huang, E.S. Chng, S. Li,
Speech-Text Based Multi-Modal Training with Bidirectional Attention for Improved Speech Recognition.
in Proc. IEEE-ICASSP, pp. 1-5, 2023.
K. Soky, S. Li, C. Chu, T. Kawahara,
Domain and Language Adaptation Using Heterogeneous Datasets for Wav2vec2.0-Based Speech Recognition of Low-Resource Language.
in Proc. IEEE-ICASSP, pp. 1-5, 2023.
Q. Liu, Z. Gong, Z. Yang, Y. Yang, S. Li, Ding C. Chen, N. Minematsu, H. Huang, F. Cheng, C. Chu, S. Kurohashi,
Hierarchical Softmax for End-To-End Low-Resource Multilingual Speech Recognition.
in Proc. IEEE-ICASSP, pp. 1-5, 2023. (Travel Granted by IEEE-SPS)

2022@NICT

L. Yang, J. Li, S. Li and T. Shinozaki,
Multi-Domain Dialogue State Tracking with Top-k Slot Self Attention.
in Proc. SIGdial Meeting Discourse \& Dialogue, pp. 231--236, 2022.
K. Soky, S. Li, M. Mimura, C. Chu and T. Kawahara,
Leveraging Simultaneous Translation for Enhancing Transcription of Low-resource Language via Cross Attention Mechanism.
in Proc. INTERSPEECH, pp. 1362--1366, 2022.
L. Yang, W. Wei, S. Li, J. Li and T. Shinozaki,
Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech Detection.
in Proc. INTERSPEECH, pp. 541--545, 2022.
K. Li, S. Li, X. Lu, M. Akagi, M. Liu, L. Zhang, C. Zeng, L. Wang, J. Dang and M. Unoki,
Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio Detection.
in Proc. INTERSPEECH, pp. 664--668, 2022.
Z. Yang, W. Zhou, C. Chu, S. Li, R. Dabre, R. Rubino and Y. Zhao,
Fusion of Self-supervised Learned Models for MOS Prediction.
in Proc. INTERSPEECH, pp. 5443--5447, 2022.
S. Qin, L. Wang, S. Li, Y. Lin and J. Dang,
Finer-grained Modeling units-based Meta-Learning for Low-resource Tibetan Speech Recognition.
in Proc. INTERSPEECH, pp. 2133--2137, 2022.
H. Shi, L. Wang, S. Li, J. Dang and T. Kawahara,
Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction.
in Proc. INTERSPEECH, pp. 221--225, 2022.
N. Li, M. Ge, L. Wang, M. Unoki, S. Li and J. Dang,
Global Signal-to-noise Ratio Estimation Based on Multi-subband Processing Using Convolutional Neural Network.
in Proc. INTERSPEECH, pp. 361--365, 2022.
Z. Gong, D. Saito, L. Yang, T. Shinozaki, S. Li, H. Kawai and N. Minematsu,
Self-Adaptive Multilingual ASR Rescoring with Language Identification and Unified Language Model.
In Proc. ISCA-Odyssey (The Speaker and Language Recognition Workshop), pp. 415--420, 2022.
S. Li, J. Li, Q. Liu, Z. Gong,
Adversarial Speech Generation and Natural Speech Recovery for Speech Content Protection.
in Proc. LREC (Language Resources and Evaluation Conference), pp. 7291--7297, 2022.
Y. Lv, L. Wang, M. Ge, S. Li (corresponding), C. Ding, L. Pan, Y. Wang, J. Dang, K. Honda,
Compressing Transformer-based ASR Model by Task-driven Loss and Attention-based Multi-level Feature Distillation.
in Proc. IEEE-ICASSP, pp. 7992--7996, 2022.
K. Wang, Y. Peng, H. Huang, Y. Hu, and S. Li,
Mining Hard Samples Locally and Globally for Improved Speech Separation.
in Proc. IEEE-ICASSP, pp. 6037--6041, 2022.

2021@NICT

K. Soky, M. Mimura, T. Kawahara, S. Li, C. Ding, C. Chu, and S. Sam.
Khmer Speech Translation Corpus of the Extraordinary Chambers in the Courts of Cambodia (ECCC).
In Proc. O-COCOSDA, pp. 122--127, 2021. (Best student paper nomination, invited as fast tracking journal paper)
D. Wang, S. Ye, X. Hu, S. Li, and X. Xu,
An End-to-End Dialect Identification System with Transfer Learning from a Multilingual Automatic Speech Recognition Model.
in Proc. INTERSPEECH, pp. 3266--3270, 2021.
K. Wang, H. Huang, Y. Hu, Z. Huang, and S. Li,
End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time- Frequency Domain.
in Proc. INTERSPEECH, pp. 3046--3050, 2021.
N. Li, L. Wang, M. Unoki, S. Li, R. Wang, M. Ge, and J. Dang,
Robust voice activity detection using a masked auditory encoder based convolutional neural network.
in Proc. IEEE-ICASSP, pp. 6828--6832, 2021.
S. Chen, X. Hu, S. Li, and X. Xu,
An investigation of using hybrid modeling units for improving End-to-End speech recognition systems.
in Proc. IEEE-ICASSP, pp. 6743--6747, 2021.
H. Huang, K. Wang, Y. Hu, and S. Li,
Encoder-Decoder based pitch tracking and joint model training for Mandarin tone classification.
in Proc. IEEE-ICASSP, pp. 6943--6947, 2021.

2020@NICT

Y. Lin, L. Wang, S. Li, J. Dang, and C. Ding.
Staged Knowledge Distillation for End-to-End Dysarthric Speech Recognition and Speech Attribute Transcription.
In Proc. INTERSPEECH, pp. 4791--4795, 2020 (Travel Granted by ISCA).
H. Shi, L. Wang, S. Li, C. Ding, M. Ge, N. Li, J. Dang, and H. Seki.
Singing Voice Extraction with Attention based Spectrograms Fusion.
In Proc. INTERSPEECH, pp. 2412--2416, 2020 (Travel Granted by ISCA).
S. Li, X. Lu, R. Dabre, P. Shen, and H. Kawai.
Joint Training End-to-End Speech Recognition Systems with Speaker Attributes.
In Proc. ISCA-Odyssey (The Speaker and Language Recognition Workshop), pp. 385--390, 2020.
P. Shen, X. Lu, K. Sugiura, S. Li, and H. Kawai.
Compensation on x-vector for short utterance spoken language identification.
In Proc. ISCA-Odyssey (The Speaker and Language Recognition Workshop), pp. 47--52, 2020.
Y. Han, S. Li, Y. Cao, Q. Ma, and M. Yoshikawa.
Voice-Indistinguishability: Protecting Voiceprint in Privacy Preserving Speech Data Release.
In Proc. IEEE-ICME, pp. 1--6, 2020. (Best student paper nomination, select as fast tracking journal paper of IEEE Trans. Multimedia (TMM))
Y. Lin, L. Wang, J. Dang, S. Li, and C. Ding.
End-To-End Articulatory Modeling for Dysarthria Articulatory Attribute Detection.
In Proc. IEEE-ICASSP, pp. 7349--7353, 2020.
H. Shi, L. Wang, M. Ge, S. Li, and J. Dang.
Spectrograms Fusion with Minimum Difference Masks Estimation for Monaural Speech Dereverberation.
In Proc. IEEE-ICASSP, pp. 7544--7548, 2020.

2019@NICT

X. Lu, P. Shen, S. Li, Y. Tsao, and H. Kawai.
Class-wise Centroid Distance Metric Learning for Acoustic Event Detection.
In Proc. INTERSPEECH, pp. 3614--3618, 2019.
S. Li, X. Lu, C. Ding, P. Shen, T. Kawahara, and H. Kawai.
Investigating Radical-based End-to-End Speech Recognition Systems for Chinese Dialects and Japanese.
In Proc. INTERSPEECH, pp. 2200--2204, 2019.
S. Li, C. Ding, X. Lu, P. Shen, T. Kawahara, and H. Kawai.
End-to-End Articulatory Attribute Modeling for Low-resource Multilingual Speech Recognition.
In Proc. INTERSPEECH, pp. 2145--2149, 2019.
S. Li, R. Dabre, X. Lu, P. Shen, T. Kawahara, and H. Kawai.
Improving Transformer-based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation.
In Proc. INTERSPEECH, pp. 4400--4404, 2019.
P.Shen, X.Lu, S. Li, and H.Kawai.
Interactive learning of teacher-student model for short utterance spoken language identification.
In Proc. IEEE-ICASSP, pp. 5981--5985, 2019.
R.Takashima, S. Li, and H.Kawai.
Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models.
In Proc. IEEE-ICASSP, pp. 6156--6160, 2019.

2018@NICT

S. Li, X.Lu, R.Takashima, P.Shen, T.Kawahara, and H.Kawai.
Improving very deep time-delay neural network with vertical-attention for effectively training CTC-based ASR systems.
In Proc. IEEE Spoken Language Technology Workshop (IEEE-SLT), pp. 77--83, 2018.
S. Li, X.Lu, R.Takashima, P.Shen, T.Kawahara, and H.Kawai.
Improving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks.
In Proc. INTERSPEECH, pp. 3708--3712, 2018.
P.Shen, X.Lu, S. Li, and H.Kawai.
Feature Representation of Short Utterances based on Knowledge Distillation for Spoken Language Identification.
In Proc. INTERSPEECH, pp. 1813--1817, 2018.
X.Lu, P.Shen, S. Li, Y.Tsao, and H.Kawai.
Temporal Attentive Pooling for Acoustic Event Detection.
In Proc. INTERSPEECH, pp. 1354--1357, 2018.
R.Takashima, S. Li, and H.Kawai.
An Investigation of a Knowledge Distillation Method for CTC Acoustic Models.
In Proc. IEEE-ICASSP, pp. 5809--5813, 2018.
R.Takashima, S. Li, and H.Kawai.
CTC Loss Function with a Unit-level Ambiguity Penalty.
In Proc. IEEE-ICASSP, pp. 5909--5913, 2018.

2017@NICT

S. Li, X.Lu, P.Shen, R.Takashima, T.Kawahara, and H.Kawai.
Incremental training and constructing the very deep convolutional residual network acoustic models.
In Proc. IEEE Workshop Automatic Speech Recognition \& Understanding (IEEE-ASRU), pp. 222--227, 2017.
P. Shen, X. Lu, S. Li, and H. Kawai.
Conditional Generative Adversarial Nets Classifier for Spoken Language Identification.
In Proc. INTERSPEECH, pp. 2814--2818, 2017.

before 2017 (Kyoto Univ.)

S. Li, X.Lu, S.Sakai, M.Mimura, and T.Kawahara.
Semi-supervised ensemble DNN acoustic model training.
In Proc. IEEE-ICASSP, pp. 5270--5274, 2017.
S. Li, Y.Akita, and T.Kawahara.
Data selection from multiple ASR systems' hypotheses for unsupervised acoustic model training.
In Proc. IEEE-ICASSP, pp. 5875--5879, 2016.
S. Li, Y.Akita, and T.Kawahara.
Discriminative data selection for lightly supervised training of acoustic model using closed caption texts.
In Proc. INTERSPEECH, pp. 3526--3530, 2015.(oral)
S. Li, X.Lu, Y.Akita, and T.Kawahara.
Ensemble speaker modeling using speaker adaptive training deep neural network for speaker adaptation.
In Proc. INTERSPEECH, pp. 2892--2896, 2015.

before 2013 (Joint Lab. CAS & CUHK)

S. Li and L. Wang.
Cross Linguistic Comparison of Mandarin and English EMA Articulatory Data,
In Proc. INTERSPEECH, pp. 903--906, 2012. (Travel granted by IBM research)

Challenges・Demo (selected):

ICASSP2024 ICMC-ASR (In-Car Multi-Channel Automatic Speech Recognition) Challenge. (top2 in one track)
The System Description for VoiceMOS Challenge 2023. (top1 in one track)
S. Li, R. Dabre, R. Raphael, W. Zhou, Z. Yang, C. Chu, Y. Zhao.
The System Description for VoiceMOS Challenge 2022 (KK team, main/ood tasks). (6 indexes 1st place)
D. Wang, S. Ye, X. Hu, S. Li.
The RoyalFlush-NICT System Description for AP21-OLR Challenge (Silk-road team, full tasks).
In OLR2021 (oriental language recognition challenge), 2021. (top3)
Y. Han, Y. Cao, S. Li, Q. Ma, and M. Yoshikawa.
Voice-Indistinguishability: Protecting Voiceprint with Differential Privacy under an Untrusted Server.
ACM conference on Computer and Communications Security (CCS), demo, pp. 2125--2127, 2020.
H. Zhang, S. Li, X. Ma, Y. Zhao, Y. Cao, and T. Kawahara,
Phantom in the Opera: Effective Adversarial Music Attack on Keyword Spotting Systems.
in Proc. IEEE-SLT, 2021 (demo session, introduction).

Patents:

発明者:李勝、ルーシュガン、高島遼一、沈鵬、河井恒・出願人/特許権者：国立研究開発法人情報通信研究機構,
学習方法・特願2017-236626・特開2019-105899・特許番号「6979203」,
出願2017年12月11日・公開2019年6月21日, 特許登録日2021年11月17日.
発明者:高島遼一、李勝、河井恒・出願人/特許権者：国立研究開発法人情報通信研究機構,
時系列情報の学習システム、方法およびニューラルネットワークモデル・特願2018-044134・特開2019-159654・特許番号「7070894」,
出願2018年03月12日・公開2018年03月12日・特許登録日2022年5月10日.
発明者:李勝、ルーシュガン、高島遼一、沈鵬、河井恒・出願人/特許権者：国立研究開発法人情報通信研究機構,
音声認識システム、音声認識方法、学習済モデル・特願2018-044491・特開2019-159058・特許番号「7070894」,
出願 2018年03月12日・公開2018年3月12日, 特許登録日2022年7月22日.
発明者:李勝、ルーシュガン、高島遼一、沈鵬、河井恒・出願人/特許権者：国立研究開発法人情報通信研究機構,
識別器、学習済モデル、学習方法・特願2018-142418・特開2020-020872・特許番号「7209330」,
出願2018年07月30日・公開2018年07月30日, 特許登録日2023年1月12日.
発明者:沈鵬、ルーシュガン、李勝、河井恒・出願人/特許権者：国立研究開発法人情報通信研究機構,
言語識別モデルの訓練方法及び装置、並びにそのためのコンピュータプログラム・特願2019-086005・特開2020-038343・特許番号「7282363」,
出願2019年04月26日・公開2019年04月26日, 特許登録日2023年5月19日.
発明者:李勝、ルーシュガン、丁塵辰、河原達也、河井恒・出願人/特許権者：国立研究開発法人情報通信研究機構,
推論器、推論プログラムおよび学習方法・特願2019-163555・特開2021-043272・特許番号「7385900」,
出願2019年09月09日・公開2021年03月18日, 特許登録日2023年11月15日.
発明者: 李勝, ルーシュガン,河井恒・出願人/特許権者：国立研究開発法人情報通信研究機構
推論器および推論器の学習方法・特願2020-059962・特開2021-157145・特許番号「7423056」
出願2020年03月30日・公開2021年10月07日, 特許登録日2024年1月19日.

Technical Reports:

please see full list

Software and Recipes:

please see full list

Data Releases:

please see full list

Academic Services:

Academic Membership:

IEEE senior member
IEEE-SPS (Signal Processing Society), IEEE-SPS Applied Signal Processing Systems Technical Committee (ASPS TC) (till 2026)
IEEE-RAS (Robotics and Automation Society)
ISCA (International Speech Communication Association),
ASJ (Acoustical Society of Japan)，
SIG-CSLP (Chinese Spoken Language Processing),
APSIPA (Asia Pacific Signal and Information Processing Association), APSIPA Speech, Language, and Audio (SLA) Technical Committee (till 2026)
ACM (Association for Computing Machinery)
APNNS (Asia Pacific Neural Network Society)

Chairing and organizing:

[1] Session Chair of INTERSPEECH2020 session: Topics of ASR I
[2] Co-organizing INTERSPEECH2020 workshop: Spoken Language Interaction for Mobile Transportation System (SLIMTS)
[3] Session Chair of Speaker Odyssey2022 session: Evaluation and Benchmarking (EB)
[4] Co-organizing Coling2022 workshop: when creative ai meets conversational ai (cai + cai = cai^2)
[5] Co-organizing ACM Multimedia Asia 2023 workshop: M3Oriental (https://sites.google.com/view/m3oriental)
[6] Area Chair of APSIPA 2023
[7] Area Chair of EMNLP 2023
[8] Session Chair of ICANN 2023
[9] Session Chair of ICASSP 2024
[10] Publicity Chair of ACM Multimedia Asia 2024
[11] Session Chair of DASFAA 2024
[12] Co-organizing RO-MAN2025, special session

Reviewer/Program committee:
Journal:

[1] IEEE/ACM Trans. Audio, Speech \& Language Process.
[2] Computer Speech and Language
[3] Speech Communication
[4] IEICE transactions, letters
[5] APSIPA transactions
[6] Applied Acoustics
[7] Transactions on Asian and Low-Resource Language Information Processing (TALLIP)
[8] Digital Signal Processing
[9] behavior information and technology
[10] EURASIP Journal on Audio, Speech, and Music Processing

Conferences:

[1] ICASSP-2021/2022/2023/2024, INTERSPEECH-2015/2018/2019/2020/2021/2022/2023/2024, SLT-2022/2024, ASRU-2023
[2] APSIPA-2019/2020/2021/2022/2023, IJCNN-2023/2024, ICONIP2023
[3] BC_VCC-2020 (Blizzard Challenge and Voice Conversion Challenge 2020)
[4] ACL-2017/2018/2020, EACL-2020/2022, NAACL-HLT-2016/2018/2019/2021
[5] IJCNLP-2017, EMNLP-IJCNLP-2019, EMNLP-2020/2021/2022, AACL-IJCNLP-2020/2022/2023, COLING-2018/2022, SIGDIAL-2024
[6] NLP-2022/2023/2024, IALP-2023/2024
[7] AAAI-2019, ICLR-2021/2024, NeurIPS-2022/2023, ICML-2023/2024
[8] IROS-2019, Ubiquitous Robots (UR)-2020, IEEE-ROMAN 2023
[9] ICME-2020/2021/2022/2023(main+workshop)/2024, ACM Multimedia 2021/2022/2023, ACM Multimedia Asia 2023, MMM 2023
[10] PAKDD-2023, DASFAA-2024, ACM ICMR 2024

SHENG LI (李 勝) Japanese version (CV)