活动介绍
file-type

深度学习驱动的语音识别:端到端模型解析

下载需积分: 10 | 1.93MB | 更新于2024-07-16 | 53 浏览量 | 3 评论 | 10 下载量 举报 收藏
download 立即下载
"ASR公开课材料主要讲解了基于深度学习的语音识别技术,涵盖了从基础知识到最新进展,包括端到端模型的应用。" 在【公开课材料_ASR.pdf】中,杨学锐,云从科技的资深算法研究员,分享了关于语音识别的深入知识,主要聚焦于基于深度学习的最新进展。课程内容主要分为四部分:语音识别概述、语音识别思路与常见框架、云从科技的语音识别模型以及问答环节。 首先,语音识别被定义为Large Vocabulary Continuous Speech Recognition (LVCSR),即从语音/音频文件转换成文字序列的过程,处理的是连续语音而非孤立词。近年来,这一技术的发展迅速,错误率从16%降低到4%,甚至超过了人类的识别能力。这一进步得益于深度学习的引入,特别是自2010年的DNN-HMM模型到2012年后的端到端模型。 在发展与现状方面,尽管取得了显著成就,但仍然面临多重挑战。例如,大量的词汇量(英文250000+,中文400000+)要求模型具备广泛的词汇理解能力。此外,不同说话人的口音、方言,以及各种环境噪声(如汽车喇叭、会场回响、马路噪声)都会影响识别效果。同时,不同设备的麦克风类型(手持、耳带、近场远场)也会带来差异。 构建识别系统的框架包括模型构建、数据收集、监督学习和解码搜索。在这个过程中,语音转文字的统计学模型是核心,而数据质量对于模型训练至关重要。为了处理中文语音识别,通常会使用声韵母作为子词单位。在特征提取阶段,经过一系列预处理,如A/D转换、预加重、加窗等,然后通过梅尔滤波器组和MFCC(Mel-frequency cepstral coefficients)提取特征,动态特征(如MFCC的差分和二阶差分)进一步增强信息。 声学模型的构建,传统上采用GMM-HMM,使用隐马尔可夫模型(HMM)来描述语音信号的时间变化。然而,随着深度学习的发展,RNN(循环神经网络)和其他端到端模型如Transformer或LSTM等逐渐成为主流,它们可以直接从原始音频序列预测文本,简化了传统的声学和语言模型结构。 这份材料深入浅出地介绍了语音识别技术,包括其历史、挑战、解决方案以及深度学习在该领域的应用,对于理解并研究语音识别的最新进展极具价值。

相关推荐

filetype

root@autodl-container-c85144bc1a-7cf0bfa7:~/autodl-tmp/Open-LLM-VTuber# uv run run_server.py # 第一次运行可能会下载一些模型,导致等待时间较久。 2025-07-17 19:14:19.094 | INFO | __main__:<module>:86 - Running in standard mode. For detailed debug logs, use: uv run run_server.py --verbose 2025-07-17 19:14:19 | INFO | __main__:run:57 | Open-LLM-VTuber, version v1.1.4 2025-07-17 19:14:19 | INFO | upgrade:sync_user_config:350 | [DEBUG] User configuration is up-to-date. 2025-07-17 19:14:19 | INFO | src.open_llm_vtuber.service_context:init_live2d:156 | Initializing Live2D: shizuku-local 2025-07-17 19:14:19 | INFO | src.open_llm_vtuber.live2d_model:_lookup_model_info:142 | Model Information Loaded. 2025-07-17 19:14:19 | INFO | src.open_llm_vtuber.service_context:init_asr:166 | Initializing ASR: sherpa_onnx_asr 2025-07-17 19:14:19 | INFO | src.open_llm_vtuber.asr.sherpa_onnx_asr:__init__:81 | Sherpa-Onnx-ASR: Using cpu for inference 2025-07-17 19:14:19 | WARNING | src.open_llm_vtuber.asr.sherpa_onnx_asr:_create_recognizer:166 | SenseVoice model not found. Downloading the model... 2025-07-17 19:14:19 | INFO | src.open_llm_vtuber.asr.utils:check_and_extract_local_file:141 | ✅ Extracted directory exists: models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17, no operation needed. 2025-07-17 19:14:19 | INFO | src.open_llm_vtuber.asr.sherpa_onnx_asr:_create_recognizer:179 | Local file found. Using existing file. 2025-07-17 19:14:19 | ERROR | __main__:<module>:91 | An error has been caught in function '<module>', process 'MainProcess' (186339), thread 'MainThread' (140161628456768): Traceback (most recent call last): > File "/root/autodl-tmp/Open-LLM-VTuber/run_server.py", line 91, in <module> run(console_log_level=console_log_level) │ └ 'INFO' └ <function run at 0x7f79baae0790> File "/root/autodl-tmp/Open-LLM-VTuber/run_server.py", line 71, in run server = WebSocketServer(config=config) │ └ Config(system_config=SystemConfig(conf_version='v1.1.1', host='localhost', port=12393, config_alts_dir='characters', tool_pro... └ <class 'src.open_llm_vtuber.server.WebSocketServer'> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/server.py", line 45, in __init__ default_context_cache.load_from_config(config) │ │ └ Config(system_config=SystemConfig(conf_version='v1.1.1', host='localhost', port=12393, config_alts_dir='characters', tool_pro... │ └ <function ServiceContext.load_from_config at 0x7f79bac70430> └ <src.open_llm_vtuber.service_context.ServiceContext object at 0x7f79bab10310> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/service_context.py", line 132, in load_from_config self.init_asr(config.character_config.asr_config) │ │ │ │ └ ASRConfig(asr_model='sherpa_onnx_asr', azure_asr=AzureASRConfig(api_key='azure_api_key', region='eastus', languages=['en-US',... │ │ │ └ CharacterConfig(conf_name='shizuku-local', conf_uid='shizuku-local-001', live2d_model_name='shizuku-local', character_name='S... │ │ └ Config(system_config=SystemConfig(conf_version='v1.1.1', host='localhost', port=12393, config_alts_dir='characters', tool_pro... │ └ <function ServiceContext.init_asr at 0x7f79bac70550> └ <src.open_llm_vtuber.service_context.ServiceContext object at 0x7f79bab10310> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/service_context.py", line 167, in init_asr self.asr_engine = ASRFactory.get_asr_system( │ │ │ └ <staticmethod(<function ASRFactory.get_asr_system at 0x7f79bc0c37f0>)> │ │ └ <class 'src.open_llm_vtuber.asr.asr_factory.ASRFactory'> │ └ None └ <src.open_llm_vtuber.service_context.ServiceContext object at 0x7f79bab10310> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/asr/asr_factory.py", line 58, in get_asr_system return SherpaOnnxASR(**kwargs) │ └ {'model_type': 'sense_voice', 'encoder': None, 'decoder': None, 'joiner': None, 'paraformer': None, 'nemo_ctc': None, 'wenet_... └ <class 'src.open_llm_vtuber.asr.sherpa_onnx_asr.VoiceRecognition'> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/asr/sherpa_onnx_asr.py", line 83, in __init__ self.recognizer = self._create_recognizer() │ │ └ <function VoiceRecognition._create_recognizer at 0x7f79b930d510> │ └ <src.open_llm_vtuber.asr.sherpa_onnx_asr.VoiceRecognition object at 0x7f79bab101f0> └ <src.open_llm_vtuber.asr.sherpa_onnx_asr.VoiceRecognition object at 0x7f79bab101f0> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/asr/sherpa_onnx_asr.py", line 188, in _create_recognizer recognizer = sherpa_onnx.OfflineRecognizer.from_sense_voice( │ │ └ <classmethod(<function OfflineRecognizer.from_sense_voice at 0x7f79baae1750>)> │ └ <class 'sherpa_onnx.offline_recognizer.OfflineRecognizer'> └ <module 'sherpa_onnx' from '/root/autodl-tmp/Open-LLM-VTuber/.venv/lib/python3.10/site-packages/sherpa_onnx/__init__.py'> File "/root/autodl-tmp/Open-LLM-VTuber/.venv/lib/python3.10/site-packages/sherpa_onnx/offline_recognizer.py", line 259, in from_sense_voice self.recognizer = _Recognizer(recognizer_config) │ │ └ <_sherpa_onnx.OfflineRecognizerConfig object at 0x7f79bab49130> │ └ <class '_sherpa_onnx.OfflineRecognizer'> └ <sherpa_onnx.offline_recognizer.OfflineRecognizer object at 0x7f79bab10700> RuntimeError: No graph was found in the protobuf.

filetype

root@autodl-container-c85144bc1a-7cf0bfa7:~/autodl-tmp/Open-LLM-VTuber# uv run run_server.py # 第一次运行可能会下载一些模型,导致等待时间较久。 2025-07-17 18:37:44.643 | INFO | __main__:<module>:86 - Running in standard mode. For detailed debug logs, use: uv run run_server.py --verbose 2025-07-17 18:37:44 | INFO | __main__:run:57 | Open-LLM-VTuber, version v1.1.4 2025-07-17 18:37:44 | INFO | upgrade:sync_user_config:350 | [DEBUG] User configuration is up-to-date. 2025-07-17 18:37:44 | INFO | src.open_llm_vtuber.service_context:init_live2d:156 | Initializing Live2D: shizuku-local 2025-07-17 18:37:44 | INFO | src.open_llm_vtuber.live2d_model:_lookup_model_info:142 | Model Information Loaded. 2025-07-17 18:37:44 | INFO | src.open_llm_vtuber.service_context:init_asr:166 | Initializing ASR: sherpa_onnx_asr 2025-07-17 18:37:44 | INFO | src.open_llm_vtuber.asr.sherpa_onnx_asr:__init__:81 | Sherpa-Onnx-ASR: Using cpu for inference 2025-07-17 18:37:44 | WARNING | src.open_llm_vtuber.asr.sherpa_onnx_asr:_create_recognizer:166 | SenseVoice model not found. Downloading the model... 2025-07-17 18:37:44 | INFO | src.open_llm_vtuber.asr.utils:check_and_extract_local_file:141 | ✅ Extracted directory exists: models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17, no operation needed. 2025-07-17 18:37:44 | INFO | src.open_llm_vtuber.asr.sherpa_onnx_asr:_create_recognizer:179 | Local file found. Using existing file. 2025-07-17 18:37:44 | ERROR | __main__:<module>:91 | An error has been caught in function '<module>', process 'MainProcess' (90183), thread 'MainThread' (139716376794944): Traceback (most recent call last): > File "/root/autodl-tmp/Open-LLM-VTuber/run_server.py", line 91, in <module> run(console_log_level=console_log_level) │ └ 'INFO' └ <function run at 0x7f120f9c0790> File "/root/autodl-tmp/Open-LLM-VTuber/run_server.py", line 71, in run server = WebSocketServer(config=config) │ └ Config(system_config=SystemConfig(conf_version='v1.1.1', host='localhost', port=12393, config_alts_dir='characters', tool_pro... └ <class 'src.open_llm_vtuber.server.WebSocketServer'> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/server.py", line 45, in __init__ default_context_cache.load_from_config(config) │ │ └ Config(system_config=SystemConfig(conf_version='v1.1.1', host='localhost', port=12393, config_alts_dir='characters', tool_pro... │ └ <function ServiceContext.load_from_config at 0x7f120fb50430> └ <src.open_llm_vtuber.service_context.ServiceContext object at 0x7f120f9f0310> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/service_context.py", line 132, in load_from_config self.init_asr(config.character_config.asr_config) │ │ │ │ └ ASRConfig(asr_model='sherpa_onnx_asr', azure_asr=AzureASRConfig(api_key='azure_api_key', region='eastus', languages=['en-US',... │ │ │ └ CharacterConfig(conf_name='shizuku-local', conf_uid='shizuku-local-001', live2d_model_name='shizuku-local', character_name='S... │ │ └ Config(system_config=SystemConfig(conf_version='v1.1.1', host='localhost', port=12393, config_alts_dir='characters', tool_pro... │ └ <function ServiceContext.init_asr at 0x7f120fb50550> └ <src.open_llm_vtuber.service_context.ServiceContext object at 0x7f120f9f0310> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/service_context.py", line 167, in init_asr self.asr_engine = ASRFactory.get_asr_system( │ │ │ └ <staticmethod(<function ASRFactory.get_asr_system at 0x7f1210fd37f0>)> │ │ └ <class 'src.open_llm_vtuber.asr.asr_factory.ASRFactory'> │ └ None └ <src.open_llm_vtuber.service_context.ServiceContext object at 0x7f120f9f0310> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/asr/asr_factory.py", line 58, in get_asr_system return SherpaOnnxASR(**kwargs) │ └ {'model_type': 'sense_voice', 'encoder': None, 'decoder': None, 'joiner': None, 'paraformer': None, 'nemo_ctc': None, 'wenet_... └ <class 'src.open_llm_vtuber.asr.sherpa_onnx_asr.VoiceRecognition'> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/asr/sherpa_onnx_asr.py", line 83, in __init__ self.recognizer = self._create_recognizer() │ │ └ <function VoiceRecognition._create_recognizer at 0x7f120faa9510> │ └ <src.open_llm_vtuber.asr.sherpa_onnx_asr.VoiceRecognition object at 0x7f120f9f0130> └ <src.open_llm_vtuber.asr.sherpa_onnx_asr.VoiceRecognition object at 0x7f120f9f0130> File "/root/autodl-tmp/Open-LLM-VTuber/src/open_llm_vtuber/asr/sherpa_onnx_asr.py", line 188, in _create_recognizer recognizer = sherpa_onnx.OfflineRecognizer.from_sense_voice( │ │ └ <classmethod(<function OfflineRecognizer.from_sense_voice at 0x7f120f9c1750>)> │ └ <class 'sherpa_onnx.offline_recognizer.OfflineRecognizer'> └ <module 'sherpa_onnx' from '/root/autodl-tmp/Open-LLM-VTuber/.venv/lib/python3.10/site-packages/sherpa_onnx/__init__.py'> File "/root/autodl-tmp/Open-LLM-VTuber/.venv/lib/python3.10/site-packages/sherpa_onnx/offline_recognizer.py", line 259, in from_sense_voice self.recognizer = _Recognizer(recognizer_config) │ │ └ <_sherpa_onnx.OfflineRecognizerConfig object at 0x7f120fa48c30> │ └ <class '_sherpa_onnx.OfflineRecognizer'> └ <sherpa_onnx.offline_recognizer.OfflineRecognizer object at 0x7f120f9f0700> RuntimeError: No graph was found in the protobuf. AutoDL中运行的,什么问题?如何解决?

资源评论
用户头像
陌陌的日记
2025.08.02
针对语音识别感兴趣的深度学习研究者,这份公开课材料提供了前沿知识,不容错过。
用户头像
Friday永不为奴
2025.06.08
这份材料介绍了深度学习在语音识别领域的新进展,特别是端到端方法,内容质量上乘。🦁
用户头像
Msura
2025.05.31
端到端的语音识别技术讨论深入,对理解最新的AI语音应用非常有帮助。
weixin_44220177
  • 粉丝: 3
上传资源 快速赚钱