Speech Recognition Using attention-based sequence-to-sequence methods

最新推荐文章于 2024-07-25 10:09:27 发布

瞳恩Dawn

最新推荐文章于 2024-07-25 10:09:27 发布

阅读量412

点赞数

CC 4.0 BY-SA版权

文章标签： nlp deep learning

本文链接：https://blog.csdn.net/weixin_44391984/article/details/122895769

本文概述了语音识别的主要步骤，包括使用梅尔频率倒谱系数（MFCC）进行特征提取和基于注意力的序列到序列模型进行训练。讨论了五种基于注意力的训练方法：Listen, Attend, and Spell、连接主义时间分类、RNN转录器、神经转导器和单调块状注意力。这些方法有助于提高语音识别的效率和准确性。" 79366820,5417195,GUI在ILI9341屏幕上实现横竖屏切换及方向翻转,"['GUI开发', '硬件界面', 'FPGA应用', '显示驱动', '嵌入式系统']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Abstract—Speech is one of the most important and prominent manner to communicate among human being. It also has capacity to become a kind of medium when facing the human computer interaction. Speech recognition has become a popular area across research institutes and the Internet-related companies. This paper presents a brief overview on two main steps of speech recognition which are feature extraction and training model using deep learning. In particular, five art-of-state methods using attention-based sequence-to-sequence model for speech recognition training process are discussed.

Keywords-speech recognition; attention mechanism; sequence to sequence; neural transducer; Mel-frequency cepstrum coefficient

I. INTRODUCTION
Natural language refers to a kind of language that evolves naturally with culture, and it is also the primary tool of human communicating and thinking. While speech recognition, as the name suggests that takes natural language speech as the input into the model, and the output is the text of the speech. In other words, it is converting speech signals into text sequences. It is simple for humans to convert speech audio into text manually. Still, when facing large amounts of data, it takes plenty of time, and it is, to some extent, very difficult or impossible to covert in real-time for humans. Moreover, there are hundreds of languages in the world, so that few experts can master multiple languages simultaneously. As a result, people expect that machine learning can help people achieve this task.

At present, the typical steps of speech recognition include preprocessing, feature extraction, training, and recognition. For the feature extraction, because the speech signal is volatile, even if people try hard to say the same two sentences, the signals of which always have some differences. So, feature extraction of speech is difficult for computer scientists.

In this paper, we introduce the main process of speech recognition. For feature extraction, we introduce one of the most popular approaches of it which is called the Mel-frequency Cepstrum Coefficient (MFCC) [1]. For the training part, it is evident that the length of input (speech vectors sequence) and output (text vectors sequence) is probably different. Input is determined by humans (etc., select 25ms speech), and the specific output length is determined by the model itself. Thus, Sequence-To-Sequence (Seq2Seq) based models are most widely used nowadays.

The remainder of this article is organized as follows. In Section II, the MFCC feature extraction approach is illustrated. In Section III, we describe the basic attention mechanism, as well as five training methods based on it: Listen, Attend, and Spell Connectionist Temporal Classification, RNN transducer, Neural Transducer, and Monotonic Chunkwise Attention. Finally, concluding remarks are contained in Section IV.

II. FEATURE EXTRACTION

Because of the instability of speech signals, feature extraction of the speech signal is very difficult. It exists different features between each word. For each word, there are differences among different people, such as adults and children, male and female. Even for the same person and the same word, there also exists changes for a different time[2]. Mel-Fre-Frequency-Doppler is proposed based on the different auditory characteristics of human ears. It uses the nonlinear frequency unit, which names Mel frequency [4], to simulate the human auditory system [8,10,11,17,18]. The calculation method is shown in Formula (1):
在这里插入图片描述
Figure 1 shows the construction of the MFCC model.

The original acoustic wave through the window and other pre-processing, then we obtain the frame signal.

Because it is difficult to observe the characteristics of the signal in the time domain, transforming it into the energy distribution in the frequency domain can solve this problem. The energy distribution in the spectrum, which represents the characteristics of different sounds, is obtained by fast Fourier transform.

After the fast Fourier change of the speech signals is completed, Mel frequency filtering is performed [3]. The specific step is to redefine the filter bank composed of triangular bandpass filters, assuming that the center frequency of each filter is , is the low-pass frequency within the coverage range after the cross overlap of three filters, is the high pass frequency. Then the calculation method is shown as
在这里插入图片描述
We can obtain the output spectrum energy generated by each filte