3. 1.Honda Research Institute Japan(HRI-JP)の紹介
HRI-EU
HRI-JP
HRI-US
San Jose, California
Wako, Saitama
Offenbach, Germany
Honda Research Institute設立の狙い
『21世紀の最先端技術で、価値の高い技術を最速で創造し、未来の社会に貢献』
2003年
日本、米国、ヨーロッパ(ドイツ)の3拠点に設立
9. 9
アレイ信号処理ベースのカスケード手法
●アレイ信号処理ベースのカスケード手法[1][2]
- 個別のアレイ信号処理機能ブロックのカスケード構成
- 各ブロックで発生した誤差の蓄積による性能劣化
𝑦∗
= 𝑎𝑟𝑔 max
𝑦
𝑓(𝑿, 𝜽)
𝑌𝜔,𝑡 =
𝑚=1
𝑀
𝐹𝑚,𝜔𝑋𝑚,𝜔,𝑡
𝑃 ∅ =
𝐻𝐻(𝜃)𝐻(𝜃)
𝐺 ∅ 𝐻𝑒𝑚
2
[1] K. Nakadai, G. Ince, K. Nakamura, and H. Nakajima,“ Robot audition for dynamic environments, ”IEEE International Conference on Signal Processing, Communication and Computing (ICSPCC), 2012, pp.
125–130.
[2] K. Nakamura, K. Nakadai and H. G. Okuno,“ A real-time super-resolution robot audition system that improves the robustness of simultaneous speech recognition, ”Journal of Advanced Robotics; 2013. Vol.
27, No. 12, pp. 933– 945.
MUSIC (Multiple Signal Classification) ビームフォーマ GMM (Gaussian Mixture Model)
雑音を目的音と誤定位してしまう
可能性
同じ方向から到来する音源は
分離できない
前段ブロックの誤差により精度
が低下
MUSICスペクトル
音源分離 認識
音源定位 認識結果
11. 11
シングルチャンネルベースの深層学習手法
● Universal Sound Separation [1]
- 混合時系列波形からConv TasNetを用いて音源分離とクラス分類を同時に行う
- シングルチャンネル手法であるため、オーバーラップでの性能劣化
[1] Kavalerov, Ilya, et al. "Universal sound separation." 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019.
Conv TasNetを用いたUniversal Sound Separation
12. 12
シングルチャンネルベースの深層学習手法
● Learning to separate sounds from weakly labeled scenes [2]
- 混合音のスペクトログラムからCRNNを用いて音源分離とクラス分類を同時に行う
- 音響イベント検出のデータセットを用いた弱ラベル学習
[2] Pishdadian, Fatemeh, Gordon Wichern, and Jonathan Le Roux. "Finding strength in weakness: Learning to separate sounds with weak supervision." IEEE/ACM Transactions on Audio, Speech, and
Language Processing 28 (2020): 2386-2399.
Conv TasNetを用いたUniversal Sound Separation
振幅スペクトル等の
スペクトル特徴
13. 13
マルチチャンネルベースの深層学習手法
[3] Adavanne, Sharath, et al. "Sound event localization and detection of overlapping sources using convolutional recurrent neural networks." IEEE Journal of Selected Topics in Signal Processing 13.1
(2018): 34-48.
● 音響イベントの定位と検出(Sound Event Localization and Detection, SELD) [3]
- 音源定位、区間検出、クラス分類を同時に行う
- スペクトル特徴に加えて、IPDなどの空間特徴によって音源定位を実現
- 音源クラスと音源方向という相関のない2つを同時に学習することで、それらが過剰に紐づいてしまう
音響イベント検出 音源方向推定
IPD等の
空間特徴量
振幅スペクトル等の
スペクトル特徴
空間特徴
26. 26
参考文献
[1] Kavalerov, Ilya, et al. "Universal sound separation." 2019 IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics (WASPAA). IEEE, 2019.
[2] Pishdadian, Fatemeh, Gordon Wichern, and Jonathan Le Roux. "Learning to separate sounds from weakly
labeled scenes." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2020.
[3] Adavanne, Sharath, et al. "Sound event localization and detection of overlapping sources using
convolutional recurrent neural networks." IEEE Journal of Selected Topics in Signal Processing 13.1 (2018): 34-
48.
シングルチャンネル環境音セグメンテーション
- Y. Sudo, K. Itoyama, K. Nishida and K. Nakadai, Sound event aware environmental sound segmentation with
Mask U-Net, Journal of Advanced Robotics; 2020, Vol. 34, No. 20, pp. 1280-1290.
- Y. Sudo, K. Itoyama, K. Nishida and K. Nakadai, Environmental sound segmentation utilizing Mask U-Net,
IEEE/RSJ International Conference on Intelligent Robots and Systems, Macau, 2019, pp. 5340–5345.
マルチチャンネル環境音セグメンテーション
- Y. Sudo, K. Itoyama, K. Nishida and K. Nakadai, Multi-channel Environmental sound segmentation, Journal
of Applied Intelligence; 2020, 10.1007/s10489-021-02314-5.
Editor's Notes
#2:My name is Yui Sudo from Tokyo Institute of Technology.
I am going to talk about “Environmental sound セグメンテーション utilizing Mask U-Net”.
10s
#3:Robots in real environment must recognize many kinds of sounds like speech in a noisy environment. Or sometimes not only speech but also music, bird singing and so on.
So many methods have been developed for example noise reduction, sound source localization and separation.
However, these conventional methods are used in カスケード like this block diagram.
The biggest drawback of the カスケード system is that 誤差 which occurred at each function block are accumulated.
Therefore it is necessary to develop end-to-end system in order to actualize overall optimized and more general method.
Of simultaneous section detection classification and separation.
1’00
#5:Robots in real environment must recognize many kinds of sounds like speech in a noisy environment. Or sometimes not only speech but also music, bird singing and so on.
So many methods have been developed for example noise reduction, sound source localization and separation.
However, these conventional methods are used in カスケード like this block diagram.
The biggest drawback of the カスケード system is that 誤差 which occurred at each function block are accumulated.
Therefore it is necessary to develop end-to-end system in order to actualize overall optimized and more general method.
Of simultaneous section detection classification and separation.
1’00
#6:Robots in real environment must recognize many kinds of sounds like speech in a noisy environment. Or sometimes not only speech but also music, bird singing and so on.
So many methods have been developed for example noise reduction, sound source localization and separation.
However, these conventional methods are used in カスケード like this block diagram.
The biggest drawback of the カスケード system is that 誤差 which occurred at each function block are accumulated.
Therefore it is necessary to develop end-to-end system in order to actualize overall optimized and more general method.
Of simultaneous section detection classification and separation.
1’00
#7:Robots in real environment must recognize many kinds of sounds like speech in a noisy environment. Or sometimes not only speech but also music, bird singing and so on.
So many methods have been developed for example noise reduction, sound source localization and separation.
However, these conventional methods are used in カスケード like this block diagram.
The biggest drawback of the カスケード system is that 誤差 which occurred at each function block are accumulated.
Therefore it is necessary to develop end-to-end system in order to actualize overall optimized and more general method.
Of simultaneous section detection classification and separation.
1’00
#9:Robots in real environment must recognize many kinds of sounds like speech in a noisy environment. Or sometimes not only speech but also music, bird singing and so on.
So many methods have been developed for example noise reduction, sound source localization and separation.
However, these conventional methods are used in カスケード like this block diagram.
The biggest drawback of the カスケード system is that 誤差 which occurred at each function block are accumulated.
Therefore it is necessary to develop end-to-end system in order to actualize overall optimized and more general method.
Of simultaneous section detection classification and separation.
1’00
#10:I’ll review some related work of 音響イベント検出.
This slide shows a one of the popular approach of 音響イベント検出 which is a CNN based method.
This method applies CRNN to spectrogram and detect on-set and off-set of each class sound event.
However this SED method cannot remain frequency information by using frequency pooling. So this method cannot separate each sound event.
30s
#11:I’ll review some related work of 音響イベント検出.
This slide shows a one of the popular approach of 音響イベント検出 which is a CNN based method.
This method applies CRNN to spectrogram and detect on-set and off-set of each class sound event.
However this SED method cannot remain frequency information by using frequency pooling. So this method cannot separate each sound event.
30s
#12:Subsequently, Let me introduce some more related works of sound source separation.
One traditional approach is NMF based method which uses NMF. Computational cost is low, its performance is also low compared to DNN based approach. Additionally, it is difficult to deal with many classes like environmental sounds.
Second is Deep learning-based approach. U-Net which is originally proposed for image semantic セグメンテーション has been proposed for vocal separation. This method predicts mask spectrograms for separating singing voice and is possible to train end-to-end.
However the number of classes is also small, vocal and instrument. And additionally, it is pointed out in the previous study of image semantic セグメンテーション that when the object size is very huge, the performance will be low for example bed and pillow.
So it is difficult to simply apply this method to envitonmental sound セグメンテーション which have a lot of classes.
1’15
#13:Subsequently, Let me introduce some more related works of sound source separation.
One traditional approach is NMF based method which uses NMF. Computational cost is low, its performance is also low compared to DNN based approach. Additionally, it is difficult to deal with many classes like environmental sounds.
Second is Deep learning-based approach. U-Net which is originally proposed for image semantic セグメンテーション has been proposed for vocal separation. This method predicts mask spectrograms for separating singing voice and is possible to train end-to-end.
However the number of classes is also small, vocal and instrument. And additionally, it is pointed out in the previous study of image semantic セグメンテーション that when the object size is very huge, the performance will be low for example bed and pillow.
So it is difficult to simply apply this method to envitonmental sound セグメンテーション which have a lot of classes.
1’15
#14:Subsequently, Let me introduce some more related works of sound source separation.
One traditional approach is NMF based method which uses NMF. Computational cost is low, its performance is also low compared to DNN based approach. Additionally, it is difficult to deal with many classes like environmental sounds.
Second is Deep learning-based approach. U-Net which is originally proposed for image semantic セグメンテーション has been proposed for vocal separation. This method predicts mask spectrograms for separating singing voice and is possible to train end-to-end.
However the number of classes is also small, vocal and instrument. And additionally, it is pointed out in the previous study of image semantic セグメンテーション that when the object size is very huge, the performance will be low for example bed and pillow.
So it is difficult to simply apply this method to envitonmental sound セグメンテーション which have a lot of classes.
1’15
#18:This slide shows the complete architecture of environmental sound セグメンテーション consists of three blocks, 特徴抽出, セグメンテーション and reconstruction.
In the 特徴抽出 block STFT is applied to the mixed waveforms and divided into spectral and 空間特徴s.
These features are input into the セグメンテーション block. This block predicts a mask spectrogram for separating each class from the input spectrogram.
Then an inverse STFT is applied to reconstruct the time domain signal using predicted amplitude spectrogram and phase spectrogram obtained from mixed waveform.
The differences between the conventional method and our model are
Input features are extended to multi-channel input
Deeplabv3+ was applied instead of U-Net based method.
These difference are expected that improve the performance on overlapping sound and robustness of large variation in sound event length.
1’10
#19:This slide shows the complete architecture of environmental sound セグメンテーション consists of three blocks, 特徴抽出, セグメンテーション and reconstruction.
In the 特徴抽出 block STFT is applied to the mixed waveforms and divided into spectral and 空間特徴s.
These features are input into the セグメンテーション block. This block predicts a mask spectrogram for separating each class from the input spectrogram.
Then an inverse STFT is applied to reconstruct the time domain signal using predicted amplitude spectrogram and phase spectrogram obtained from mixed waveform.
The differences between the conventional method and our model are
Input features are extended to multi-channel input
Deeplabv3+ was applied instead of U-Net based method.
These difference are expected that improve the performance on overlapping sound and robustness of large variation in sound event length.
1’10
#20:We evaluate our method by conducting some simulation experiments using three custom データセットs.
We created these custom データセットs using 10 corpuses contaning many classes of dry source.
セグメンテーション 結果s are evaluated by calculating RMSE.
Then I will show you the few example and discuss the effect of Deeplabv3+ and 空間特徴s.
30s
#22:This figure and this table shows the experimental settings for the numerical simulations.
Three dry sources are randomly selected from these 10 corpuses.
and the impulse response was convolved like this mixed spectrogram.
Then, diffuse noise were added to all time frames.
We created 10,000 training set 1,000 評価 set.
30s
#23:I’ll review some related work of 音響イベント検出.
This slide shows a one of the popular approach of 音響イベント検出 which is a CNN based method.
This method applies CRNN to spectrogram and detect on-set and off-set of each class sound event.
However this SED method cannot remain frequency information by using frequency pooling. So this method cannot separate each sound event.
30s
#24:I’ll review some related work of 音響イベント検出.
This slide shows a one of the popular approach of 音響イベント検出 which is a CNN based method.
This method applies CRNN to spectrogram and detect on-set and off-set of each class sound event.
However this SED method cannot remain frequency information by using frequency pooling. So this method cannot separate each sound event.
30s
#25:This table shows the まとめ of the simulation 結果. First, Let’s see the 結果 of データセット1 containing three classes of sound.
Regarding deep learning models, Deeplabv3+ showed higher performance than conventional models.
As for input features, by using sinIPD and cosIP, RMSE was obviously improved especially on CRNN and UNet.
And these figure shows an example.
This example contains overlapping sound as shown in blue spectra behind green and yellow spectra.
As you can see these colored spectrogram, every model look good. However
1’00