[PDF][PDF] Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN.
Y Li, D Xu, Y Zhang, Y Wang, B Chen - Interspeech, 2020 - isca-archive.org
Y Li, D Xu, Y Zhang, Y Wang, B Chen
Interspeech, 2020•isca-archive.orgVoice Conversion (VC) aims at modifying source speaker's speech to sound like that of
target speaker while preserving linguistic information of given speech. StarGAN-VC was
recently proposed, which utilizes a variant of Generative Adversarial Networks (GAN) to
perform non-parallel many-to-many VC. However, the quality of generated speech is not
satisfactory enough. An improved method named “PSR-StarGAN-VC” is proposed in this
paper by incorporating three improvements. Firstly, perceptual loss functions are introduced …
target speaker while preserving linguistic information of given speech. StarGAN-VC was
recently proposed, which utilizes a variant of Generative Adversarial Networks (GAN) to
perform non-parallel many-to-many VC. However, the quality of generated speech is not
satisfactory enough. An improved method named “PSR-StarGAN-VC” is proposed in this
paper by incorporating three improvements. Firstly, perceptual loss functions are introduced …
Abstract
Voice Conversion (VC) aims at modifying source speaker’s speech to sound like that of target speaker while preserving linguistic information of given speech. StarGAN-VC was recently proposed, which utilizes a variant of Generative Adversarial Networks (GAN) to perform non-parallel many-to-many VC. However, the quality of generated speech is not satisfactory enough. An improved method named “PSR-StarGAN-VC” is proposed in this paper by incorporating three improvements. Firstly, perceptual loss functions are introduced to optimize the generator in StarGAN-VC aiming to learn high-level spectral features. Secondly, considering that Switchable Normalization (SN) could learn different operations in different normalization layers of model, it is introduced to replace Batch Normalization (BN) in StarGAN-VC. Lastly, Residual Network (ResNet) is applied to establish the mapping of different layers between the encoder and decoder of generator aiming to retain more semantic features when converting speech, and to reduce the difficulty of training. Experiment results on the VCC 2018 datasets demonstrate superiority of the proposed method in terms of naturalness and speaker similarity.
isca-archive.org
Showing the best result for this search. See all results