GF-VAE
A. A. Nugraha, K. Sekiguchi, and K. Yoshii, "A flow-based deep latent variable model for speech spectrogram modeling and enhancement," IEEE/ACM Trans. Audio, Speech, Language Process., vol. 28, pp. 1104--1117, 2020.
Abstract
This paper describes a deep latent variable model of speech power spectrograms and its application to semi-supervised speech enhancement with a deep speech prior. Integrating two major deep generative models, a variational autoencoder (VAE) and a normalizing flow (NF), in a mutually-beneficial manner, we can formulate a flexible latent variable model called the NF-VAE that can extract low-dimensional latent representations from high-dimensional observations, akin to the VAE, and does not need to explicitly represent the distribution of the observations, akin to the NF. In this paper, we consider a variant of NF called the generative flow (GF a.k.a. Glow) and formulate a latent variable model called the GF-VAE. We experimentally show that the proposed GF-VAE is better than the standard VAE at capturing fine-structured harmonics of speech spectrograms, especially in a higher frequency range. A similar finding is also obtained when the GF-VAE and the VAE are used to generate speech spectrograms from latent variables randomly sampled from the standard Gaussian distribution. Lastly, when these models are used as speech priors for statistical multichannel speech enhancement, the GF-VAE outperforms the VAE and the GF.
Reference
A. A. Nugraha, K. Sekiguchi, and K. Yoshii, “A flow-based deep latent variable model for speech spectrogram modeling and enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1104–1117, 2020, doi: 10.1109/TASLP.2020.2979603.
Audio Samples
Clean Speech Reconstruction
The following time-domain speech signals are obtained given the power spectrograms reconstructed using the different models and the original clean speech phase spectrograms.
Model | z_dim | M05_445C020A_BUS | M06_443C020K_CAF | F05_442C020T_PED | F06_445C0202_STR |
---|---|---|---|---|---|
Clean | – | ||||
VAE-2L | 8 | ||||
VAE-3L | 8 | ||||
GF-VAE-1 | 8 | ||||
GF-VAE-2 | 8 | ||||
VAE-2L | 16 | ||||
VAE-3L | 16 | ||||
GF-VAE-1 | 16 | ||||
GF-VAE-2 | 16 | ||||
VAE-2L | 32 | ||||
VAE-3L | 32 | ||||
GF-VAE-1 | 32 | ||||
GF-VAE-2 | 32 |
(back to the top of this section)
Random Speech Generation
The following time-domain speech signals are obtained given the power spectrograms randomly generated using the different models and random phase values sampled from a uniform distribution, which then are improved by the Griffin-Lim method (20 iterations).
Model | z_dim | random seed = 123 | random seed = 456 | random seed = 789 |
---|---|---|---|---|
VAE-2L | 8 | |||
VAE-3L | 8 | |||
GF-VAE-1 | 8 | |||
GF-VAE-2 | 8 | |||
VAE-2L | 16 | |||
VAE-3L | 16 | |||
GF-VAE-1 | 16 | |||
GF-VAE-2 | 16 | |||
VAE-2L | 32 | |||
VAE-3L | 32 | |||
GF-VAE-1 | 32 | |||
GF-VAE-2 | 32 | |||
GF | – |
(back to the top of this section)
Multichannel Speech Enhancement
Model | z_dim | M05_445C020A_BUS | M06_443C020K_CAF | F05_442C020T_PED | F06_445C0202_STR |
---|---|---|---|---|---|
Noisy | – | ||||
Clean | – | ||||
VAE-2L | 8 | ||||
VAE-3L | 8 | ||||
GF-VAE-1 | 8 | ||||
GF-VAE-2 | 8 | ||||
VAE-2L | 16 | ||||
VAE-3L | 16 | ||||
GF-VAE-1 | 16 | ||||
GF-VAE-2 | 16 | ||||
VAE-2L | 32 | ||||
VAE-3L | 32 | ||||
GF-VAE-1 | 32 | ||||
GF-VAE-2 | 32 | ||||
GF | – |