This page lists publications by categories in reversed chronological order. Asterisk symbols (*) indicate authors who contributed equally to an article. An up-to-date list is available on Google Scholar.
2023
WASPAA
Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning
This paper revisits single-channel audio source separation based on a probabilistic generative model of a mixture signal defined in the continuous time domain. We assume that each source signal follows a non-stationary Gaussian process (GP), i.e., any finite set of sampled points follows a zero-mean multivariate Gaussian distribution whose covariance matrix is governed by a kernel function over time-varying latent variables. The mixture signal composed of such source signals thus follows a GP whose covariance matrix is given by the sum of the source covariance matrices. To estimate the latent variables from the mixture signal, we use a deep neural network with an encoder-separator-decoder architecture (e.g., Conv-TasNet) that separates the latent variables in a pseudo-time-frequency space. The key feature of our method is to feed the latent variables into the kernel function for estimating the source covariance matrices, instead of using the decoder for directly estimating the time-domain source signals. This enables the decomposition of a mixture signal into the source signals with a classical yet powerful Wiener filter that considers the full covariance structure over all samples. The kernel function and the network are trained jointly in the maximum likelihood framework. Comparative experiments using two-speech mixtures under clean, noisy, and noisy-reverberant conditions from the WSJ0-2mix, WHAM!, and WHAMR! benchmark datasets demonstrated that the proposed method performed well and outperformed the baseline method under noisy and noisy-reverberant conditions.
EUSIPCO
Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation
This paper describes an efficient unsupervised learning method for a neural source separation model that utilizes a probabilistic generative model of observed multichannel mixtures proposed for blind source separation (BSS). For this purpose, amortized variational inference (AVI) has been used for directly solving the inverse problem of BSS with full-rank spatial covariance analysis (FCA). Although this unsupervised technique called neural FCA is in principle free from the domain mismatch problem, it is computationally demanding due to the full rankness of the spatial model in exchange for robustness against relatively short reverberations. To reduce the model complexity without sacrificing performance, we propose neural FastFCA based on the jointly-diagonalizable yet full-rank spatial model. Our neural separation model introduced for AVI alternately performs neural network blocks and single steps of an efficient iterative algorithm called iterative source steering. This alternating architecture enables the separation model to quickly separate the mixture spectrogram by leveraging both the deep neural network and the multichannel optimization algorithm. The training objective with AVI is derived to maximize the marginalized likelihood of the observed mixtures. The experiment using mixture signals of two to four sound sources shows that neural FastFCA outperforms conventional BSS methods and reduces the computational time to about 2 % of that for the neural FCA.
ICASSP
Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation
Murtiza Ali,
Aditya Arie Nugraha,
and Karan Nathwani
In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2023
This paper proposes a semi-supervised training approach for a direction-of-arrival (DoA) estimation based on a convolutional neural network (CNN). We apply a sparse recovery algorithm called optMGD-ℓ1-SVD on the training dataset consisting of only unlabeled observed data to obtain binarized pseudo-spectra regarded as the CNN training targets (labels). The estimated DoAs are obtained at test time by performing peak picking on the CNN outputs. optMGD-ℓ1-SVD has been shown to perform well with a few sensors under low signal-to-noise ratio (SNR) conditions (up to −6 dB) by optimally reweighting the pseudo-spectra of ℓ1-SVD based on the application of group delay function on the pseudo-spectra of MUSIC. Since its hyperparameters are noise-sensitive, we assume that the SNR levels of the training dataset are known such that we can use the optimal ones. We also consider multi-condition training using data of multiple SNR levels to improve the robustness towards different noisy environments. We evaluated the trained networks, named optMGD-ℓ1-SVD-CNN and MGD-ℓ1-SVD-CNN, in terms of the average root-mean-square error and the resolution probability under low SNR conditions (up to −20 dB). We demonstrated that it performed well with a few sensors and snapshots, including at SNR levels unseen in the training data.
2022
IROS
Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments
This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset that helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party). One may use a state-of-the-art blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) that works well in various environments thanks to its unsupervised nature. Its heavy computational cost, however, prevents its application to real-time processing. In contrast, a supervised beamforming method that uses a deep neural network (DNN) for estimating spatial information of speech and noise readily fits real-time processing, but suffers from drastic performance degradation in mismatched conditions. Given such complementary characteristics, we propose a dual-process robust online speech enhancement method based on DNN-based beamforming with FastMNMF-guided adaptation. FastMNMF (back end) is performed in a mini-batch style and the noisy and enhanced speech pairs are used together with the original parallel training data for updating the direction-aware DNN (front end) with backpropagation at a computationally-allowable interval. This method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker, which can be detected from video or selected by a user’s hand gesture or eye gaze, in a streaming manner and spatially showing the transcriptions with an AR technique. Our experiment showed that the word error rate was improved by more than 10 points with the run-time adaptation using only twelve minutes of observation.
Interspeech
Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication with in real multiparty conversational environments. A major approach that has actively been studied in simulated environments is to sequentially perform speech enhancement and automatic speech recognition (ASR) based on deep neural networks (DNNs) trained in a supervised manner. In our task, however, such a pretrained system fails to work due to the mismatch between the training and test conditions and the head movements of the user. To enhance only the utterances of a target speaker, we use beamforming based on a DNN-based speech mask estimator that can adaptively extract the speech components corresponding to a head-relative particular direction. We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions. Comparative experiments using the state-of-the-art distant speech recognition system show that the proposed method significantly improves the ASR performance.
IWAENC
DNN-Free Low-Latency Adaptive Speech Enhancement Based on Frame-Online Beamforming Powered by Block-Online FastMNMF
This paper describes a practical dual-process speech enhancement system that adapts environment-sensitive frame-online beamforming (front-end) with help from environment-free block-online source separation (back-end). To use minimum variance distortionless response (MVDR) beamforming, one may train a deep neural network (DNN) that estimates time-frequency masks used for computing the covariance matrices of sources (speech and noise). Backpropagation-based run-time adaptation of the DNN was proposed for dealing with the mismatched training-test conditions. Instead, one may try to directly estimate the source covariance matrices with a state-of-the-art blind source separation method called fast multichannel non-negative matrix factorization (FastMNMF). In practice, however, neither the DNN nor the FastMNMF can be updated in a frame-online manner due to its computationally-expensive iterative nature. Our DNN-free system leverages the posteriors of the latest source spectrograms given by block-online FastMNMF to derive the current source covariance matrices for frame-online beamforming. The evaluation shows that our frame-online system can quickly respond to scene changes caused by interfering speaker movements and outperformed an existing block-online system with DNN-based beamforming by 5.0 points in terms of the word error rate.
IWAENC
Joint Localization and Synchronization of Distributed Camera-Attached Microphone Arrays for Indoor Scene Analysis
This paper describes an automatic calibration method that localizes and synchronizes distributed camera-attached microphone arrays (e.g., Microsoft Azure Kinect) used for audiovisual indoor scene analysis. Operating multiple audio-visual sensors as a large-scale array is a key to resolving object occlusions and sound overlaps by integrating audio-visual information obtained from multiple angles. A naive solution to the calibration problem is to synchronize microphone arrays after localizing them using only visual information. This cascading approach, however, would suffer from the error propagation problem. We thus propose a principled statistical method that fully uses audio-visual information at once. Our method only asks a user to make handclaps and jointly estimates the sensor positions and time offsets and the time-varying source position with the GraphSLAM algorithm based on a unified state-space model associating all the latent calibration targets with the audio-visual observations. The experiment using real recordings shows the stable behavior of the proposed method.
EUSIPCO
Elliptically Contoured Alpha-Stable Representation for MUSIC-Based Sound Source Localization
This paper introduces a theoretically-rigorous sound source localization (SSL) method based on a robust extension of the classical multiple signal classification (MUSIC) algorithm. The original SSL method estimates the noise eigenvectors and the MUSIC spectrum by computing the spatial covariance matrix of the observed multichannel signal and then detects the peaks from the spectrum. In this work, the covariance matrix is replaced with the positive definite shape matrix originating from the elliptically contoured α-stable model, which is more suitable under real noisy high-reverberant conditions. Evaluation on synthetic data shows that the proposed method outperforms baseline methods under such adverse conditions, while it is comparable on real data recorded in a mild acoustic condition.
TASLP
Autoregressive Moving Average Jointly-Diagonalizable Spatial Covariance Analysis for Joint Source Separation and Dereverberation
This article describes a computationally-efficient statistical approach to joint (semi-)blind source separation and dereverberation for multichannel noisy reverberant mixture signals. A standard approach to source separation is to formulate a generative model of a multichannel mixture spectrogram that consists of source and spatial models representing the time-frequency power spectral densities (PSDs) and spatial covariance matrices (SCMs) of source images, respectively, and find the maximum-likelihood estimates of these parameters. A state-of-the-art blind source separation method in this thread of research is fast multichannel nonnegative matrix factorization (FastMNMF) based on the low-rank PSDs and jointly-diagonalizable full-rank SCMs. To perform mutually-dependent separation and dereverberation jointly, in this paper we integrate both moving average (MA) and autoregressive (AR) models that represent the early reflections and late reverberations of sources, respectively, into the FastMNMF formalism. Using a pretrained deep generative model of speech PSDs as a source model, we realize semi-blind joint speech separation and dereverberation. We derive an iterative optimization algorithm based on iterative projection or iterative source steering for jointly and efficiently updating the AR parameters and the SCMs. Our experimental results showed the superiority of the proposed ARMA extension over its AR- or MA-ablated version in a speech separation and/or dereverberation task.
TASLP
Generalized Fast Multichannel Nonnegative Matrix Factorization Based on Gaussian Scale Mixtures for Blind Source Separation
This paper describes heavy-tailed extensions of a state-of-the-art versatile blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) from a unified point of view. The common way of deriving such an extension is to replace the multivariate complex Gaussian distribution in the likelihood function with its heavy-tailed generalization, e.g., the multivariate complex Student’s t and leptokurtic generalized Gaussian distributions, and tailor-make the corresponding parameter optimization algorithm. Using a wider class of heavy-tailed distributions called a Gaussian scale mixture (GSM), i.e., a mixture of Gaussian distributions whose variances are perturbed by positive random scalars called impulse variables, we propose GSM-FastMNMF and develop an expectation-maximization algorithm that works even when the probability density function of the impulse variables have no analytical expressions. We show that existing heavy-tailed FastMNMF extensions are instances of GSM-FastMNMF and derive a new instance based on the generalized hyperbolic distribution that include the normal-inverse Gaussian, Student’s t, and Gaussian distributions as the special cases. Our experiments show that the normal-inverse Gaussian FastMNMF outperforms the state-of-the-art FastMNMF extensions and ILRMA model in speech enhancement and separation in terms of the signal-to-distortion ratio.
ICASSP
Flow-Based Fast Multichannel Nonnegative Matrix Factorization for Blind Source Separation
This paper describes a blind source separation method for multichannel audio signals, called NF-FastMNMF, based on the integration of the normalizing flow (NF) into the multichannel nonnegative matrix factorization with jointly-diagonalizable spatial covariance matrices, a.k.a. FastMNMF. Whereas the NF of flow-based independent vector analysis, called NF-IVA, acts as the demixing matrices to transform an M-channel mixture into M independent sources, the NF of NF-FastMNMF acts as the diagonalization matrices to transform an M- channel mixture into a spatially-independent M-channel mixture represented as a weighted sum of N source images. This diagonalization enables the NF, which has been used only for determined separation because of its bijective nature, to be applicable to non-determined separation. NF-FastMNMF has time-varying diagonalization matrices that are potentially better at handling dynamical data variation than the time-invariant ones in FastMNMF. To have an NF with richer expression capability, the dimension-wise scalings using diagonal matrices originally used in NF-IVA are replaced with linear transformations using upper triangular matrices; in both cases, the diagonal and upper triangular matrices are estimated by neural networks. The evaluation shows that NF-FastMNMF performs well for both determined and non-determined separations of multiple speech utterances by stationary or non-stationary speakers from a noisy reverberant mixture.
2021
SPL
Neural Full-Rank Spatial Covariance Analysis for Blind Source Separation
This paper describes aneural blind source separation (BSS) method based on amortized variational inference (AVI) of a non-linear generative model of mixture signals. A classical statistical approach to BSS is to fit a linear generative model that consists of spatial and source models representing the inter-channel covariances and power spectral densities of sources, respectively. Although the variational autoencoder (VAE) has successfully been used as a non-linear source model with latent features, it should be pretrained from a sufficient amount of isolated signals. Our method, in contrast, enables the VAE-based source model to be trained only from mixture signals. Specifically, we introduce a neural mixture-to-feature inference model that directly infers the latent features from the observed mixture and integrate it with a neural feature-to-mixture generative model consisting of a full-rank spatial model and a VAE-based source model. All the models are optimized jointly such that the likelihood for the training mixtures is maximized in the framework of AVI. Once the inference model is optimized, it can be used for estimating the latent features of sources included in unseen mixture signals. The experimental results show that the proposed method outperformed the state-of-the-art BSS methods based on linear generative models and was comparable to a method based on supervised learning of the VAE-based sourcemodel.
Interspeech
Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation
This paper proposes α-stable autoregressive fast multichannel nonnegative matrix factorization (α-AR-FastMNMF), a robust joint blind speech enhancement and dereverberation method for improved automatic speech recognition in a realistic adverse environment. The state-of-the-art versatile blind source separation method called FastMNMF that assumes the short-time Fourier transform (STFT) coefficients of a direct sound to follow a circular complex Gaussian distribution with jointly-diagonalizable full-rank spatial covariance matrices was extended to AR-FastMNMF with an autoregressive reverberation model. Instead of the light-tailed Gaussian distribution, we use the heavy-tailed α-stable distribution, which also has the reproductive property useful for the additive source modeling, to better deal with the large dynamic range of the direct sound. The experimental results demonstrate that the proposed α-AR-FastMNMF works well as a front-end of an automatic speech recognition system. It outperforms α-AR-ILRMA, which is a special case of α-AR-FastMNMF, and their Gaussian counterparts, i.e., AR-FastMNMF and AR-ILRMA, in terms of the speech signal quality metrics and word error rate.
ICASSP
Autoregressive Fast Multichannel Nonnegative Matrix Factorization For Joint Blind Source Separation And Dereverberation
This paper describes a joint blind source separation and dereverberation method that works adaptively and efficiently in a reverberant noisy environment. The modern approach to blind source separation (BSS) is to formulate a probabilistic model of multichannel mixture signals that consists of a source model representing the time-frequency structures of source spectrograms and a spatial model representing the inter-channel covariance structures of source images. The cutting-edge BSS method in this thread of research is fast multi-channel nonnegative matrix factorization (FastMNMF) that consists of a low-rank source model based on nonnegative matrix factorization (NMF) and a full-rank spatial model based on jointly-diagonalizable spatial covariance matrices. Although FastMNMF is computationally efficient and can deal with both directional sources and diffuse noise simultaneously, its performance is severely degraded in a reverberant environment. To solve this problem, we propose autoregressive FastMNMF (AR-FastMNMF) based on a unified probabilistic model that combines FastMNMF with a blind dereverberation method called weighted prediction error (WPE), where all the parameters are optimized jointly such that the likelihood for observed reverberant mixture signals is maximized. Experimental results showed the superiority of AR-FastMNMF over conventional methods that perform blind dereverberation and BSS jointly or sequentially.
2020
Interspeech
Unsupervised Robust Speech Enhancement Based on Alpha-Stable Fast Multichannel Nonnegative Matrix Factorization
This paper describes multichannel speech enhancement based on a probabilistic model of complex source spectrograms for improving the intelligibility of speech corrupted by undesired noise. The univariate complex Gaussian model with the reproductive property supports the additivity of source complex spectrograms and forms the theoretical basis of nonnegative matrix factorization (NMF). Multichannel NMF (MNMF) is an extension of NMF based on the multivariate complex Gaussian model with spatial covariance matrices (SCMs), and its state-of-the-art variant called FastMNMF with jointly-diagonalizable SCMs achieves faster decomposition based on the univariate Gaussian model in the transformed domain where all time-frequency-channel elements are independent. Although a heavy-tailed extension of FastMNMF has been proposed to improve the robustness against impulsive noise, the source additivity has never been considered. The multivariate α-stable distribution does not have the reproductive property for the shape matrix parameter. This paper, therefore, proposes a heavy-tailed extension called α-stable FastMNMF which works in the transformed domain to use a univariate complex α-stable model, satisfying the reproductive property for any tail lightness parameter α and allowing the α-fractional Wiener filtering based on the element-wise source additivity. The experimental results show that α-stable FastMNMF with α = 1.8 significantly outperforms Gaussian FastMNMF (α=2).
TASLP
15th IEEE Signal Processing Society (SPS) Japan Student Journal Paper Award
Fast Multichannel Nonnegative Matrix Factorization with Directivity-Aware Jointly-Diagonalizable Spatial Covariance Matrices for Blind Source Separation
This paper describes a computationally-efficient blind source separation (BSS) method based on the independence, low-rankness, and directivity of the sources. A typical approach to BSS is unsupervised learning of a probabilistic model that consists of a source model representing the time-frequency structure of source images and a spatial model representing their inter-channel covariance structure. Building upon the low-rank source model based on nonnegative matrix factorization (NMF), which has been considered to be effective for inter-frequency source alignment, multichannel NMF (MNMF) assumes source images to follow multivariate complex Gaussian distributions with unconstrained full-rank spatial covariance matrices (SCMs). An effective way of reducing the computational cost and initialization sensitivity of MNMF is to restrict the degree of freedom of SCMs. While a variant of MNMF called independent low-rank matrix analysis (ILRMA) severely restricts SCMs to rank-1 matrices under an idealized condition that only directional and less-echoic sources exist, we restrict SCMs to jointly-diagonalizable yet full-rank matrices in a frequency-wise manner, resulting in FastMNMF1. To help inter-frequency source alignment, we then propose FastMNMF2 that shares the directional feature of each source over all frequency bins. To explicitly consider the directivity or diffuseness of each source, we also propose rank-constrained FastMNMF that enables us to individually specify the ranks of SCMs. Our experiments showed the superiority of FastMNMF over MNMF and ILRMA in speech separation and the effectiveness of the rank constraint in speech enhancement.
TASLP
A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement
This paper describes a deep latent variable model of speech power spectrograms and its application to semi-supervised speech enhancement with a deep speech prior. By integrating two major deep generative models, a variational autoencoder (VAE) and a normalizing flow (NF), in a mutually-beneficial manner, we formulate a flexible latent variable model called the NF-VAE that can extract low-dimensional latent representations from high-dimensional observations, akin to the VAE, and does not need to explicitly represent the distribution of the observations, akin to the NF. In this paper, we consider a variant of NF called the generative flow (GF a.k.a. Glow) and formulate a latent variable model called the GF-VAE. We experimentally show that the proposed GF-VAE is better than the standard VAE at capturing fine-structured harmonics of speech spectrograms, especially in the high-frequency range. A similar finding is also obtained when the GF-VAE and the VAE are used to generate speech spectrograms from latent variables randomly sampled from the standard Gaussian distribution. Lastly, when these models are used as speech priors for statistical multichannel speech enhancement, the GF-VAE outperforms the VAE and the GF.
SPL
Flow-Based Independent Vector Analysis for Blind Source Separation
This paper describes a time-varying extension of independent vector analysis (IVA) based on the normalizing flow (NF), called NF-IVA, for determined blind source separation of multichannel audio signals. As in IVA, NF-IVA estimates demixing matrices that transform mixture spectra to source spectra in the complex-valued spatial domain such that the likelihood of those matrices for the mixture spectra is maximized under some non-Gaussian source model. While IVA performs a time-invariant bijective linear transformation, NF-IVA performs a series of time-varying bijective linear transformations (flow blocks) adaptively predicted by neural networks. To regularize such transformations, we introduce a soft volume-preserving (VP) constraint. Given mixture spectra, the parameters of NF-IVA are optimized by gradient descent with backpropagation in an unsupervised manner. Experimental results show that NF-IVA successfully performs speech separation in reverberant environments with different numbers of speakers and microphones and that NF-IVA with the VP constraint outperforms NF-IVA without it, standard IVA with iterative projection, and improved IVA with gradient descent.
EUSIPCO
Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms
This paper describes a semi-supervised multichannel speech separation method that uses clean speech signals with frame-wise phonetic labels and sample-level speaker labels for pre-training. A standard approach to statistical source separation is to formulate a probabilistic model of multichannel mixture spectrograms that combines source models representing the time-frequency characteristics of sources with spatial models representing the covariance structure between channels. For speech separation and enhancement, deep generative models with latent variables have successfully been used as source models. The parameters of such a speech model can be trained beforehand from clean speech signals with a variational autoencoder (VAE) or its conditional variant (CVAE) that takes speaker labels as auxiliary inputs. Because human speech is characterized by both phonetic features and speaker identities, we propose a probabilistic model that combines a phone- and speaker-aware deep speech model with a full-rank spatial model. Our speech model is trained with a CVAE taking both phone and speaker labels as conditions. Given speech mixtures, the spatial covariance matrices, latent variables of sources, and phone and speaker labels of sources are jointly estimated. Comparative experimental results showed that the performance of speech separation can be improved by explicitly considering phonetic features and/or speaker identities.
EUSIPCO
Fast Multichannel Correlated Tensor Factorization for Blind Source Separation
This paper describes an ultimate covariance-aware multichannel extension of nonnegative matrix factorization (NMF) for blind source separation (BSS). A typical approach to BSS is to integrate a low-rank source model with a full-rank spatial model as multichannel NMF (MNMF) based on full-rank spatial covariance matrices (CMs) or its efficient version named FastMNMF based on jointly-diagonalizable spatial CMs do. The NMF-based phase-unaware source model, however, can deal with only the positive cooccurrence relations between time-frequency bins. To overcome this limitation, we propose an efficient multichannel extension of correlated tensor factorization (CTF) named FastMCTF based on jointly-diagonalizable temporal, frequency, and spatial CMs. Integration of the jointly-diagonalizable full-rank source model proposed by FastCTF with the jointly-diagonalizable full-rank spatial model proposed by FastMNMF enables us to completely consider the positive and negative covariance relations between frequency bins, time frames, and channels. We derive a convergence-guaranteed parameter estimation algorithm based on the multiplicative update and iterative projection and experimentally show the potential of the proposed method.
2019
TASLP
17th IEEE Kansai Section Student Paper Award
Semi-supervised Multichannel Speech Enhancement with a Deep Speech Prior
This paper describes a semi-supervised multichannel speech enhancement method that only uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for unsupervised speech enhancement, the low-rank assumption on the power spectral densities (PSDs) of all sources (speech and noise) does not hold in reality. To solve this problem, we replace a low-rank model of speech with a deep generative model in the framework of MNMF or ILRMA, i.e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise. The deep speech model is trained from clean speech data in an unsupervised auto-encoding variational Bayesian manner. Given multichannel noisy speech spectra, the full-rank or rank-1 spatial covariance matrices and PSDs of speech and noise are estimated in an unsupervised maximum-likelihood manner. Experimental results showed that the full-rank version of the proposed method was significantly better than MNMF, ILRMA, and the rank-1 version. We confirmed that the initialization-sensitivity and local-optimum problems of MNMF with many spatial parameters can be solved by incorporating the precise speech model.
RO-MAN
Best Conference Paper Award
Audio-Visual SLAM towards Human Tracking and Human-Robot Interaction in Indoor Environments
We propose a novel audio-visual simultaneous and localization (SLAM) framework that exploits human pose and acoustic speech of human partners to allow a robot equipped with a microphone array and a monocular camera to track, map, and interact with human sound sources in an indoor environment. Since human interaction is characterized by features perceived in not only the visual modality, but the acoustic modality as well. SLAM systems must utilize information from both modalities. Using a state-of-the-art beamforming technique, we obtain sound components corresponding to speech and noise, and estimate the Direction-of-Arrival (DoA) estimates of active sound sources as useful representations of observed features in the acoustic modality. Through estimated human pose by a monocular camera, we obtain the relative positions of humans as useful representation of observed features in the visual modality. Using these techniques, we attempt to eliminate restrictions imposed by intermittent speech, noisy and reverberant periods, triangulation of sound-source range, and restrictions imposed by limited visual field-of-views; and subsequently perform early fusion on these representations. We develop a system that allows for complimentary action between audio-visual sensor modalities in the simultaneous mapping of multiple human sound sources and the localization of observer position.
EUSIPCO
Cauchy Multichannel Speech Enhancement with a Deep Speech Prior
We propose a semi-supervised multichannel speech enhancement system based on a probabilistic model which assumes that both speech and noise follow the heavy-tailed multivariate complex Cauchy distribution. As we advocate, this allows handling strong and adverse noisy conditions. Consequently, the model is parameterized by the source magnitude spectrograms and the source spatial scatter matrices. To deal with the non-additivity of scatter matrices, our first contribution is to perform the enhancement on a projected space. Then, our second contribution is to combine a latent variable model for speech, which is trained by following the variational autoencoder framework, with a low-rank model for the noise source. At test time, an iterative inference algorithm is applied, which produces estimated parameters to use for separation. The speech latent variables are estimated first from the noisy speech and then updated by a gradient descent method, while a majorization-equalization strategy is used to update both the noise and the spatial parameters of both sources. Our experimental results show that the Cauchy model outperforms the state-of-art methods. The standard deviation scores also reveal that the proposed method is more robust against non-stationary noise.
EUSIPCO
Fast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices
This paper describes a versatile method that accelerates multichannel source separation methods based on full-rank spatial modeling. A popular approach to multichannel source separation is to integrate a spatial model with a source model for estimating the spatial covariance matrices (SCMs) and power spectral densities (PSDs) of each sound source in the time-frequency domain. One of the most successful examples of this approach is multichannel nonnegative matrix factorization (MNMF) based on a full-rank spatial model and a low-rank source model. MNMF, however, is computationally expensive and often works poorly due to the difficulty of estimating the unconstrained full-rank SCMs. Instead of restricting the SCMs to rank-1 matrices with the severe loss of the spatial modeling ability as in independent low-rank matrix analysis (ILRMA), we restrict the SCMs of each frequency bin to jointly-diagonalizable but still full-rank matrices. For such a fast version of MNMF, we propose a computationally-efficient and convergence-guaranteed algorithm that is similar in form to that of ILRMA. Similarly, we propose a fast version of a state-of-the-art speech enhancement method based on a deep speech model and a low-rank noise model. Experimental results showed that the fast versions of MNMF and the deep speech enhancement method were several times faster and performed even better than the original versions of those methods, respectively.
ICASSP
A Deep Generative Model of Speech Complex Spectrograms
This paper proposes an approach to the joint modeling of the short-time Fourier transform magnitude and phase spectrograms with a deep generative model. We assume that the magnitude follows a Gaussian distribution and the phase follows a von Mises distribution. To improve the consistency of the phase values in the time-frequency domain, we also apply the von Mises distribution to the phase derivatives, i.e., the group delay and the instantaneous frequency. Based on these assumptions, we explore and compare several combinations of loss functions for training our models. Built upon the variational autoencoder framework, our model consists of three convolutional neural networks acting as an encoder, a magnitude decoder, and a phase decoder. In addition to the latent variables, we propose to also condition the phase estimation on the estimated magnitude. Evaluated for a time-domain speech reconstruction task, our models could generate speech with a high perceptual quality and a high intelligibility.
2018
Deep Neural Network Based Multichannel Audio Source Separation
This chapter presents a multichannel audio source separation framework where deep neural networks (DNNs) are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information. The parameters are estimated in an iterative expectation-maximization (EM) fashion and used to derive a multichannel Wiener filter. Different design choices and their impact on the performance are discussed. They include the cost functions for DNN training, the number of parameter updates, the use of multiple DNNs, and the use of weighted parameter updates. Finally, we present its application to a speech enhancement task and a music separation task. The experimental results show the benefit of the multichannel DNN-based approach over a single-channel DNN-based approach and the multichannel nonnegative matrix factorization based iterative EM framework.
2017
CSL
ISCA Award for the Best Review Paper published in Computer Speech and Language (2016-2020)
An analysis of environment, microphone and data simulation mismatches in robust speech recognition
Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.
2016
TASLP
6th IEEE Signal Processing Society (SPS) Japan Young Author Best Paper Award
Multichannel audio source separation with deep neural networks
This article addresses the problem of multichannel audio source separation. We propose a framework where deep neural networks (DNNs) are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information. The parameters are estimated in an iterative expectation-maximization (EM) fashion and used to derive a multichannel Wiener filter. We present an extensive experimental study to show the impact of different design choices on the performance of the proposed technique. We consider different cost functions for the training of DNNs, namely the probabilistically motivated Itakura-Saito divergence, and also Kullback-Leibler, Cauchy, mean squared error, and phase-sensitive cost functions. We also study the number of EM iterations and the use of multiple DNNs, where each DNN aims to improve the spectra estimated by the preceding EM iteration. Finally, we present its application to a speech enhancement problem. The experimental results show the benefit of the proposed multichannel approach over a single-channel DNNbased approach and the conventional multichannel nonnegative matrix factorization based iterative EM algorithm.
EUSIPCO
Multichannel music separation with deep neural networks
This article addresses the problem of multichannel music separation. We propose a framework where the source spectra are estimated using deep neural networks and combined with spatial covariance matrices to encode the source spatial characteristics. The parameters are estimated in an iterative expectation-maximization fashion and used to derive a multichannel Wiener filter. We evaluate the proposed framework for the task of music separation on a large dataset. Experimental results show that the method we describe performs consistently well in separating singing voice and other instruments from realistic musical mixtures.
2015
ASRU
Robust ASR using neural network based speech enhancement and feature simulation
Sunit Sivasankaran,
Aditya Arie Nugraha,
Emmanuel Vincent,
Juan Andrés Morales Cordovilla,
Siddharth Dalmia,
Irina Illina,
and Antoine Liutkus
In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU),
2015
We consider the problem of robust automatic speech recognition (ASR) in the context of the CHiME-3 Challenge. The proposed system combines three contributions. First, we propose a deep neural network (DNN) based multichannel speech enhancement technique, where the speech and noise spectra are estimated using a DNN based regressor and the spatial parameters are derived in an expectation-maximization (EM) like fashion. Second, a conditional restricted Boltzmann machine (CRBM) model is trained using the obtained enhanced speech and used to generate simulated training and development datasets. The goal is to increase the similarity between simulated and real data, so as to increase the benefit of multicondition training. Finally, we make some changes to the ASR backend. Our system ranked 4th among 25 entries.
2014
ASMP
Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition
Aditya Arie Nugraha,
Kazumasa Yamamoto,
and Seiichi Nakagawa
EURASIP Journal on Audio, Speech, and Music Processing,
2014
We present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. Experiments using speaker identification (SID) and automatic speech recognition (ASR) systems were conducted to evaluate the method. The experiments of SID system was conducted by using our own simulated and real reverberant datasets, while the CENSREC-4 evaluation framework was used as the evaluation for the ASR system. The proposed method could remarkably improve the performance of both systems by using limited stereo data and low speaker-variant data as the training data. From the evaluation using SID, we reached 26.0% and 34.8% of error rate reduction (ERR) relative to the baseline by using simulated and real data, respectively, by using only one pair of utterances for matched condition cases. Then, by using combined dataset containing 15 pairs of utterances by one speaker from three positions in a room, we could reach 93.7% of average identification rate (three known and two unknown positions), which was 42.2% of ERR relative to the use of cepstral mean normalization (CMN). From the evaluation using ASR, by using 40 pairs of utterances as the NN training data, we could reach 78.4% of ERR relative to the baseline by using simulated utterances by five speakers. Moreover, we could reach 75.4% and 71.6% of ERR relative to the baseline by using real utterances by five speakers and one speaker, respectively.
2013
APSIPA
Single channel dereverberation method in logmelspectral domain using limited stereo data for distant speaker identification
Aditya Arie Nugraha,
Kazumasa Yamamoto,
and Seiichi Nakagawa
In Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA),
2013
In this paper, we present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. We assumed that the dimensions of feature were independent from each other and experimented on several assumptions of the room transfer function for each dimension. Speaker identification system was used to evaluate the method. Using limited stereo data, we could improve the identification rate for simulated and real datasets. On the simulated dataset, we could show that the proposed method is effective for both noiseless and noisy reverberant environments, with various noise and reverberation characteristics. On the real dataset, we could show that by using 6 independent NNs configuration for 24-dimensional feature and only 1 pair of utterances we could get 35% average error reduction relative to the baseline, which employed cepstral mean normalization (CMN).
SP/IPSJ-SLP
Single Channel Dereverberation Method by Feature Mapping Using Limited Stereo Data
Aditya Arie Nugraha,
Kazumasa Yamamoto,
and Seiichi Nakagawa
Technical Report of Institute of Electronics, Information and Communication Engineers (IEICE),
2013
In this paper, we present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. Experiments using speaker identification (SID) and speech recognition (ASR) systems were conducted to evaluate the method. The experiments of SID system was conducted by using real noisy reverberant datasets, while CENSREC-4 evaluation framework was used as the evaluation for the ASR system. Using limited stereo data consisting of simultaneously recorded clean speech and reverberant speech, the proposed method could remarkably improve the performance of both systems.
2012
ASJ
Improving distant speaker identification robustness using a nonlinear regression based dereverberation method in feature domain
Aditya Arie Nugraha
and Seiichi Nakagawa
In Proceedings of the Autumn Meeting of Acoustical Society of Japan,
2012
The use of reverberated speech signal which is captured by distant-talking microphone as input of speaker identification system would degrade its performance. In this paper, we present a single-channel non-linear regression based dereverberation method that works on a feature domain. Artificial neural networks were trained using Cascade 2 algorithm on stereo data to compensate the reverberation effect by mapping the reverberated signal to the clean signal on 24-dimensional log-melspectral features. We also employ segment-level normalization to compensate the power difference between the clean signal and the reverberated signal. Using the proposed method, we could enhance the signal and improve the identification rate of distant speaker identification system.
2011
TSSA
Performance evaluation of audio-video streaming service in Keerom, Papua using integrated audio-video performance test tool
Yudi Satria Gondokaryono,
Yoanes Bandung,
Joko Ari Wibowo,
Aditya Arie Nugraha,
Bryan Yonathan,
and Dwi Ramadhianto
In Proceedings of International Conference on Telecommunication Systems, Services, and Applications (TSSA),
2011
This study compared some video codec, audio codec, audio bit rate, video bit rate to determine the quality of the audio-video streaming service on the network Keerom, Papua. Average capacity in this network is 1.5Mbps. Mpeg audio and ac3 are choosen because of its characteristic, while the video codec is mpeg4 and H.264. Audio bit rate used 64 and 128kbps, while the video bit rate 64, 128 and 256kbps. The experiments result show the quality of the audio-video streaming service was better when the audio codec used mpeg audio 64kbps-mpeg4 256kbps. The test results will be used as a reference implementation of audio-video streaming service later in the network Keerom, Papua.
2010
AEEI
Web based multimedia conference system for digital learning in rural elementary school
Aska Narendra,
Aditya Arie Nugraha,
Yoanes Bandung,
Armein Z. R. Langi,
and Bambang Pharmasetiawan
Advances in Electrical Engineering and Informatics,
2010
This paper describes the process of designing a web-based multimedia conferencing system that will be used to support digital learning for elementary school in rural areas and implementing them in some network testbeds in Bandung, Subang, and Cianjur. The system must be able to send each of the constituent media, namely video, audio, and other materials (e.g. slide presentations) independently so that the learning process between student and teacher could still be running even if one of the media is absent. In addition, the multimedia conferencing system must also be easily operated independently by an elementary school teacher in rural areas with a minimum computer mastery level. The result is a product that is expected to be useful for improving the quality of primary education especially in rural areas through ICT applications.