publications | Aditya Arie Nugraha

2026

ICASSP
Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems with Deep Kernel Learning

Aditya Arie Nugraha, Diego Di Carlo, Yoshiaki Bando, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2026

Abs DOI Bib HTML

This paper describes genuine audio super-resolution (SR) that aims to estimate a continuous signal from a discrete signal in a sampling-rate-agnostic manner based on deep kernel learning (DKL). We assume that the discrete signal is obtained by observing the continuous signal at arbitrary, possibly irregular time points. From a statistical point of view, audio SR can thus be tackled as curve fitting, an inverse problem based on a probabilistic model that represents the generation of a discrete signal from a continuous signal (curve). To deal with speech signals with complicated temporal dynamics, we propose a continuous-time-domain speech SR method that uses a nonlinear state-space model called a Gaussian process dynamical system (GPDS) as a prior on the speech signal. Specifically, we assume the speech signal to follow a GP conditioned by a latent signal following another GP. Given discrete observations, we approximate the latent posterior via variational inference with neural-based DKL and then sample the speech signal on an arbitrarily finer time grid. We empirically confirmed that the proposed method, GPDS-SR, remains robust to missing and irregular input samples and supports prediction on even nonstandard output rates while estimating plausible high-frequency content consistent with the observed context.
@inproceedings{nugraha2026gpdssr, author = {Nugraha, Aditya Arie and Di Carlo, Diego and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems with Deep Kernel Learning}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2026}, month = may, pages = {15442--15446}, address = {Barcelona, Spain}, url = {https://ieeexplore.ieee.org/document/11462432}, doi = {10.1109/ICASSP55912.2026.11462432} }
ICASSP
SIRUP: A Diffusion-Based Virtual Upmixer of Steering Vectors for Highly-Directive Spatialization with First-Order Ambisonics

Emilio Picard, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2026

Abs DOI Bib HTML

This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared to FOA systems for steering vector upmixing, source localization, and speech denoising.
@inproceedings{picard2026sirup, author = {Picard, Emilio and Di Carlo, Diego and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {{SIRUP}: A Diffusion-Based Virtual Upmixer of Steering Vectors for Highly-Directive Spatialization with First-Order Ambisonics}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2026}, month = may, pages = {14707--14711}, address = {Barcelona, Spain}, url = {https://ieeexplore.ieee.org/document/11464234}, doi = {10.1109/ICASSP55912.2026.11464234} }
ICASSP
Physics-Informed Learning of Neural Scattering Fields Towards Measurement-Free Mesh-To-HRTF Estimation

Tancrède Martinez, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2026

Abs DOI Bib HTML

This paper describes neural simulation of the scattered pressure field from a plane wave around a scattering object in both continuous 2D and 3D domains. This task has typically been treated as a regression problem that aims to train a physics-informed neural network (PINN) using pressure measurements at discrete positions. This approach, however, needs to train the whole network for each incident wave direction. To address this, we propose a measurement-free simulator based on a PINN purely driven by the Helmholtz equation with the Robin boundary condition and the Sommerfeld radiation condition with the aid of the perfectly matched layer (PML) framework. More specifically, we design a physics-informed scattering hypernetwork (PHISK) that can generalize to incident waves from any direction via low-rank adaptation (LoRA) of a PINN trained for a specific configuration. The experiment shows that the proposed method accurately simulated sound scattering around various objects, adapting to unseen incident wave directions with minimal performance loss, and realized reasonable simulation of head-related transfer functions (HRTFs) from complex mesh data of a human head.
@inproceedings{tancrede2026phisk, author = {Martinez, Tancrède and Di Carlo, Diego and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {Physics-Informed Learning of Neural Scattering Fields Towards Measurement-Free Mesh-To-{HRTF} Estimation}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2026}, month = may, pages = {22577--22581}, address = {Barcelona, Spain}, url = {https://ieeexplore.ieee.org/document/11462698}, doi = {10.1109/ICASSP55912.2026.11462698} }

2025

APSIPA
Visually-Informed Multichannel Sound Source Separation Based on 3D Gaussian Primitives

Haruaki Asano, Ryunosuke Nihei, Yoshiaki Bando, Aditya Arie Nugraha, Diego Di Carlo, Hiroyuki Ueda, Yosuke Ito, and Kazuyoshi Yoshii

In Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Oct 2025

Abs DOI Bib HTML

This paper proposes visually-informed sound source separation for audio-visual understanding of indoor scenes captured by distributed microphone arrays and cameras. Our approach leverages the 3D information of sound-emitting objects, reconstructed via 3D Gaussian splatting (3DGS), to overcome a limitation of modern blind source separation methods like multichannel nonnegative matrix factorization (MNMF). While adaptable and potentially performant, the iterative optimization of MNMF often converges to poor local minima due to the highly-expressive full-rank spatial covariance matrices (SCMs) of sources. Our key idea is to treat the set of 3D Gaussians representing a sizable sound source object as a collection of sub-sources that share an audio signal but have unique emission weights, both of which are to be estimated jointly from an observed mixture. To enforce this structure, we guide MNMF by regularizing the SCM of each source object at each frequency. Specifically, we use a prior that centers the SCM estimate around a weighted sum of theoretical SCMs, which are analytically derived from the 3D Gaussian positions. Experiments with simulated data, featuring two 3D human models, demonstrated the effectiveness of the proposed method. To our knowledge, this is the first work to use 3D Gaussians as a common primitive for joint audio-visual analysis.
@inproceedings{asano2025visuallyinformed, author = {Asano, Haruaki and Nihei, Ryunosuke and Bando, Yoshiaki and Nugraha, Aditya Arie and Di Carlo, Diego and Ueda, Hiroyuki and Ito, Yosuke and Yoshii, Kazuyoshi}, title = {Visually-Informed Multichannel Sound Source Separation Based on 3D Gaussian Primitives}, booktitle = {Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)}, year = {2025}, month = oct, pages = {36-41}, address = {Singapore}, url = {https://ieeexplore.ieee.org/document/11249412}, doi = {10.1109/APSIPAASC65261.2025.11249412} }
EUSIPCO
SHAMaNS: Sound Localization with Hybrid Alpha-Stable Spatial Measure and Neural Steerer

Diego Di Carlo, Mathieu Fontaine, Aditya Arie Nugraha, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of European Signal Processing Conference (EUSIPCO), Sep 2025

Abs Bib PDF Code

This paper describes a sound source localization (SSL) technique that combines an α-stable model for the observed signal with a neural network-based approach for modeling steering vectors. Specifically, a physics-informed neural network, referred to as Neural Steerer, is used to interpolate measured steering vectors (SVs) on a fixed microphone array. This allows for a more robust estimation of the so-called α-stable spatial measure, which represents the most plausible direction of arrival (DOA) of a target signal. As an α-stable model for the non-Gaussian case (α ∈ (0, 2)) theoretically defines a unique spatial measure, we choose to leverage it to account for residual reconstruction error of the Neural Steerer in the downstream tasks. The objective scores indicate that our proposed technique outperforms state-of-the-art methods in the case of multiple sound sources.
@inproceedings{dicarlo2025shamans, author = {Di Carlo, Diego and Fontaine, Mathieu and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {SHAMaNS: Sound Localization with Hybrid Alpha-Stable Spatial Measure and Neural Steerer}, booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)}, year = {2025}, month = sep, pages = {216--220}, address = {Palermo, Italy}, preprint = {https://arxiv.org/abs/2506.18954}, }
WASPAA
Physically Informed Spatial Regularization for Sound Event Localization and Detection

Haocheng Liu, Diego Di Carlo, Aditya Arie Nugraha, Kazuyoshi Yoshii, Gaël Richard, and Mathieu Fontaine

In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct 2025

Abs DOI Bib HTML

Building Sound Event Localization and Detection (SELD) models that are robust to diverse acoustic environments remains one of the major challenges in multichannel signal processing, as reflections and reverberation can significantly confuse both the source direction and event detection. Introducing priors such as microphone geometry or room impulse response (RIR) into the model has proven effective in addressing this issue. Existing methods typically incorporate such priors in a deterministic way, often through data augmentation to enlarge data diversity. However, the uncertainty arising from the complex nature of audio acoustics remains largely underexplored in the SELD literature and naturally call for incorporating a stochastic modeling of acoustic prior. In this paper, we propose regularizing deep learning based SELD models with a physically constructed spatial covariance matrix (SCM) based on the estimated direction of arrival (DOA) and sound event detection (SED).
@inproceedings{liu2025physicallyinformed, author = {Liu, Haocheng and Di Carlo, Diego and Nugraha, Aditya Arie and Yoshii, Kazuyoshi and Richard, Gaël and Fontaine, Mathieu}, title = {Physically Informed Spatial Regularization for Sound Event Localization and Detection}, booktitle = {Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year = {2025}, month = oct, pages = {1-5}, address = {Tahoe City, CA, USA}, url = {https://ieeexplore.ieee.org/document/11230919}, preprint = {https://hal.science/hal-05244860v1}, doi = {10.1109/WASPAA66052.2025.11230919} }
APSIPA
Joint Separation and Tracking of Moving Sources with Distributed Microphone Arrays Based on Time-Varying Inertial Spatial Models

Ryunosuke Nihei, Yoshiaki Bando, Aditya Arie Nugraha, Diego Di Carlo, Hiroyuki Ueda, Yosuke Ito, and Kazuyoshi Yoshii

In Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Oct 2025

Abs DOI Bib HTML

This paper describes the first attempt at separation and tracking (3 D localization) of multiple moving sound sources using multiple microphone arrays fixed at known locations in an indoor environment. As for static sources, location-dependent priors have been incorporated on the time-invariant spatial covariance matrices (SCMs) of sources in the statistical framework of blind source separation based on multichannel nonnegative matrix factorization (MNMF), achieving the maximum likelihood estimation of source locations. One may thus make both the SCMs and their priors vary over time to deal with source movements. This naive extension, however, fails to localize sources when the sources are inactive, yielding non-smooth, non-continuous trajectory estimates. To solve this problem, we formulate a hierarchical probabilistic model for multichannel mixture signals that consists of inertial Markov models for source locations, location-aware moving-average models for source SCMs, and NMF-based lowrank models for the power spectral densities (PSDs) of sources. All the time-varying attributes of sources are jointly estimated under a maximum-a-posteriori (MAP) principle, and the source images are then estimated with a multichannel Wiener filter. The experiment using simulated data with two moving sources and four four-channel arrays showed that the proposed method achieved better separation and smoother localization.
@inproceedings{nihei2025jointseptrack, author = {Nihei, Ryunosuke and Bando, Yoshiaki and Nugraha, Aditya Arie and Di Carlo, Diego and Ueda, Hiroyuki and Ito, Yosuke and Yoshii, Kazuyoshi}, title = {Joint Separation and Tracking of Moving Sources with Distributed Microphone Arrays Based on Time-Varying Inertial Spatial Models}, booktitle = {Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)}, year = {2025}, month = oct, pages = {30-35}, address = {Singapore}, url = {https://ieeexplore.ieee.org/document/11249306}, doi = {10.1109/APSIPAASC65261.2025.11249306} }

2024

ICASSPW
Neural Steerer: Novel Steering Vector Synthesis with a Causal Neural Field over Frequency and Direction

Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing Workshops (ICASSPW), Apr 2024

Abs DOI Bib HTML Code

We address the problem of accurately interpolating measured anechoic steering vectors with a deep learning framework called the neural field. This task plays a pivotal role in reducing the resource-intensive measurements required for precise sound source separation and localization, essential as the front-end of speech recognition. Classical approaches to interpolation rely on linear weighting of nearby measurements in space on a fixed, discrete set of frequencies. Drawing inspiration from the success of neural fields for novel view synthesis in computer vision, we introduce the neural steerer, a continuous complex-valued function that takes both frequency and direction as input and produces the corresponding steering vector. Importantly, it incorporates inter-channel phase difference information and a regularization term enforcing filter causality, essential for accurate steering vector modeling. Our experiments, conducted using a dataset of real measured steering vectors, demonstrate the effectiveness of our resolution-free model in interpolating such measurements.
@inproceedings{dicarlo2024neuralsteerer, author = {Di Carlo, Diego and Nugraha, Aditya Arie and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {Neural Steerer: Novel Steering Vector Synthesis with a Causal Neural Field over Frequency and Direction}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing Workshops (ICASSPW)}, month = apr, year = {2024}, pages = {740-744}, address = {Seoul, South Korea}, url = {https://ieeexplore.ieee.org/document/10626510}, preprint = {https://arxiv.org/abs/2305.04447}, doi = {10.1109/ICASSPW62465.2024.10626510}, }
APSIPA
Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising

Yoto Fujita, Aditya Arie Nugraha, Diego Di Carlo, Yoshiaki Bando, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Dec 2024

Abs DOI Bib HTML

This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).
@inproceedings{fujita2024runtimeadaptation, author = {Fujita, Yoto and Nugraha, Aditya Arie and Di Carlo, Diego and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising}, booktitle = {Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)}, year = {2024}, month = dec, pages = {1--6}, address = {Macau, China}, url = {https://ieeexplore.ieee.org/document/10849318}, preprint = {https://arxiv.org/abs/2410.22805}, doi = {10.1109/APSIPAASC63619.2025.10849318} }
Interspeech
RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox Approximation

Liam Kelley, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), Sep 2024

Abs DOI Bib HTML PDF Code

This paper describes a method for estimating the room impulse response (RIR) for a microphone and a sound source located at arbitrary positions from the 3D mesh data of the room. Simulating realistic RIRs with pure physics-driven methods often fails the balance between physical consistency and computational efficiency, hindering application to real time speech processing. Alternatively, one can use MESH2IR, a fast black-box estimator that consists of an encoder extracting latent code from mesh data with a graph convolutional network (GCN) and a decoder generating the RIR from the latent code. Combining these two approaches, we propose a fast yet physically coherent estimator with interpretable latent code based on differentiable digital signal processing (DDSP). Specifically, the encoder estimates a virtual shoebox room scene that acoustically approximates the real scene, accelerating physical simulation with the differentiable image-source model in the decoder. Our experiments showed that our method outperformed MESH2IR for real mesh data obtained with the depth scanner of Microsoft HoloLens 2, and can provide correct spatial consistency for binaural RIRs.
@inproceedings{kelley2024ririnabox, author = {Kelley, Liam and Di Carlo, Diego and Nugraha, Aditya Arie and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox Approximation}, booktitle = {Proceedings of Annual Conference of the International Speech Communication Association (Interspeech)}, year = {2024}, month = sep, pages = {3255-3259}, address = {Kos Island, Greece}, url = {https://www.isca-archive.org/interspeech_2024/kelley24_interspeech.html}, preprint = {https://telecom-paris.hal.science/hal-04632526}, doi = {10.21437/Interspeech.2024-2053}, }
IWAENC
Joint Audio Source Localization and Separation with Distributed Microphone Arrays Based on Spatially-Regularized Multichannel NMF

Yoshiaki Sumura, Diego Di Carlo, Aditya Arie Nugraha, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC), Sep 2024

Abs DOI Bib HTML

This paper describes a statistically principled method that simultaneously localizes and separates multiple sound sources using multiple calibrated microphone arrays distributed in a room. Given the extensive research on direction of arrival (DOA) estimation with a single microphone array, for 3D source localization, one may attempt triangulation based on DOAs separately and egocentrically estimated by multiple arrays. However, in multiple sources scenarios, this cascading approach faces both the inter-array DOA association problem and the error accumulation problem. To solve these problems, we propose a spatially regularized extension of a versatile blind source separation method called multichannel nonnegative matrix factorization (MNMF). Our method treats multiple microphone arrays as a single big array and puts priors on the frequency-wise spatial covariance matrices (SCMs) of each source. These priors are defined using the source DOA computed from the 3D positions of the source and arrays. The power spectral densities (PSDs), SCMs, and positions of multiple sources are jointly estimated under the unified maximum-a-posteriori (MAP) principle. We show the effectiveness of the joint statistical estimation for real data recorded by four five-channel microphone arrays of Microsoft Azure Kinect.
@inproceedings{sumura2024jointlocalsep, author = {Sumura, Yoshiaki and Di Carlo, Diego and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {Joint Audio Source Localization and Separation with Distributed Microphone Arrays Based on Spatially-Regularized Multichannel NMF}, booktitle = {Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC)}, year = {2024}, month = sep, pages = {145-149}, address = {Aalborg, Denmark}, url = {https://ieeexplore.ieee.org/document/10694042}, doi = {10.1109/IWAENC61483.2024.10694042} }

2023

ICASSP
Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation

Murtiza Ali, Aditya Arie Nugraha, and Karan Nathwani

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2023

Abs DOI Bib HTML

This paper proposes a semi-supervised training approach for a direction-of-arrival (DoA) estimation based on a convolutional neural network (CNN). We apply a sparse recovery algorithm called optMGD-ℓ1-SVD on the training dataset consisting of only unlabeled observed data to obtain binarized pseudo-spectra regarded as the CNN training targets (labels). The estimated DoAs are obtained at test time by performing peak picking on the CNN outputs. optMGD-ℓ1-SVD has been shown to perform well with a few sensors under low signal-to-noise ratio (SNR) conditions (up to −6 dB) by optimally reweighting the pseudo-spectra of ℓ1-SVD based on the application of group delay function on the pseudo-spectra of MUSIC. Since its hyperparameters are noise-sensitive, we assume that the SNR levels of the training dataset are known such that we can use the optimal ones. We also consider multi-condition training using data of multiple SNR levels to improve the robustness towards different noisy environments. We evaluated the trained networks, named optMGD-ℓ1-SVD-CNN and MGD-ℓ1-SVD-CNN, in terms of the average root-mean-square error and the resolution probability under low SNR conditions (up to −20 dB). We demonstrated that it performed well with a few sensors and snapshots, including at SNR levels unseen in the training data.
@inproceedings{ali2023semisupdoa, author = {Ali, Murtiza and Nugraha, Aditya Arie and Nathwani, Karan}, title = {Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, month = jun, year = {2023}, pages = {1--5}, address = {Rhodes Island, Greece}, url = {https://ieeexplore.ieee.org/document/10095717}, doi = {10.1109/ICASSP49357.2023.10095717} }
EUSIPCO
Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation

Yoshiaki Bando, Yoshiki Masuyama, Aditya Arie Nugraha, and Kazuyoshi Yoshii

In Proceedings of European Signal Processing Conference (EUSIPCO), Sep 2023

Abs DOI Bib HTML

This paper describes an efficient unsupervised learning method for a neural source separation model that utilizes a probabilistic generative model of observed multichannel mixtures proposed for blind source separation (BSS). For this purpose, amortized variational inference (AVI) has been used for directly solving the inverse problem of BSS with full-rank spatial covariance analysis (FCA). Although this unsupervised technique called neural FCA is in principle free from the domain mismatch problem, it is computationally demanding due to the full rankness of the spatial model in exchange for robustness against relatively short reverberations. To reduce the model complexity without sacrificing performance, we propose neural FastFCA based on the jointly-diagonalizable yet full-rank spatial model. Our neural separation model introduced for AVI alternately performs neural network blocks and single steps of an efficient iterative algorithm called iterative source steering. This alternating architecture enables the separation model to quickly separate the mixture spectrogram by leveraging both the deep neural network and the multichannel optimization algorithm. The training objective with AVI is derived to maximize the marginalized likelihood of the observed mixtures. The experiment using mixture signals of two to four sound sources shows that neural FastFCA outperforms conventional BSS methods and reduces the computational time to about 2 % of that for the neural FCA.
@inproceedings{bando2023neuralfastfca, author = {Bando, Yoshiaki and Masuyama, Yoshiki and Nugraha, Aditya Arie and Yoshii, Kazuyoshi}, title = {Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation}, booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)}, year = {2023}, month = sep, pages = {51--55}, address = {Helsinki, Finland}, url = {https://ieeexplore.ieee.org/document/10289974}, preprint = {https://arxiv.org/abs/2306.10240}, doi = {10.23919/EUSIPCO58844.2023.10289974} }
WASPAA
Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning

Aditya Arie Nugraha, Diego Di Carlo, Yoshiaki Bando, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct 2023

Abs DOI Bib HTML

This paper revisits single-channel audio source separation based on a probabilistic generative model of a mixture signal defined in the continuous time domain. We assume that each source signal follows a non-stationary Gaussian process (GP), i.e., any finite set of sampled points follows a zero-mean multivariate Gaussian distribution whose covariance matrix is governed by a kernel function over time-varying latent variables. The mixture signal composed of such source signals thus follows a GP whose covariance matrix is given by the sum of the source covariance matrices. To estimate the latent variables from the mixture signal, we use a deep neural network with an encoder-separator-decoder architecture (e.g., Conv-TasNet) that separates the latent variables in a pseudo-time-frequency space. The key feature of our method is to feed the latent variables into the kernel function for estimating the source covariance matrices, instead of using the decoder for directly estimating the time-domain source signals. This enables the decomposition of a mixture signal into the source signals with a classical yet powerful Wiener filter that considers the full covariance structure over all samples. The kernel function and the network are trained jointly in the maximum likelihood framework. Comparative experiments using two-speech mixtures under clean, noisy, and noisy-reverberant conditions from the WSJ0-2mix, WHAM!, and WHAMR! benchmark datasets demonstrated that the proposed method performed well and outperformed the baseline method under noisy and noisy-reverberant conditions.
@inproceedings{nugraha2023gpdkl, author = {Nugraha, Aditya Arie and Di Carlo, Diego and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning}, booktitle = {Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year = {2023}, month = oct, pages = {1--5}, address = {New Paltz, NY, USA}, url = {https://ieeexplore.ieee.org/document/10248168}, preprint = {https://hal.science/hal-04172863}, doi = {10.1109/WASPAA58266.2023.10248168} }

2022

TASLP
Generalized Fast Multichannel Nonnegative Matrix Factorization Based on Gaussian Scale Mixtures for Blind Source Separation

Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, and Kazuyoshi Yoshii

IEEE/ACM Transactions on Audio, Speech, and Language Processing, May 2022

Abs DOI Bib HTML

This paper describes heavy-tailed extensions of a state-of-the-art versatile blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) from a unified point of view. The common way of deriving such an extension is to replace the multivariate complex Gaussian distribution in the likelihood function with its heavy-tailed generalization, e.g., the multivariate complex Student’s t and leptokurtic generalized Gaussian distributions, and tailor-make the corresponding parameter optimization algorithm. Using a wider class of heavy-tailed distributions called a Gaussian scale mixture (GSM), i.e., a mixture of Gaussian distributions whose variances are perturbed by positive random scalars called impulse variables, we propose GSM-FastMNMF and develop an expectation-maximization algorithm that works even when the probability density function of the impulse variables have no analytical expressions. We show that existing heavy-tailed FastMNMF extensions are instances of GSM-FastMNMF and derive a new instance based on the generalized hyperbolic distribution that include the normal-inverse Gaussian, Student’s t, and Gaussian distributions as the special cases. Our experiments show that the normal-inverse Gaussian FastMNMF outperforms the state-of-the-art FastMNMF extensions and ILRMA model in speech enhancement and separation in terms of the signal-to-distortion ratio.
@article{fontaine2022gsmfastmnmf, author = {Fontaine, Mathieu and Sekiguchi, Kouhei and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi}, journal = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing}, title = {Generalized Fast Multichannel Nonnegative Matrix Factorization Based on Gaussian Scale Mixtures for Blind Source Separation}, year = {2022}, month = may, volume = {30}, number = {}, pages = {1734--1748}, url = {https://ieeexplore.ieee.org/abstract/document/9769993}, doi = {10.1109/TASLP.2022.3172631} }
TASLP
Autoregressive Moving Average Jointly-Diagonalizable Spatial Covariance Analysis for Joint Source Separation and Dereverberation

Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, Mathieu Fontaine, Kazuyoshi Yoshii, and Tatsuya Kawahara

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Jul 2022

Abs DOI Bib HTML PDF

This article describes a computationally-efficient statistical approach to joint (semi-)blind source separation and dereverberation for multichannel noisy reverberant mixture signals. A standard approach to source separation is to formulate a generative model of a multichannel mixture spectrogram that consists of source and spatial models representing the time-frequency power spectral densities (PSDs) and spatial covariance matrices (SCMs) of source images, respectively, and find the maximum-likelihood estimates of these parameters. A state-of-the-art blind source separation method in this thread of research is fast multichannel nonnegative matrix factorization (FastMNMF) based on the low-rank PSDs and jointly-diagonalizable full-rank SCMs. To perform mutually-dependent separation and dereverberation jointly, in this paper we integrate both moving average (MA) and autoregressive (AR) models that represent the early reflections and late reverberations of sources, respectively, into the FastMNMF formalism. Using a pretrained deep generative model of speech PSDs as a source model, we realize semi-blind joint speech separation and dereverberation. We derive an iterative optimization algorithm based on iterative projection or iterative source steering for jointly and efficiently updating the AR parameters and the SCMs. Our experimental results showed the superiority of the proposed ARMA extension over its AR- or MA-ablated version in a speech separation and/or dereverberation task.
@article{sekiguchi2022armafastmnmf, author = {Sekiguchi, Kouhei and Bando, Yoshiaki and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi and Kawahara, Tatsuya}, journal = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing}, title = {Autoregressive Moving Average Jointly-Diagonalizable Spatial Covariance Analysis for Joint Source Separation and Dereverberation}, year = {2022}, month = jul, volume = {30}, number = {}, pages = {2368--2382}, url = {https://ieeexplore.ieee.org/document/9829286}, doi = {10.1109/TASLP.2022.3190734} }
Interspeech
Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments

Yicheng Du^*, Aditya Arie Nugraha^*, Kouhei Sekiguchi^*, Yoshiaki Bando, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), Sep 2022

Abs DOI Bib HTML PDF

This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication with in real multiparty conversational environments. A major approach that has actively been studied in simulated environments is to sequentially perform speech enhancement and automatic speech recognition (ASR) based on deep neural networks (DNNs) trained in a supervised manner. In our task, however, such a pretrained system fails to work due to the mismatch between the training and test conditions and the head movements of the user. To enhance only the utterances of a target speaker, we use beamforming based on a DNN-based speech mask estimator that can adaptively extract the speech components corresponding to a head-relative particular direction. We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions. Comparative experiments using the state-of-the-art distant speech recognition system show that the proposed method significantly improves the ASR performance.
@inproceedings{du2022directionaware, author = {Du, Yicheng and Nugraha, Aditya Arie and Sekiguchi, Kouhei and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments}, booktitle = {Proceedings of Annual Conference of the International Speech Communication Association (Interspeech)}, year = {2022}, month = sep, pages = {2918--2922}, address = {Incheon, South Korea}, url = {https://www.isca-speech.org/archive/interspeech_2022/du22d_interspeech.html}, doi = {10.21437/Interspeech.2022-10508} }
EUSIPCO
Elliptically Contoured Alpha-Stable Representation for MUSIC-Based Sound Source Localization

Mathieu Fontaine, Diego Di Carlo, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of European Signal Processing Conference (EUSIPCO), Aug 2022

Abs DOI Bib HTML PDF

This paper introduces a theoretically-rigorous sound source localization (SSL) method based on a robust extension of the classical multiple signal classification (MUSIC) algorithm. The original SSL method estimates the noise eigenvectors and the MUSIC spectrum by computing the spatial covariance matrix of the observed multichannel signal and then detects the peaks from the spectrum. In this work, the covariance matrix is replaced with the positive definite shape matrix originating from the elliptically contoured α-stable model, which is more suitable under real noisy high-reverberant conditions. Evaluation on synthetic data shows that the proposed method outperforms baseline methods under such adverse conditions, while it is comparable on real data recorded in a mild acoustic condition.
@inproceedings{fontaine2022alphamusic, author = {Fontaine, Mathieu and Di Carlo, Diego and Sekiguchi, Kouhei and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {Elliptically Contoured Alpha-Stable Representation for MUSIC-Based Sound Source Localization}, booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)}, year = {2022}, month = aug, pages = {26--30}, address = {Belgrade, Serbia}, url = {https://ieeexplore.ieee.org/document/9909944}, doi = {10.23919/EUSIPCO55093.2022.9909944} }
ICASSP
Flow-Based Fast Multichannel Nonnegative Matrix Factorization for Blind Source Separation

Aditya Arie Nugraha, Kouhei Sekiguchi, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022

Abs DOI Bib HTML Poster

This paper describes a blind source separation method for multichannel audio signals, called NF-FastMNMF, based on the integration of the normalizing flow (NF) into the multichannel nonnegative matrix factorization with jointly-diagonalizable spatial covariance matrices, a.k.a. FastMNMF. Whereas the NF of flow-based independent vector analysis, called NF-IVA, acts as the demixing matrices to transform an M-channel mixture into M independent sources, the NF of NF-FastMNMF acts as the diagonalization matrices to transform an M- channel mixture into a spatially-independent M-channel mixture represented as a weighted sum of N source images. This diagonalization enables the NF, which has been used only for determined separation because of its bijective nature, to be applicable to non-determined separation. NF-FastMNMF has time-varying diagonalization matrices that are potentially better at handling dynamical data variation than the time-invariant ones in FastMNMF. To have an NF with richer expression capability, the dimension-wise scalings using diagonal matrices originally used in NF-IVA are replaced with linear transformations using upper triangular matrices; in both cases, the diagonal and upper triangular matrices are estimated by neural networks. The evaluation shows that NF-FastMNMF performs well for both determined and non-determined separations of multiple speech utterances by stationary or non-stationary speakers from a noisy reverberant mixture.
@inproceedings{nugraha2022nffastmnmf, author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {Flow-Based Fast Multichannel Nonnegative Matrix Factorization for Blind Source Separation}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2022}, month = may, pages = {501--505}, address = {Singapore}, url = {https://ieeexplore.ieee.org/document/9747718}, preprint = {https://hal.archives-ouvertes.fr/hal-03637425/}, doi = {10.1109/ICASSP43922.2022.9747718} }
IWAENC
DNN-Free Low-Latency Adaptive Speech Enhancement Based on Frame-Online Beamforming Powered by Block-Online FastMNMF

Aditya Arie Nugraha, Kouhei Sekiguchi, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC), Sep 2022

Abs DOI Bib HTML

This paper describes a practical dual-process speech enhancement system that adapts environment-sensitive frame-online beamforming (front-end) with help from environment-free block-online source separation (back-end). To use minimum variance distortionless response (MVDR) beamforming, one may train a deep neural network (DNN) that estimates time-frequency masks used for computing the covariance matrices of sources (speech and noise). Backpropagation-based run-time adaptation of the DNN was proposed for dealing with the mismatched training-test conditions. Instead, one may try to directly estimate the source covariance matrices with a state-of-the-art blind source separation method called fast multichannel non-negative matrix factorization (FastMNMF). In practice, however, neither the DNN nor the FastMNMF can be updated in a frame-online manner due to its computationally-expensive iterative nature. Our DNN-free system leverages the posteriors of the latest source spectrograms given by block-online FastMNMF to derive the current source covariance matrices for frame-online beamforming. The evaluation shows that our frame-online system can quickly respond to scene changes caused by interfering speaker movements and outperformed an existing block-online system with DNN-based beamforming by 5.0 points in terms of the word error rate.
@inproceedings{nugraha2022dnnfree, author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {DNN-Free Low-Latency Adaptive Speech Enhancement Based on Frame-Online Beamforming Powered by Block-Online FastMNMF}, booktitle = {Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC)}, year = {2022}, month = sep, pages = {1--5}, address = {Bamberg, Germany}, url = {https://ieeexplore.ieee.org/document/9914729}, preprint = {https://arxiv.org/abs/2207.10934}, doi = {10.1109/IWAENC53105.2022.9914729} }
IROS
Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments

Kouhei Sekiguchi^*, Aditya Arie Nugraha^*, Yicheng Du, Yoshiaki Bando, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2022

Abs DOI Bib HTML

This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset that helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party). One may use a state-of-the-art blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) that works well in various environments thanks to its unsupervised nature. Its heavy computational cost, however, prevents its application to real-time processing. In contrast, a supervised beamforming method that uses a deep neural network (DNN) for estimating spatial information of speech and noise readily fits real-time processing, but suffers from drastic performance degradation in mismatched conditions. Given such complementary characteristics, we propose a dual-process robust online speech enhancement method based on DNN-based beamforming with FastMNMF-guided adaptation. FastMNMF (back end) is performed in a mini-batch style and the noisy and enhanced speech pairs are used together with the original parallel training data for updating the direction-aware DNN (front end) with backpropagation at a computationally-allowable interval. This method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker, which can be detected from video or selected by a user’s hand gesture or eye gaze, in a streaming manner and spatially showing the transcriptions with an AR technique. Our experiment showed that the word error rate was improved by more than 10 points with the run-time adaptation using only twelve minutes of observation.
@inproceedings{sekiguchi2022directionaware, author = {Sekiguchi, Kouhei and Nugraha, Aditya Arie and Du, Yicheng and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments}, booktitle = {Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2022}, month = oct, pages = {9266--9273}, address = {Kyoto, Japan}, url = {https://ieeexplore.ieee.org/document/9981659}, preprint = {https://arxiv.org/abs/2207.07296}, doi = {10.1109/IROS47612.2022.9981659} }
IWAENC
Joint Localization and Synchronization of Distributed Camera-Attached Microphone Arrays for Indoor Scene Analysis

Yoshiaki Sumura, Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, and Kazuyoshi Yoshii

In Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC), Sep 2022

Abs DOI Bib HTML

This paper describes an automatic calibration method that localizes and synchronizes distributed camera-attached microphone arrays (e.g., Microsoft Azure Kinect) used for audiovisual indoor scene analysis. Operating multiple audio-visual sensors as a large-scale array is a key to resolving object occlusions and sound overlaps by integrating audio-visual information obtained from multiple angles. A naive solution to the calibration problem is to synchronize microphone arrays after localizing them using only visual information. This cascading approach, however, would suffer from the error propagation problem. We thus propose a principled statistical method that fully uses audio-visual information at once. Our method only asks a user to make handclaps and jointly estimates the sensor positions and time offsets and the time-varying source position with the GraphSLAM algorithm based on a unified state-space model associating all the latent calibration targets with the audio-visual observations. The experiment using real recordings shows the stable behavior of the proposed method.
@inproceedings{sumura2022jointlocalsync, author = {Sumura, Yoshiaki and Sekiguchi, Kouhei and Bando, Yoshiaki and Nugraha, Aditya Arie and Yoshii, Kazuyoshi}, title = {Joint Localization and Synchronization of Distributed Camera-Attached Microphone Arrays for Indoor Scene Analysis}, booktitle = {Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC)}, year = {2022}, month = sep, pages = {1--5}, address = {Bamberg, Germany}, url = {https://ieeexplore.ieee.org/document/9914786}, doi = {10.1109/IWAENC53105.2022.9914786} }

2021

SPL
Neural Full-Rank Spatial Covariance Analysis for Blind Source Separation

Yoshiaki Bando, Kouhei Sekiguchi, Yoshiki Masuyama, Aditya Arie Nugraha, Mathieu Fontaine, and Kazuyoshi Yoshii

IEEE Signal Processing Letters, Aug 2021

Abs DOI Bib HTML PDF

This paper describes aneural blind source separation (BSS) method based on amortized variational inference (AVI) of a non-linear generative model of mixture signals. A classical statistical approach to BSS is to fit a linear generative model that consists of spatial and source models representing the inter-channel covariances and power spectral densities of sources, respectively. Although the variational autoencoder (VAE) has successfully been used as a non-linear source model with latent features, it should be pretrained from a sufficient amount of isolated signals. Our method, in contrast, enables the VAE-based source model to be trained only from mixture signals. Specifically, we introduce a neural mixture-to-feature inference model that directly infers the latent features from the observed mixture and integrate it with a neural feature-to-mixture generative model consisting of a full-rank spatial model and a VAE-based source model. All the models are optimized jointly such that the likelihood for the training mixtures is maximized in the framework of AVI. Once the inference model is optimized, it can be used for estimating the latent features of sources included in unseen mixture signals. The experimental results show that the proposed method outperformed the state-of-the-art BSS methods based on linear generative models and was comparable to a method based on supervised learning of the VAE-based sourcemodel.
@article{bando2021neuralfca, author = {Bando, Yoshiaki and Sekiguchi, Kouhei and Masuyama, Yoshiki and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi}, journal = {{IEEE} Signal Processing Letters}, title = {Neural Full-Rank Spatial Covariance Analysis for Blind Source Separation}, year = {2021}, month = aug, volume = {28}, number = {}, pages = {1670--1674}, url = {https://ieeexplore.ieee.org/document/9506855}, doi = {10.1109/LSP.2021.3101699} }
Interspeech
Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation

Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), Aug 2021

Abs DOI Bib HTML PDF

This paper proposes α-stable autoregressive fast multichannel nonnegative matrix factorization (α-AR-FastMNMF), a robust joint blind speech enhancement and dereverberation method for improved automatic speech recognition in a realistic adverse environment. The state-of-the-art versatile blind source separation method called FastMNMF that assumes the short-time Fourier transform (STFT) coefficients of a direct sound to follow a circular complex Gaussian distribution with jointly-diagonalizable full-rank spatial covariance matrices was extended to AR-FastMNMF with an autoregressive reverberation model. Instead of the light-tailed Gaussian distribution, we use the heavy-tailed α-stable distribution, which also has the reproductive property useful for the additive source modeling, to better deal with the large dynamic range of the direct sound. The experimental results demonstrate that the proposed α-AR-FastMNMF works well as a front-end of an automatic speech recognition system. It outperforms α-AR-ILRMA, which is a special case of α-AR-FastMNMF, and their Gaussian counterparts, i.e., AR-FastMNMF and AR-ILRMA, in terms of the speech signal quality metrics and word error rate.
@inproceedings{fontaine2021alphaarfastmnmf, author = {Fontaine, Mathieu and Sekiguchi, Kouhei and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation}, booktitle = {Proceedings of Annual Conference of the International Speech Communication Association (Interspeech)}, year = {2021}, month = aug, pages = {661--665}, address = {Brno, Czechia}, url = {https://www.isca-speech.org/archive/interspeech_2021/fontaine21_interspeech.html}, doi = {10.21437/Interspeech.2021-742} }
ICASSP
Autoregressive Fast Multichannel Nonnegative Matrix Factorization For Joint Blind Source Separation And Dereverberation

Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2021

Abs DOI Bib HTML

This paper describes a joint blind source separation and dereverberation method that works adaptively and efficiently in a reverberant noisy environment. The modern approach to blind source separation (BSS) is to formulate a probabilistic model of multichannel mixture signals that consists of a source model representing the time-frequency structures of source spectrograms and a spatial model representing the inter-channel covariance structures of source images. The cutting-edge BSS method in this thread of research is fast multi-channel nonnegative matrix factorization (FastMNMF) that consists of a low-rank source model based on nonnegative matrix factorization (NMF) and a full-rank spatial model based on jointly-diagonalizable spatial covariance matrices. Although FastMNMF is computationally efficient and can deal with both directional sources and diffuse noise simultaneously, its performance is severely degraded in a reverberant environment. To solve this problem, we propose autoregressive FastMNMF (AR-FastMNMF) based on a unified probabilistic model that combines FastMNMF with a blind dereverberation method called weighted prediction error (WPE), where all the parameters are optimized jointly such that the likelihood for observed reverberant mixture signals is maximized. Experimental results showed the superiority of AR-FastMNMF over conventional methods that perform blind dereverberation and BSS jointly or sequentially.
@inproceedings{sekiguchi2021arfastmnmf, author = {Sekiguchi, Kouhei and Bando, Yoshiaki and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {Autoregressive Fast Multichannel Nonnegative Matrix Factorization For Joint Blind Source Separation And Dereverberation}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2021}, month = jun, pages = {511--515}, address = {Toronto, Canada}, url = {https://ieeexplore.ieee.org/document/9414857}, doi = {10.1109/ICASSP39728.2021.9414857} }

2020

TASLP
A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement

Aditya Arie Nugraha, Kouhei Sekiguchi, and Kazuyoshi Yoshii

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020

Abs DOI Bib HTML

This paper describes a deep latent variable model of speech power spectrograms and its application to semi-supervised speech enhancement with a deep speech prior. By integrating two major deep generative models, a variational autoencoder (VAE) and a normalizing flow (NF), in a mutually-beneficial manner, we formulate a flexible latent variable model called the NF-VAE that can extract low-dimensional latent representations from high-dimensional observations, akin to the VAE, and does not need to explicitly represent the distribution of the observations, akin to the NF. In this paper, we consider a variant of NF called the generative flow (GF a.k.a. Glow) and formulate a latent variable model called the GF-VAE. We experimentally show that the proposed GF-VAE is better than the standard VAE at capturing fine-structured harmonics of speech spectrograms, especially in the high-frequency range. A similar finding is also obtained when the GF-VAE and the VAE are used to generate speech spectrograms from latent variables randomly sampled from the standard Gaussian distribution. Lastly, when these models are used as speech priors for statistical multichannel speech enhancement, the GF-VAE outperforms the VAE and the GF.
@article{nugraha2020gfvae, title = {A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement}, author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Yoshii, Kazuyoshi}, journal = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing}, year = {2020}, month = {}, volume = {28}, number = {}, pages = {1104--1117}, url = {https://ieeexplore.ieee.org/document/9028147}, preprint = {https://www.techrxiv.org/articles/A_Flow-Based_Deep_Latent_Variable_Model_for_Speech_Spectrogram_Modeling_and_Enhancement/12375284}, doi = {10.1109/TASLP.2020.2979603} }
SPL
Flow-Based Independent Vector Analysis for Blind Source Separation

Aditya Arie Nugraha, Kouhei Sekiguchi, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii

IEEE Signal Processing Letters, 2020

Abs DOI Bib HTML PDF

This paper describes a time-varying extension of independent vector analysis (IVA) based on the normalizing flow (NF), called NF-IVA, for determined blind source separation of multichannel audio signals. As in IVA, NF-IVA estimates demixing matrices that transform mixture spectra to source spectra in the complex-valued spatial domain such that the likelihood of those matrices for the mixture spectra is maximized under some non-Gaussian source model. While IVA performs a time-invariant bijective linear transformation, NF-IVA performs a series of time-varying bijective linear transformations (flow blocks) adaptively predicted by neural networks. To regularize such transformations, we introduce a soft volume-preserving (VP) constraint. Given mixture spectra, the parameters of NF-IVA are optimized by gradient descent with backpropagation in an unsupervised manner. Experimental results show that NF-IVA successfully performs speech separation in reverberant environments with different numbers of speakers and microphones and that NF-IVA with the VP constraint outperforms NF-IVA without it, standard IVA with iterative projection, and improved IVA with gradient descent.
@article{nugraha2020nfiva, title = {Flow-Based Independent Vector Analysis for Blind Source Separation}, author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi}, journal = {{IEEE} Signal Processing Letters}, year = {2020}, month = {}, volume = {27}, number = {}, pages = {2173--2177}, url = {https://ieeexplore.ieee.org/document/9269436}, doi = {10.1109/LSP.2020.3039944} }
TASLP
Fast Multichannel Nonnegative Matrix Factorization with Directivity-Aware Jointly-Diagonalizable Spatial Covariance Matrices for Blind Source Separation

Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, Kazuyoshi Yoshii, and Tatsuya Kawahara

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Aug 2020

Awarded Abs DOI Bib HTML PDF Code

15th IEEE Signal Processing Society (SPS) Japan Student Journal Paper Award

This paper describes a computationally-efficient blind source separation (BSS) method based on the independence, low-rankness, and directivity of the sources. A typical approach to BSS is unsupervised learning of a probabilistic model that consists of a source model representing the time-frequency structure of source images and a spatial model representing their inter-channel covariance structure. Building upon the low-rank source model based on nonnegative matrix factorization (NMF), which has been considered to be effective for inter-frequency source alignment, multichannel NMF (MNMF) assumes source images to follow multivariate complex Gaussian distributions with unconstrained full-rank spatial covariance matrices (SCMs). An effective way of reducing the computational cost and initialization sensitivity of MNMF is to restrict the degree of freedom of SCMs. While a variant of MNMF called independent low-rank matrix analysis (ILRMA) severely restricts SCMs to rank-1 matrices under an idealized condition that only directional and less-echoic sources exist, we restrict SCMs to jointly-diagonalizable yet full-rank matrices in a frequency-wise manner, resulting in FastMNMF1. To help inter-frequency source alignment, we then propose FastMNMF2 that shares the directional feature of each source over all frequency bins. To explicitly consider the directivity or diffuseness of each source, we also propose rank-constrained FastMNMF that enables us to individually specify the ranks of SCMs. Our experiments showed the superiority of FastMNMF over MNMF and ILRMA in speech separation and the effectiveness of the rank constraint in speech enhancement.
@article{sekiguchi2020fastmnmf, title = {Fast Multichannel Nonnegative Matrix Factorization with Directivity-Aware Jointly-Diagonalizable Spatial Covariance Matrices for Blind Source Separation}, author = {Sekiguchi, Kouhei and Bando, Yoshiaki and Nugraha, Aditya Arie and Yoshii, Kazuyoshi and Kawahara, Tatsuya}, journal = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing}, year = {2020}, month = aug, volume = {28}, number = {}, pages = {2610--2625}, url = {https://ieeexplore.ieee.org/document/9177266}, doi = {10.1109/TASLP.2020.3019181}, }
EUSIPCO
Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms

Yicheng Du, Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, Mathieu Fontaine, Kazuyoshi Yoshii, and Tatsuya Kawahara

In Proceedings of European Signal Processing Conference (EUSIPCO), 2020

Abs DOI Bib HTML PDF

This paper describes a semi-supervised multichannel speech separation method that uses clean speech signals with frame-wise phonetic labels and sample-level speaker labels for pre-training. A standard approach to statistical source separation is to formulate a probabilistic model of multichannel mixture spectrograms that combines source models representing the time-frequency characteristics of sources with spatial models representing the covariance structure between channels. For speech separation and enhancement, deep generative models with latent variables have successfully been used as source models. The parameters of such a speech model can be trained beforehand from clean speech signals with a variational autoencoder (VAE) or its conditional variant (CVAE) that takes speaker labels as auxiliary inputs. Because human speech is characterized by both phonetic features and speaker identities, we propose a probabilistic model that combines a phone- and speaker-aware deep speech model with a full-rank spatial model. Our speech model is trained with a CVAE taking both phone and speaker labels as conditions. Given speech mixtures, the spatial covariance matrices, latent variables of sources, and phone and speaker labels of sources are jointly estimated. Comparative experimental results showed that the performance of speech separation can be improved by explicitly considering phonetic features and/or speaker identities.
@inproceedings{du2020phonespeakeraware, author = {Du, Yicheng and Sekiguchi, Kouhei and Bando, Yoshiaki and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi and Kawahara, Tatsuya}, title = {Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms}, booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)}, year = {2020}, optmonth = jan, pages = {870--874}, address = {Amsterdam, Netherlands}, url = {https://ieeexplore.ieee.org/document/9287464}, doi = {10.23919/Eusipco47968.2020.9287464} }
Interspeech
Unsupervised Robust Speech Enhancement Based on Alpha-Stable Fast Multichannel Nonnegative Matrix Factorization

Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, and Kazuyoshi Yoshii

In Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), Oct 2020

Abs DOI Bib HTML PDF

This paper describes multichannel speech enhancement based on a probabilistic model of complex source spectrograms for improving the intelligibility of speech corrupted by undesired noise. The univariate complex Gaussian model with the reproductive property supports the additivity of source complex spectrograms and forms the theoretical basis of nonnegative matrix factorization (NMF). Multichannel NMF (MNMF) is an extension of NMF based on the multivariate complex Gaussian model with spatial covariance matrices (SCMs), and its state-of-the-art variant called FastMNMF with jointly-diagonalizable SCMs achieves faster decomposition based on the univariate Gaussian model in the transformed domain where all time-frequency-channel elements are independent. Although a heavy-tailed extension of FastMNMF has been proposed to improve the robustness against impulsive noise, the source additivity has never been considered. The multivariate α-stable distribution does not have the reproductive property for the shape matrix parameter. This paper, therefore, proposes a heavy-tailed extension called α-stable FastMNMF which works in the transformed domain to use a univariate complex α-stable model, satisfying the reproductive property for any tail lightness parameter α and allowing the α-fractional Wiener filtering based on the element-wise source additivity. The experimental results show that α-stable FastMNMF with α = 1.8 significantly outperforms Gaussian FastMNMF (α=2).
@inproceedings{fontaine2020alphastablefastmnmf, author = {Fontaine, Mathieu and Sekiguchi, Kouhei and Nugraha, Aditya Arie and Yoshii, Kazuyoshi}, title = {Unsupervised Robust Speech Enhancement Based on Alpha-Stable Fast Multichannel Nonnegative Matrix Factorization}, booktitle = {Proceedings of Annual Conference of the International Speech Communication Association (Interspeech)}, year = {2020}, month = oct, pages = {4541--4545}, address = {Shanghai, China}, url = {https://www.isca-speech.org/archive/interspeech_2020/fontaine20_interspeech.html}, doi = {10.21437/Interspeech.2020-3202} }
EUSIPCO
Fast Multichannel Correlated Tensor Factorization for Blind Source Separation

Kazuyoshi Yoshii, Kouhei Sekiguchi, Yoshiaki Bando, Mathieu Fontaine, and Aditya Arie Nugraha

In Proceedings of European Signal Processing Conference (EUSIPCO), 2020

Abs DOI Bib HTML PDF

This paper describes an ultimate covariance-aware multichannel extension of nonnegative matrix factorization (NMF) for blind source separation (BSS). A typical approach to BSS is to integrate a low-rank source model with a full-rank spatial model as multichannel NMF (MNMF) based on full-rank spatial covariance matrices (CMs) or its efficient version named FastMNMF based on jointly-diagonalizable spatial CMs do. The NMF-based phase-unaware source model, however, can deal with only the positive cooccurrence relations between time-frequency bins. To overcome this limitation, we propose an efficient multichannel extension of correlated tensor factorization (CTF) named FastMCTF based on jointly-diagonalizable temporal, frequency, and spatial CMs. Integration of the jointly-diagonalizable full-rank source model proposed by FastCTF with the jointly-diagonalizable full-rank spatial model proposed by FastMNMF enables us to completely consider the positive and negative covariance relations between frequency bins, time frames, and channels. We derive a convergence-guaranteed parameter estimation algorithm based on the multiplicative update and iterative projection and experimentally show the potential of the proposed method.
@inproceedings{yoshii2020fastmctf, author = {Yoshii, Kazuyoshi and Sekiguchi, Kouhei and Bando, Yoshiaki and Fontaine, Mathieu and Nugraha, Aditya Arie}, title = {Fast Multichannel Correlated Tensor Factorization for Blind Source Separation}, booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)}, year = {2020}, optmonth = jan, pages = {306--310}, address = {Amsterdam, Netherlands}, url = {https://ieeexplore.ieee.org/document/9287530}, doi = {10.23919/Eusipco47968.2020.9287530} }

2019

TASLP
Semi-supervised Multichannel Speech Enhancement with a Deep Speech Prior

Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, Kazuyoshi Yoshii, and Tatsuya Kawahara

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Dec 2019

Awarded Abs DOI Bib HTML PDF Code

17th IEEE Kansai Section Student Paper Award

This paper describes a semi-supervised multichannel speech enhancement method that only uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for unsupervised speech enhancement, the low-rank assumption on the power spectral densities (PSDs) of all sources (speech and noise) does not hold in reality. To solve this problem, we replace a low-rank model of speech with a deep generative model in the framework of MNMF or ILRMA, i.e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise. The deep speech model is trained from clean speech data in an unsupervised auto-encoding variational Bayesian manner. Given multichannel noisy speech spectra, the full-rank or rank-1 spatial covariance matrices and PSDs of speech and noise are estimated in an unsupervised maximum-likelihood manner. Experimental results showed that the full-rank version of the proposed method was significantly better than MNMF, ILRMA, and the rank-1 version. We confirmed that the initialization-sensitivity and local-optimum problems of MNMF with many spatial parameters can be solved by incorporating the precise speech model.
@article{sekiguchi2019massvae, title = {Semi-supervised Multichannel Speech Enhancement with a Deep Speech Prior}, author = {Sekiguchi, Kouhei and Bando, Yoshiaki and Nugraha, Aditya Arie and Yoshii, Kazuyoshi and Kawahara, Tatsuya}, journal = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing}, year = {2019}, month = dec, volume = {27}, number = {12}, pages = {2197--2212}, url = {http://ieeexplore.ieee.org/document/8861142}, doi = {10.1109/TASLP.2019.2944348}, }
RO-MAN
Audio-Visual SLAM towards Human Tracking and Human-Robot Interaction in Indoor Environments

Aaron Chau, Kouhei Sekiguchi, Aditya Arie Nugraha, Kazuyoshi Yoshii, and Kotaro Funakoshi

In Proceedings of IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), Oct 2019

Awarded Abs DOI Bib HTML

Best Conference Paper Award

We propose a novel audio-visual simultaneous and localization (SLAM) framework that exploits human pose and acoustic speech of human partners to allow a robot equipped with a microphone array and a monocular camera to track, map, and interact with human sound sources in an indoor environment. Since human interaction is characterized by features perceived in not only the visual modality, but the acoustic modality as well. SLAM systems must utilize information from both modalities. Using a state-of-the-art beamforming technique, we obtain sound components corresponding to speech and noise, and estimate the Direction-of-Arrival (DoA) estimates of active sound sources as useful representations of observed features in the acoustic modality. Through estimated human pose by a monocular camera, we obtain the relative positions of humans as useful representation of observed features in the visual modality. Using these techniques, we attempt to eliminate restrictions imposed by intermittent speech, noisy and reverberant periods, triangulation of sound-source range, and restrictions imposed by limited visual field-of-views; and subsequently perform early fusion on these representations. We develop a system that allows for complimentary action between audio-visual sensor modalities in the simultaneous mapping of multiple human sound sources and the localization of observer position.
@inproceedings{chau2019roman, author = {Chau, Aaron and Sekiguchi, Kouhei and Nugraha, Aditya Arie and Yoshii, Kazuyoshi and Funakoshi, Kotaro}, title = {Audio-Visual SLAM towards Human Tracking and Human-Robot Interaction in Indoor Environments}, booktitle = {Proceedings of IEEE International Conference on Robot \& Human Interactive Communication (RO-MAN)}, month = oct, year = {2019}, pages = {1--8}, address = {New Delhi, India}, url = {https://ieeexplore.ieee.org/document/8956321}, preprint = {http://sap.ist.i.kyoto-u.ac.jp/members/yoshii/papers/roman-2019-chau.pdf}, doi = {10.1109/RO-MAN46459.2019.8956321}, }
EUSIPCO
Cauchy Multichannel Speech Enhancement with a Deep Speech Prior

Mathieu Fontaine, Aditya Arie Nugraha, Roland Badeau, Kazuyoshi Yoshii, and Antoine Liutkus

In Proceedings of European Signal Processing Conference (EUSIPCO), Sep 2019

Abs DOI Bib HTML Slides

We propose a semi-supervised multichannel speech enhancement system based on a probabilistic model which assumes that both speech and noise follow the heavy-tailed multivariate complex Cauchy distribution. As we advocate, this allows handling strong and adverse noisy conditions. Consequently, the model is parameterized by the source magnitude spectrograms and the source spatial scatter matrices. To deal with the non-additivity of scatter matrices, our first contribution is to perform the enhancement on a projected space. Then, our second contribution is to combine a latent variable model for speech, which is trained by following the variational autoencoder framework, with a low-rank model for the noise source. At test time, an iterative inference algorithm is applied, which produces estimated parameters to use for separation. The speech latent variables are estimated first from the noisy speech and then updated by a gradient descent method, while a majorization-equalization strategy is used to update both the noise and the spatial parameters of both sources. Our experimental results show that the Cauchy model outperforms the state-of-art methods. The standard deviation scores also reveal that the proposed method is more robust against non-stationary noise.
@inproceedings{fontaine2019eusipco, author = {Fontaine, Mathieu and Nugraha, Aditya Arie and Badeau, Roland and Yoshii, Kazuyoshi and Liutkus, Antoine}, title = {Cauchy Multichannel Speech Enhancement with a Deep Speech Prior}, booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)}, month = sep, year = {2019}, pages = {1--5}, address = {A Coru\~{n}a, Spain}, url = {https://ieeexplore.ieee.org/document/8903091}, preprint = {https://hal.telecom-paristech.fr/hal-02288063}, doi = {10.23919/EUSIPCO.2019.8903091} }
ICASSP
A Deep Generative Model of Speech Complex Spectrograms

Aditya Arie Nugraha, Kouhei Sekiguchi, and Kazuyoshi Yoshii

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019

Abs DOI Bib HTML Poster

This paper proposes an approach to the joint modeling of the short-time Fourier transform magnitude and phase spectrograms with a deep generative model. We assume that the magnitude follows a Gaussian distribution and the phase follows a von Mises distribution. To improve the consistency of the phase values in the time-frequency domain, we also apply the von Mises distribution to the phase derivatives, i.e., the group delay and the instantaneous frequency. Based on these assumptions, we explore and compare several combinations of loss functions for training our models. Built upon the variational autoencoder framework, our model consists of three convolutional neural networks acting as an encoder, a magnitude decoder, and a phase decoder. In addition to the latent variables, we propose to also condition the phase estimation on the estimated magnitude. Evaluated for a time-domain speech reconstruction task, our models could generate speech with a high perceptual quality and a high intelligibility.
@inproceedings{nugraha2019icassp, author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Yoshii, Kazuyoshi}, title = {A Deep Generative Model of Speech Complex Spectrograms}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, month = may, year = {2019}, pages = {905--909}, address = {Brighton, UK}, url = {https://ieeexplore.ieee.org/document/8682797}, preprint = {https://arxiv.org/abs/1903.03269}, doi = {10.1109/ICASSP.2019.8682797} }
EUSIPCO
Fast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices

Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of European Signal Processing Conference (EUSIPCO), Sep 2019

Abs DOI Bib HTML Code

This paper describes a versatile method that accelerates multichannel source separation methods based on full-rank spatial modeling. A popular approach to multichannel source separation is to integrate a spatial model with a source model for estimating the spatial covariance matrices (SCMs) and power spectral densities (PSDs) of each sound source in the time-frequency domain. One of the most successful examples of this approach is multichannel nonnegative matrix factorization (MNMF) based on a full-rank spatial model and a low-rank source model. MNMF, however, is computationally expensive and often works poorly due to the difficulty of estimating the unconstrained full-rank SCMs. Instead of restricting the SCMs to rank-1 matrices with the severe loss of the spatial modeling ability as in independent low-rank matrix analysis (ILRMA), we restrict the SCMs of each frequency bin to jointly-diagonalizable but still full-rank matrices. For such a fast version of MNMF, we propose a computationally-efficient and convergence-guaranteed algorithm that is similar in form to that of ILRMA. Similarly, we propose a fast version of a state-of-the-art speech enhancement method based on a deep speech model and a low-rank noise model. Experimental results showed that the fast versions of MNMF and the deep speech enhancement method were several times faster and performed even better than the original versions of those methods, respectively.
@inproceedings{sekiguchi2019eusipco, author = {Sekiguchi, Kouhei and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {Fast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices}, booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)}, month = sep, year = {2019}, pages = {1--5}, address = {A Coru\~{n}a, Spain}, url = {https://ieeexplore.ieee.org/document/8902557}, preprint = {https://arxiv.org/abs/1903.03237}, doi = {10.23919/EUSIPCO.2019.8902557} }

2018

Deep Neural Network Based Multichannel Audio Source Separation

Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent

2018

Abs DOI Bib HTML

This chapter presents a multichannel audio source separation framework where deep neural networks (DNNs) are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information. The parameters are estimated in an iterative expectation-maximization (EM) fashion and used to derive a multichannel Wiener filter. Different design choices and their impact on the performance are discussed. They include the cost functions for DNN training, the number of parameter updates, the use of multiple DNNs, and the use of weighted parameter updates. Finally, we present its application to a speech enhancement task and a music separation task. The experimental results show the benefit of the multichannel DNN-based approach over a single-channel DNN-based approach and the multichannel nonnegative matrix factorization based iterative EM framework.
@inbook{nugraha2018ass, author = {Nugraha, Aditya Arie and Liutkus, Antoine and Vincent, Emmanuel}, title = {Deep Neural Network Based Multichannel Audio Source Separation}, editor = {Makino, Shoji}, booktitle = {Audio Source Separation}, year = {2018}, publisher = {Springer}, address = {Cham}, pages = {157--185}, doi = {10.1007/978-3-319-73031-8_7}, url = {https://doi.org/10.1007/978-3-319-73031-8_7}, preprint = {https://hal.inria.fr/hal-01633858} }

2017

CSL
An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer

Computer Speech & Language, Nov 2017

Awarded Abs DOI Bib HTML

ISCA Award for the Best Review Paper published in Computer Speech and Language (2016-2020)

Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.
@article{vincent2017csl, title = {An analysis of environment, microphone and data simulation mismatches in robust speech recognition}, author = {Vincent, Emmanuel and Watanabe, Shinji and Nugraha, Aditya Arie and Barker, Jon and Marxer, Ricard}, journal = {Computer Speech & Language}, year = {2017}, month = nov, volume = {46}, number = {9}, pages = {535--557}, url = {http://www.sciencedirect.com/science/article/pii/S0885230816301231}, preprint = {https://hal.inria.fr/hal-01399180}, doi = {10.1016/j.csl.2016.11.005}, }

2016

TASLP
Multichannel audio source separation with deep neural networks

Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Sep 2016

Awarded Abs DOI Bib HTML

6th IEEE Signal Processing Society (SPS) Japan Young Author Best Paper Award

This article addresses the problem of multichannel audio source separation. We propose a framework where deep neural networks (DNNs) are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information. The parameters are estimated in an iterative expectation-maximization (EM) fashion and used to derive a multichannel Wiener filter. We present an extensive experimental study to show the impact of different design choices on the performance of the proposed technique. We consider different cost functions for the training of DNNs, namely the probabilistically motivated Itakura-Saito divergence, and also Kullback-Leibler, Cauchy, mean squared error, and phase-sensitive cost functions. We also study the number of EM iterations and the use of multiple DNNs, where each DNN aims to improve the spectra estimated by the preceding EM iteration. Finally, we present its application to a speech enhancement problem. The experimental results show the benefit of the proposed multichannel approach over a single-channel DNNbased approach and the conventional multichannel nonnegative matrix factorization based iterative EM algorithm.
@article{nugraha2016massdnn, title = {Multichannel audio source separation with deep neural networks}, author = {Nugraha, Aditya Arie and Liutkus, Antoine and Vincent, Emmanuel}, journal = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing}, year = {2016}, month = sep, volume = {24}, number = {9}, pages = {1652--1664}, url = {http://ieeexplore.ieee.org/document/7492604}, preprint = {https://hal.inria.fr/hal-01163369}, doi = {10.1109/TASLP.2016.2580946}, }
EUSIPCO
Multichannel music separation with deep neural networks

Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent

In Proceedings of European Signal Processing Conference (EUSIPCO), Aug 2016

Abs DOI Bib HTML

This article addresses the problem of multichannel music separation. We propose a framework where the source spectra are estimated using deep neural networks and combined with spatial covariance matrices to encode the source spatial characteristics. The parameters are estimated in an iterative expectation-maximization fashion and used to derive a multichannel Wiener filter. We evaluate the proposed framework for the task of music separation on a large dataset. Experimental results show that the method we describe performs consistently well in separating singing voice and other instruments from realistic musical mixtures.
@inproceedings{nugraha2016eusipco, author = {Nugraha, Aditya Arie and Liutkus, Antoine and Vincent, Emmanuel}, title = {Multichannel music separation with deep neural networks}, booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)}, month = aug, year = {2016}, pages = {1748--1752}, address = {Budapest, Hungary}, url = {https://ieeexplore.ieee.org/document/7760548}, preprint = {https://hal.inria.fr/hal-01334614}, doi = {10.1109/EUSIPCO.2016.7760548} }

2015

ASRU
Robust ASR using neural network based speech enhancement and feature simulation

Sunit Sivasankaran, Aditya Arie Nugraha, Emmanuel Vincent, Juan Andrés Morales Cordovilla, Siddharth Dalmia, Irina Illina, and Antoine Liutkus

In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Dec 2015

Abs DOI Bib HTML

We consider the problem of robust automatic speech recognition (ASR) in the context of the CHiME-3 Challenge. The proposed system combines three contributions. First, we propose a deep neural network (DNN) based multichannel speech enhancement technique, where the speech and noise spectra are estimated using a DNN based regressor and the spatial parameters are derived in an expectation-maximization (EM) like fashion. Second, a conditional restricted Boltzmann machine (CRBM) model is trained using the obtained enhanced speech and used to generate simulated training and development datasets. The goal is to increase the similarity between simulated and real data, so as to increase the benefit of multicondition training. Finally, we make some changes to the ASR backend. Our system ranked 4th among 25 entries.
@inproceedings{sivasankaran2015asru, author = {Sivasankaran, Sunit and Nugraha, Aditya Arie and Vincent, Emmanuel and Cordovilla, Juan Andrés Morales and Dalmia, Siddharth and Illina, Irina and Liutkus, Antoine}, title = {Robust {ASR} using neural network based speech enhancement and feature simulation}, booktitle = {Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)}, year = {2015}, month = dec, pages = {482--489}, address = {Scottsdale, USA}, url = {https://ieeexplore.ieee.org/document/7404834}, preprint = {https://hal.inria.fr/hal-01204553}, doi = {10.1109/ASRU.2015.7404834} }

2014

ASMP
Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition

Aditya Arie Nugraha, Kazumasa Yamamoto, and Seiichi Nakagawa

EURASIP Journal on Audio, Speech, and Music Processing, Apr 2014

Abs DOI Bib HTML PDF

We present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. Experiments using speaker identification (SID) and automatic speech recognition (ASR) systems were conducted to evaluate the method. The experiments of SID system was conducted by using our own simulated and real reverberant datasets, while the CENSREC-4 evaluation framework was used as the evaluation for the ASR system. The proposed method could remarkably improve the performance of both systems by using limited stereo data and low speaker-variant data as the training data. From the evaluation using SID, we reached 26.0% and 34.8% of error rate reduction (ERR) relative to the baseline by using simulated and real data, respectively, by using only one pair of utterances for matched condition cases. Then, by using combined dataset containing 15 pairs of utterances by one speaker from three positions in a room, we could reach 93.7% of average identification rate (three known and two unknown positions), which was 42.2% of ERR relative to the use of cepstral mean normalization (CMN). From the evaluation using ASR, by using 40 pairs of utterances as the NN training data, we could reach 78.4% of ERR relative to the baseline by using simulated utterances by five speakers. Moreover, we could reach 75.4% and 71.6% of ERR relative to the baseline by using real utterances by five speakers and one speaker, respectively.
@article{nugraha2014dereverb, title = {Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition}, author = {Nugraha, Aditya Arie and Yamamoto, Kazumasa and Nakagawa, Seiichi}, journal = {{EURASIP} Journal on Audio, Speech, and Music Processing}, year = {2014}, month = apr, volume = {2014}, number = {13}, pages = {1--31}, url = {https://link.springer.com/article/10.1186/1687-4722-2014-13}, doi = {10.1186/1687-4722-2014-13} }

2013

APSIPA
Single channel dereverberation method in logmelspectral domain using limited stereo data for distant speaker identification

Aditya Arie Nugraha, Kazumasa Yamamoto, and Seiichi Nakagawa

In Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Oct 2013

Abs Bib PDF

In this paper, we present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. We assumed that the dimensions of feature were independent from each other and experimented on several assumptions of the room transfer function for each dimension. Speaker identification system was used to evaluate the method. Using limited stereo data, we could improve the identification rate for simulated and real datasets. On the simulated dataset, we could show that the proposed method is effective for both noiseless and noisy reverberant environments, with various noise and reverberation characteristics. On the real dataset, we could show that by using 6 independent NNs configuration for 24-dimensional feature and only 1 pair of utterances we could get 35% average error reduction relative to the baseline, which employed cepstral mean normalization (CMN).
@inproceedings{nugraha2013dereverb, author = {Nugraha, Aditya Arie and Yamamoto, Kazumasa and Nakagawa, Seiichi}, title = {Single channel dereverberation method in logmelspectral domain using limited stereo data for distant speaker identification}, booktitle = {Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)}, year = {2013}, month = oct, pages = {1--4}, address = {Kaohsiung, Taiwan}, }
SP/IPSJ-SLP
Single Channel Dereverberation Method by Feature Mapping Using Limited Stereo Data

Aditya Arie Nugraha, Kazumasa Yamamoto, and Seiichi Nakagawa

Jul 2013

Abs Bib HTML

In this paper, we present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. Experiments using speaker identification (SID) and speech recognition (ASR) systems were conducted to evaluate the method. The experiments of SID system was conducted by using real noisy reverberant datasets, while CENSREC-4 evaluation framework was used as the evaluation for the ASR system. Using limited stereo data consisting of simultaneously recorded clean speech and reverberant speech, the proposed method could remarkably improve the performance of both systems.
@techreport{nugraha2013dereverc, author = {Nugraha, Aditya Arie and Yamamoto, Kazumasa and Nakagawa, Seiichi}, title = {Single Channel Dereverberation Method by Feature Mapping Using Limited Stereo Data}, institution = {Institute of Electronics, Information and Communication Engineers (IEICE)}, year = {2013}, month = jul, key = {S13-54}, volume = {113}, number = {161}, pages = {7--12}, url = {https://www.ieice.org/ken/paper/20130725FB4S/eng/}, }

2012

ASJ
Improving distant speaker identification robustness using a nonlinear regression based dereverberation method in feature domain

Aditya Arie Nugraha and Seiichi Nakagawa

In Proceedings of the Autumn Meeting of Acoustical Society of Japan, Sep 2012

Abs Bib

The use of reverberated speech signal which is captured by distant-talking microphone as input of speaker identification system would degrade its performance. In this paper, we present a single-channel non-linear regression based dereverberation method that works on a feature domain. Artificial neural networks were trained using Cascade 2 algorithm on stereo data to compensate the reverberation effect by mapping the reverberated signal to the clean signal on 24-dimensional log-melspectral features. We also employ segment-level normalization to compensate the power difference between the clean signal and the reverberated signal. Using the proposed method, we could enhance the signal and improve the identification rate of distant speaker identification system.
@inproceedings{nugraha2012speaker, author = {Nugraha, Aditya Arie and Nakagawa, Seiichi}, title = {Improving distant speaker identification robustness using a nonlinear regression based dereverberation method in feature domain}, booktitle = {Proceedings of the Autumn Meeting of Acoustical Society of Japan}, year = {2012}, month = sep, pages = {163--166} }

2011

TSSA
Performance evaluation of audio-video streaming service in Keerom, Papua using integrated audio-video performance test tool

Yudi Satria Gondokaryono, Yoanes Bandung, Joko Ari Wibowo, Aditya Arie Nugraha, Bryan Yonathan, and Dwi Ramadhianto

In Proceedings of International Conference on Telecommunication Systems, Services, and Applications (TSSA), Oct 2011

Abs DOI Bib HTML

This study compared some video codec, audio codec, audio bit rate, video bit rate to determine the quality of the audio-video streaming service on the network Keerom, Papua. Average capacity in this network is 1.5Mbps. Mpeg audio and ac3 are choosen because of its characteristic, while the video codec is mpeg4 and H.264. Audio bit rate used 64 and 128kbps, while the video bit rate 64, 128 and 256kbps. The experiments result show the quality of the audio-video streaming service was better when the audio codec used mpeg audio 64kbps-mpeg4 256kbps. The test results will be used as a reference implementation of audio-video streaming service later in the network Keerom, Papua.
@inproceedings{gondokaryono2011avstreaming, author = {Gondokaryono, Yudi Satria and Bandung, Yoanes and Wibowo, Joko Ari and Nugraha, Aditya Arie and Yonathan, Bryan and Ramadhianto, Dwi}, title = {Performance evaluation of audio-video streaming service in Keerom, Papua using integrated audio-video performance test tool}, booktitle = {Proceedings of International Conference on Telecommunication Systems, Services, and Applications (TSSA)}, year = {2011}, month = oct, pages = {145--148}, address = {Denpasar, Indonesia}, url = {https://ieeexplore.ieee.org/document/6095423}, doi = {10.1109/TSSA.2011.6095423} }

2010

AEEI
Web based multimedia conference system for digital learning in rural elementary school

Aska Narendra, Aditya Arie Nugraha, Yoanes Bandung, Armein Z. R. Langi, and Bambang Pharmasetiawan

Advances in Electrical Engineering and Informatics, 2010

Abs Bib HTML PDF

This paper describes the process of designing a web-based multimedia conferencing system that will be used to support digital learning for elementary school in rural areas and implementing them in some network testbeds in Bandung, Subang, and Cianjur. The system must be able to send each of the constituent media, namely video, audio, and other materials (e.g. slide presentations) independently so that the learning process between student and teacher could still be running even if one of the media is absent. In addition, the multimedia conferencing system must also be easily operated independently by an elementary school teacher in rural areas with a minimum computer mastery level. The result is a product that is expected to be useful for improving the quality of primary education especially in rural areas through ICT applications.
@article{narendra2010multimedia, title = {Web based multimedia conference system for digital learning in rural elementary school}, author = {Narendra, Aska and Nugraha, Aditya Arie and Bandung, Yoanes and Langi, Armein Z. R. and Pharmasetiawan, Bambang}, journal = {Advances in Electrical Engineering and Informatics}, year = {2010}, month = {}, volume = {III}, number = {}, pages = {97--104}, url = {https://www.researchgate.net/publication/261439213_Web_Based_Multimedia_Conference_System_for_Digital_Learning_in_Rural_Elementary_School}, }