Aditya Arie Nugraha

Kyoto University Artificial Intelligence Research Unit,

Dr. Ichikawa Commemorative Laboratory, Room #202,

Yoshida-honmachi, Sakyo-ku, Kyoto, 606-8501 JAPAN

I am a research scientist in the Sound Scene Understanding Team, Center for Advanced Intelligence Project (AIP), RIKEN and a visiting researcher in the Speech and Audio Processing Group, Kyoto University.

I received a Doctorate in Informatics from University of Lorraine, France for a doctoral research on multichannel audio source separation based on deep neural networks conducted at Inria Nancy – Grand-Est, France under the supervision of Dr. Antoine Liutkus and Dr. Emmanuel Vincent. The doctoral thesis covers the applications of our separation methods to various tasks, including speech enhancement, singing voice separation, and musical instrument separation.

My current research interests include audio source separation, audio-visual scene understanding, and machine learning.

news

Jun 13, 2025	We proudly showcased our AV-SUARA system during a live demonstration at the IPSJ Otogaku Symposium 2025 at Waseda University.
Oct 24, 2023	Our paper “Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning” was presented at IEEE WASPAA 2023.
Aug 20, 2023	Our team provided a tutorial entitled “Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models” at Interspeech 2023.
Jun 8, 2023	Our paper “Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation” was presented at IEEE ICASSP 2023.
Oct 26, 2022	Our paper “Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments” was presented at IEEE/RSJ IROS 2022.
Sep 21, 2022	Our paper “Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments” was presented at Interspeech 2022.
Sep 7, 2022	We presented two papers at IWAENC 2022: ① “DNN-free Low-Latency Adaptive Speech Enhancement Based on Frame-Online Beamforming Powered by Block-Online FastMNMF” and ② “Joint Localization and Synchronization of Distributed Camera-Attached Microphone Arrays for Indoor Scene Analysis”.
May 7, 2022	Our Sound Scene Understanding Team presented two papers at IEEE ICASSP 2022: ① “Flow-Based Fast Multichannel Nonnegative Matrix Factorization for Blind Source Separation” and ② “Neural Full-Rank Spatial Covariance Analysis for Blind Source Separation”.
Apr 28, 2022	Our article “Generalized Fast Multichannel Nonnegative Matrix Factorization Based on Gaussian Scale Mixtures for Blind Source Separation” has been accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing. It is now available on IEEE Xplore.
Apr 1, 2022	I’m happy to share that I’m starting a new position as Research Scientist (研究員) at RIKEN!

selected publications

WASPAA
Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning

Aditya Arie Nugraha, Diego Di Carlo, Yoshiaki Bando, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023

Abs Bib DOI HTML Preprint

This paper revisits single-channel audio source separation based on a probabilistic generative model of a mixture signal defined in the continuous time domain. We assume that each source signal follows a non-stationary Gaussian process (GP), i.e., any finite set of sampled points follows a zero-mean multivariate Gaussian distribution whose covariance matrix is governed by a kernel function over time-varying latent variables. The mixture signal composed of such source signals thus follows a GP whose covariance matrix is given by the sum of the source covariance matrices. To estimate the latent variables from the mixture signal, we use a deep neural network with an encoder-separator-decoder architecture (e.g., Conv-TasNet) that separates the latent variables in a pseudo-time-frequency space. The key feature of our method is to feed the latent variables into the kernel function for estimating the source covariance matrices, instead of using the decoder for directly estimating the time-domain source signals. This enables the decomposition of a mixture signal into the source signals with a classical yet powerful Wiener filter that considers the full covariance structure over all samples. The kernel function and the network are trained jointly in the maximum likelihood framework. Comparative experiments using two-speech mixtures under clean, noisy, and noisy-reverberant conditions from the WSJ0-2mix, WHAM!, and WHAMR! benchmark datasets demonstrated that the proposed method performed well and outperformed the baseline method under noisy and noisy-reverberant conditions.
@inproceedings{nugraha2023gpdkl, selected = {true}, abbr = {WASPAA}, bibtex_show = {true}, author = {Nugraha, Aditya Arie and Di Carlo, Diego and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning}, booktitle = {Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year = {2023}, month = oct, pages = {1--5}, address = {New Paltz, NY, USA}, url = {https://ieeexplore.ieee.org/document/10248168}, html = {https://ieeexplore.ieee.org/document/10248168}, preprint = {https://hal.science/hal-04172863}, doi = {10.1109/WASPAA58266.2023.10248168} }
IROS
Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments

Kouhei Sekiguchi*, Aditya Arie Nugraha*, Yicheng Du, Yoshiaki Bando, Mathieu Fontaine, and Kazuyoshi Yoshii

In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022

Abs Bib DOI HTML Preprint

This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset that helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party). One may use a state-of-the-art blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) that works well in various environments thanks to its unsupervised nature. Its heavy computational cost, however, prevents its application to real-time processing. In contrast, a supervised beamforming method that uses a deep neural network (DNN) for estimating spatial information of speech and noise readily fits real-time processing, but suffers from drastic performance degradation in mismatched conditions. Given such complementary characteristics, we propose a dual-process robust online speech enhancement method based on DNN-based beamforming with FastMNMF-guided adaptation. FastMNMF (back end) is performed in a mini-batch style and the noisy and enhanced speech pairs are used together with the original parallel training data for updating the direction-aware DNN (front end) with backpropagation at a computationally-allowable interval. This method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker, which can be detected from video or selected by a user’s hand gesture or eye gaze, in a streaming manner and spatially showing the transcriptions with an AR technique. Our experiment showed that the word error rate was improved by more than 10 points with the run-time adaptation using only twelve minutes of observation.
@inproceedings{sekiguchi2022directionaware, selected = {true}, abbr = {IROS}, bibtex_show = {true}, author = {Sekiguchi*, Kouhei and Nugraha*, Aditya Arie and Du, Yicheng and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi}, title = {Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments}, booktitle = {Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2022}, month = oct, pages = {9266--9273}, address = {Kyoto, Japan}, url = {https://ieeexplore.ieee.org/document/9981659}, html = {https://ieeexplore.ieee.org/document/9981659}, preprint = {https://arxiv.org/abs/2207.07296}, doi = {10.1109/IROS47612.2022.9981659} }
IWAENC
DNN-Free Low-Latency Adaptive Speech Enhancement Based on Frame-Online Beamforming Powered by Block-Online FastMNMF

Aditya Arie Nugraha, Kouhei Sekiguchi, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC), 2022

Abs Bib DOI HTML Preprint

This paper describes a practical dual-process speech enhancement system that adapts environment-sensitive frame-online beamforming (front-end) with help from environment-free block-online source separation (back-end). To use minimum variance distortionless response (MVDR) beamforming, one may train a deep neural network (DNN) that estimates time-frequency masks used for computing the covariance matrices of sources (speech and noise). Backpropagation-based run-time adaptation of the DNN was proposed for dealing with the mismatched training-test conditions. Instead, one may try to directly estimate the source covariance matrices with a state-of-the-art blind source separation method called fast multichannel non-negative matrix factorization (FastMNMF). In practice, however, neither the DNN nor the FastMNMF can be updated in a frame-online manner due to its computationally-expensive iterative nature. Our DNN-free system leverages the posteriors of the latest source spectrograms given by block-online FastMNMF to derive the current source covariance matrices for frame-online beamforming. The evaluation shows that our frame-online system can quickly respond to scene changes caused by interfering speaker movements and outperformed an existing block-online system with DNN-based beamforming by 5.0 points in terms of the word error rate.
@inproceedings{nugraha2022dnnfree, selected = {true}, abbr = {IWAENC}, bibtex_show = {true}, author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {DNN-Free Low-Latency Adaptive Speech Enhancement Based on Frame-Online Beamforming Powered by Block-Online FastMNMF}, booktitle = {Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC)}, year = {2022}, month = sep, pages = {1--5}, address = {Bamberg, Germany}, url = {https://ieeexplore.ieee.org/document/9914729}, html = {https://ieeexplore.ieee.org/document/9914729}, preprint = {https://arxiv.org/abs/2207.10934}, doi = {10.1109/IWAENC53105.2022.9914729} }
ICASSP
Flow-Based Fast Multichannel Nonnegative Matrix Factorization for Blind Source Separation

Aditya Arie Nugraha, Kouhei Sekiguchi, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

Abs Bib DOI HTML Preprint Poster

This paper describes a blind source separation method for multichannel audio signals, called NF-FastMNMF, based on the integration of the normalizing flow (NF) into the multichannel nonnegative matrix factorization with jointly-diagonalizable spatial covariance matrices, a.k.a. FastMNMF. Whereas the NF of flow-based independent vector analysis, called NF-IVA, acts as the demixing matrices to transform an M-channel mixture into M independent sources, the NF of NF-FastMNMF acts as the diagonalization matrices to transform an M- channel mixture into a spatially-independent M-channel mixture represented as a weighted sum of N source images. This diagonalization enables the NF, which has been used only for determined separation because of its bijective nature, to be applicable to non-determined separation. NF-FastMNMF has time-varying diagonalization matrices that are potentially better at handling dynamical data variation than the time-invariant ones in FastMNMF. To have an NF with richer expression capability, the dimension-wise scalings using diagonal matrices originally used in NF-IVA are replaced with linear transformations using upper triangular matrices; in both cases, the diagonal and upper triangular matrices are estimated by neural networks. The evaluation shows that NF-FastMNMF performs well for both determined and non-determined separations of multiple speech utterances by stationary or non-stationary speakers from a noisy reverberant mixture.
@inproceedings{nugraha2022nffastmnmf, selected = {true}, abbr = {ICASSP}, bibtex_show = {true}, author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi}, title = {Flow-Based Fast Multichannel Nonnegative Matrix Factorization for Blind Source Separation}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2022}, month = may, pages = {501--505}, address = {Singapore}, url = {https://ieeexplore.ieee.org/document/9747718}, html = {https://ieeexplore.ieee.org/document/9747718}, preprint = {https://hal.archives-ouvertes.fr/hal-03637425/}, poster = {https://sigport.org/documents/flow-based-fast-multichannel-nonnegative-matrix-factorization-blind-source-separation}, doi = {10.1109/ICASSP43922.2022.9747718} }
SPL
Neural Full-Rank Spatial Covariance Analysis for Blind Source Separation

Yoshiaki Bando, Kouhei Sekiguchi, Yoshiki Masuyama, Aditya Arie Nugraha, Mathieu Fontaine, and Kazuyoshi Yoshii

IEEE Signal Processing Letters, 2021

Abs Bib DOI HTML PDF

This paper describes aneural blind source separation (BSS) method based on amortized variational inference (AVI) of a non-linear generative model of mixture signals. A classical statistical approach to BSS is to fit a linear generative model that consists of spatial and source models representing the inter-channel covariances and power spectral densities of sources, respectively. Although the variational autoencoder (VAE) has successfully been used as a non-linear source model with latent features, it should be pretrained from a sufficient amount of isolated signals. Our method, in contrast, enables the VAE-based source model to be trained only from mixture signals. Specifically, we introduce a neural mixture-to-feature inference model that directly infers the latent features from the observed mixture and integrate it with a neural feature-to-mixture generative model consisting of a full-rank spatial model and a VAE-based source model. All the models are optimized jointly such that the likelihood for the training mixtures is maximized in the framework of AVI. Once the inference model is optimized, it can be used for estimating the latent features of sources included in unseen mixture signals. The experimental results show that the proposed method outperformed the state-of-the-art BSS methods based on linear generative models and was comparable to a method based on supervised learning of the VAE-based sourcemodel.
@article{bando2021neuralfca, selected = {true}, abbr = {SPL}, bibtex_show = {true}, author = {Bando, Yoshiaki and Sekiguchi, Kouhei and Masuyama, Yoshiki and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi}, journal = {{IEEE} Signal Processing Letters}, title = {Neural Full-Rank Spatial Covariance Analysis for Blind Source Separation}, year = {2021}, month = aug, volume = {28}, number = {}, pages = {1670--1674}, url = {https://ieeexplore.ieee.org/document/9506855}, html = {https://ieeexplore.ieee.org/document/9506855}, pdf = {https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9506855}, doi = {10.1109/LSP.2021.3101699} }
TASLP
15th IEEE Signal Processing Society (SPS) Japan Student Journal Paper Award
Fast Multichannel Nonnegative Matrix Factorization with Directivity-Aware Jointly-Diagonalizable Spatial Covariance Matrices for Blind Source Separation

Kouhei Sekiguchi, Yoshiaki Bando, Aditya Arie Nugraha, Kazuyoshi Yoshii, and Tatsuya Kawahara

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020

Abs Bib DOI HTML PDF Code

This paper describes a computationally-efficient blind source separation (BSS) method based on the independence, low-rankness, and directivity of the sources. A typical approach to BSS is unsupervised learning of a probabilistic model that consists of a source model representing the time-frequency structure of source images and a spatial model representing their inter-channel covariance structure. Building upon the low-rank source model based on nonnegative matrix factorization (NMF), which has been considered to be effective for inter-frequency source alignment, multichannel NMF (MNMF) assumes source images to follow multivariate complex Gaussian distributions with unconstrained full-rank spatial covariance matrices (SCMs). An effective way of reducing the computational cost and initialization sensitivity of MNMF is to restrict the degree of freedom of SCMs. While a variant of MNMF called independent low-rank matrix analysis (ILRMA) severely restricts SCMs to rank-1 matrices under an idealized condition that only directional and less-echoic sources exist, we restrict SCMs to jointly-diagonalizable yet full-rank matrices in a frequency-wise manner, resulting in FastMNMF1. To help inter-frequency source alignment, we then propose FastMNMF2 that shares the directional feature of each source over all frequency bins. To explicitly consider the directivity or diffuseness of each source, we also propose rank-constrained FastMNMF that enables us to individually specify the ranks of SCMs. Our experiments showed the superiority of FastMNMF over MNMF and ILRMA in speech separation and the effectiveness of the rank constraint in speech enhancement.
@article{sekiguchi2020fastmnmf, selected = {true}, abbr = {TASLP}, bibtex_show = {true}, title = {Fast Multichannel Nonnegative Matrix Factorization with Directivity-Aware Jointly-Diagonalizable Spatial Covariance Matrices for Blind Source Separation}, author = {Sekiguchi, Kouhei and Bando, Yoshiaki and Nugraha, Aditya Arie and Yoshii, Kazuyoshi and Kawahara, Tatsuya}, journal = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing}, year = {2020}, month = aug, volume = {28}, number = {}, pages = {2610--2625}, url = {https://ieeexplore.ieee.org/document/9177266}, html = {https://ieeexplore.ieee.org/document/9177266}, pdf = {https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9177266}, code = {https://github.com/sekiguchi92/SoundSourceSeparation}, doi = {10.1109/TASLP.2020.3019181}, award = {15th IEEE Signal Processing Society (SPS) Japan Student Journal Paper Award} }
SPL
Flow-Based Independent Vector Analysis for Blind Source Separation

Aditya Arie Nugraha, Kouhei Sekiguchi, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii

IEEE Signal Processing Letters, 2020

Abs Bib DOI HTML PDF

This paper describes a time-varying extension of independent vector analysis (IVA) based on the normalizing flow (NF), called NF-IVA, for determined blind source separation of multichannel audio signals. As in IVA, NF-IVA estimates demixing matrices that transform mixture spectra to source spectra in the complex-valued spatial domain such that the likelihood of those matrices for the mixture spectra is maximized under some non-Gaussian source model. While IVA performs a time-invariant bijective linear transformation, NF-IVA performs a series of time-varying bijective linear transformations (flow blocks) adaptively predicted by neural networks. To regularize such transformations, we introduce a soft volume-preserving (VP) constraint. Given mixture spectra, the parameters of NF-IVA are optimized by gradient descent with backpropagation in an unsupervised manner. Experimental results show that NF-IVA successfully performs speech separation in reverberant environments with different numbers of speakers and microphones and that NF-IVA with the VP constraint outperforms NF-IVA without it, standard IVA with iterative projection, and improved IVA with gradient descent.
@article{nugraha2020nfiva, selected = {true}, abbr = {SPL}, bibtex_show = {true}, title = {Flow-Based Independent Vector Analysis for Blind Source Separation}, author = {Nugraha, Aditya Arie and Sekiguchi, Kouhei and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi}, journal = {{IEEE} Signal Processing Letters}, year = {2020}, month = {}, volume = {27}, number = {}, pages = {2173--2177}, url = {https://ieeexplore.ieee.org/document/9269436}, html = {https://ieeexplore.ieee.org/document/9269436}, pdf = {https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9269436}, doi = {10.1109/LSP.2020.3039944} }
CSL
ISCA Award for the Best Review Paper published in Computer Speech and Language (2016-2020)
An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer

Computer Speech & Language, 2017

Abs Bib DOI HTML Preprint

Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.
@article{vincent2017csl, selected = {true}, abbr = {CSL}, bibtex_show = {true}, title = {An analysis of environment, microphone and data simulation mismatches in robust speech recognition}, author = {Vincent, Emmanuel and Watanabe, Shinji and Nugraha, Aditya Arie and Barker, Jon and Marxer, Ricard}, journal = {Computer Speech & Language}, year = {2017}, month = nov, volume = {46}, number = {9}, pages = {535--557}, url = {http://www.sciencedirect.com/science/article/pii/S0885230816301231}, html = {http://www.sciencedirect.com/science/article/pii/S0885230816301231}, preprint = {https://hal.inria.fr/hal-01399180}, doi = {10.1016/j.csl.2016.11.005}, award = {ISCA Award for the Best Review Paper published in Computer Speech and Language (2016-2020)} }
TASLP
6th IEEE Signal Processing Society (SPS) Japan Young Author Best Paper Award
Multichannel audio source separation with deep neural networks

Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016

Abs Bib DOI HTML Preprint

This article addresses the problem of multichannel audio source separation. We propose a framework where deep neural networks (DNNs) are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information. The parameters are estimated in an iterative expectation-maximization (EM) fashion and used to derive a multichannel Wiener filter. We present an extensive experimental study to show the impact of different design choices on the performance of the proposed technique. We consider different cost functions for the training of DNNs, namely the probabilistically motivated Itakura-Saito divergence, and also Kullback-Leibler, Cauchy, mean squared error, and phase-sensitive cost functions. We also study the number of EM iterations and the use of multiple DNNs, where each DNN aims to improve the spectra estimated by the preceding EM iteration. Finally, we present its application to a speech enhancement problem. The experimental results show the benefit of the proposed multichannel approach over a single-channel DNNbased approach and the conventional multichannel nonnegative matrix factorization based iterative EM algorithm.
@article{nugraha2016massdnn, selected = {true}, abbr = {TASLP}, bibtex_show = {true}, title = {Multichannel audio source separation with deep neural networks}, author = {Nugraha, Aditya Arie and Liutkus, Antoine and Vincent, Emmanuel}, journal = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing}, year = {2016}, month = sep, volume = {24}, number = {9}, pages = {1652--1664}, url = {http://ieeexplore.ieee.org/document/7492604}, html = {http://ieeexplore.ieee.org/document/7492604}, preprint = {https://hal.inria.fr/hal-01163369}, doi = {10.1109/TASLP.2016.2580946}, award = {6th IEEE Signal Processing Society (SPS) Japan Young Author Best Paper Award} }