Aditya Arie Nugraha | Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems With Deep Kernel Learning

Abstract

This paper describes genuine audio super-resolution (SR) that aims to estimate a continuous signal from a discrete signal in a sampling-rate-agnostic manner based on deep kernel learning (DKL). We assume that the discrete signal is obtained by observing the continuous signal at arbitrary, possibly irregular time points. From a statistical point of view, audio SR can thus be tackled as curve fitting, an inverse problem based on a probabilistic model that represents the generation of a discrete signal from a continuous signal (curve). To deal with speech signals with complicated temporal dynamics, we propose a continuous-time-domain speech SR method that uses a nonlinear state-space model called a Gaussian process dynamical system (GPDS) as a prior on the speech signal. Specifically, we assume the speech signal to follow a GP conditioned by a latent signal following another GP. Given discrete observations, we approximate the latent posterior via variational inference with neural-based DKL and then sample the speech signal on an arbitrarily finer time grid. We empirically confirmed that the proposed method, GPDS-SR, remains robust to missing and irregular input samples and supports prediction on even nonstandard output rates while estimating plausible high-frequency content consistent with the observed context.

Reference

A. A. Nugraha, D. Di Carlo, Y. Bando, M. Fontaine, and K. Yoshii, “Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems With Deep Kernel Learning,” 2025, under review.