Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems With Deep Kernel Learning

A. A. Nugraha, D. Di Carlo, Y. Bando, M. Fontaine, and K. Yoshii, "Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems With Deep Kernel Learning," 2025, under review.

Abstract

This paper describes genuine audio super-resolution (SR) that aims to estimate a continuous signal from a discrete signal in a sampling-rate-agnostic manner based on deep kernel learning (DKL). We assume that the discrete signal is obtained by observing the continuous signal at arbitrary, possibly irregular time points. From a statistical point of view, audio SR can thus be tackled as curve fitting, an inverse problem based on a probabilistic model that represents the generation of a discrete signal from a continuous signal (curve). To deal with speech signals with complicated temporal dynamics, we propose a continuous-time-domain speech SR method that uses a nonlinear state-space model called a Gaussian process dynamical system (GPDS) as a prior on the speech signal. Specifically, we assume the speech signal to follow a GP conditioned by a latent signal following another GP. Given discrete observations, we approximate the latent posterior via variational inference with neural-based DKL and then sample the speech signal on an arbitrarily finer time grid. We empirically confirmed that the proposed method, GPDS-SR, remains robust to missing and irregular input samples and supports prediction on even nonstandard output rates while estimating plausible high-frequency content consistent with the observed context.


Reference

A. A. Nugraha, D. Di Carlo, Y. Bando, M. Fontaine, and K. Yoshii, “Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems With Deep Kernel Learning,” 2025, under review.


Audio Samples

Utterance: “These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.”


Utterance identifier: p361_008_mic1
 
Target (16 kHz) Input (2 kHz) Input (4 kHz) Input (8 kHz)
2 kHz WAV playback may fail in some browsers. Download raw 2 kHz WAV.
 
Methods Estimate (2->16 kHz) Estimate (4->16 kHz) Estimate (8->16 kHz)
Cubic spline interpolation
Polyphase resampling
NU-Wave2
UDM+
GPDS-SR

Utterance identifier: p374_008_mic1
 
Target (16 kHz) Input (2 kHz) Input (4 kHz) Input (8 kHz)
2 kHz WAV playback may fail in some browsers. Download raw 2 kHz WAV.
 
Methods Estimate (2->16 kHz) Estimate (4->16 kHz) Estimate (8->16 kHz)
Cubic spline interpolation
Polyphase resampling
NU-Wave2
UDM+
GPDS-SR

Utterance identifier: p376_008_mic1
 
Target (16 kHz) Input (2 kHz) Input (4 kHz) Input (8 kHz)
2 kHz WAV playback may fail in some browsers. Download raw 2 kHz WAV.
 
Methods Estimate (2->16 kHz) Estimate (4->16 kHz) Estimate (8->16 kHz)
Cubic spline interpolation
Polyphase resampling
NU-Wave2
UDM+
GPDS-SR

Utterance identifier: s5_008_mic1
 
Target (16 kHz) Input (2 kHz) Input (4 kHz) Input (8 kHz)
2 kHz WAV playback may fail in some browsers. Download raw 2 kHz WAV.
 
Methods Estimate (2->16 kHz) Estimate (4->16 kHz) Estimate (8->16 kHz)
Cubic spline interpolation
Polyphase resampling
NU-Wave2
UDM+
GPDS-SR