Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems With Deep Kernel Learning
A. A. Nugraha, D. Di Carlo, Y. Bando, M. Fontaine, and K. Yoshii, "Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems With Deep Kernel Learning," 2025, under review.
Abstract
This paper describes genuine audio super-resolution (SR) that aims to estimate a continuous signal from a discrete signal in a sampling-rate-agnostic manner based on deep kernel learning (DKL). We assume that the discrete signal is obtained by observing the continuous signal at arbitrary, possibly irregular time points. From a statistical point of view, audio SR can thus be tackled as curve fitting, an inverse problem based on a probabilistic model that represents the generation of a discrete signal from a continuous signal (curve). To deal with speech signals with complicated temporal dynamics, we propose a continuous-time-domain speech SR method that uses a nonlinear state-space model called a Gaussian process dynamical system (GPDS) as a prior on the speech signal. Specifically, we assume the speech signal to follow a GP conditioned by a latent signal following another GP. Given discrete observations, we approximate the latent posterior via variational inference with neural-based DKL and then sample the speech signal on an arbitrarily finer time grid. We empirically confirmed that the proposed method, GPDS-SR, remains robust to missing and irregular input samples and supports prediction on even nonstandard output rates while estimating plausible high-frequency content consistent with the observed context.
Reference
A. A. Nugraha, D. Di Carlo, Y. Bando, M. Fontaine, and K. Yoshii, “Sampling-Rate-Agnostic Speech Super-Resolution Based on Gaussian Process Dynamical Systems With Deep Kernel Learning,” 2025, under review.
Audio Samples
Utterance: “These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.”
Utterance identifier: p361_008_mic1
Target (16 kHz) | Input (2 kHz) | Input (4 kHz) | Input (8 kHz) |
---|---|---|---|
2 kHz WAV playback may fail in some browsers.
Download raw 2 kHz WAV.
|
Methods | Estimate (2->16 kHz) | Estimate (4->16 kHz) | Estimate (8->16 kHz) |
---|---|---|---|
Cubic spline interpolation | |||
Polyphase resampling | |||
NU-Wave2 | |||
UDM+ | |||
GPDS-SR |
Utterance identifier: p374_008_mic1
Target (16 kHz) | Input (2 kHz) | Input (4 kHz) | Input (8 kHz) |
---|---|---|---|
2 kHz WAV playback may fail in some browsers.
Download raw 2 kHz WAV.
|
Methods | Estimate (2->16 kHz) | Estimate (4->16 kHz) | Estimate (8->16 kHz) |
---|---|---|---|
Cubic spline interpolation | |||
Polyphase resampling | |||
NU-Wave2 | |||
UDM+ | |||
GPDS-SR |
Utterance identifier: p376_008_mic1
Target (16 kHz) | Input (2 kHz) | Input (4 kHz) | Input (8 kHz) |
---|---|---|---|
2 kHz WAV playback may fail in some browsers.
Download raw 2 kHz WAV.
|
Methods | Estimate (2->16 kHz) | Estimate (4->16 kHz) | Estimate (8->16 kHz) |
---|---|---|---|
Cubic spline interpolation | |||
Polyphase resampling | |||
NU-Wave2 | |||
UDM+ | |||
GPDS-SR |
Utterance identifier: s5_008_mic1
Target (16 kHz) | Input (2 kHz) | Input (4 kHz) | Input (8 kHz) |
---|---|---|---|
2 kHz WAV playback may fail in some browsers.
Download raw 2 kHz WAV.
|
Methods | Estimate (2->16 kHz) | Estimate (4->16 kHz) | Estimate (8->16 kHz) |
---|---|---|---|
Cubic spline interpolation | |||
Polyphase resampling | |||
NU-Wave2 | |||
UDM+ | |||
GPDS-SR |