Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder

| Preprint | Code (integrated as a part of Amphion) |

Yicheng Gu, Xueyao Zhang, Liumeng Xue, Zhizheng Wu

School of Data Science, The Chinese University of Hong Kong, Shenzhen

Overview

Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency resolution in a spectrogram is fixed, making it incompatible with signals like singing voices that require flexible attention for different frequency bands. Motivated by that, our study utilizes the Constant-Q Transform (CQT), which owns dynamic resolution among frequencies, contributing to a better modeling ability in pitch variation and harmonic tracking. Specifically, we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates on the CQT spectrogram at multiple scales and performs sub-band processing according to different octaves. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed method. Moreover, we also verified that the CQT-based and the STFT-based discriminators could be complementary under joint training.

Architecture of the proposed Multi-Scale Sub-Band Constant-Q Transform (MS-SB-CQT) Discriminator, which can be integrated with any GAN-based vocoder. Operator ``C'' denotes for concatenation. SBP means our proposed Sub-Band Processing module.

Effectiveness of MS-SB-CQT Discriminator

Table 1: Results of different discriminators when being integrated into HiFi-GAN. The best and the second best results of every column (except those from Ground Truth) in each domain (speech and singing voice) are bold and italic. "S" and "C" represent MS-STFT and MS-SB-CQT Discriminators respectively. The MOS scores are with 95% Confidence Interval (CI).

As illustrated in Table 1, regarding singing voice, we can observe that:
(1) both HiFi-GAN (+C) and HiFi-GAN (+S) perform better than HiFi-GAN, showing the importance of time-frequency-representation-based discriminators;
(2) HiFi-GAN (+C) performs better than HiFi-GAN (+S) with a significant boost in MOS, showing the superiority of our proposed MS-SB-CQT Discriminator;
(3) HiFi-GAN (+S+C) performs best both objectively and subjectively, which shows that different discriminators will have complementary information for each other, confirming the effectiveness of jointly training.
A similar conclusion can be drawn for the unseen speaker evaluation of speech data.

Here, we show some representative samples to reveal the effctiveness of our MS-SB-CQT Discriminator and the complementary role of the STFT-based and CQT-based Discriminators. A case study will also be attached at the bottom of this section to explore the effctiveness of jointly training.

Representative Samples

Seen Singers

	Ground Truth	HiFi-GAN	HiFi-GAN (+S)	HiFi-GAN (+C)	HiFi-GAN (+S+C)
#1
#2
#3

Unseen Singers

	Ground Truth	HiFi-GAN	HiFi-GAN (+S)	HiFi-GAN (+C)	HiFi-GAN (+S+C)
#1
#2
#3

Seen Speakers

	Ground Truth	HiFi-GAN	HiFi-GAN (+S)	HiFi-GAN (+C)	HiFi-GAN (+S+C)
#1
#2
#3

Unseen Speakers

	Ground Truth	HiFi-GAN	HiFi-GAN (+S)	HiFi-GAN (+C)	HiFi-GAN (+S+C)
#1
#2
#3

Case Study

We visualized the same singing voice utterance synthesized by generators trained with different discriminator combinations. It can be observed that:

(1) With only the time-domain-based discriminators, it is hard for our generator to model the high-frequency parts. Adding time-frequency-representation-based discriminators, neither MS-SB-CQT Discriminator (HiFi-GAN (+C)) or MS-STFT Discriminator (HiFi-GAN (+S)), significantly boosts the quality of high-frequency reconstruction.

(2) STFT has a fixed time-frequency resolution across all frequency bands. In the low-frequency parts, its lack of frequency resolution brings frequency distortions, resulting in phonemes with artifacts. In the high-frequency parts, its lack of time resolution limits it from reconstructing harmonic components. CQT has a dynamic resolution trade-off, thus alleviating these artifacts. However, its lack of time resolution in the low-frequency parts and the lack of frequency resolution in the high-frequency parts still bring problems like glitches and hissing noises.

(3) Combining STFT-based and CQT-based Discriminators integrates their strengths, thus attaining a significantly better synthesis quality.

Ground Truth	HiFi-GAN	HiFi-GAN (+S)	HiFi-GAN (+C)	HiFi-GAN (+S+C)


能不能给我一首歌的时间	能不能给我一首歌的时间	能不能给我一首歌的时间	能不能给我一首歌的时间	能不能给我一首歌的时间
néng bù néng gěi wǒ yī shǒu gē de shí jiān	*néng* bù néng gěi wǒ yī shǒu gē de shí jiān	néng bù néng gěi wǒ yī shǒu gē de *shí* jiān	néng bù néng *gěi* wǒ yī shǒu gē de shí jiān	néng bù néng gěi wǒ yī shǒu gē de shí jiān

Generalization Ability of MS-SB-CQT Discriminator

Table 2: Results of our proposed MS-SB-CQT Discriminator when integrating in MelGAN and NSF-HiFiGAN in singing voice datasets. The improvements are shown in bold. "S" and "C" represent MS-STFT and MS-SB-CQT Discriminators respectively. All the improvements in MCD, PESQ, and Preference are significant (p-value < 0.01).

As, illustrated in Table 2, in general, the performance of MelGAN and NSF-HiFiGAN can be improved significantly by jointly training with MS-SB-CQT and MS-STFT Discriminators, with both objective and subjective preference tests confirming the effectiveness.

Here, we show some representative samples to reveal the effctiveness of joint training with our proposed MS-SB-CQT and the existing MS-STFT Discriminators on MelGAN and NSF-HiFiGAN. Case studies are conducted and attached at the bottom of this section to explore the detailed boosts.

Representative Samples

MelGAN

Seen Singers

	Ground Truth	MelGAN	MelGAN (+S+C)
#1
#2
#3

Unseen Singers

	Ground Truth	MelGAN	MelGAN (+S+C)
#1
#2
#3

NSF-HiFiGAN

Seen Singers

	Ground Truth	NSF-HiFiGAN	NSF-HiFiGAN (+S+C)
#1
#2
#3

Unseen Singers

	Ground Truth	NSF-HiFiGAN	NSF-HiFiGAN (+S+C)
#1
#2
#3

Case Study

MelGAN

MelGAN tends to overfit the low-frequency part and ignore mid and high-frequency components, resulting in audible metallic noise. After adding MS-STFT and MS-SB-CQT Discriminators, it could model the global information of spectrogram better, remarkably increasing the synthesis quality.

Ground Truth	MelGAN	MelGAN (+S+C)

NSF-HiFiGAN

NSF-HiFiGAN can synthesize high-fidelity singing voices. However, it still lacks frequency details. Adding MS-STFT and MS-SB-CQT Discrimina-tors tackles that problem, making synthesized samples closer to the ground truth.

Ground Truth	NSF-HiFiGAN	NSF-HiFiGAN (+S+C)

Necessity of Sub-Band Processing

Table 3: Results of HiFi-GAN enhanced by different CQT-based discriminators. MS-CQT Discriminator represents a discriminator that only removes the Sub-Band Processing module from our proposed MS-SB-CQT Discriminator.

As, illustrated in Table 3, we can see that HiFi-GAN can be enhanced successfully by our proposed MS-SB-CQT Discriminator. However, just applying the raw CQT to the discriminator (MS-CQT) would even harm the quality of HiFi-GAN. We speculate this is because the temporal desynchronization in inter-octaves of the raw CQT would burden the model learning. Therefore, it is necessary to adopt the proposed SBP module for designing a CQT-based discriminator.

Here, we show some representative cases to reveal the effctiveness of our proposed SBP module.

	Ground Truth	HiFi-GAN	HiFi-GAN (+MS-CQT)	HiFi-GAN (+MS-SB-CQT)
#1
#2
#3