语音合成 | 精选论文汇总（197篇）

2021-04-16 18:04 作者:深蓝学院 0人读过 | 我要投稿

本文为大家整理了语音合成相关论文197篇，共分为12部分，分类如下：

（转至文末链接，免费获取源码链接及PDF版论文）

原出处：【低调奋进】公众号

Journal and conference on speech

Alignment

1.Online and Linear-Time Attention by Enforcing Monotonic Alignments

Code: https://github.com/craffel/mad

2.Forward Attention in Sequence-to-Sequence Acoustic Modeling for Speech Synthesis

3.Monotonic Chunkwise Attention

Code: https://github.com/j-min/MoChA-pytorch

4.Initial Investigation of An Encoder-Decoder End-to-End TTS Framework Using Marginalization of Monotonic Hard Latent Alignments

5.Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS

Code: https://gist.github.com/mutiann/38a7638f75c21479582d7391490df37c

6.Attentron:Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding

7.Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis

Code: https://github.com/bshall/Tacotron

https://github.com/anandaswarup/TTS

8.Peking Opera Synthesis via Duration Informed Attention Network

9.Understanding Self-Attention of Self-Supervised Audio Transformers

Dual Learning

1.Listening While Speaking:Speech Chain by Deep Learning

2.Machine Speech Chain with One-Shot Speaker Adaptation

3.Almost Unsupervised Text to Speech and Automatic Speech Recognition

Code:https://github.com/RayeRen/unsuper_tts_asr

4.LRSpeech:Extremely Low-Resource Speech Synthesis and Recognition

EEG

1.Advancing Speech Synthesis Using EEG

2.Predicting Different Acoustic Features From EEG and towards Direct Synthesis of Audio Waveform From EEG

3.Speech Synthesis Using EEG

Expressive TTS

1.Hierarchical Generative Modeling for Controllable Speech Synthesis

Code:https://github.com/rarefin/TTS_VAE

https://github.com/lturing/Tools

2.Predicting Expressive Speaking Style From Text in End-to-End Speech Synthesis

3.Style Tokens:Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Code:https://github.com/syang1993/gst-tacotron

4.Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Demo 地址：https://google.github.io/tacotron/publications/end_to_end_prosody_transfer/

5.Mellotron:Multispeaker Expressive Voice Synthesis by Conditioning On Rhythm, Pitch and Global Style tokens

Code: https://github.com/NVIDIA/mellotron

6.Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

Code: https://github.com/entn-at/acc-tacotron2

7.Multi-Reference Tacotron by Intercross Training for Style Disentangling, Transfer and Control in Speech Synthesis

8.Controllable Emotion Transfer for End-to-End Speech Synthesis

9.Controllable Neural Prosody Synthesis

10.Enhancing Speech Intelligibility in Text-to-Speech Synthesis Using Speaking Style Conversion

11.Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis

12.Flowtron：An Autoregressive Flow-Based Generative Network for Text-to-Speech Synthesis

Code: https://github.com/Sebidev/flowtron

13.Fully-Hierarchical Fine-Grained Prosody Modeling for interpretable Speech Synthesis

Demo 地址：https://google.github.io/tacotron/publications/hierarchical_prosody/index.html

14.Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

15.Whispered and Lombard Neural Speech Synthesis

Front End

1.Automatic Prosody Prediction for Chinese Speech Synthesis Using Blstm-Rnn and Embedding Features

2.Improving Prosodic Boundaries Prediction for Mandarin Speech Synthesis by Using Enhanced Embedding Feature and Model Fusion Approach

3.Mandarin Prosody Prediction Based On Attention Mechanism and Multimodel Ensemble

4.A Mandarin Prosodic Boundary Prediction Model Based On Multi Task Learning

5.Pre-Trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis

6.Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

Code:https://github.com/sigmeta/g2p-kd

7.A Hybrid Text Normalization System Using Multi-Head Self-Attention for Mandarin

8.A Mask-Based Model for Mandarin Chinese Polyphone Disambiguation

9.A Unified Sequence-to-Sequence Front-End Model for Mandarin Text-to-Speech Synthesis

10.Unified Mandarin TTS Front-End Based On Distilled Bert Model

General TTS

1.Statical Parameteric Speech Synthesis Using Deep Neural Networks

2.TTS Synthesis with Bidirectional Lstm Based Recurrent Neural Networks

3.A Study of Speaker Adaptation for Dnn-Based Speech Synthesis

4.Acoustic Modeling in Statistical Parametric Speech Synthesis–From Hmm to Lstm-Rnn

5.Effective Approaches to Attention-Based Neural Machine Translation

Code:https://github.com/lingyongyan/Neural-Machine-Translation

6.The Htk Book

7.Fast, Compact, and High Quality Lstm-Rnn Based Statistical Parametric Speech Synthesizers for Mobile Devices

8.Merlin:An Open Source Neural Network Speech Synthesis System

Code: https://github.com/speechdnn/merlin

9.Attention Is All You Need

Code:https://github.com/jadore801120/attention-is-all-you-need-pytorch

https://github.com/Lsdefine/attention-is-all-you-need-keras

https://github.com/soskek/attention_is_all_you_need

10.Char2wav:End-to-End Speech Synthesis

Code:https://github.com/sotelo/parrot

Demo:http://www.josesotelo.com/speechsynthesis/

11.Deep Voice2:Multi-Speaker Neural Text-to-Speech

12.Deep Voice:Real-Time Neural Text-to-Speech

Code:https://github.com/israelg99/deepvoice

13.Tacotron:towards End-to-End Speech Synthesis

Demo:https://google.github.io/tacotron/publications/tacotron/index.html

14.Voiceloop:Voice Fitting and Synthesis Via A Phonological Loop

15.Clarinet:Parallel Wave Generation in End-to-End Text-to-Speech

Demo:https://clarinet-demo.github.io/

16.Deep Voice 3:Scaling Text-to-Speech with Convolutional Sequence Learning

Code: https://github.com/r9y9/deepvoice3_pytorch

17.A 2019 Guide to Speech Synthesis with Deep Learning

18.Deep Text-to-Speech System with Seq2seq Model

19.Durian:Duration informed Attention Network for Multimodal Synthesis

Code:https://github.com/entn-at/DurIAN-1

20.Exploiting Syntactic Features in A Parsed Tree to Improve End-to-End TTS

21.Fastspeech:Fast,Robust and Controllable Text to Speech

Code:https://github.com/Deepest-Project/FastSpeech

22.Forward-Backward Decoding for Regularizing End-to-End TTS

23.Libritts:A Corpus Derived From Librispeech for Text-to-Speech

24.Maximizing Mutual information for Tacotron

Code: https://github.com/makman09/tacotron2

25.Neural Speech Synthesis with Transformer Network

Code:https://github.com/lfchener/Transformer-TTS

26.Non-Autoregressive Neural Text-to-Speech

Code: https://github.com/ksw0306/WaveVAE

27.Parallel Neural Text-to-Speech

Demo:https://github.com/parallel-neural-tts-demo/parallel-neural-tts-demo.github.io

Code: https://github.com/ksw0306/WaveVAE

28.Self-Attention Based Prosodic Boundary Prediction for Chinese Speech Synthesis

29.Tacotron-Based Acoustic Model Using Phoneme for Practical Neural Text-to-Speech Systems

30.Tutorial On End-to-End Text-to-Speech Synthesis

31.Controllable Neural Prosody Synthesis

32.Deep Mos Predictor for Synthetic Speech Using Cluster-Based Modeling

33.Deep Representation Learning in Speech Processing Challenges Recent Advances and Future Trends

34.Devicetts:Asmall-Footprint,Fast,Stable Network for On-Device Text-to-Speech

35.End-to-End Adversarial Text-to-Speech

Code: https://github.com/yanggeng1995/EATS

36.Fast and Lightweight On-Device TTS with Tacotron2 and Lpcnet

37.Fastspeech 2 Fast and High Quality End to End Text to Speech

Code: https://github.com/ming024/FastSpeech2

https://github.com/rishikksh20/FastSpeech2

https://github.com/ga642381/FastSpeech2

https://github.com/dathudeptrai/FastSpeech2

38.Feathertts:Robust and Efficient Attention Based Neural TTS

39.Flowtron:an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Code: https://github.com/NVIDIA/flowtron

Demo: https://nv-adlr.github.io/Flowtron

40.From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Code: https://github.com/caizexin/tf_multispeakerTTS_fc

41.Glow-TTS:A Generative Flow for Text-to-Speech Via Monotonic Search

Code: https://github.com/ntzzc/glow-tts

42.Graphspeech:Syntax-Aware Graph Attention Network for Neural Speech Synthesis

Code: https://github.com/ttslr/GraphSpeech

43.Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

44.incremental Text to Speech for Neural Sequence-to-Sequence Models Using Reinforcement Learning

45.interactive Text-to-Speech Via Semi-Supervised Style Transfer Learning

46.JDI-T:Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment

47.Location Relative Attention Mechanisms for Robust Long Form Speech Synthesis

Code: https://github.com/anandaswarup/TTS

48.Non-Attentive Tacotron:Robust and Controllable Neural TTS Synthesis including Unsupervised Duration Modeling

Demo: https://google.github.io/tacotron/publications/nat/index.html

49.Parallel Tacotron:Non-Autoregressive and Controllable TTS

Demo: https://google.github.io/tacotron/publications/parallel_tacotron/index.html

50.Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

51.Prosody Learning Mechanism for Speech Synthesis System without Text Length Limit

52.Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

53.Speaking Speed Control of End-to-End Speech Synthesis Using Sentence-Level Conditioning

54.Speech Synthesis and Control Using Differentiable DSP

55.Speedyspeech- Efficient Neural Speech Synthesis

Code: https://github.com/janvainer/speedyspeech

56.Squeezewave:Extremely Lightweight Vocoders for On Device Speech Synthesis

Code:https://github.com/tianrengao/squeezewave

57.TTS-by-TTS:TTS-Driven Data Augmentation for Fast and High-Quality Speech Synthesis

58.Unsupervised Learning for Sequence-to-Sequence Text-to-Speech for Low-Resource Languages

59.Adaspeech:Adaptive Text to Speech for Custom Voice

Code: https://github.com/rishikksh20/AdaSpeech

60.Bidirectional Variational inference for Non-Autoregressive Text-to-Speech

61.Building Multilingual TTS Using Cross-Lingual Voice Conversion

62.Lightspeech:Lightweight and Fast Text to Speech with Neural Architecture Search

Code: https://github.com/rishikksh20/LightSpeech

63.TripleM:Apractical Text-to-Speech Synthesis System with Multi-Guidance Attention and Multi-Band Multi-Time Lpcnet

64.Vara-TTS:Non-Autoregressive Text-to-Speech Synthesis Based On Very Deep Vae with Residual Attention

Demo: https://github.com/vara-tts/VARA-TTS

Multispeaker & Multilingual

1.Multi-Speaker Modeling and Speaker Adaptation for Dnn-Based TTS Synthesis

2.Speaker Representations for Speaker Adaptation in Multiple Speakers’ Blstm-Rnn-Based Speech Synthesis

3.Cross Lingual Multi Speaker Texttospeech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

4.Cross-Lingual,Multi-Speaker Text-to-Speech Synthesis Using Neural Speaker Embedding

5.Learning to Speak Fluently in Aforeign Language：Multilingual Speech Synthesis and Cross-Language Voice Cloning

6.Master Thesis:Automatic Multispeaker Voice Cloning

7.Training Multi-Speaker Neural Text-to-Speech Systems Using Speaker-Imbalanced Speech Corpora

8.Transfer Learning From Speaker Verification to Multispeaker Text-to-Speech Synthesis

Code:https://github.com/smoke-trees/Voice-synthesis

9.个性化语音合成中说话人特征不同嵌入方式的研究

10.Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS

11.Cross-Lingual Multispeaker Text-to-Speech Under Limited-Data Scenario

Demo:https://caizexin.github.io/mlms-syn-samples/index.html

12.Domain-Adversarial Training of Multi-Speaker TTS

13.Efficient Neural Speech Synthesis for Low Resource Languages Through Multilingual Modeling

14.End-to-End Code-Switching TTS with Cross-Lingual Language Model

15.Focusing On Attention:Prosody Transfer and Adaptative Optimization Strategy for Multi Speaker End to End Speech Synthesis

16.Generating Multilingual Voices Using Speaker Space Translation Based On Bilingual Speaker Data

17.Multi-Speaker Text-to-Speech Synthesis Using Deep Gaussian Processes

18.Multilingual Speech Synthesis

19.One Model, Many Languages:Meta Learning for Multilingual Text to Speech

Code: https://github.com/Tomiinek/Multilingual_Text_to_Speech

20.Phonological Features for 0-Shot Multilingual Speech Synthesis

Code:https://github.com/papercup-open-source/phonological-features

21.Semi-Supervised Learning for Multi-Speaker Text-to-Speech Synthesis Using Discrete Speech Representation

Code: https://github.com/ttaoREtw/semi-tts

22.Speaker Adaptation of A Multilingual Acoustic Model for Cross-Language Synthesis

23.Towards Natural Bilingual and Code-Switched Speech Synthesis Based On Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

Code: https://github.com/espnet/espnet

24.Using Ipa-Based Tacotron for Data Efficient Cross-Lingual Speaker Adaptation and Pronunciation Enhancement

25.Zero-Shot Multi-Speaker Text-to-Speech with State-of-The-Art Neural Speaker Embeddings

26.Adaspeech:Adaptive Text to Speech for Custom Voice

Code: https://github.com/rishikksh20/AdaSpeech

27.Building Multilingual TTS Using Cross-Lingual Voice Conversion

28.Investigating On incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Robust TTS

1.Disentangling Correlated Speaker and Noise for Speech Synthesis Via Data Augmentation and Adversarial Factorization

Code:https://github.com/meelement/noise_adversarial_tacotron

2.Neural Text-to-Speech Adaptation From Low Quality Public Recordings

3.Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS

Code:https://github.com/nii-yamagishilab/multi-speaker-tacotron

4.Data Efficient Voice Cloning From Noisy Samples with Domain Adversarial Training

5.Noise Robust TTS for Low Resource Speakers Using Pre-Trained Model and Speech Enhancement

Sing Synthesis

1.Mellotron:Multispeaker Expressive Voice Synthesis by Conditioning On Rhythm, Pitch and Global Style tokens

Code: https://github.com/NVIDIA/mellotron

2.A Comprehensive Survey On Deep Music Generation Multi-Level Representations, Algorithms, Evaluations, and Future Directions

3.ByteSing:A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder Decoder Acoustic Models and Wavernn Vocoders

4.Durian Sc:Duration informed Attention Network Based Singing Voice Conversion System

Code:https://github.com/tencent-ailab/learning_singing_from_speech

5.HiFiSinger:Towards High Fidelity Neural Singing Voice Synthesis

6.Jukebox:A Generative Model for Music

Code: https://github.com/openai/jukebox

7.Speech-to-Singing Conversion Based On Boundary Equilibrium Gan

8.Xiaoicesing:A High-Quality and integrated Singing Voice Synthesis System

Demo:https://github.com/xiaoicesing/xiaoicesing.github.io

Talking Head

1.Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Code:https://github.com/Hangz-nju-cuhk/Talking-Face-Generation-DAVS

2.Text-Based Editing of Talking-Head Video

Project:http://zollhoefer.com/papers/SG2019_TalkingHead/page.html

3.A Novel Face-Tracking Mouth Controller and Its Application to interacting with Bioacoustic Models

4.Large-Scale Multilingual Audio Visual Dubbing

Vocoder

1.Fast Wavenet Generation Algorithm

Code:https://github.com/tomlepaine/fast-wavenet

2.Wavenet:A Generative Model for Raw Audio

Demo:https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

3.Parallel Wavenet:Fast High-Fidelity Speech Synthesis

4.Efficient Neural Audio Synthesis

Code: https://github.com/ys10/WaveRNN

5.Improving Fftnet Vocoder with Noise Shaping and Subband Approaches

6.Natural TTS Synthesis by Conditioning Wavenet On Mel Spectrogram Predictions

Code: https://github.com/sooftware/tacotron2

7.A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

8.A Real-Time Wideband Neural Vocoder At 1.6 Kbs Using Lpcnet

9.An investigation of Subband Wavenet Vocoder Covering Entire Audible Frequency Range with Limited Acoustic Features

10.High Quality, Lightweight and Adaptable TTS Using Lpcnet

11.Melgan:Generative Adversarial Networks for Conditional Waveform Synthesis

Code: https://github.com/erogol/melgan-neurips

12.Rawnet:Fast End-to-End Neural Vocoder

Code: https://github.com/candlewill/RawNet

13.Waveglow：A Flow-Based Generative Network for Speech Synthesis

Code: https://github.com/yanggeng1995/WaveGlow

https://github.com/npuichigo/waveglow

14.A Cyclical Post-Filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-Speech Systems

15.Bunched Lpcnet:Vocoder for Low-Cost Neural Text-to-Speech Systems

16.Featherwave:An Efﬁcient High-ﬁdelity Neural Vocoder with Multi-Band Linear Prediction

Demo: https://github.com/wavecoder/FeatherWave

17.Gaussian Lpcnet for Multisample Speech Synthesis

18.Hifi-Gan:Generative Adversarial Networks for Efﬁcient and High Fidelity Speech Synthesis

Code: https://github.com/rishikksh20/HiFi-GAN

19.Improving Lpcnet-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network

20.Improving Opus Low Bit Rate Quality with Neural Speech Synthesis

21.Investigating The Impact of Lookahead for incremental Neural TTS

22.Multi-Band Melgan:Faster Waveform Generation for High-Quality Text-to-Speech

23.Neural Text-to-Speech with A Modeling-by-Generation Excitation Vocoder

Demo: https://github.com/sewplay/demos

24.Parallel Wavegan:A Fast Waveform Generation Model Based On Generative Adversarial Networks with Multi-Resolution Spectrogram

Code:https://github.com/kan-bayashi/ParallelWaveGAN

25.Quasi-Periodic Parallel Wavegan Vocoder:Anon-Autoregressive Pitchdependent Dilated Convolution Model for Parametric Speech Generation

Demo:https://github.com/bigpon/QuasiPeriodicParallelWaveGAN_demo

26.Speaker Conditional Wavernn:towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions

Code:https://github.com/dipjyoti92/SC-WaveRNN

27.Ultrasound-Based Articulatory-to-Acoustic Mapping with Waveglow Speech Synthesis

Code:https://github.com/BME-SmartLab/UTI-to-STFT

28.Universal Melgan:A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains

Code: https://github.com/avi33/universalmelgan

29.Vocgan:A High-Fidelity Real-Time Vocoder with A Hierarchically Nested Adversarial Network

Code: https://github.com/rishikksh20/VocGAN

30.Vocoder-Based Speech Synthesis From Silent Videos

31.Wavegrad:Estimating Gradients for Waveform Generation

Code: https://github.com/ivanvovk/WaveGrad

32.WG-Wavenet:Real-Timehigh-Fidelity Speech Synthesis without Gpu

Code:https://github.com/BogiHsu/WG-WaveNet

33.Gan Vocoder:Multi-Resolution Discriminator Is All You Need

Voice Conversion

1.An Overview of Voice Conversion Systems

2.Autovc:Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Code: https://github.com/auspicious3000/autovc

3.Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

Code: https://github.com/jxzhanggg/nonparaSeq2seqVC_code

4.Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

Code: https://github.com/andi611/ZeroSpeech-TTS-without-T

5.Accent and Speaker Disentanglement in Many-to-Many Voice Conversion

6.An Overview of Voice Conversion and Its Challenges:From Statistical Modeling to Deep Learning

7.Any-to-One Sequence-to-Sequence Voice Conversion Using Self-Supervised Discrete Speech Representations

8.Converting Anyone’S Emotion:towards Speaker-independent Emotional Voice Conversion

Code:https://github.com/KunZhou9646/Speaker-independent-emotional-voice-conversion-based-on-conditional-VAW-GAN-and-CWT

9.Cyclegan-Vc3:Examining and Improving Cyclegan-Vcs for Mel-Spectrogram Conversion

Code: https://github.com/jackaduma/CycleGAN-VC3

10.Gazev:Gan-Based Zero-Shot Voice Conversion Over Non-Parallel Speech Corpus

11.Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

Code:https://github.com/HLTSingapore/Emotional-Speech-Data

12.Towards Low-Resource Stargan Voice Conversion Using Weight Adaptive instance Normalization

Code: https://github.com/MingjieChen/LowResourceVC

13.Building Multilingual TTS Using Cross-Lingual Voice Conversion

14.Emocat:Language-Agnostic Emotional Voice Conversion

转至链接：https://www.shenlanxueyuan.com/page/57

源码链接及PDF版论文

标签：