Log in

Relevant bibliographies by topics / Tacotron-2 / Journal articles

To see the other types of publications on this topic, follow the link: Tacotron-2.

Journal articles on the topic 'Tacotron-2'

Author: Grafiati

Published: 4 June 2025

Last updated: 1 August 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 29 journal articles for your research on the topic 'Tacotron-2.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Liu, Yifan, and Jin Zheng. "Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem." Information 10, no. 4 (2019): 131. http://dx.doi.org/10.3390/info10040131.

Full text

Abstract:

Text-to-speech synthesis is a computational technique for producing synthetic, human-like speech by a computer. In recent years, speech synthesis techniques have developed, and have been employed in many applications, such as automatic translation applications and car navigation systems. End-to-end text-to-speech synthesis has gained considerable research interest, because compared to traditional models the end-to-end model is easier to design and more robust. Tacotron 2 is an integrated state-of-the-art end-to-end speech synthesis system that can directly predict closed-to-natural human speec

APA, Harvard, Vancouver, ISO, and other styles

2

Tran, Duc Chung. "The first FOSD-tacotron-2-based text-to-speech application for Vietnamese." Bulletin of Electrical Engineering and Informatics 10, no. 2 (2021): 898–903. http://dx.doi.org/10.11591/eei.v10i2.2539.

Full text

Abstract:

Recently, with the development and deployment of voicebots which help to minimize personnels at call centers, text-to-speech (TTS) systems supporting English and Chinese have attracted attentions of researchers and corporates worldwide. However, there is very limited published works in TTS developed for Vietnamese. Thus, this paper presents in detail the first Tacotron-2-based TTS application development for Vietnamese that utilizes the publicly available FPT open speech dataset (FOSD) containing approximately 30 hours of labeled audio files together with their transcripts. The dataset was mad

APA, Harvard, Vancouver, ISO, and other styles

3

Tran, Duc Chung. "The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset." Data in Brief 31 (August 2020): 105775. http://dx.doi.org/10.1016/j.dib.2020.105775.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Duc, Chung Tran. "The first FOSD-tacotron-2-based text-to-speech application for Vietnamese." Bulletin of Electrical Engineering and Informatics 10, no. 2 (2021): 898~903. https://doi.org/10.11591/eei.v10i2.2539.

Full text

Abstract:

Recently, with the development and deployment of voicebots which help to minimize personnels at call centers, text-to-speech (TTS) systems supporting English and Chinese have attracted attentions of researchers and corporates worldwide. However, there is very limited published works in TTS developed for Vietnamese. Thus, this paper presents in detail the first Tacotron-2-based TTS application development for Vietnamese that utilizes the publicly available FPT open speech dataset (FOSD) containing approximately 30 hours of labeled audio files together with their transcripts. The dataset was mad

APA, Harvard, Vancouver, ISO, and other styles

5

Rono, Kelvin Kiptoo, Ciira Wa Maina, and Elijah Mwangi. "Development of a Kiswahili Text-to-Speech System based on Tacotron 2 and Wave Net Vocoder." International Journal of Electrical and Electronics Engineering 10, no. 2 (2023): 75–83. http://dx.doi.org/10.14445/23488379/ijeee-v10i2p107.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

García, Víctor, Inma Hernáez, and Eva Navas. "Evaluation of Tacotron Based Synthesizers for Spanish and Basque." Applied Sciences 12, no. 3 (2022): 1686. http://dx.doi.org/10.3390/app12031686.

Full text

Abstract:

In this paper, we describe the implementation and evaluation of Text to Speech synthesizers based on neural networks for Spanish and Basque. Several voices were built, all of them using a limited number of data. The system applies Tacotron 2 to compute mel-spectrograms from the input sequence, followed by WaveGlow as neural vocoder to obtain the audio signals from the spectrograms. The limited number of data used for training the models leads to synthesis errors in some sentences. To automatically detect those errors, we developed a new method that is able to find the sentences that have lost

APA, Harvard, Vancouver, ISO, and other styles

7

Sirohi, Anant. "Research Paper on Text to Audio Converter using NLP." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 1313–16. https://doi.org/10.22214/ijraset.2025.70467.

Full text

Abstract:

The development of text-to-speech (TTS) systems has advanced significantly with the introduc on of deep learningbased models. This paper inves gates the impact of various deep learning architectures, such as WaveNet and Tacotron 2, on the naturalness of synthesized speech. By leveraging convolu onal neural networks (CNNs) and recurrent neural networks (RNNs), we explore techniques for improving prosody, intona on, and speech quality. Our experiments show that the integra on of a en on mechanisms and vocoder models leads to more accurate and human-like speech output, par cularly in complex sent

APA, Harvard, Vancouver, ISO, and other styles

8

Savkova, Tatiana, Ivan Opirskyy, and Dmytro Sabodashko. "STUDYING THE RESISTANCE OF BIOMETRIC AUTHENTICATION SYSTEMS TO ATTACKS USING VOICE CLONING TECHNOLOGY BASED ON DEEP NEURAL NETWORKS." Cybersecurity: Education, Science, Technique 2, no. 26 (2024): 27–43. https://doi.org/10.28925/2663-4023.2024.26.670.

Full text

Abstract:

With the development of voice synthesis technologies based on deep neural networks, the security threats to biometric authentication systems that use voice recognition have increased. These systems, which were considered reliable, can be easily compromised by fake voices created using advanced models such as WaveNet, Tacotron 2, and other modern algorithms. In today's cybersecurity environment, such attacks jeopardize the confidentiality of personal data, which necessitates the improvement of protection methods. The purpose of this article is to study the resilience of biometric authentication

APA, Harvard, Vancouver, ISO, and other styles

9

Abuali, Batool, and Mohamad-Bassam Kurdy. "Full Diacritization of the Arabic Text to Improve Screen Readers for the Visually Impaired." Advances in Human-Computer Interaction 2022 (July 18, 2022): 1–9. http://dx.doi.org/10.1155/2022/1186678.

Full text

Abstract:

This paper aims to find the relationship between the full diacritization of the Arabic text and the quality of the speech synthesized in screen readers and presents a new methodology to develop screen readers for the visually impaired, focusing on preprocessing and diacritization of the text before converting it to audio. First, the actual need for our proposal was measured by conducting a MOS (Mean Opinion Score) questionnaire to evaluate the quality of the speech synthesized before and after full diacritization in the NVDA (https://www.nvda-ar.org/) screen reader. Then, an e-reader was built

APA, Harvard, Vancouver, ISO, and other styles

10

Oyucu, Saadin. "A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning." Electronics 12, no. 8 (2023): 1900. http://dx.doi.org/10.3390/electronics12081900.

Full text

Abstract:

Text-to-Speech (TTS) systems have made strides but creating natural-sounding human voices remains challenging. Existing methods rely on noncomprehensive models with only one-layer nonlinear transformations, which are less effective for processing complex data such as speech, images, and video. To overcome this, deep learning (DL)-based solutions have been proposed for TTS but require a large amount of training data. Unfortunately, there is no available corpus for Turkish TTS, unlike English, which has ample resources. To address this, our study focused on developing a Turkish speech synthesis

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Jing-Xuan, Korin Richmond, Zhen-Hua Ling, and Lirong Dai. "TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (2021): 14402–10. http://dx.doi.org/10.1609/aaai.v35i16.17693.

Full text

Abstract:

This paper presents TaLNet, a model for voice reconstruction with ultrasound tongue and optical lip videos as inputs. TaLNet is based on an encoder-decoder architecture. Separate encoders are dedicated to processing the tongue and lip data streams respectively. The decoder predicts acoustic features conditioned on encoder outputs and speaker codes.To mitigate for having only relatively small amounts of dual articulatory-acoustic data available for training, and since our task here shares with text-to-speech (TTS) the common goal of speech generation, we propose a novel transfer learning strate

APA, Harvard, Vancouver, ISO, and other styles

12

Tiwari, Kartik. "Deep Learning Based TTS-STT Model with Transliteration for Indic Languages." International Journal for Research in Applied Science and Engineering Technology 9, no. 12 (2021): 2207–13. http://dx.doi.org/10.22214/ijraset.2021.39689.

Full text

Abstract:

Abstract: This paper introduces a new text-to-speech presentation from end-to-end (E2E-TTS) using toolkit called ESPnet-TTS, which is an open source extension. ESPnet speech processing tools kit. Various models come under ESPnet TTS TacoTron 2, Transformer TTS, and Fast Speech. This also provides recipes recommended by the Kaldi speech recognition tool kit (ASR). Recipes based on the composition combined with the ESPnet ASR recipe, which provides high performance. This toolkit also provides pre-trained models and samples of all recipes for users to use as a base .It works on TTS-STT and transl

APA, Harvard, Vancouver, ISO, and other styles

13

Torres Núñez del Prado, Paola. "AIELSON: A neural spoken-word poetry generator with a distinct South American voice." Journal of Interdisciplinary Voice Studies 7, no. 1 (2022): 11–33. http://dx.doi.org/10.1386/jivs_00052_1.

Full text

Abstract:

Human‐computer interaction will soon be framed as a dialogue in-between two agents, rather than the imposition of the needs and desires of the human entity over the inert machine. As the latter become seemingly more intelligent, we will witness how they reshape art, knowledge and society in general even more in the not-so-distant future. In this framework, decolonization of their algorithms becomes imperative so as not to reproduce the ethnic and cultural biases that prevail in contemporary human society. By using a pre-trained transformer-based language model (GPT-2) (Radford et al. 2019a), r

APA, Harvard, Vancouver, ISO, and other styles

14

González-Docasal, Ander, and Aitor Álvarez. "Enhancing Voice Cloning Quality through Data Selection and Alignment-Based Metrics." Applied Sciences 13, no. 14 (2023): 8049. http://dx.doi.org/10.3390/app13148049.

Full text

Abstract:

Voice cloning, an emerging field in the speech-processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigated the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also used two high-quality corpora for comparative analysis. We conducted exhaustive evaluations of the quality of the gathered corpora in order to select the most-suitable data for the training of a voice-cloning system. Following these measurements, we c

APA, Harvard, Vancouver, ISO, and other styles

15

P. Jayanth, K. Lakshmi Sree, K. Karthik Kumar Reddy, G. Om Prakash, and G. Reddy Prasad. "Vision-to-Voice: AI for generating Description & Audio of Visual Content." International Research Journal of Innovations in Engineering and Technology 09, Special Issue ICCIS (2025): 206–13. https://doi.org/10.47001/irjiet/2025.iccis-202533.

Full text

Abstract:

Abstract - The seamless transformation of visual content into descriptive text and naturalistic speech, termed Vision-to-Voice, represents a significant interdisciplinary advancement at the intersection of computer vision, natural language processing (NLP), and speech synthesis. This paper explores the development of an end-to-end Vision-to-Voice pipeline, encompassing visual scene understanding, semantic description generation, and highquality speech synthesis, thereby enabling AI systems to narrate visual content for human users. The proposed methodology integrates Transformer-based image ca

APA, Harvard, Vancouver, ISO, and other styles

16

Zhang, Lusheng, Shie Wu, and Zhongxun Wang. "Phoneme-Aware Hierarchical Augmentation and Semantic-Aware SpecAugment for Low-Resource Cantonese Speech Recognition." Sensors 25, no. 14 (2025): 4288. https://doi.org/10.3390/s25144288.

Full text

Abstract:

Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments and Tacotron-2 synthesis, injects adversarial phoneme variants into both transcripts and their aligned audio segments, enlarging pronunciation diversity. Concurrently, a semantic-aware SpecAugment scheme exploits wav2vec 2.0 attention heat maps and keyword boun

APA, Harvard, Vancouver, ISO, and other styles

17

Sang, Songzhen, and Wanlin Li. "P‐3.10: Research on Key Technologies of Virtual Digital Human." SID Symposium Digest of Technical Papers 56, S1 (2025): 901–4. https://doi.org/10.1002/sdtp.18960.

Full text

Abstract:

Abstract Virtual digital humans can not only express images and sounds, but also simulate the emotions and reactions of real humans through interaction with users. Its application has been widely penetrated into many fields such as education, medical care, entertainment, customer service, etc. This paper focuses on the two core technologies in digital human technology‐speech generation technology TTS (Text‐to‐Speech) and image generation and processing technology, and explores their development history, technical challenges and future development trends. First, this paper analyzes the evolutio

APA, Harvard, Vancouver, ISO, and other styles

18

Qiu, Zeyu, Jun Tang, Yaxin Zhang, Jiaxin Li, and Xishan Bai. "A Voice Cloning Method Based on the Improved HiFi-GAN Model." Computational Intelligence and Neuroscience 2022 (October 11, 2022): 1–12. http://dx.doi.org/10.1155/2022/6707304.

Full text

Abstract:

With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model paramet

APA, Harvard, Vancouver, ISO, and other styles

19

Galdino, Julio Cesar, Ariadne Nascimento Matos, Flaviane Romani Fernandes Svartman, and Sandra Maria Aluisio. "The evaluation of prosody in speech synthesis: a systematic review." Journal of the Brazilian Computer Society 31, no. 1 (2025): 466–87. https://doi.org/10.5753/jbcs.2025.5468.

Full text

Abstract:

This paper presents a systematic review on the relationship between prosody and speech synthesis, focusing on the evaluation of prosodic parameters of synthesized speech. The relevance of the topic lies in the fact that the task of speech synthesis has not yet been resolved, therefore the information obtained in this review can contribute to knowledge and to the improvement of the methodologies used in evaluating the prosody of synthesized speech. To select studies, we used the Parsifal platform, including 100 studies published between 2020 and 2024, with the purpose of answering eight previou

APA, Harvard, Vancouver, ISO, and other styles

20

Li, Naihan, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. "Neural Speech Synthesis with Transformer Network." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6706–13. http://dx.doi.org/10.1609/aaai.v33i01.33016706.

Full text

Abstract:

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-theart performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden state

APA, Harvard, Vancouver, ISO, and other styles

21

Aziz, Azrul Fahmi Abdul, Sabrina Tiun, and Noraini Ruslan. "End to End Text to Speech Synthesis for Malay Language using Tacotron and Tacotron 2." International Journal of Advanced Computer Science and Applications 14, no. 6 (2023). http://dx.doi.org/10.14569/ijacsa.2023.0140644.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Patole, Prof Mrunalinee, Akhilesh Pandey, Kaustubh Bhagwat, Mukesh Vaishnav, and Salikram Chadar. "A Survey on “Text-to-Speech Systems for Real-Time Audio Synthesis”." International Journal of Advanced Research in Science, Communication and Technology, June 10, 2021, 375–79. http://dx.doi.org/10.48175/ijarsct-1400.

Full text

Abstract:

Text to Speech (TTS) is a form of speech synthesis wherein the text is converted right into a spoken human-like voice output. The state of the art strategies for TTS employs a neural network based totally method. This paintings pursuits to take a look at a number of the problems and barriers gift inside the contemporary works, especially Tacotron-2, and attempts to in addition enhance its performance by means of editing its structure. till now many papers were published on these topics that display various exceptional TTS structures by means of developing new TTS products. The aim is to have a

APA, Harvard, Vancouver, ISO, and other styles

23

Chen, Lijiang, Jie Ren, Pengfei Chen, Xia Mao, and Qi Zhao. "Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2." Applied Intelligence, March 12, 2022. http://dx.doi.org/10.1007/s10489-021-03075-x.

Full text

Abstract:

AbstractThis paper proposes a framework of applying only the EGG signal for speech synthesis in the limited categories of contents scenario. EGG is a sort of physiological signal which can reflect the trends of the vocal cord movement. Note that EGG’s different acquisition method contrasted with speech signals, we exploit its application in speech synthesis under the following two scenarios. (1) To synthesize speeches under high noise circumstances, where clean speech signals are unavailable. (2) To enable dumb people who retain vocal cord vibration to speak again. Our study consists of two st

APA, Harvard, Vancouver, ISO, and other styles

24

Rono, Kelvin Kiptoo, Dr Ciira wa Maina, and Prof Elijah Mwangi. "Development of a Kiswahili Text-to-Speech System Based on Tacotron 2 and WaveNet Vocoder." SSRN Electronic Journal, 2022. http://dx.doi.org/10.2139/ssrn.4027431.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Sarasola, Xabier, Ander Corral, Igor Leturia, and Iñigo Morcillo. "Hizlari-bektore manipulazioaren bidezko genero-anbiguoko hizketaren sintesia euskaraz." EKAIA Euskal Herriko Unibertsitateko Zientzia eta Teknologia Aldizkaria, September 24, 2024. http://dx.doi.org/10.1387/ekaia.26334.

Full text

Abstract:

Gero eta interes handiagoa dago genero-anbiguko ahotsa duten text-to-speech (TTS) sistemetan, besteak beste, laguntzaile birtualetan eta bozgorailu adimendunetan genero-alborapenak eta estereotipoak saihesteko duten ahalmenagatik. Artikulu honetan, ahots-bihurketa teknika berriak aplikatzen dizkiegu ahots-bektoreei sare neuronaletan oinarritutako genero-anbiguoko euskarazko TTS sistemak lortzeko. Hizlari-bektoreak hiztun anitzeko Tacotron 2-a entrenatuz lortzen dira. Hizlari-bektoreen normalizazioa eta eskala parametro bat erabiltzen duten eta erabiltzen ez duten sistemak konparatu ditugu, bai

APA, Harvard, Vancouver, ISO, and other styles

26

"Cross-Language Speech Synthesis using Transfer Learning." REST Journal on Data Analytics and Artificial Intelligence 4, no. 1 March 2025 (2025): 631–35. https://doi.org/10.46632/jdaai/4/1/80.

Full text

Abstract:

The advancement of Text-to-Speech (TTS) technology has opened new possibilities for personalized voice applications, particularly in multilingual contexts. This project focuses on developing a cross-language speech synthesis system, converting English text into high-quality Telugu audio. The primary challenge lies in retaining the unique vocal characteristics of a specific speaker during this language transformation. By leveraging advanced deep learning models, such as Tacotron 2 and WaveGlow, along with transfer learning techniques, the system addresses linguistic and phonetic differences to

APA, Harvard, Vancouver, ISO, and other styles

27

Satija, Ishita, Vina Lomte, Yash Wani, Digisha Kaneria, and Shubham Yadav. "Text-To-Speech Synthesis Using Transfer Learning." International Journal of Advanced Research in Science, Communication and Technology, April 9, 2021, 139–44. http://dx.doi.org/10.48175/ijarsct-956.

Full text

Abstract:

We portray a neural organization based framework for text-to-speech (TTS) combination that can create discourse sound in the voice of various speakers, including those concealed during preparation. Our framework comprises of three autonomously prepared parts: (1) a speaker encoder network; (2) a grouping to-succession union organization based on Tacotron 2; (3) an auto-backward Wave Net-based vocoder network. We illustrate that the proposed model can move the information on speaker fluctuation learned by the discriminatively-prepared speaker encoder to the multi speaker TTS task, and can incor

APA, Harvard, Vancouver, ISO, and other styles

28

Bhuvan Shridhar and Barath M. "Autoregressive Speech-To-Text Alignment is a Critical Component of Neural Text-To-Speech (TTS) Models." International Journal of Scientific Research in Science, Engineering and Technology, December 5, 2022, 310–16. http://dx.doi.org/10.32628/ijsrset229643.

Full text

Abstract:

Autoregressive speech-to-text alignment is a critical component of neural text-to-speech (TTS) models. Commonly, autoregressive TTS models rely on an attention mechanism to train these alignments online--but they are often brittle and fail to generalize in long utterances or out-of-domain text, leading to missing or repeating words. Non-autoregressive endto end TTS models usually rely on durations extracted from external sources. Our work exploits the alignment mechanism proposed in RAD -, which can be applied to various neural TTS architectures. In our experiments, the proposed alignment lear

APA, Harvard, Vancouver, ISO, and other styles

29

Nikoghosyan, K. H. "LEVERAGING PAUSE DETECTION FOR ENHANCED TTS DATASET GENERATION." Proceedings of National Polytechnic University of Armenia. INFORMATION TECHNOLOGIES, ELECTRONICS, RADIO ENGINEERING, 2024. https://doi.org/10.53297/18293336-2024.2-45.

Full text

Abstract:

This paper presents a novel tool designed to facilitate the creation of datasets for Text-to-Speech (TTS) systems. Current methods for generating these datasets often in-volve labor-intensive manual segmentation and transcription of audio data. The proposed tool addresses this challenge by utilizing pause detection to automatically segment the au-dio input, such as audiobooks into manageable segments, typically around 15 seconds in length. The average pause duration is computed to guide the segmentation process. The tool also provides a user-friendly interface where the segmented audio and its

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!