Dissertations / Theses on the topic 'Speech-to-text systems'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Speech-to-text systems.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Chan, Ngor-chi. "Text-to-speech conversion for Putonghua /." [Hong Kong : University of Hong Kong], 1990. http://sunzi.lib.hku.hk/hkuto/record.jsp?B12929475.
Full text陳我智 and Ngor-chi Chan. "Text-to-speech conversion for Putonghua." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1990. http://hub.hku.hk/bib/B31209580.
Full textBreitenbücher, Mark. "Textvorverarbeitung zur deutschen Version des Festival Text-to-Speech Synthese Systems." [S.l.] : Universität Stuttgart , Fakultät Philosophie, 1997. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB6783514.
Full textBaloyi, Ntsako. "A text-to-speech synthesis system for Xitsonga using hidden Markov models." Thesis, University of Limpopo (Turfloop Campus), 2012. http://hdl.handle.net/10386/1021.
Full textThis research study focuses on building a general-purpose working Xitsonga speech synthesis system that is as far as can be possible reasonably intelligible, natural sounding, and flexible. The system built has to be able to model some of the desirable speaker characteristics and speaking styles. This research project forms part of the broader national speech technology project that aims at developing spoken language systems for human-machine interaction using the eleven official languages of South Africa (SA). Speech synthesis is the reverse of automatic speech recognition (which receives speech as input and converts it to text) in that it receives text as input and produces synthesized speech as output. It is generally accepted that most people find listening to spoken utterances better that reading the equivalent of such utterances. The Xitsonga speech synthesis system has been developed using a hidden Markov model (HMM) speech synthesis method. The HMM-based speech synthesis (HTS) system synthesizes speech that is intelligible, and natural sounding. This method can synthesize speech on a footprint of only a few megabytes of training speech data. The HTS toolkit is applied as a patch to the HTK toolkit which is a hidden Markov model toolkit primarily designed for use in speech recognition to build and manipulate hidden Markov models.
Engell, Trond Bøe. "TaleTUC: Text-to-Speech and Other Enhancements to Existing Bus Route Information Systems." Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, 2012. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-18920.
Full textLambert, Tanya. "Databases for concatenative text-to-speech synthesis systems : unit selection and knowledge-based approach." Thesis, University of East Anglia, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.421192.
Full textLevefeldt, Christer. "Evaluation of NETtalk as a means to extract phonetic features from text for synchronization with speech." Thesis, University of Skövde, Department of Computer Science, 1998. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-173.
Full textThe background for this project is a wish to automate synchronization of text and speech. The idea is to present speech through speakers synchronized word-for-word with text appearing on a monitor.
The solution decided upon is to use artificial neural networks, ANNs, to convert both text and speech into streams made up of sets of phonetic features and then matching these two streams against each other. Several text-to-feature ANN designs based on the NETtalk system are implemented and evaluated. The extraction of phonetic features from speech and the synchronization itself are not implemented, but some assessments are made regarding their possible performances. The performance of a finished system is not possible to determine, but a NETtalk-based ANN is believed to be suitable for such a system using phonetic features for synchronization.
Yoon, Kyuchul. "Building a prosodically sensitive diphone database for a Korean text-to-speech synthesis system." Connect to this title online, 2005. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1119010941.
Full textTitle from first page of PDF file. Document formatted into pages; contains xxii, 291 p.; also includes graphics (some col.) Includes bibliographical references (p. 210-216). Available online via OhioLINK's ETD Center
Thorstensson, Niklas. "A knowledge-based grapheme-to-phoneme conversion for Swedish." Thesis, University of Skövde, Department of Computer Science, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-731.
Full textA text-to-speech system is a complex system consisting of several different modules such as grapheme-to-phoneme conversion, articulatory and prosodic modelling, voice modelling etc.
This dissertation is aimed at the creation of the initial part of a text-to-speech system, i.e. the grapheme-to-phoneme conversion, designed for Swedish. The problem area at hand is the conversion of orthographic text into a phonetic representation that can be used as a basis for a future complete text-to speech system.
The central issue of the dissertation is the grapheme-to-phoneme conversion and the elaboration of rules and algorithms required to achieve this task. The dissertation aims to prove that it is possible to make such a conversion by a rule-based algorithm with reasonable performance. Another goal is to find a way to represent phonotactic rules in a form suitable for parsing. It also aims to find and analyze problematic structures in written text compared to phonetic realization.
This work proposes a knowledge-based grapheme-to-phoneme conversion system for Swedish. The system suggested here is implemented, tested, evaluated and compared to other existing systems. The results achieved are promising, and show that the system is fast, with a high degree of accuracy.
Mhlana, Siphe. "Development of isiXhosa text-to-speech modules to support e-Services in marginalized rural areas." Thesis, University of Fort Hare, 2011. http://hdl.handle.net/10353/495.
Full textEksvärd, Siri, and Julia Falk. "Evaluating Speech-to-Text Systems and AR-glasses : A study to develop a potential assistive device for people with hearing impairments." Thesis, Uppsala universitet, Avdelningen för visuell information och interaktion, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-437608.
Full textSpeech-to-Text System using Augmented Reality for People with Hearing Deficits
Micallef, Paul. "A text to speech synthesis system for Maltese." Thesis, University of Surrey, 1997. http://epubs.surrey.ac.uk/842702/.
Full textMonaghan, Alexander Ian Campbell. "Intonation in a text-to-speech conversion system." Thesis, University of Edinburgh, 1991. http://hdl.handle.net/1842/20023.
Full textReynolds, Douglas A. "A Gaussian mixture modeling approach to text-independent speaker identification." Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/16903.
Full textRousseau, Francois. "Design of an advanced Text-To-Speech system for Afrikaans." Master's thesis, University of Cape Town, 2006. http://hdl.handle.net/11427/5112.
Full textIncludes bibliographical references (leaves 87-92).
Afrikaans is the home language to approximately six million people in South Africa. The need for an Afrikaans TTS system comes with the growing interest in integrating speech technology in all eleven languages of the country. The ultimate goal here is to enable communication between man and machine using speech. This can be achieved with the use of speech technology by implementing multilingual technological systems that all the people in South Africa can understand and relate to. Understandability, flexibility, naturalness and pleasantnedd are the requirements of an advanced TTS system. The technique of concatentative speech synthesis has been the most successful in meeting all these requirements. The Festival speech synthesis system uses two popular concatenative techniques to design new TTS systems in different languages. The techniques are: diphone concatenative synthesis (DCS) and unit selection synthesis (USS).
Cohen, Andrew Dight. "The use of learnable phonetic representations in connectionist text-to-speech system." Thesis, University of Reading, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360787.
Full textMohasi, Lehlohonolo. "Prosody modelling for a Sesotho text-to-speech system using the Fujisaki model." Thesis, Stellenbosch : Stellenbosch University, 2015. http://hdl.handle.net/10019.1/97050.
Full textSwart, Philippa H. "Prosodic features of imperatives in Xhosa : implications for a text-to-speech system." Thesis, Stellenbosch : Stellenbosch University, 2000. http://hdl.handle.net/10019.1/51891.
Full textENGLISH ABSTRACT: This study focuses on the prosodic features of imperatives and the role of prosodies in the development of a text-to-speech (TIS) system for Xhosa, an African tone language. The perception of prosody is manifested in suprasegmental features such as fundamental frequency (pitch), intensity (loudness) and duration (length). Very little experimental research has been done on the prosodic features of any grammatical structures (moods and tenses) in Xhosa, therefore it has not yet been determined how and to what degree the different prosodic features are combined and utilized in the production and perception of Xhosa speech. One such grammatical structure, for which no explicit descriptive phonetic information exists, is the imperative mood expressing commands. In this study it was shown how the relationship between duration, pitch and loudness, as manifested in the production and perception of Xhosa imperatives could be determined through acoustic analyses and perceptual experiments. An experimental phonetic approach proved to be essential for the acquisition of substantial and reliable prosodic information. An extensive acoustic analysis was conducted to acquire prosodic information on the production of imperatives by Xhosa mother tongue speakers. Subsequently, various statistical parameters were calculated on the raw acoustic data (i) to establish patterns of significance and (ii) to represent the large amount of numeric data generated, in a compact manner. A perceptual experiment was conducted to investigate the perception of imperatives. The prosodic parameters that were extracted from the acoustic analysis were applied to synthesize imperatives in different contexts. A novel approach to Xhosa speech synthesis was adopted. Monotonous verbs were recorded by one speaker and the pitch and duration of these words were then manipulated with the TD-PSOLA technique. Combining the results of the acoustic analysis and the perceptual experiment made it possible to present a prosodic model for the generation of perceptually acceptable imperati ves in a practical Xhosa TIS system. Prosody generation in a natural language processing (NLP) module and its place within the larger framework of text-to-speech synthesis was discussed. It was shown that existing architectures for TTS synthesis would not be appropriate for Xhosa without some adaptation. Hence, a unique architecture was suggested and its possible application subsequently illustrated. Of particular importance was the development of an alternative algorithm for grapheme-to-phoneme conversion. Keywords: prosody, speech synthesis, speech perception, acoustic analysis, Xhosa
AFRIKAANSE OPSOMMING: Hierdie studie fokus op die prodiese eienskappe van imperatiewe en die rol van prosodie in die ontwikkeling van 'n teks-na-spraak-sisteem vir Xhosa, 'n Afrika-toontaal. Die persepsie van prosodie word gemanifesteer in suprasegmentele eienskappe soos fundamentele frekwensie (toonhoogte), intensiteit (luidheid) en duur (lengte). Weinig eksperimentele navorsing bestaan ten opsigte van die prosodiese eienskappe van enige grammatikale strukture (modus en tyd) in Xhosa. Hoe en tot watter mate die verskillende prosodiese kenmerke gekombineer en gebruik word in die produksie en persepsie van Xhosa-spraak is nog nie duidelik nie. 'n Grammatikale struktuur waarvoor geen eksplisiete deskriptiewe fonetiese inligting bestaan nie, is die van die imperatiewe modus wat bevele uitdruk. Hierdie studie wys hoe die verhouding tussen duur, toonhoogte en luidheid, soos gemanifesteer in die produksie en persepsie van Xhosa-imperatiewe bepaal kon word deur akoestiese analises en persepsueIe eksperimente. Dit het geblyk dat 'n eksperimenteelfonetiese benadering noodsaaklik is vir die verkryging van sinvolle en betroubare prosodiese inligting. 'n Uitgebreide akoestiese analise is uitgevoer om prosodiese data omtrent die produksie van imperatiewe deur Xhosa-moedertaalsprekers te bekom. Vervolgens is verskeie statistiese analises op die rou akoestiese data uitgevoer om (i) patrone van beduidenheid te bepaal en om (ii) die groot hoeveelheid numeriese data wat gegenereer is meer kompak voor te stel. 'n PersepsueIe eksperiment is uitgevoer met die doelom die persepsie van imperatiewe te ondersoek. Die prosodiese parameters soos uit die akoestiese analise bekom, is toegepas in die sintese van bevele in verskillende kontekste. 'n Nuwe benadering tot Xhosaspraaksintese is gevolg. Monotone werkwoorde is vir een spreker opgeneem en die toonhoogte en duur van hierdie woorde is met TD-PSOLA tegniek gemanipuleer. 'n Kombinasie van akoestiese en persepsueie resultate is aangewend om 'n prosodiese model te ontwikkel vir die sintese van persepsueel aanvaarbare imperatiewe in 'n praktiese Xhosa teks- na- spraaksinteti seerder . Prosodie-generering in 'n natuurlike taalprosesering-module en die plek daarvan binne die raamwerk van teks-na-spraaksintese is bespreek. Daar is gewys dat bestaande argitekture vir teks-na-spraaksisteme nie sonder sommige aanpassings toepaslik vir Xhosa sal wees nie. Derhalwe is 'n unieke argitektuur gesuggereer en die moontlike toepassing daarvan geïllustreer. Die ontwikkeling van 'n alternatiewe algoritme vir letter-na-klankomsetting was van besondere belang. Sleutelwoorde: spraaksintese, spraakpersepsie, akoestiese analise, Xhosa
Mohasi, Lehlohonolo. "Design of an advanced and fluent Sesotho text-to-speech system through intonation." Master's thesis, University of Cape Town, 2006. http://hdl.handle.net/11427/5155.
Full textNguyen, Thi Thu Trang. "HMM-based Vietnamese Text-To-Speech : Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112201/document.
Full textThe thesis objective is to design and build a high quality Hidden Markov Model (HMM-)based Text-To-Speech (TTS) system for Vietnamese – a tonal language. The system is called VTED (Vietnamese TExt-tospeech Development system). In view of the great importance of lexical tones, a “tonophone” – an allophone in tonal context – was proposed as a new speech unit in our TTS system. A new training corpus, VDTS (Vietnamese Di-Tonophone Speech corpus), was designed for 100% coverage of di-phones in tonal contexts (i.e. di-tonophones) using the greedy algorithm from a huge raw text. A total of about 4,000 sentences of VDTS were recorded and pre-processed as a training corpus of VTED.In the HMM-based speech synthesis, although pause duration can be modeled as a phoneme, the appearanceof pauses cannot be predicted by HMMs. Lower phrasing levels above words may not be completely modeled with basic features. This research aimed at automatic prosodic phrasing for Vietnamese TTS using durational clues alone as it appeared too difficult to disentangle intonation from lexical tones. Syntactic blocks, i.e. syntactic phrases with a bounded number of syllables (n), were proposed for predicting final lengthening (n = 6) and pause appearance (n = 10). Improvements for final lengthening were done by some strategies of grouping single syntactic blocks. The quality of the predictive J48-decision-tree model for pause appearance using syntactic blocks combining with syntactic link and POS (Part-Of-Speech) features reached F-score of 81.4% Precision=87.6%, Recall=75.9%), much better than that of the model with only POS (F-score=43.6%)or syntactic link (F-score=52.6%) alone.The architecture of the system was proposed on the basis of the core architecture of HTS with an extension of a Natural Language Processing part for Vietnamese. Pause appearance was predicted by the proposed model. Contextual feature set included phone identity features, locational features, tone-related features, and prosodic features (i.e. POS, final lengthening, break levels). Mary TTS was chosen as a platform for implementing VTED. In the MOS (Mean Opinion Score) test, the first VTED, trained with the old corpus and basic features, was rather good, 0.81 (on a 5 point MOS scale) higher than the previous system – HoaSung (using the non-uniform unit selection with the same training corpus); but still 1.2-1.5 point lower than the natural speech. The quality of the final VTED, trained with the new corpus and prosodic phrasing model, progressed by about 1.04 compared to the first VTED, and its gap with the natural speech was much lessened. In the tone intelligibility test, the final VTED received a high correct rate of 95.4%, only 2.6% lower than the natural speech, and 18% higher than the initial one. The error rate of the first VTED in the intelligibility test with the Latin square design was about 6-12% higher than the natural speech depending on syllable, tone or phone levels. The final one diverged about only 0.4-1.4% from the natural speech
Gong, XiangQi. "Ellection markup language (EML) based tele-voting system." Thesis, University of the Western Cape, 2009. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_5841_1350999620.
Full textvoting machines, voting via the Internet, telephone, SMS and digital interactive television. This thesis concerns voting by telephone, or televoting, it starts by giving a brief overview and evaluation of various models and technologies that are implemented within such systems. The aspects of televoting that have been investigated are technologies that provide a voice interface to the voter and conduct the voting process, namely the Election Markup Language (EML), Automated Speech Recognition (ASR) and Text-to-Speech (TTS).
Beněk, Tomáš. "Implementing and Improving a Speech Synthesis System." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236079.
Full textMalatji, Promise Tshepiso. "The development of accented English synthetic voices." Thesis, University of Limpopo, 2019. http://hdl.handle.net/10386/2917.
Full textA Text-to-speech (TTS) synthesis system is a software system that receives text as input and produces speech as output. A TTS synthesis system can be used for, amongst others, language learning, and reading out text for people living with different disabilities, i.e., physically challenged, visually impaired, etc., by native and non-native speakers of the target language. Most people relate easily to a second language spoken by a non-native speaker they share a native language with. Most online English TTS synthesis systems are usually developed using native speakers of English. This research study focuses on developing accented English synthetic voices as spoken by non-native speakers in the Limpopo province of South Africa. The Modular Architecture for Research on speech sYnthesis (MARY) TTS engine is used in developing the synthetic voices. The Hidden Markov Model (HMM) method was used to train the synthetic voices. Secondary training text corpus is used to develop the training speech corpus by recording six speakers reading the text corpus. The quality of developed synthetic voices is measured in terms of their intelligibility, similarity and naturalness using a listening test. The results in the research study are classified based on evaluators’ occupation and gender and the overall results. The subjective listening test indicates that the developed synthetic voices have a high level of acceptance in terms of similarity and intelligibility. A speech analysis software is used to compare the recorded synthesised speech and the human recordings. There is no significant difference in the voice pitch of the speakers and the synthetic voices except for one synthetic voice.
Uggerud, Nils. "AnnotEasy: A gesture and speech-to-text based video annotation tool for note taking in pre-recorded lectures in higher education." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-105962.
Full textLindgren, Viktor. "Evaluating Multi-Uav System with Text to Spech for Sitational Awarness and Workload." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-53343.
Full textXiao, He. "An affective personality for an embodied conversational agent." Thesis, Curtin University, 2006. http://hdl.handle.net/20.500.11937/167.
Full textXiao, He. "An affective personality for an embodied conversational agent." Curtin University of Technology, Department of Computer Engineering, 2006. http://espace.library.curtin.edu.au:80/R/?func=dbin-jump-full&object_id=16139.
Full textTsai, Zong-Mou, and 蔡宗謀. "A Priliminary Study on Mandarin to Taiwanese Text-to-Speech Systems." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/82303213184115638004.
Full text國立中興大學
資訊科學與工程學系
96
There are many articles, books and magazines in Taiwan which may contain some valuable information. Yet they are mostly written in Mandarin. We use Mandarin-to-Taiwanese methods to turn them into Taiwanese, including speech. Such documents will be more comprehensible for Taiwanese speakers and Taiwanese learners. The focus of this thesis is on how to find the correct Taiwanese pronunciation. If we cannot find the exact word in a dictionary, we partition it into smaller parts for searching the pronunciation. The parts will become smaller in our searching. Finally we will consult the one-character word dictionary for the pronunciation.
"Cantonese text-to-speech synethesis using sub-syllable units." 2001. http://library.cuhk.edu.hk/record=b5890790.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references.
Text in English; abstracts in English and Chinese.
Law Ka Man = Li yong zi yin jie de Yue yu wen yu zhuan huan xi tong / Luo Jiawen.
Chapter 1. --- INTRODUCTION --- p.1
Chapter 1.1 --- Text analysis --- p.2
Chapter 1.2 --- Prosody prediction --- p.3
Chapter 1.3 --- Speech generation --- p.3
Chapter 1.4 --- The trend of TTS technology --- p.5
Chapter 1.5 --- TTS systems for different languages --- p.6
Chapter 1.6 --- Objectives of the thesis --- p.8
Chapter 1.7 --- Thesis outline --- p.8
References --- p.10
Chapter 2. --- BACKGROUND --- p.11
Chapter 2.1 --- Cantonese phonology --- p.11
Chapter 2.2 --- Cantonese TTS - a baseline system --- p.16
Chapter 2.3 --- Time-Domain Prrch-Synchronous-OverLap-Add --- p.17
Chapter 2.3.1 --- "From, speech signal to short-time analysis signals" --- p.18
Chapter 2.3.2 --- From short-time analysis signals to short-time synthesis signals --- p.19
Chapter 2.3.3 --- From short-time synthesis signals to synthetic speech --- p.20
Chapter 2.4 --- Time-scale and Pitch-scale modifications --- p.20
Chapter 2.4.1 --- Voiced speech --- p.20
Chapter 2.4.2 --- Unvoiced speech --- p.21
Chapter 2.5 --- Summary --- p.22
References --- p.23
Chapter 3. --- SUB-SYLLABLE BASED TTS SYSTEM --- p.24
Chapter 3.1 --- Motivations --- p.24
Chapter 3.2 --- Choices of synthesis units --- p.27
Chapter 3.2.1 --- Sub-syllable unit --- p.29
Chapter 3.2.2 --- "Diphones, demi-syllables and sub-syllable units" --- p.31
Chapter 3.3 --- Proposed TTS system --- p.32
Chapter 3.3.1 --- Text analysis module --- p.33
Chapter 3.3.2 --- Synthesis module --- p.36
Chapter 3.3.3 --- Prosody module --- p.37
Chapter 3.4 --- Summary --- p.38
References --- p.39
Chapter 4. --- ACOUSTIC INVENTORY --- p.40
Chapter 4.1 --- The full set of Cantonese sub-syllable units --- p.40
Chapter 4.2 --- A reduced set of sub-syllable units --- p.42
Chapter 4.3 --- Corpus design --- p.44
Chapter 4.4 --- Recording --- p.46
Chapter 4.5 --- Post-processing of speech data --- p.47
Chapter 4.6 --- Summary --- p.51
References --- p.51
Chapter 5. --- CONCATENATION TECHNIQUES --- p.52
Chapter 5.1 --- Concatenation of sub-syllable units --- p.52
Chapter 5.1.1 --- Concatenation of plosives and affricates --- p.54
Chapter 5.1.2 --- Concatenation of fricatives --- p.55
Chapter 5.1.3 --- "Concatenation of vowels, semi-vowels and nasals" --- p.55
Chapter 5.1.4 --- Spectral distance measure --- p.57
Chapter 5.2 --- Waveform concatenation method --- p.58
Chapter 5.3 --- Selected examples of waveform concatenation --- p.59
Chapter 5.3.1 --- I-I concatenation --- p.60
Chapter 5.3.2 --- F-F concatenation --- p.66
Chapter 5.4 --- Summary --- p.71
References --- p.72
Chapter 6. --- PERFORMANCE EVALUATION --- p.73
Chapter 6.1 --- Listening test --- p.73
Chapter 6.2 --- Test results: --- p.74
Chapter 6.3 --- Discussions --- p.75
References --- p.78
Chapter 7. --- CONCLUSIONS & FUTURE WORKS --- p.79
Chapter 7.1 --- Conclusions --- p.79
Chapter 7.2 --- Suggested future work --- p.81
APPENDIX 1 SYLLABLE DURATION --- p.82
APPENDIX 2 PERCEPTUAL TEST PARAGRAPHS --- p.86
"Prosody analysis and modeling for Cantonese text-to-speech." 2003. http://library.cuhk.edu.hk/record=b5891678.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references.
Abstracts in English and Chinese.
Chapter Chapter 1 --- Introduction --- p.1
Chapter 1.1. --- TTS Technology --- p.1
Chapter 1.2. --- Prosody --- p.2
Chapter 1.2.1. --- What is Prosody --- p.2
Chapter 1.2.2. --- Prosody from Different Perspectives --- p.3
Chapter 1.2.3. --- Acoustical Parameters of Prosody --- p.3
Chapter 1.2.4. --- Prosody in TTS --- p.5
Chapter 1.2.4.1 --- Analysis --- p.5
Chapter 1.2.4.2 --- Modeling --- p.6
Chapter 1.2.4.3 --- Evaluation --- p.6
Chapter 1.3. --- Thesis Objectives --- p.7
Chapter 1.4. --- Thesis Outline --- p.7
Reference --- p.8
Chapter Chapter 2 --- Cantonese --- p.9
Chapter 2.1. --- The Cantonese Dialect --- p.9
Chapter 2.1.1. --- Phonology --- p.10
Chapter 2.1.1.1 --- Initial --- p.11
Chapter 2.1.1.2 --- Final --- p.12
Chapter 2.1.1.3 --- Tone --- p.13
Chapter 2.1.2. --- Phonological Constraints --- p.14
Chapter 2.2. --- Tones in Cantonese --- p.15
Chapter 2.2.1. --- Tone System --- p.15
Chapter 2.2.2. --- Linguistic Significance --- p.18
Chapter 2.2.3. --- Acoustical Realization --- p.18
Chapter 2.3. --- Prosodic Variation in Continuous Cantonese Speech --- p.20
Chapter 2.4. --- Cantonese Speech Corpus - CUProsody --- p.21
Reference --- p.23
Chapter Chapter 3 --- F0 Normalization --- p.25
Chapter 3.1. --- F0 in Speech Production --- p.25
Chapter 3.2. --- F0 Extraction --- p.27
Chapter 3.3. --- Duration-normalized Tone Contour --- p.29
Chapter 3.4. --- F0 Normalization --- p.30
Chapter 3.4.1. --- Necessity and Motivation --- p.30
Chapter 3.4.2. --- F0 Normalization --- p.33
Chapter 3.4.2.1 --- Methodology --- p.33
Chapter 3.4.2.2 --- Assumptions --- p.34
Chapter 3.4.2.3 --- Estimation of Relative Tone Ratios --- p.35
Chapter 3.4.2.4 --- Derivation of Phrase Curve --- p.37
Chapter 3.4.2.5 --- Normalization of Absolute FO Values --- p.39
Chapter 3.4.3. --- Experiments and Discussion --- p.39
Chapter 3.5. --- Conclusions --- p.44
Reference --- p.45
Chapter Chapter 4 --- Acoustical FO Analysis --- p.48
Chapter 4.1. --- Methodology of FO Analysis --- p.48
Chapter 4.1.1. --- Analysis-by-Synthesis --- p.48
Chapter 4.1.2. --- Acoustical Analysis --- p.51
Chapter 4.2. --- Acoustical FO Analysis for Cantonese --- p.52
Chapter 4.2.1. --- Analysis of Phrase Curves --- p.52
Chapter 4.2.2. --- Analysis of Tone Contours --- p.55
Chapter 4.2.2.1 --- Context-independent Single-tone Contours --- p.56
Chapter 4.2.2.2 --- Contextual Variation --- p.58
Chapter 4.2.2.3 --- Co-articulated Tone Contours of Disyllabic Word --- p.59
Chapter 4.2.2.4 --- Cross-word Contours --- p.62
Chapter 4.2.2.5 --- Phrase-initial Tone Contours --- p.65
Chapter 4.3. --- Summary --- p.66
Reference --- p.67
Chapter Chapter5 --- Prosody Modeling for Cantonese Text-to-Speech --- p.70
Chapter 5.1. --- Parametric Model and Non-parametric Model --- p.70
Chapter 5.2. --- Cantonese Text-to-Speech: Baseline System --- p.72
Chapter 5.2.1. --- Sub-syllable Unit --- p.72
Chapter 5.2.2. --- Text Analysis Module --- p.73
Chapter 5.2.3. --- Acoustical Synthesis --- p.74
Chapter 5.2.4. --- Prosody Module --- p.74
Chapter 5.3. --- Enhanced Prosody Model --- p.74
Chapter 5.3.1. --- Modeling Tone Contours --- p.75
Chapter 5.3.1.1 --- Word-level FO Contours --- p.76
Chapter 5.3.1.2 --- Phrase-initial Tone Contours --- p.77
Chapter 5.3.1.3 --- Tone Contours at Word Boundary --- p.78
Chapter 5.3.2. --- Modeling Phrase Curves --- p.79
Chapter 5.3.3. --- Generation of Continuous FO Contours --- p.81
Chapter 5.4. --- Summary --- p.81
Reference --- p.82
Chapter Chapter 6 --- Performance Evaluation --- p.83
Chapter 6.1. --- Introduction to Perceptual Test --- p.83
Chapter 6.1.1. --- Aspects of Evaluation --- p.84
Chapter 6.1.2. --- Methods of Judgment Test --- p.84
Chapter 6.1.3. --- Problems in Perceptual Test --- p.85
Chapter 6.2. --- Perceptual Tests for Cantonese TTS --- p.86
Chapter 6.2.1. --- Intelligibility Tests --- p.86
Chapter 6.2.1.1 --- Method --- p.86
Chapter 6.2.1.2 --- Results --- p.88
Chapter 6.2.1.3 --- Analysis --- p.89
Chapter 6.2.2. --- Naturalness Tests --- p.90
Chapter 6.2.2.1 --- Word-level --- p.90
Chapter 6.2.2.1.1 --- Method --- p.90
Chapter 6.2.2.1.2 --- Results --- p.91
Chapter 6.2.3.1.3 --- Analysis --- p.91
Chapter 6.2.2.2 --- Sentence-level --- p.92
Chapter 6.2.2.2.1 --- Method --- p.92
Chapter 6.2.2.2.2 --- Results --- p.93
Chapter 6.2.2.2.3 --- Analysis --- p.94
Chapter 6.3. --- Conclusions --- p.95
Chapter 6.4. --- Summary --- p.95
Reference --- p.96
Chapter Chapter 7 --- Conclusions and Future Work --- p.97
Chapter 7.1. --- Conclusions --- p.97
Chapter 7.2. --- Suggested Future Work --- p.99
Appendix --- p.100
Appendix 1 Linear Regression --- p.100
Appendix 2 36 Templates of Cross-word Contours --- p.101
Appendix 3 Word List for Word-level Tests --- p.102
Appendix 4 Syllable Occurrence in Word List of Intelligibility Test --- p.108
Appendix 5 Wrongly Identified Word List --- p.112
Appendix 6 Confusion Matrix --- p.115
Appendix 7 Unintelligible Word List --- p.117
Appendix 8 Noisy Word List --- p.119
Appendix 9 Sentence List for Naturalness Test --- p.120
"Unit selection and waveform concatenation strategies in Cantonese text-to-speech." 2005. http://library.cuhk.edu.hk/record=b5892349.
Full textThesis (M.Phil.)--Chinese University of Hong Kong, 2005.
Includes bibliographical references.
Abstracts in English and Chinese.
Chapter 1. --- Introduction --- p.1
Chapter 1.1 --- An overview of Text-to-Speech technology --- p.2
Chapter 1.1.1 --- Text processing --- p.2
Chapter 1.1.2 --- Acoustic synthesis --- p.3
Chapter 1.1.3 --- Prosody modification --- p.4
Chapter 1.2 --- Trends in Text-to-Speech technologies --- p.5
Chapter 1.3 --- Objectives of this thesis --- p.7
Chapter 1.4 --- Outline of the thesis --- p.9
References --- p.11
Chapter 2. --- Cantonese Speech --- p.13
Chapter 2.1 --- The Cantonese dialect --- p.13
Chapter 2.2 --- Phonology of Cantonese --- p.14
Chapter 2.2.1 --- Initials --- p.15
Chapter 2.2.2 --- Finals --- p.16
Chapter 2.2.3 --- Tones --- p.18
Chapter 2.3 --- Acoustic-phonetic properties of Cantonese syllables --- p.19
References --- p.24
Chapter 3. --- Cantonese Text-to-Speech --- p.25
Chapter 3.1 --- General overview --- p.25
Chapter 3.1.1 --- Text processing --- p.25
Chapter 3.1.2 --- Corpus based acoustic synthesis --- p.26
Chapter 3.1.3 --- Prosodic control --- p.27
Chapter 3.2 --- Syllable based Cantonese Text-to-Speech system --- p.28
Chapter 3.3 --- Sub-syllable based Cantonese Text-to-Speech system --- p.29
Chapter 3.3.1 --- Definition of sub-syllable units --- p.29
Chapter 3.3.2 --- Acoustic inventory --- p.31
Chapter 3.3.3 --- Determination of the concatenation points --- p.33
Chapter 3.4 --- Problems --- p.34
References --- p.36
Chapter 4. --- Waveform Concatenation for Sub-syllable Units --- p.37
Chapter 4.1 --- Previous work in concatenation methods --- p.37
Chapter 4.1.1 --- Determination of concatenation point --- p.38
Chapter 4.1.2 --- Waveform concatenation --- p.38
Chapter 4.2 --- Problems and difficulties in concatenating sub-syllable units --- p.39
Chapter 4.2.1 --- Mismatch of acoustic properties --- p.40
Chapter 4.2.2 --- "Allophone problem of Initials /z/, Id and /s/" --- p.42
Chapter 4.3 --- General procedures in concatenation strategies --- p.44
Chapter 4.3.1 --- Concatenation of unvoiced segments --- p.45
Chapter 4.3.2 --- Concatenation of voiced segments --- p.45
Chapter 4.3.3 --- Measurement of spectral distance --- p.48
Chapter 4.4 --- Detailed procedures in concatenation points determination --- p.50
Chapter 4.4.1 --- Unvoiced segments --- p.50
Chapter 4.4.2 --- Voiced segments --- p.53
Chapter 4.5 --- Selected examples in concatenation strategies --- p.58
Chapter 4.5.1 --- Concatenation at Initial segments --- p.58
Chapter 4.5.1.1 --- Plosives --- p.58
Chapter 4.5.1.2 --- Fricatives --- p.59
Chapter 4.5.2 --- Concatenation at Final segments --- p.60
Chapter 4.5.2.1 --- V group (long vowel) --- p.60
Chapter 4.5.2.2 --- D group (diphthong) --- p.61
References --- p.63
Chapter 5. --- Unit Selection for Sub-syllable Units --- p.65
Chapter 5.1 --- Basic requirements in unit selection process --- p.65
Chapter 5.1.1 --- Availability of multiple copies of sub-syllable units --- p.65
Chapter 5.1.1.1 --- "Levels of ""identical""" --- p.66
Chapter 5.1.1.2 --- Statistics on the availability --- p.67
Chapter 5.1.2 --- Variations in acoustic parameters --- p.70
Chapter 5.1.2.1 --- Pitch level --- p.71
Chapter 5.1.2.2 --- Duration --- p.74
Chapter 5.1.2.3 --- Intensity level --- p.75
Chapter 5.2 --- Selection process: availability check on sub-syllable units --- p.77
Chapter 5.2.1 --- Multiple copies found --- p.79
Chapter 5.2.2 --- Unique copy found --- p.79
Chapter 5.2.3 --- No matched copy found --- p.80
Chapter 5.2.4 --- Illustrative examples --- p.80
Chapter 5.3 --- Selection process: acoustic analysis on candidate units --- p.81
References --- p.88
Chapter 6. --- Performance Evaluation --- p.89
Chapter 6.1 --- General information --- p.90
Chapter 6.1.1 --- Objective test --- p.90
Chapter 6.1.2 --- Subjective test --- p.90
Chapter 6.1.3 --- Test materials --- p.91
Chapter 6.2 --- Details of the objective test --- p.92
Chapter 6.2.1 --- Testing method --- p.92
Chapter 6.2.2 --- Results --- p.93
Chapter 6.2.3 --- Analysis --- p.96
Chapter 6.3 --- Details of the subjective test --- p.98
Chapter 6.3.1 --- Testing method --- p.98
Chapter 6.3.2 --- Results --- p.99
Chapter 6.3.3 --- Analysis --- p.101
Chapter 6.4 --- Summary --- p.107
References --- p.108
Chapter 7. --- Conclusions and Future Works --- p.109
Chapter 7.1 --- Conclusions --- p.109
Chapter 7.2 --- Suggested future works --- p.111
References --- p.113
Appendix 1 Mean pitch level of Initials and Finals stored in the inventory --- p.114
Appendix 2 Mean durations of Initials and Finals stored in the inventory --- p.121
Appendix 3 Mean intensity level of Initials and Finals stored in the inventory --- p.124
Appendix 4 Test word used in performance evaluation --- p.127
Appendix 5 Test paragraph used in performance evaluation --- p.128
Appendix 6 Pitch profile used in the Text-to-Speech system --- p.131
Appendix 7 Duration model used in Text-to-Speech system --- p.132
Rato, João Pedro Cordeiro. "Conversação homem-máquina. Caracterização e avaliação do estado actual das soluções de speech recognition, speech synthesis e sistemas de conversação homem-máquina." Master's thesis, 2016. http://hdl.handle.net/10400.8/2375.
Full textTso, Chin-Heng, and 左晉恆. "HMM-Based Chinese Text-To-Speech System." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/81700635040061571703.
Full textHuang, Yi-chin, and 黃奕欽. "Emotional Text-to-Speech System of Baseball Broadcast." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/7595f3.
Full text國立中山大學
資訊工程學系研究所
96
In this study, we implement an emotional text-to-speech system for the limited domain of on-line play-by-play baseball game summary. TheChinese Professional Baseball League (CPBL) is our target domain. Our goal is that the output synthesized speech is fluent with appropriate emotion. The system first parses the input text and keeps the on-court informations, e.g., the number of runners and which base is occupied, the number of outs, the score of each team, the batter''s performance in game. And the system adds additional sentences in the input text. Then, the system outputs neutral synthesized speech from the text with additional sentences inserted, and subsequently converts it to emotional speech. Our approach to speech conversion is to simulate a baseball braodcaster. Specifically, our system learns and uses the prosody from a broadcaster. To learn the prosody, we record two baseball games and analyze the prosodic features of emotional utterances. These observations are used to generate some prosodic rules of emotional conversion. The subjective evaluation is used to study the preference of the subjects about the additional sentences insertion and the emotion conversion in the system.
蔡依玲. "An HMM-based Hakka Text-to-Speech System." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/65432807762850521130.
Full text國立交通大學
電信工程研究所
98
In this thesis, a Hakka Text-to-Speech (TTS) system is implemented. It consists of four main parts: parser, pause predictor, context analyzer and HMM-based synthesizer. The input text is first tagged in the text analyzer into word sequence. Due to the lack of a large text corpus to train a robust Hakka parser, we adopt a new approach to constructing a Hakka parser via extending an existing CRF-based Chinese parser to add a Hakka dictionary and incorporate some Hakka word construction rules. Then, the pause predictor estimates the inter-syllable locations to insert pauses. The context analyzer then generates the synthesis unit and some language parameters. Lastly, the HMM-based synthesizer produces duration, pitch, and spectral parameters to generate the output synthesized speech. Some experiments are also designed to evaluate the performances of the parser and the pause predictor, as well as the quality of the synthesized speech. A good MOS score obtained in the subjective quality test confirms that the Hakka TTS system is a promising one.
Lin, Dong-Yi, and 林東毅. "An Implementation of Hakka Text-to-Speech System." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/39336830137118858666.
Full text國立交通大學
電信工程系所
95
In this thesis, a Hakka Text-to-Speech (TTS) system is implemented. It consists of four main parts: Text Analyzer, RNN prosody generator, waveform inventory of synthesis units and PSOLA synthesizer. The input text is first tagged in the text analyzer into word sequence. Then, the RNN prosody generator is used to generate the prosodic information by using linguistic feature extracted from the word sequence.The Waveform corresponding to the word sequence is then extracted from the waveform inventory and prosodically-adjusted to generate the output speech. The basic implementation of the system follows the Mandarin TTS system developed previously in NCTU.A demo system operating on the Windows platform by using a SDI(Single Document Interface)text editor with the synthesis kernel was last realized. Informal listening tests show that most synthesized speeches sound fair.
Yang, Yu-Ching, and 楊鈺清. "An Implementation of Taiwanese Text-to-Speech System." Thesis, 1999. http://ndltd.ncl.edu.tw/handle/85718453317105013745.
Full text國立交通大學
電信工程系
87
In this thesis, a Taiwanese TTS system is implemented. It consists of four main parts: text analyzer, RNN prosody generator, waveform inventory of synthesis units, and PSOLA synthesizer. The input text is first tagged in the text analyzer into word sequence. Then, the RNN prosody generator is used to generate the prosodic information by using linguistic features extracted from the word sequence. Waveform sequence corresponding to the word sequence is then extracted from the waveform inventory and prosodically-adjusted to generate the output speech. The basic implementation of the system follows the Mandarin TTS system developed previously in NCTU with the following improvements. First, the sample-based duration information are used rather than the frame-based one. Second, the syllable energy contour is taken as a prosodic information to be generated in stead of using static patterns given by the corresponding basic waveform. Third, both duration and energy features are normalized up to the utterance level. A demo system operating on the Windows 95/NT platform by using a SDI (Single Document Interface) text editor with the synthesis kernel was last realized. Informal listening tests show that most synthesized speeches sound fair.
Whang, Bau-Jang, and 黃保章. "A Study for Mandarin Text to Taiwanese Speech System." Thesis, 1999. http://ndltd.ncl.edu.tw/handle/87699301605849503933.
Full text國立成功大學
電機工程學系
87
In this thesis, a Mandarin text to Taiwanese conversion system is described . The user can obtain corresponding Taiwanese conveniently when one inputs Chinese sentences. We discuss the four main problems which will occur in the implementation of the Taiwanese speech synthesis system. These four problems contain (1)one Chinese word maps to different syllables of Taiwanese, (2) ambiguity in segmenting sentence, (3) tonal operation, and (4) synthesis unit process. First,in order to solve the program(1),we collect all possibly lexical words which maps to different syllables into the lexical corpus. Second,We deal with the ambiguity in segmenting sentence by Viterbi algorithm, and establish a way for morphology and pronunciation in literary. Third,To observe tonal change of Taiwanese ,this paper has built a rule to treat the relationship between the inherent tone and derived tone. Otherwise, that is a difficult problem to determine whether a morpheme will take its inherent tone or the derived tone for every word in a sentence. We have built a rule to handle this problem, too. Finally, for the recording of the synthesis unit ,weproposed a method called "recording by fixed melody method". This method can improve the naturalness of the synthetic speech to near the prosodic properties of real speech.
Chu, Kuo Hua, and 朱國華. "A Language Model for Chinese Speech-to-Text System." Thesis, 1993. http://ndltd.ncl.edu.tw/handle/11480419945514931288.
Full textLin, Yih-Jeng, and 林義証. "Developing A Chinese Text-To-Speech System For CAI." Thesis, 1994. http://ndltd.ncl.edu.tw/handle/88531281586485611562.
Full textLu, Peng-Ren, and 盧鵬任. "An Improvement on the Mandarin Text-to-Speech System." Thesis, 1997. http://ndltd.ncl.edu.tw/handle/15791269009752805058.
Full text國立交通大學
電信研究所
85
In this thesis, the improvement of a Mandarin TTS system developed previously in the Speech Processing Lab of NCTU is performed. The system consists of four main parts: text analyzer, RNN-based prosodic information generator, waveform table of 417 base-syllables, and PSOLA synthesizer. Input texts are first analyzed in the text analyzer. Then, the RNN prosody generator is used to generate the prosodic information by using linguistic features extracted from the outputs of text analysis. Meanwhile, the corresponding waveform template sequence are extracted from the waveform table. Lastly, the PSOLA synthesizer is used to generate the output synthesized speech by adjusting the prosody of the waveform template sequence. In this study, improvements of the system on many aspects are done. We first extend the lexicon size of the text analyzer from 80,000 words to 110,000 words. The coverage of the lexicon is hence greatly increase. Then, a word pronunciation tree is constructed to speed up the text-analysis process. Some simple phonological rules are also incorporated into the text analyzer. The number of POS types used in the RNN prosody generator is then reduced from 44 to 22 to reduce its computational complexity while keeping the naturalness of the synthesized speech being undegraded. Then, a new method of producing the waveform table of 417 base-syllables using utterances of isolated syllables is proposed. This not only increases the quality of the synthesized speech but also greatly simplifies the process of adding a new speaker*s speech to the system. Lastly, we change the system operating environment from DOS to Windows 95. The software architecture is also changed to a dynamic library form. This makes the developments of new applications more easy.
XIE, GING-JIANG, and 謝清江. "A Chinese text-to-speech system based on formant synthesis." Thesis, 1987. http://ndltd.ncl.edu.tw/handle/68840754016731337307.
Full textZheng, Yuan-Jie, and 鄭元傑. "A Telephone Number Text-to-Speech System With Speaker Adaptation." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/14289661377522120035.
Full text國立中興大學
應用數學系
89
In this thesis, We developed a Mandarin telephone number text-to-speech system with speaker adaptation. We use some parameters to predict prosody in a hierarchical way. The parameters of prosody include the numbers before and after the target number, segment information, and the number of syllables. We use the above parameters predict duration, volume, and pause. For the duration production model, the average errors of inside test and outside test are 24ms and 45ms, respectively. For volume production model, the average errors of inside test and outside test are 1.83dB and 2.22dB, respectively. In addition, we test speaker adaptation in our text-to-speech system, We try to use a speaker’s prosody to predict that of someone else who has only few training data. In our test, the average errors in duration is 44ms and the average errors in volume is 2.26dB with 5 sentences(52 syllables) in the training data; the average errors in duration is 35ms and the average errors in volume is 2.03dB with 10 sentences(107 syllables) in the training data; and the average errors in duration is 29ms and the average errors in volume is 1.99dB with 20 sentences(253 syllables) in the training data.
Wu, Chao-Hsiang, and 吳兆祥. "A Chinese Text-to-Speech System Based on Word Units." Thesis, 1994. http://ndltd.ncl.edu.tw/handle/74068357773888875038.
Full textLi, Jie, and 李杰. "HMM-Based Chinese Text-To-Speech System with Support Speakers." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/69884613145394916907.
Full text國立臺灣大學
資訊工程學研究所
100
Nowadays people can use the speech technology to make their life better. Among the speech technology, speech synthesis is regarded as an important part recently. There are two speech synthesis techniques commonly used. One is the unit selection technique and the other is the HMM-based technique. In the unit selection technique, voice in the corpus is divided into small pieces, and they will be concatenated to generate the synthesized voice. With the HMM-based technique, the acoustic model will be calculated using the acoustic features, and synthesized voice will be generated based on acoustic models. In this thesis, I used the HMM-based technique to implement the Chinese Text-to-Speech (TTS) system. In this system, it extracts the spectral feature and the frequency feature and context-dependent labels to train the models. After the training stage, it analyzes the text and uses the corresponding models to generate the voice. In the acoustic model training it needs a large amount of training data to train a high quality model. It is difficult to obtain enough training data, so conventionally we exploit the average acoustic model and speaker adaptation to make training with less data possible. However training models close to the one of the target speaker is difficult for average acoustic models, so the performance of the speaker adaptation is not good. In this thesis, I proposed several methods to find out acoustically similar speakers as the support speakers of the target speaker and use their training data to train support speaker models. I conducted objective experiments and subjective experiments. The experiments showed support speaker model technique is better than average acoustic model technique, and support speaker model technique can result in better synthesis quality.
Shih, Shan-Shu, and 施善舒. "Processing for Generating Continuous Speech from Syllables in a Mandarin Text-to-Speech System." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/k288um.
Full text國立中興大學
資訊科學與工程學系所
102
This thesis is to investigate how to generate a continuous speech from syllables in a Mandarin text-to-speech (TTS) system. We want to find more appropriate pitches from syllables to synthesis, and expect to get smooth synthesis speech. We use short-time speech processing methods to divide speech into frames and extract some feature of speech. We then use this feature to search the closest pitch. We use the following 4 method to get the pitches for synthesis: searching with individual pitch, near distance ascending, three-point average, and dynamic time warping (DTW). Combined with adjustment of duration, pitch, and volume, we hope the synthesis speech can be as natural as possible. Then, we use sentences and synthesis units recorded by real people for experiment. We compare result and efficacy of these methods by using mean opinion score (MOS). We then use the best method combined with classification of continuous speech to training curves, in order to synthesize continuous speech from syllables by using curve fitting. Finally, we build a user interface. We let user input related parameters according to prosody structure. We want to make a speech synthesis system more intuitive and approachable.
Fu, Zhen-Hong, and 傅振宏. "Automatic Generation of Synthesis Units for Taiwanese Text-to-Speech System." Thesis, 2000. http://ndltd.ncl.edu.tw/handle/46706238089789082381.
Full text長庚大學
電機工程研究所
88
In this thesis, we’ll demonstrate a Taiwanese (Min-nan) text-to-speech (TTS) system based on automatically generated synthetic units. It can read out any modern Taiwanese articles rather naturally. This TTS system is composed of 3 functional modules, namely a text analysis module, a prosody module, and a waveform synthesis module. Modern Taiwanese texts consist of Chinese characters and English alphabets simultaneously. For this reason, the text analysis module should be able to deal with the Chinese-English mixed texts first of all. In this module, text normalization, words segmentation, letter-to-phonemes and word frequency are used to deal with the multi-pronunciation. The prosody module process tone sandhi appearance and phonetic variation in Taiwanese. The synthetic units in the waveform synthesis module come from 2 sources: (1) the isolated-uttered tonal syllables including all possible tonal variations in Taiwanese, totally about 4521 in numbers, (2) the automatically generated synthetic units from a designated speech corpus. We employ a HMM-based large vocabulary Taiwanese speech recognition system to do the forced alignment for the speech corpus. The short pause recognition was proposed in the recognition system. After the synthesis units string has been extracted, the inter-syllable coarticulation information will be applied to decide how to concatenation these units. After the energy normalization, the output speech was generated. We evaluate our system on automatically segmented speech. Comparing with the human segmentation, about 85% correct rate can be achieved. The system was already implemented on a PC running MS-windows 9x/NT/2000.
蘇子安. "A distributed text-to-speech service system and its scheduling algorithms." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/21919220997966758710.
Full text國立海洋大學
資訊科學學系
91
Quick technology advances on office automation equipments, information appliances, and interconnection networks have brought us a brand new integrated office environment. In this environment, most devices have compact form factors, limited resources, limited communication bandwidth, and limited computation power. These devices provide their individual functions through integration of available public services over the local interconnecting equipments. A framework that provides services on the network has to overcome difficulties such as network traffic congestion, unbalanced server loads, and dynamic configuration of service components. In this thesis, a distributed network service for intelligent text to speech is constructed. In order to design an efficient service framework which provides better quality for service with limited resources, a supervised network service architecture and scheduling algorithms are proposed. Computer simulations are conducted to analyze the performance of each decision rule of the scheduling algorithms in different environments (the power of each server in the system, the distribution of job size and the distribution of job inter-arrival time, etc..). This result can be used to dynamically adjust the decision rules in the service framework.
HUANG, SHAO-HUA, and 黃紹華. "A synthesis of prosodic information in mandarin text-to-speech system." Thesis, 1991. http://ndltd.ncl.edu.tw/handle/08240353472600497334.
Full textYang, Chi-Yu, and 楊棋宇. "Performance Improvement of Neural Network based End-to-end Text-to-Speech System." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/56575t.
Full text