Dissertations / Theses: 'Speech-to-text systems'

1

Chan, Ngor-chi. "Text-to-speech conversion for Putonghua /." [Hong Kong : University of Hong Kong], 1990. http://sunzi.lib.hku.hk/hkuto/record.jsp?B12929475.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

陳我智 and Ngor-chi Chan. "Text-to-speech conversion for Putonghua." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1990. http://hub.hku.hk/bib/B31209580.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Breitenbücher, Mark. "Textvorverarbeitung zur deutschen Version des Festival Text-to-Speech Synthese Systems." [S.l.] : Universität Stuttgart , Fakultät Philosophie, 1997. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB6783514.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Baloyi, Ntsako. "A text-to-speech synthesis system for Xitsonga using hidden Markov models." Thesis, University of Limpopo (Turfloop Campus), 2012. http://hdl.handle.net/10386/1021.

Full text

Abstract:

Thesis (M.Sc. (Computer Science) --University of Limpopo, 2013
This research study focuses on building a general-purpose working Xitsonga speech synthesis system that is as far as can be possible reasonably intelligible, natural sounding, and flexible. The system built has to be able to model some of the desirable speaker characteristics and speaking styles. This research project forms part of the broader national speech technology project that aims at developing spoken language systems for human-machine interaction using the eleven official languages of South Africa (SA). Speech synthesis is the reverse of automatic speech recognition (which receives speech as input and converts it to text) in that it receives text as input and produces synthesized speech as output. It is generally accepted that most people find listening to spoken utterances better that reading the equivalent of such utterances. The Xitsonga speech synthesis system has been developed using a hidden Markov model (HMM) speech synthesis method. The HMM-based speech synthesis (HTS) system synthesizes speech that is intelligible, and natural sounding. This method can synthesize speech on a footprint of only a few megabytes of training speech data. The HTS toolkit is applied as a patch to the HTK toolkit which is a hidden Markov model toolkit primarily designed for use in speech recognition to build and manipulate hidden Markov models.

APA, Harvard, Vancouver, ISO, and other styles

5

Engell, Trond Bøe. "TaleTUC: Text-to-Speech and Other Enhancements to Existing Bus Route Information Systems." Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, 2012. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-18920.

Full text

Abstract:

As smartphone sales increase, the demand for content for these devices alsoincreases. Service providers that want to reach out to as many users as possibleneed to create smartphone applications that satisfy people that do not fall intothe "normal user" category. People that require non-visual feedback, such asvisually impaired persons, need output in form of auditory signals. Text-tospeechsynthesis provides this functionality, giving the smartphone the abilityto convey messages in the form of speech.This thesis describes TaleTUC: Text-to-speech, a proof of concept text-to-speechsystem for the domain of bus route information. The system uses a client-serverarchitecture where the server converts text to computer-generated speech signalsand provides playable audio files either directly to the smartphone, orthrough a Java Servlet that provides functionality to tailor the output (e.g., audiocompression). Descriptions of other enhancements to bus route informationsystems, that are are not directly related to synthesized speech, have also beengiven.Three text-to-speech modules have been evaluated and to establish whetherthere is a link between intelligibility and naturalness in synthesized speech.Non-functional tests (transfer, response time, etc) have also been conductedto get an impression of whether service providers that use "cloud" technologyprovide a better service than an in-house system. There are no definitive answersto these questions, but results indicate that there might be a link (howeversmall) between intelligibility and naturalness and that an in-house systemis still preferable in the domain of bus route information.

APA, Harvard, Vancouver, ISO, and other styles

6

Lambert, Tanya. "Databases for concatenative text-to-speech synthesis systems : unit selection and knowledge-based approach." Thesis, University of East Anglia, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.421192.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Levefeldt, Christer. "Evaluation of NETtalk as a means to extract phonetic features from text for synchronization with speech." Thesis, University of Skövde, Department of Computer Science, 1998. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-173.

Full text

Abstract:

The background for this project is a wish to automate synchronization of text and speech. The idea is to present speech through speakers synchronized word-for-word with text appearing on a monitor.

The solution decided upon is to use artificial neural networks, ANNs, to convert both text and speech into streams made up of sets of phonetic features and then matching these two streams against each other. Several text-to-feature ANN designs based on the NETtalk system are implemented and evaluated. The extraction of phonetic features from speech and the synchronization itself are not implemented, but some assessments are made regarding their possible performances. The performance of a finished system is not possible to determine, but a NETtalk-based ANN is believed to be suitable for such a system using phonetic features for synchronization.

APA, Harvard, Vancouver, ISO, and other styles

8

Yoon, Kyuchul. "Building a prosodically sensitive diphone database for a Korean text-to-speech synthesis system." Connect to this title online, 2005. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1119010941.

Full text

Abstract:

Thesis (Ph. D.)--Ohio State University, 2005.
Title from first page of PDF file. Document formatted into pages; contains xxii, 291 p.; also includes graphics (some col.) Includes bibliographical references (p. 210-216). Available online via OhioLINK's ETD Center

APA, Harvard, Vancouver, ISO, and other styles

9

Thorstensson, Niklas. "A knowledge-based grapheme-to-phoneme conversion for Swedish." Thesis, University of Skövde, Department of Computer Science, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-731.

Full text

Abstract:

A text-to-speech system is a complex system consisting of several different modules such as grapheme-to-phoneme conversion, articulatory and prosodic modelling, voice modelling etc.

This dissertation is aimed at the creation of the initial part of a text-to-speech system, i.e. the grapheme-to-phoneme conversion, designed for Swedish. The problem area at hand is the conversion of orthographic text into a phonetic representation that can be used as a basis for a future complete text-to speech system.

The central issue of the dissertation is the grapheme-to-phoneme conversion and the elaboration of rules and algorithms required to achieve this task. The dissertation aims to prove that it is possible to make such a conversion by a rule-based algorithm with reasonable performance. Another goal is to find a way to represent phonotactic rules in a form suitable for parsing. It also aims to find and analyze problematic structures in written text compared to phonetic realization.

This work proposes a knowledge-based grapheme-to-phoneme conversion system for Swedish. The system suggested here is implemented, tested, evaluated and compared to other existing systems. The results achieved are promising, and show that the system is fast, with a high degree of accuracy.

APA, Harvard, Vancouver, ISO, and other styles

10

Mhlana, Siphe. "Development of isiXhosa text-to-speech modules to support e-Services in marginalized rural areas." Thesis, University of Fort Hare, 2011. http://hdl.handle.net/10353/495.

Full text

Abstract:

Information and Communication Technology (ICT) projects are being initiated and deployed in marginalized areas to help improve the standard of living for community members. This has lead to a new field, which is responsible for information processing and knowledge development in rural areas, called Information and Communication Technology for Development (ICT4D). An ICT4D projects has been implemented in a marginalized area called Dwesa; this is a rural area situated in the wild coast of the former homelandof Transkei, in the Eastern Cape Province of South Africa. In this rural community there are e-Service projects which have been developed and deployed to support the already existent ICT infrastructure. Some of these projects include the e-Commerce platform, e-Judiciary service, e-Health and e-Government portal. Although these projects are deployed in this area, community members face a language and literacy barrier because these services are typically accessed through English textual interfaces. This becomes a challenge because their language of communication is isiXhosa and some of the community members are illiterate. Most of the rural areas consist of illiterate people who cannot read and write isiXhosa but can only speak the language. This problem of illiteracy in rural areas affects both the youth and the elderly. This research seeks to design, develop and implement software modules that can be used to convert isiXhosa text into natural sounding isiXhosa speech. Such an application is called a Text-to-Speech (TTS) system. The main objective of this research is to improve ICT4D eServices’ usability through the development of an isiXhosa Text-to-Speech system. This research is undertaken within the context of Siyakhula Living Lab (SLL), an ICT4D intervention towards improving the lives of rural communities of South Africa in an attempt to bridge the digital divide. Thedeveloped TTS modules were subsequently tested to determine their applicability to improve eServices usability. The results show acceptable levels of usability as having produced audio utterances for the isiXhosa Text-To-Speech system for marginalized areas.

APA, Harvard, Vancouver, ISO, and other styles

11

Eksvärd, Siri, and Julia Falk. "Evaluating Speech-to-Text Systems and AR-glasses : A study to develop a potential assistive device for people with hearing impairments." Thesis, Uppsala universitet, Avdelningen för visuell information och interaktion, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-437608.

Full text

Abstract:

Suffering from a hearing impairment or deafness has major consequences on the individual's social life. Today, there exist various aids, but there are some challenges with these, like availability, reliability and high cognitive load when the user trying to focus on both the aid and the surrounding context. To overcome these challenges, one potential solution could make use of a combination of Augmented Reality (AR) and speech-to-text systems, where speech is converted into text that is then presented in AR glasses. However, in AR, one crucial problem is the legibility and readability of text under different environmental conditions. Moreover, different types of AR-glasses have different usage characteristics, which implies that a certain type of glasses might be more suitable for the proposed system than others. For speech-to-text systems, it is necessary to consider factors such as accuracy, latency and robustness when used in different acoustic environments and with different speech audio. In this master thesis, two different AR-glasses are being evaluated based on the different characteristics of the glasses, such as optical, visual and ergonomic. Moreover, user tests are conducted with 23 normal hearing individuals to evaluate the legibility and readability of text under different environmental contexts. Due to the pandemic, it was not possible to conduct the tests with hearing impaired individuals. Finally, a literature review is performed on speech-to-text systems available on the Swedish market. The results indicate that the legibility and readability are affected by several factors, such as ambient illuminance, background properties and also how the text is presented with respect to polarity, opacity, size and number of lines. Moreover, the characteristics of the glasses impact the user experience, but which glasses are preferable depends on the individual's preferences. For the choice of a speech-to-text system, four speech-to-text APIs available on the Swedish market were identified. Based on our research, Google Cloud Speech API is recommended for the proposed system. However, a more extensive evaluation of these systems would be required to determine this.
Speech-to-Text System using Augmented Reality for People with Hearing Deficits

APA, Harvard, Vancouver, ISO, and other styles

12

Micallef, Paul. "A text to speech synthesis system for Maltese." Thesis, University of Surrey, 1997. http://epubs.surrey.ac.uk/842702/.

Full text

Abstract:

The subject of this thesis covers a considerably varied multidisciplinary area which needs to be addressed to be able to achieve a text-to-speech synthesis system of high quality, in any language. This is the first time that such a system has been built for Maltese, and therefore, there was the additional problem of no computerised sources or corpora. However many problems and much of the system designs are common to all languages. This thesis focuses on two general problems. The first is that of automatic labelling of phonemic data, since this is crucial for the setting up of Maltese speech corpora, which in turn can be used to improve the system. A novel way of achieving such automatic segmentation was investigated. This uses a mixed parameter model with maximum likelihood training of the first derivative of the features across a set of phonetic class boundaries. It was found that this gives good results even for continuous speech provided that a phonemic labelling of the text is available. A second general problem is that of segment concatenation, since the end and beginning of subsequent diphones can have mismatches in amplitude, frequency, phase and spectral envelope. The use of-intermediate frames, build up from the last and first frames of two concatenated diphones, to achieve a smoother continuity was analysed. The analysis was done both in time and in frequency. The use of wavelet theory for the separation of the spectral envelope from the excitation was also investigated. The linguistic system modules have been built for this thesis. In particular a rule based grapheme to phoneme conversion system that is serial and not hierarchical was developed. The morphological analysis required the design of a system which allowed two dissimilar lexical structures, (semitic and romance) to be integrated into one overall morphological analyser. Appendices at the back are included with detailed rules of the linguistic modules developed. The present system, while giving satisfactory intelligibility, with capability of modifying duration, does not include as yet a prosodic module.

APA, Harvard, Vancouver, ISO, and other styles

13

Monaghan, Alexander Ian Campbell. "Intonation in a text-to-speech conversion system." Thesis, University of Edinburgh, 1991. http://hdl.handle.net/1842/20023.

Full text

Abstract:

This thesis presents the development and implementation of a set of rules to generate intonational specifications for unrestricted text. The theoretical assumptions which motivate this work are outlined, and the performance of the rules is discussed with reference to various test corpora and formal evaluation experiments. The development of our rules is seen as a cycle involving the implementation of theoretical ideas about intonation in a text-to-speech conversion system, the testing of that implementation against some relevant body of data, and the refinement of the theory on the basis of the results. The first chapter introduces the problem of intonation in text-to-speech conversion, discusses previous practical and theoretical approaches to the problem, and sets out the general approach which is followed in subsequent chapters. We restrict the scope of our rules to generating acceptable neutral intonation, an approximation to broad focus (Ladd 1980), and we present a rule-development strategy based on the idea of a default specification) which can be successively refined) and on the principle of making maximum use of all the information available from text. The second chapter presents a framework for deriving an intonational specification in terms of accents and boundaries from a crude syntactic representation of any text sentence. This framework involves three stages: the division of text into intonational domains of various hierarchic levels; the assignment of accents to lexical items on the basis of stress information and grammatical class; and the modification of these accents and boundaries in accordance with phonological principles of prominence and rhythm. Chapter 3 discusses the problem of evaluating synthetic intonation, introduces an original evaluation procedure, and presents two formal evaluations of the output of the rules described in Chapter 2. Further sections present our attempts to improve our treatment of the three major causes of errors in the evaluated output: prepositional phrases, non-words or anomalies (e.g. numbers, dates and abbreviations), and anaphora of various kinds. The final chapter presents a summary of the main points of Chapters 1-3. We draw various conclusions regarding the nature of intonation, the development of text-to-speech conversion systems, and the generation of intonation in such systems.

APA, Harvard, Vancouver, ISO, and other styles

14

Reynolds, Douglas A. "A Gaussian mixture modeling approach to text-independent speaker identification." Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/16903.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Rousseau, Francois. "Design of an advanced Text-To-Speech system for Afrikaans." Master's thesis, University of Cape Town, 2006. http://hdl.handle.net/11427/5112.

Full text

Abstract:

Word processed copy.
Includes bibliographical references (leaves 87-92).
Afrikaans is the home language to approximately six million people in South Africa. The need for an Afrikaans TTS system comes with the growing interest in integrating speech technology in all eleven languages of the country. The ultimate goal here is to enable communication between man and machine using speech. This can be achieved with the use of speech technology by implementing multilingual technological systems that all the people in South Africa can understand and relate to. Understandability, flexibility, naturalness and pleasantnedd are the requirements of an advanced TTS system. The technique of concatentative speech synthesis has been the most successful in meeting all these requirements. The Festival speech synthesis system uses two popular concatenative techniques to design new TTS systems in different languages. The techniques are: diphone concatenative synthesis (DCS) and unit selection synthesis (USS).

APA, Harvard, Vancouver, ISO, and other styles

16

Cohen, Andrew Dight. "The use of learnable phonetic representations in connectionist text-to-speech system." Thesis, University of Reading, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360787.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Mohasi, Lehlohonolo. "Prosody modelling for a Sesotho text-to-speech system using the Fujisaki model." Thesis, Stellenbosch : Stellenbosch University, 2015. http://hdl.handle.net/10019.1/97050.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Swart, Philippa H. "Prosodic features of imperatives in Xhosa : implications for a text-to-speech system." Thesis, Stellenbosch : Stellenbosch University, 2000. http://hdl.handle.net/10019.1/51891.

Full text

Abstract:

Thesis (MA)--University of Stellenbosch, 2000.
ENGLISH ABSTRACT: This study focuses on the prosodic features of imperatives and the role of prosodies in the development of a text-to-speech (TIS) system for Xhosa, an African tone language. The perception of prosody is manifested in suprasegmental features such as fundamental frequency (pitch), intensity (loudness) and duration (length). Very little experimental research has been done on the prosodic features of any grammatical structures (moods and tenses) in Xhosa, therefore it has not yet been determined how and to what degree the different prosodic features are combined and utilized in the production and perception of Xhosa speech. One such grammatical structure, for which no explicit descriptive phonetic information exists, is the imperative mood expressing commands. In this study it was shown how the relationship between duration, pitch and loudness, as manifested in the production and perception of Xhosa imperatives could be determined through acoustic analyses and perceptual experiments. An experimental phonetic approach proved to be essential for the acquisition of substantial and reliable prosodic information. An extensive acoustic analysis was conducted to acquire prosodic information on the production of imperatives by Xhosa mother tongue speakers. Subsequently, various statistical parameters were calculated on the raw acoustic data (i) to establish patterns of significance and (ii) to represent the large amount of numeric data generated, in a compact manner. A perceptual experiment was conducted to investigate the perception of imperatives. The prosodic parameters that were extracted from the acoustic analysis were applied to synthesize imperatives in different contexts. A novel approach to Xhosa speech synthesis was adopted. Monotonous verbs were recorded by one speaker and the pitch and duration of these words were then manipulated with the TD-PSOLA technique. Combining the results of the acoustic analysis and the perceptual experiment made it possible to present a prosodic model for the generation of perceptually acceptable imperati ves in a practical Xhosa TIS system. Prosody generation in a natural language processing (NLP) module and its place within the larger framework of text-to-speech synthesis was discussed. It was shown that existing architectures for TTS synthesis would not be appropriate for Xhosa without some adaptation. Hence, a unique architecture was suggested and its possible application subsequently illustrated. Of particular importance was the development of an alternative algorithm for grapheme-to-phoneme conversion. Keywords: prosody, speech synthesis, speech perception, acoustic analysis, Xhosa
AFRIKAANSE OPSOMMING: Hierdie studie fokus op die prodiese eienskappe van imperatiewe en die rol van prosodie in die ontwikkeling van 'n teks-na-spraak-sisteem vir Xhosa, 'n Afrika-toontaal. Die persepsie van prosodie word gemanifesteer in suprasegmentele eienskappe soos fundamentele frekwensie (toonhoogte), intensiteit (luidheid) en duur (lengte). Weinig eksperimentele navorsing bestaan ten opsigte van die prosodiese eienskappe van enige grammatikale strukture (modus en tyd) in Xhosa. Hoe en tot watter mate die verskillende prosodiese kenmerke gekombineer en gebruik word in die produksie en persepsie van Xhosa-spraak is nog nie duidelik nie. 'n Grammatikale struktuur waarvoor geen eksplisiete deskriptiewe fonetiese inligting bestaan nie, is die van die imperatiewe modus wat bevele uitdruk. Hierdie studie wys hoe die verhouding tussen duur, toonhoogte en luidheid, soos gemanifesteer in die produksie en persepsie van Xhosa-imperatiewe bepaal kon word deur akoestiese analises en persepsueIe eksperimente. Dit het geblyk dat 'n eksperimenteelfonetiese benadering noodsaaklik is vir die verkryging van sinvolle en betroubare prosodiese inligting. 'n Uitgebreide akoestiese analise is uitgevoer om prosodiese data omtrent die produksie van imperatiewe deur Xhosa-moedertaalsprekers te bekom. Vervolgens is verskeie statistiese analises op die rou akoestiese data uitgevoer om (i) patrone van beduidenheid te bepaal en om (ii) die groot hoeveelheid numeriese data wat gegenereer is meer kompak voor te stel. 'n PersepsueIe eksperiment is uitgevoer met die doelom die persepsie van imperatiewe te ondersoek. Die prosodiese parameters soos uit die akoestiese analise bekom, is toegepas in die sintese van bevele in verskillende kontekste. 'n Nuwe benadering tot Xhosaspraaksintese is gevolg. Monotone werkwoorde is vir een spreker opgeneem en die toonhoogte en duur van hierdie woorde is met TD-PSOLA tegniek gemanipuleer. 'n Kombinasie van akoestiese en persepsueie resultate is aangewend om 'n prosodiese model te ontwikkel vir die sintese van persepsueel aanvaarbare imperatiewe in 'n praktiese Xhosa teks- na- spraaksinteti seerder . Prosodie-generering in 'n natuurlike taalprosesering-module en die plek daarvan binne die raamwerk van teks-na-spraaksintese is bespreek. Daar is gewys dat bestaande argitekture vir teks-na-spraaksisteme nie sonder sommige aanpassings toepaslik vir Xhosa sal wees nie. Derhalwe is 'n unieke argitektuur gesuggereer en die moontlike toepassing daarvan geïllustreer. Die ontwikkeling van 'n alternatiewe algoritme vir letter-na-klankomsetting was van besondere belang. Sleutelwoorde: spraaksintese, spraakpersepsie, akoestiese analise, Xhosa

APA, Harvard, Vancouver, ISO, and other styles

19

Mohasi, Lehlohonolo. "Design of an advanced and fluent Sesotho text-to-speech system through intonation." Master's thesis, University of Cape Town, 2006. http://hdl.handle.net/11427/5155.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Nguyen, Thi Thu Trang. "HMM-based Vietnamese Text-To-Speech : Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112201/document.

Full text

Abstract:

L’objectif de cette thèse est de concevoir et de construire, un système Text-To-Speech (TTS) haute qualité à base de HMM (Hidden Markov Model) pour le vietnamien, une langue tonale. Le système est appelé VTED (Vietnamese TExt-to-speech Development system). Au vu de la grande importance de tons lexicaux, un tonophone” – un allophones dans un contexte tonal – a été proposé comme nouvelle unité de la parole dans notre système de TTS. Un nouveau corpus d’entraînement, VDTS (Vietnamese Di-Tonophone Speech corpus), a été conçu à partir d’un grand texte brut pour une couverture de 100% de di-phones tonalisés (di-tonophones) en utilisant l’algorithme glouton. Un total d’environ 4000 phrases ont été enregistrées et pré-traitées comme corpus d’apprentissage de VTED.Dans la synthèse de la parole sur la base de HMM, bien que la durée de pause puisse être modélisée comme un phonème, l’apparition de pauses ne peut pas être prédite par HMM. Les niveaux de phrasé ne peuvent pas être complètement modélisés avec des caractéristiques de base. Cette recherche vise à obtenir un découpage automatique en groupes intonatifs au moyen des seuls indices de durée. Des blocs syntaxiques constitués de phrases syntaxiques avec un nombre borné de syllabes (n), ont été proposés pour prévoir allongement final (n = 6) et pause apparente (n = 10). Des améliorations pour allongement final ont été effectuées par des stratégies de regroupement des blocs syntaxiques simples. La qualité du modèle prédictive J48-arbre-décision pour l’apparence de pause à l’aide de blocs syntaxiques, combinée avec lien syntaxique et POS (Part-Of-Speech) dispose atteint un F-score de 81,4 % (Précision = 87,6 %, Recall = 75,9 %), beaucoup mieux que le modèle avec seulement POS (F-score=43,6%) ou un lien syntaxique (F-score=52,6%).L’architecture du système a été proposée sur la base de l’architecture HTS avec une extension d’une partie traitement du langage naturel pour le Vietnamien. L’apparence de pause a été prédit par le modèle proposé. Les caractéristiques contextuelles incluent les caractéristiques d’identité de “tonophones”, les caractéristiques de localisation, les caractéristiques liées à la tonalité, et les caractéristiques prosodiques (POS, allongement final, niveaux de rupture). Mary TTS a été choisi comme plateforme pour la mise en oeuvre de VTED. Dans le test MOS (Mean Opinion Score), le premier VTED, appris avec les anciens corpus et des fonctions de base, était plutôt bonne, 0,81 (sur une échelle MOS 5 points) plus élevé que le précédent système – HoaSung (lequel utilise la sélection de l’unité non-uniforme avec le même corpus) ; mais toujours 1,2-1,5 point de moins que le discours naturel. La qualité finale de VTED, avec le nouveau corpus et le modèle de phrasé prosodique, progresse d’environ 1,04 par rapport au premier VTED, et son écart avec le langage naturel a été nettement réduit. Dans le test d’intelligibilité, le VTED final a reçu un bon taux élevé de 95,4%, seulement 2,6% de moins que le discours naturel, et 18% plus élevé que le premier. Le taux d’erreur du premier VTED dans le test d’intelligibilité générale avec le carré latin test d’environ 6-12% plus élevé que le langage naturel selon des niveaux de syllabe, de ton ou par phonème. Le résultat final ne s’écarte de la parole naturelle que de 0,4-1,4%
The thesis objective is to design and build a high quality Hidden Markov Model (HMM-)based Text-To-Speech (TTS) system for Vietnamese – a tonal language. The system is called VTED (Vietnamese TExt-tospeech Development system). In view of the great importance of lexical tones, a “tonophone” – an allophone in tonal context – was proposed as a new speech unit in our TTS system. A new training corpus, VDTS (Vietnamese Di-Tonophone Speech corpus), was designed for 100% coverage of di-phones in tonal contexts (i.e. di-tonophones) using the greedy algorithm from a huge raw text. A total of about 4,000 sentences of VDTS were recorded and pre-processed as a training corpus of VTED.In the HMM-based speech synthesis, although pause duration can be modeled as a phoneme, the appearanceof pauses cannot be predicted by HMMs. Lower phrasing levels above words may not be completely modeled with basic features. This research aimed at automatic prosodic phrasing for Vietnamese TTS using durational clues alone as it appeared too difficult to disentangle intonation from lexical tones. Syntactic blocks, i.e. syntactic phrases with a bounded number of syllables (n), were proposed for predicting final lengthening (n = 6) and pause appearance (n = 10). Improvements for final lengthening were done by some strategies of grouping single syntactic blocks. The quality of the predictive J48-decision-tree model for pause appearance using syntactic blocks combining with syntactic link and POS (Part-Of-Speech) features reached F-score of 81.4% Precision=87.6%, Recall=75.9%), much better than that of the model with only POS (F-score=43.6%)or syntactic link (F-score=52.6%) alone.The architecture of the system was proposed on the basis of the core architecture of HTS with an extension of a Natural Language Processing part for Vietnamese. Pause appearance was predicted by the proposed model. Contextual feature set included phone identity features, locational features, tone-related features, and prosodic features (i.e. POS, final lengthening, break levels). Mary TTS was chosen as a platform for implementing VTED. In the MOS (Mean Opinion Score) test, the first VTED, trained with the old corpus and basic features, was rather good, 0.81 (on a 5 point MOS scale) higher than the previous system – HoaSung (using the non-uniform unit selection with the same training corpus); but still 1.2-1.5 point lower than the natural speech. The quality of the final VTED, trained with the new corpus and prosodic phrasing model, progressed by about 1.04 compared to the first VTED, and its gap with the natural speech was much lessened. In the tone intelligibility test, the final VTED received a high correct rate of 95.4%, only 2.6% lower than the natural speech, and 18% higher than the initial one. The error rate of the first VTED in the intelligibility test with the Latin square design was about 6-12% higher than the natural speech depending on syllable, tone or phone levels. The final one diverged about only 0.4-1.4% from the natural speech

APA, Harvard, Vancouver, ISO, and other styles

21

Gong, XiangQi. "Ellection markup language (EML) based tele-voting system." Thesis, University of the Western Cape, 2009. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_5841_1350999620.

Full text

Abstract:

Elections are one of the most fundamental activities of a democratic society. As is the case in any other aspect of life, developments in technology have resulted changes in the voting procedure from using the traditional paper-based voting to voting by use of electronic means, or e-voting. E-voting involves using different forms of electronic means like
voting machines, voting via the Internet, telephone, SMS and digital interactive television. This thesis concerns voting by telephone, or televoting, it starts by giving a brief overview and evaluation of various models and technologies that are implemented within such systems. The aspects of televoting that have been investigated are technologies that provide a voice interface to the voter and conduct the voting process, namely the Election Markup Language (EML), Automated Speech Recognition (ASR) and Text-to-Speech (TTS).

APA, Harvard, Vancouver, ISO, and other styles

22

Beněk, Tomáš. "Implementing and Improving a Speech Synthesis System." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236079.

Full text

Abstract:

Tato práce se zabývá syntézou řeči z textu. V práci je podán základní teoretický úvod do syntézy řeči z textu. Práce je postavena na MARY TTS systému, který umožňuje využít existujících modulů k vytvoření vlastního systému pro syntézu řeči z textu, a syntéze řeči pomocí skrytých Markovových modelů natrénovaných na vytvořené řečové databázi. Bylo vytvořeno několik jednoduchých programů ulehčujících vytvoření databáze a přidání nového jazyka a hlasu pro MARY TTS systém bylo demonstrováno. Byl vytvořen a publikován modul a hlas pro Český jazyk. Byl popsán a implementován algoritmus pro přepis grafémů na fonémy.

APA, Harvard, Vancouver, ISO, and other styles

23

Malatji, Promise Tshepiso. "The development of accented English synthetic voices." Thesis, University of Limpopo, 2019. http://hdl.handle.net/10386/2917.

Full text

Abstract:

Thesis (M. Sc. (Computer Science)) --University of Limpopo, 2019
A Text-to-speech (TTS) synthesis system is a software system that receives text as input and produces speech as output. A TTS synthesis system can be used for, amongst others, language learning, and reading out text for people living with different disabilities, i.e., physically challenged, visually impaired, etc., by native and non-native speakers of the target language. Most people relate easily to a second language spoken by a non-native speaker they share a native language with. Most online English TTS synthesis systems are usually developed using native speakers of English. This research study focuses on developing accented English synthetic voices as spoken by non-native speakers in the Limpopo province of South Africa. The Modular Architecture for Research on speech sYnthesis (MARY) TTS engine is used in developing the synthetic voices. The Hidden Markov Model (HMM) method was used to train the synthetic voices. Secondary training text corpus is used to develop the training speech corpus by recording six speakers reading the text corpus. The quality of developed synthetic voices is measured in terms of their intelligibility, similarity and naturalness using a listening test. The results in the research study are classified based on evaluators’ occupation and gender and the overall results. The subjective listening test indicates that the developed synthetic voices have a high level of acceptance in terms of similarity and intelligibility. A speech analysis software is used to compare the recorded synthesised speech and the human recordings. There is no significant difference in the voice pitch of the speakers and the synthetic voices except for one synthetic voice.

APA, Harvard, Vancouver, ISO, and other styles

24

Uggerud, Nils. "AnnotEasy: A gesture and speech-to-text based video annotation tool for note taking in pre-recorded lectures in higher education." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-105962.

Full text

Abstract:

This paper investigates students’ attitudes towards using gestures and speech-to- text (GaST) to take notes while watching recorded lectures. A literature review regarding video based learning, an expert interview, and a background survey regarding students’ note taking habits led to the creation of the prototype AnnotEasy, a tool that allows students to use GaST to take notes. AnnotEasy was tested in three iterations against 18 students, and was updated after each iteration. The students watched a five minute long lecture and took notes by using AnnotEasy. The participants’ perceived ease of use (PEU) and perceived usefulness (PU) was evaluated based on the TAM. Their general attitudes were evaluated in semi structured interviews. The result showed that the students had a high PEU and PU of AnnotEasy. They were mainly positive towards taking notes by using GaST. Further, the result suggests that AnnotEasy could facilitate the process of structuring a lecture’s content. Lastly, even though students had positive attitudes towards using speech to create notes, observations showed that this was problematic when the users attempted to create longer notes. This indicates that speech could be more beneficial for taking shorter notes.

APA, Harvard, Vancouver, ISO, and other styles

25

Lindgren, Viktor. "Evaluating Multi-Uav System with Text to Spech for Sitational Awarness and Workload." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-53343.

Full text

Abstract:

With improvements to miniaturization technologies, the ratio between operators required per UAV has become increasingly smaller at the cost of increased workload. Workload is an important factor to consider when designing the multi-UAV systems of tomorrow as too much workload may decrease an operator's performance. This study proposes the use of text to speech combined with an emphasis on a single screen design as a way of improving situational awareness and perceived workload. A controlled experiment consisting of 18 participants was conducted inside a simulator. Their situational awareness and perceived workload was measured using SAGAT and NASA-TLX respectively. The results show that the use of text to speech lead to a decrease in situational awareness for all elements inside the graphical user interface that were not directly handled by a text to speech event. All of the NASA-TLX measurements showed an improvement in perceived workload except for physical demand. Overall an improvement of perceived workload was observed when text to speech was in use.

APA, Harvard, Vancouver, ISO, and other styles

26

Xiao, He. "An affective personality for an embodied conversational agent." Thesis, Curtin University, 2006. http://hdl.handle.net/20.500.11937/167.

Full text

Abstract:

Curtin Universitys Embodied Conversational Agents (ECA) combine an MPEG-4 compliant Facial Animation Engine (FAE), a Text To Emotional Speech Synthesiser (TTES), and a multi-modal Dialogue Manager (DM), that accesses a Knowledge Base (KB) and outputs Virtual Human Markup Language (VHML) text which drives the TTES and FAE. A user enters a question and an animated ECA responds with a believable and affective voice and actions. However, this response to the user is normally marked up in VHML by the KB developer to produce the required facial gestures and emotional display. A real person does not react by fixed rules but on personality, beliefs, previous experiences, and training. This thesis details the design, implementation and pilot study evaluation of an Affective Personality Model for an ECA. The thesis discusses the Email Agent system that informs a user when they have email. The system, built in Curtins ECA environment, has personality traits of Friendliness, Extraversion and Neuroticism. A small group of participants evaluated the Email Agent system to determine the effectiveness of the implemented personality system. An analysis of the qualitative and quantitative results from questionnaires is presented.

APA, Harvard, Vancouver, ISO, and other styles

27

Xiao, He. "An affective personality for an embodied conversational agent." Curtin University of Technology, Department of Computer Engineering, 2006. http://espace.library.curtin.edu.au:80/R/?func=dbin-jump-full&object_id=16139.

Full text

Abstract:

Curtin Universitys Embodied Conversational Agents (ECA) combine an MPEG-4 compliant Facial Animation Engine (FAE), a Text To Emotional Speech Synthesiser (TTES), and a multi-modal Dialogue Manager (DM), that accesses a Knowledge Base (KB) and outputs Virtual Human Markup Language (VHML) text which drives the TTES and FAE. A user enters a question and an animated ECA responds with a believable and affective voice and actions. However, this response to the user is normally marked up in VHML by the KB developer to produce the required facial gestures and emotional display. A real person does not react by fixed rules but on personality, beliefs, previous experiences, and training. This thesis details the design, implementation and pilot study evaluation of an Affective Personality Model for an ECA. The thesis discusses the Email Agent system that informs a user when they have email. The system, built in Curtins ECA environment, has personality traits of Friendliness, Extraversion and Neuroticism. A small group of participants evaluated the Email Agent system to determine the effectiveness of the implemented personality system. An analysis of the qualitative and quantitative results from questionnaires is presented.

APA, Harvard, Vancouver, ISO, and other styles

28

Tsai, Zong-Mou, and 蔡宗謀. "A Priliminary Study on Mandarin to Taiwanese Text-to-Speech Systems." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/82303213184115638004.

Full text

Abstract:

碩士
國立中興大學
資訊科學與工程學系
96
There are many articles, books and magazines in Taiwan which may contain some valuable information. Yet they are mostly written in Mandarin. We use Mandarin-to-Taiwanese methods to turn them into Taiwanese, including speech. Such documents will be more comprehensible for Taiwanese speakers and Taiwanese learners. The focus of this thesis is on how to find the correct Taiwanese pronunciation. If we cannot find the exact word in a dictionary, we partition it into smaller parts for searching the pronunciation. The parts will become smaller in our searching. Finally we will consult the one-character word dictionary for the pronunciation.

APA, Harvard, Vancouver, ISO, and other styles

29

"Cantonese text-to-speech synethesis using sub-syllable units." 2001. http://library.cuhk.edu.hk/record=b5890790.

Full text

Abstract:

Law Ka Man = 利用子音節的粤語文語轉換系統 / 羅家文.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references.
Text in English; abstracts in English and Chinese.
Law Ka Man = Li yong zi yin jie de Yue yu wen yu zhuan huan xi tong / Luo Jiawen.
Chapter 1. --- INTRODUCTION --- p.1
Chapter 1.1 --- Text analysis --- p.2
Chapter 1.2 --- Prosody prediction --- p.3
Chapter 1.3 --- Speech generation --- p.3
Chapter 1.4 --- The trend of TTS technology --- p.5
Chapter 1.5 --- TTS systems for different languages --- p.6
Chapter 1.6 --- Objectives of the thesis --- p.8
Chapter 1.7 --- Thesis outline --- p.8
References --- p.10
Chapter 2. --- BACKGROUND --- p.11
Chapter 2.1 --- Cantonese phonology --- p.11
Chapter 2.2 --- Cantonese TTS - a baseline system --- p.16
Chapter 2.3 --- Time-Domain Prrch-Synchronous-OverLap-Add --- p.17
Chapter 2.3.1 --- "From, speech signal to short-time analysis signals" --- p.18
Chapter 2.3.2 --- From short-time analysis signals to short-time synthesis signals --- p.19
Chapter 2.3.3 --- From short-time synthesis signals to synthetic speech --- p.20
Chapter 2.4 --- Time-scale and Pitch-scale modifications --- p.20
Chapter 2.4.1 --- Voiced speech --- p.20
Chapter 2.4.2 --- Unvoiced speech --- p.21
Chapter 2.5 --- Summary --- p.22
References --- p.23
Chapter 3. --- SUB-SYLLABLE BASED TTS SYSTEM --- p.24
Chapter 3.1 --- Motivations --- p.24
Chapter 3.2 --- Choices of synthesis units --- p.27
Chapter 3.2.1 --- Sub-syllable unit --- p.29
Chapter 3.2.2 --- "Diphones, demi-syllables and sub-syllable units" --- p.31
Chapter 3.3 --- Proposed TTS system --- p.32
Chapter 3.3.1 --- Text analysis module --- p.33
Chapter 3.3.2 --- Synthesis module --- p.36
Chapter 3.3.3 --- Prosody module --- p.37
Chapter 3.4 --- Summary --- p.38
References --- p.39
Chapter 4. --- ACOUSTIC INVENTORY --- p.40
Chapter 4.1 --- The full set of Cantonese sub-syllable units --- p.40
Chapter 4.2 --- A reduced set of sub-syllable units --- p.42
Chapter 4.3 --- Corpus design --- p.44
Chapter 4.4 --- Recording --- p.46
Chapter 4.5 --- Post-processing of speech data --- p.47
Chapter 4.6 --- Summary --- p.51
References --- p.51
Chapter 5. --- CONCATENATION TECHNIQUES --- p.52
Chapter 5.1 --- Concatenation of sub-syllable units --- p.52
Chapter 5.1.1 --- Concatenation of plosives and affricates --- p.54
Chapter 5.1.2 --- Concatenation of fricatives --- p.55
Chapter 5.1.3 --- "Concatenation of vowels, semi-vowels and nasals" --- p.55
Chapter 5.1.4 --- Spectral distance measure --- p.57
Chapter 5.2 --- Waveform concatenation method --- p.58
Chapter 5.3 --- Selected examples of waveform concatenation --- p.59
Chapter 5.3.1 --- I-I concatenation --- p.60
Chapter 5.3.2 --- F-F concatenation --- p.66
Chapter 5.4 --- Summary --- p.71
References --- p.72
Chapter 6. --- PERFORMANCE EVALUATION --- p.73
Chapter 6.1 --- Listening test --- p.73
Chapter 6.2 --- Test results： --- p.74
Chapter 6.3 --- Discussions --- p.75
References --- p.78
Chapter 7. --- CONCLUSIONS & FUTURE WORKS --- p.79
Chapter 7.1 --- Conclusions --- p.79
Chapter 7.2 --- Suggested future work --- p.81
APPENDIX 1 SYLLABLE DURATION --- p.82
APPENDIX 2 PERCEPTUAL TEST PARAGRAPHS --- p.86

APA, Harvard, Vancouver, ISO, and other styles

30

"Prosody analysis and modeling for Cantonese text-to-speech." 2003. http://library.cuhk.edu.hk/record=b5891678.

Full text

Abstract:

Li Yu Jia.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references.
Abstracts in English and Chinese.
Chapter Chapter 1 --- Introduction --- p.1
Chapter 1.1. --- TTS Technology --- p.1
Chapter 1.2. --- Prosody --- p.2
Chapter 1.2.1. --- What is Prosody --- p.2
Chapter 1.2.2. --- Prosody from Different Perspectives --- p.3
Chapter 1.2.3. --- Acoustical Parameters of Prosody --- p.3
Chapter 1.2.4. --- Prosody in TTS --- p.5
Chapter 1.2.4.1 --- Analysis --- p.5
Chapter 1.2.4.2 --- Modeling --- p.6
Chapter 1.2.4.3 --- Evaluation --- p.6
Chapter 1.3. --- Thesis Objectives --- p.7
Chapter 1.4. --- Thesis Outline --- p.7
Reference --- p.8
Chapter Chapter 2 --- Cantonese --- p.9
Chapter 2.1. --- The Cantonese Dialect --- p.9
Chapter 2.1.1. --- Phonology --- p.10
Chapter 2.1.1.1 --- Initial --- p.11
Chapter 2.1.1.2 --- Final --- p.12
Chapter 2.1.1.3 --- Tone --- p.13
Chapter 2.1.2. --- Phonological Constraints --- p.14
Chapter 2.2. --- Tones in Cantonese --- p.15
Chapter 2.2.1. --- Tone System --- p.15
Chapter 2.2.2. --- Linguistic Significance --- p.18
Chapter 2.2.3. --- Acoustical Realization --- p.18
Chapter 2.3. --- Prosodic Variation in Continuous Cantonese Speech --- p.20
Chapter 2.4. --- Cantonese Speech Corpus - CUProsody --- p.21
Reference --- p.23
Chapter Chapter 3 --- F0 Normalization --- p.25
Chapter 3.1. --- F0 in Speech Production --- p.25
Chapter 3.2. --- F0 Extraction --- p.27
Chapter 3.3. --- Duration-normalized Tone Contour --- p.29
Chapter 3.4. --- F0 Normalization --- p.30
Chapter 3.4.1. --- Necessity and Motivation --- p.30
Chapter 3.4.2. --- F0 Normalization --- p.33
Chapter 3.4.2.1 --- Methodology --- p.33
Chapter 3.4.2.2 --- Assumptions --- p.34
Chapter 3.4.2.3 --- Estimation of Relative Tone Ratios --- p.35
Chapter 3.4.2.4 --- Derivation of Phrase Curve --- p.37
Chapter 3.4.2.5 --- Normalization of Absolute FO Values --- p.39
Chapter 3.4.3. --- Experiments and Discussion --- p.39
Chapter 3.5. --- Conclusions --- p.44
Reference --- p.45
Chapter Chapter 4 --- Acoustical FO Analysis --- p.48
Chapter 4.1. --- Methodology of FO Analysis --- p.48
Chapter 4.1.1. --- Analysis-by-Synthesis --- p.48
Chapter 4.1.2. --- Acoustical Analysis --- p.51
Chapter 4.2. --- Acoustical FO Analysis for Cantonese --- p.52
Chapter 4.2.1. --- Analysis of Phrase Curves --- p.52
Chapter 4.2.2. --- Analysis of Tone Contours --- p.55
Chapter 4.2.2.1 --- Context-independent Single-tone Contours --- p.56
Chapter 4.2.2.2 --- Contextual Variation --- p.58
Chapter 4.2.2.3 --- Co-articulated Tone Contours of Disyllabic Word --- p.59
Chapter 4.2.2.4 --- Cross-word Contours --- p.62
Chapter 4.2.2.5 --- Phrase-initial Tone Contours --- p.65
Chapter 4.3. --- Summary --- p.66
Reference --- p.67
Chapter Chapter5 --- Prosody Modeling for Cantonese Text-to-Speech --- p.70
Chapter 5.1. --- Parametric Model and Non-parametric Model --- p.70
Chapter 5.2. --- Cantonese Text-to-Speech: Baseline System --- p.72
Chapter 5.2.1. --- Sub-syllable Unit --- p.72
Chapter 5.2.2. --- Text Analysis Module --- p.73
Chapter 5.2.3. --- Acoustical Synthesis --- p.74
Chapter 5.2.4. --- Prosody Module --- p.74
Chapter 5.3. --- Enhanced Prosody Model --- p.74
Chapter 5.3.1. --- Modeling Tone Contours --- p.75
Chapter 5.3.1.1 --- Word-level FO Contours --- p.76
Chapter 5.3.1.2 --- Phrase-initial Tone Contours --- p.77
Chapter 5.3.1.3 --- Tone Contours at Word Boundary --- p.78
Chapter 5.3.2. --- Modeling Phrase Curves --- p.79
Chapter 5.3.3. --- Generation of Continuous FO Contours --- p.81
Chapter 5.4. --- Summary --- p.81
Reference --- p.82
Chapter Chapter 6 --- Performance Evaluation --- p.83
Chapter 6.1. --- Introduction to Perceptual Test --- p.83
Chapter 6.1.1. --- Aspects of Evaluation --- p.84
Chapter 6.1.2. --- Methods of Judgment Test --- p.84
Chapter 6.1.3. --- Problems in Perceptual Test --- p.85
Chapter 6.2. --- Perceptual Tests for Cantonese TTS --- p.86
Chapter 6.2.1. --- Intelligibility Tests --- p.86
Chapter 6.2.1.1 --- Method --- p.86
Chapter 6.2.1.2 --- Results --- p.88
Chapter 6.2.1.3 --- Analysis --- p.89
Chapter 6.2.2. --- Naturalness Tests --- p.90
Chapter 6.2.2.1 --- Word-level --- p.90
Chapter 6.2.2.1.1 --- Method --- p.90
Chapter 6.2.2.1.2 --- Results --- p.91
Chapter 6.2.3.1.3 --- Analysis --- p.91
Chapter 6.2.2.2 --- Sentence-level --- p.92
Chapter 6.2.2.2.1 --- Method --- p.92
Chapter 6.2.2.2.2 --- Results --- p.93
Chapter 6.2.2.2.3 --- Analysis --- p.94
Chapter 6.3. --- Conclusions --- p.95
Chapter 6.4. --- Summary --- p.95
Reference --- p.96
Chapter Chapter 7 --- Conclusions and Future Work --- p.97
Chapter 7.1. --- Conclusions --- p.97
Chapter 7.2. --- Suggested Future Work --- p.99
Appendix --- p.100
Appendix 1 Linear Regression --- p.100
Appendix 2 36 Templates of Cross-word Contours --- p.101
Appendix 3 Word List for Word-level Tests --- p.102
Appendix 4 Syllable Occurrence in Word List of Intelligibility Test --- p.108
Appendix 5 Wrongly Identified Word List --- p.112
Appendix 6 Confusion Matrix --- p.115
Appendix 7 Unintelligible Word List --- p.117
Appendix 8 Noisy Word List --- p.119
Appendix 9 Sentence List for Naturalness Test --- p.120

APA, Harvard, Vancouver, ISO, and other styles

31

"Unit selection and waveform concatenation strategies in Cantonese text-to-speech." 2005. http://library.cuhk.edu.hk/record=b5892349.

Full text

Abstract:

Oey Sai Lok.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.
Includes bibliographical references.
Abstracts in English and Chinese.
Chapter 1. --- Introduction --- p.1
Chapter 1.1 --- An overview of Text-to-Speech technology --- p.2
Chapter 1.1.1 --- Text processing --- p.2
Chapter 1.1.2 --- Acoustic synthesis --- p.3
Chapter 1.1.3 --- Prosody modification --- p.4
Chapter 1.2 --- Trends in Text-to-Speech technologies --- p.5
Chapter 1.3 --- Objectives of this thesis --- p.7
Chapter 1.4 --- Outline of the thesis --- p.9
References --- p.11
Chapter 2. --- Cantonese Speech --- p.13
Chapter 2.1 --- The Cantonese dialect --- p.13
Chapter 2.2 --- Phonology of Cantonese --- p.14
Chapter 2.2.1 --- Initials --- p.15
Chapter 2.2.2 --- Finals --- p.16
Chapter 2.2.3 --- Tones --- p.18
Chapter 2.3 --- Acoustic-phonetic properties of Cantonese syllables --- p.19
References --- p.24
Chapter 3. --- Cantonese Text-to-Speech --- p.25
Chapter 3.1 --- General overview --- p.25
Chapter 3.1.1 --- Text processing --- p.25
Chapter 3.1.2 --- Corpus based acoustic synthesis --- p.26
Chapter 3.1.3 --- Prosodic control --- p.27
Chapter 3.2 --- Syllable based Cantonese Text-to-Speech system --- p.28
Chapter 3.3 --- Sub-syllable based Cantonese Text-to-Speech system --- p.29
Chapter 3.3.1 --- Definition of sub-syllable units --- p.29
Chapter 3.3.2 --- Acoustic inventory --- p.31
Chapter 3.3.3 --- Determination of the concatenation points --- p.33
Chapter 3.4 --- Problems --- p.34
References --- p.36
Chapter 4. --- Waveform Concatenation for Sub-syllable Units --- p.37
Chapter 4.1 --- Previous work in concatenation methods --- p.37
Chapter 4.1.1 --- Determination of concatenation point --- p.38
Chapter 4.1.2 --- Waveform concatenation --- p.38
Chapter 4.2 --- Problems and difficulties in concatenating sub-syllable units --- p.39
Chapter 4.2.1 --- Mismatch of acoustic properties --- p.40
Chapter 4.2.2 --- "Allophone problem of Initials /z/, Id and /s/" --- p.42
Chapter 4.3 --- General procedures in concatenation strategies --- p.44
Chapter 4.3.1 --- Concatenation of unvoiced segments --- p.45
Chapter 4.3.2 --- Concatenation of voiced segments --- p.45
Chapter 4.3.3 --- Measurement of spectral distance --- p.48
Chapter 4.4 --- Detailed procedures in concatenation points determination --- p.50
Chapter 4.4.1 --- Unvoiced segments --- p.50
Chapter 4.4.2 --- Voiced segments --- p.53
Chapter 4.5 --- Selected examples in concatenation strategies --- p.58
Chapter 4.5.1 --- Concatenation at Initial segments --- p.58
Chapter 4.5.1.1 --- Plosives --- p.58
Chapter 4.5.1.2 --- Fricatives --- p.59
Chapter 4.5.2 --- Concatenation at Final segments --- p.60
Chapter 4.5.2.1 --- V group (long vowel) --- p.60
Chapter 4.5.2.2 --- D group (diphthong) --- p.61
References --- p.63
Chapter 5. --- Unit Selection for Sub-syllable Units --- p.65
Chapter 5.1 --- Basic requirements in unit selection process --- p.65
Chapter 5.1.1 --- Availability of multiple copies of sub-syllable units --- p.65
Chapter 5.1.1.1 --- "Levels of ""identical""" --- p.66
Chapter 5.1.1.2 --- Statistics on the availability --- p.67
Chapter 5.1.2 --- Variations in acoustic parameters --- p.70
Chapter 5.1.2.1 --- Pitch level --- p.71
Chapter 5.1.2.2 --- Duration --- p.74
Chapter 5.1.2.3 --- Intensity level --- p.75
Chapter 5.2 --- Selection process: availability check on sub-syllable units --- p.77
Chapter 5.2.1 --- Multiple copies found --- p.79
Chapter 5.2.2 --- Unique copy found --- p.79
Chapter 5.2.3 --- No matched copy found --- p.80
Chapter 5.2.4 --- Illustrative examples --- p.80
Chapter 5.3 --- Selection process: acoustic analysis on candidate units --- p.81
References --- p.88
Chapter 6. --- Performance Evaluation --- p.89
Chapter 6.1 --- General information --- p.90
Chapter 6.1.1 --- Objective test --- p.90
Chapter 6.1.2 --- Subjective test --- p.90
Chapter 6.1.3 --- Test materials --- p.91
Chapter 6.2 --- Details of the objective test --- p.92
Chapter 6.2.1 --- Testing method --- p.92
Chapter 6.2.2 --- Results --- p.93
Chapter 6.2.3 --- Analysis --- p.96
Chapter 6.3 --- Details of the subjective test --- p.98
Chapter 6.3.1 --- Testing method --- p.98
Chapter 6.3.2 --- Results --- p.99
Chapter 6.3.3 --- Analysis --- p.101
Chapter 6.4 --- Summary --- p.107
References --- p.108
Chapter 7. --- Conclusions and Future Works --- p.109
Chapter 7.1 --- Conclusions --- p.109
Chapter 7.2 --- Suggested future works --- p.111
References --- p.113
Appendix 1 Mean pitch level of Initials and Finals stored in the inventory --- p.114
Appendix 2 Mean durations of Initials and Finals stored in the inventory --- p.121
Appendix 3 Mean intensity level of Initials and Finals stored in the inventory --- p.124
Appendix 4 Test word used in performance evaluation --- p.127
Appendix 5 Test paragraph used in performance evaluation --- p.128
Appendix 6 Pitch profile used in the Text-to-Speech system --- p.131
Appendix 7 Duration model used in Text-to-Speech system --- p.132

APA, Harvard, Vancouver, ISO, and other styles

32

Rato, João Pedro Cordeiro. "Conversação homem-máquina. Caracterização e avaliação do estado actual das soluções de speech recognition, speech synthesis e sistemas de conversação homem-máquina." Master's thesis, 2016. http://hdl.handle.net/10400.8/2375.

Full text

Abstract:

A comunicação verbal humana é realizada em dois sentidos, existindo uma compreensão de ambas as partes que resulta em determinadas considerações. Este tipo de comunicação, também chamada de diálogo, para além de agentes humanos pode ser constituído por agentes humanos e máquinas. A interação entre o Homem e máquinas, através de linguagem natural, desempenha um papel importante na melhoria da comunicação entre ambos. Com o objetivo de perceber melhor a comunicação entre Homem e máquina este documento apresenta vários conhecimentos sobre sistemas de conversação Homemmáquina, entre os quais, os seus módulos e funcionamento, estratégias de diálogo e desafios a ter em conta na sua implementação. Para além disso, são ainda apresentados vários sistemas de Speech Recognition, Speech Synthesis e sistemas que usam conversação Homem-máquina. Por último são feitos testes de performance sobre alguns sistemas de Speech Recognition e de forma a colocar em prática alguns conceitos apresentados neste trabalho, é apresentado a implementação de um sistema de conversação Homem-máquina. Sobre este trabalho várias ilações foram obtidas, entre as quais, a alta complexidade dos sistemas de conversação Homem-máquina, a baixa performance no reconhecimento de voz em ambientes com ruído e as barreiras que se podem encontrar na implementação destes sistemas.

APA, Harvard, Vancouver, ISO, and other styles

33

Tso, Chin-Heng, and 左晉恆. "HMM-Based Chinese Text-To-Speech System." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/81700635040061571703.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Huang, Yi-chin, and 黃奕欽. "Emotional Text-to-Speech System of Baseball Broadcast." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/7595f3.

Full text

Abstract:

碩士
國立中山大學
資訊工程學系研究所
96
In this study, we implement an emotional text-to-speech system for the limited domain of on-line play-by-play baseball game summary. TheChinese Professional Baseball League (CPBL) is our target domain. Our goal is that the output synthesized speech is fluent with appropriate emotion. The system first parses the input text and keeps the on-court informations, e.g., the number of runners and which base is occupied, the number of outs, the score of each team, the batter''s performance in game. And the system adds additional sentences in the input text. Then, the system outputs neutral synthesized speech from the text with additional sentences inserted, and subsequently converts it to emotional speech. Our approach to speech conversion is to simulate a baseball braodcaster. Specifically, our system learns and uses the prosody from a broadcaster. To learn the prosody, we record two baseball games and analyze the prosodic features of emotional utterances. These observations are used to generate some prosodic rules of emotional conversion. The subjective evaluation is used to study the preference of the subjects about the additional sentences insertion and the emotion conversion in the system.

APA, Harvard, Vancouver, ISO, and other styles

35

蔡依玲. "An HMM-based Hakka Text-to-Speech System." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/65432807762850521130.

Full text

Abstract:

碩士
國立交通大學
電信工程研究所
98
In this thesis, a Hakka Text-to-Speech (TTS) system is implemented. It consists of four main parts: parser, pause predictor, context analyzer and HMM-based synthesizer. The input text is first tagged in the text analyzer into word sequence. Due to the lack of a large text corpus to train a robust Hakka parser, we adopt a new approach to constructing a Hakka parser via extending an existing CRF-based Chinese parser to add a Hakka dictionary and incorporate some Hakka word construction rules. Then, the pause predictor estimates the inter-syllable locations to insert pauses. The context analyzer then generates the synthesis unit and some language parameters. Lastly, the HMM-based synthesizer produces duration, pitch, and spectral parameters to generate the output synthesized speech. Some experiments are also designed to evaluate the performances of the parser and the pause predictor, as well as the quality of the synthesized speech. A good MOS score obtained in the subjective quality test confirms that the Hakka TTS system is a promising one.

APA, Harvard, Vancouver, ISO, and other styles

36

Lin, Dong-Yi, and 林東毅. "An Implementation of Hakka Text-to-Speech System." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/39336830137118858666.

Full text

Abstract:

碩士
國立交通大學
電信工程系所
95
In this thesis, a Hakka Text-to-Speech (TTS) system is implemented. It consists of four main parts: Text Analyzer, RNN prosody generator, waveform inventory of synthesis units and PSOLA synthesizer. The input text is first tagged in the text analyzer into word sequence. Then, the RNN prosody generator is used to generate the prosodic information by using linguistic feature extracted from the word sequence.The Waveform corresponding to the word sequence is then extracted from the waveform inventory and prosodically-adjusted to generate the output speech. The basic implementation of the system follows the Mandarin TTS system developed previously in NCTU.A demo system operating on the Windows platform by using a SDI(Single Document Interface)text editor with the synthesis kernel was last realized. Informal listening tests show that most synthesized speeches sound fair.

APA, Harvard, Vancouver, ISO, and other styles

37

Yang, Yu-Ching, and 楊鈺清. "An Implementation of Taiwanese Text-to-Speech System." Thesis, 1999. http://ndltd.ncl.edu.tw/handle/85718453317105013745.

Full text

Abstract:

碩士
國立交通大學
電信工程系
87
In this thesis, a Taiwanese TTS system is implemented. It consists of four main parts: text analyzer, RNN prosody generator, waveform inventory of synthesis units, and PSOLA synthesizer. The input text is first tagged in the text analyzer into word sequence. Then, the RNN prosody generator is used to generate the prosodic information by using linguistic features extracted from the word sequence. Waveform sequence corresponding to the word sequence is then extracted from the waveform inventory and prosodically-adjusted to generate the output speech. The basic implementation of the system follows the Mandarin TTS system developed previously in NCTU with the following improvements. First, the sample-based duration information are used rather than the frame-based one. Second, the syllable energy contour is taken as a prosodic information to be generated in stead of using static patterns given by the corresponding basic waveform. Third, both duration and energy features are normalized up to the utterance level. A demo system operating on the Windows 95/NT platform by using a SDI (Single Document Interface) text editor with the synthesis kernel was last realized. Informal listening tests show that most synthesized speeches sound fair.

APA, Harvard, Vancouver, ISO, and other styles

38

Whang, Bau-Jang, and 黃保章. "A Study for Mandarin Text to Taiwanese Speech System." Thesis, 1999. http://ndltd.ncl.edu.tw/handle/87699301605849503933.

Full text

Abstract:

碩士
國立成功大學
電機工程學系
87
In this thesis, a Mandarin text to Taiwanese conversion system is described . The user can obtain corresponding Taiwanese conveniently when one inputs Chinese sentences. We discuss the four main problems which will occur in the implementation of the Taiwanese speech synthesis system. These four problems contain (1)one Chinese word maps to different syllables of Taiwanese, (2) ambiguity in segmenting sentence, (3) tonal operation, and (4) synthesis unit process. First,in order to solve the program(1),we collect all possibly lexical words which maps to different syllables into the lexical corpus. Second,We deal with the ambiguity in segmenting sentence by Viterbi algorithm, and establish a way for morphology and pronunciation in literary. Third,To observe tonal change of Taiwanese ,this paper has built a rule to treat the relationship between the inherent tone and derived tone. Otherwise, that is a difficult problem to determine whether a morpheme will take its inherent tone or the derived tone for every word in a sentence. We have built a rule to handle this problem, too. Finally, for the recording of the synthesis unit ,weproposed a method called "recording by fixed melody method". This method can improve the naturalness of the synthetic speech to near the prosodic properties of real speech.

APA, Harvard, Vancouver, ISO, and other styles

39

Chu, Kuo Hua, and 朱國華. "A Language Model for Chinese Speech-to-Text System." Thesis, 1993. http://ndltd.ncl.edu.tw/handle/11480419945514931288.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Lin, Yih-Jeng, and 林義証. "Developing A Chinese Text-To-Speech System For CAI." Thesis, 1994. http://ndltd.ncl.edu.tw/handle/88531281586485611562.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Lu, Peng-Ren, and 盧鵬任. "An Improvement on the Mandarin Text-to-Speech System." Thesis, 1997. http://ndltd.ncl.edu.tw/handle/15791269009752805058.

Full text

Abstract:

碩士
國立交通大學
電信研究所
85
In this thesis, the improvement of a Mandarin TTS system developed previously in the Speech Processing Lab of NCTU is performed. The system consists of four main parts: text analyzer, RNN-based prosodic information generator, waveform table of 417 base-syllables, and PSOLA synthesizer. Input texts are first analyzed in the text analyzer. Then, the RNN prosody generator is used to generate the prosodic information by using linguistic features extracted from the outputs of text analysis. Meanwhile, the corresponding waveform template sequence are extracted from the waveform table. Lastly, the PSOLA synthesizer is used to generate the output synthesized speech by adjusting the prosody of the waveform template sequence. In this study, improvements of the system on many aspects are done. We first extend the lexicon size of the text analyzer from 80,000 words to 110,000 words. The coverage of the lexicon is hence greatly increase. Then, a word pronunciation tree is constructed to speed up the text-analysis process. Some simple phonological rules are also incorporated into the text analyzer. The number of POS types used in the RNN prosody generator is then reduced from 44 to 22 to reduce its computational complexity while keeping the naturalness of the synthesized speech being undegraded. Then, a new method of producing the waveform table of 417 base-syllables using utterances of isolated syllables is proposed. This not only increases the quality of the synthesized speech but also greatly simplifies the process of adding a new speaker*s speech to the system. Lastly, we change the system operating environment from DOS to Windows 95. The software architecture is also changed to a dynamic library form. This makes the developments of new applications more easy.

APA, Harvard, Vancouver, ISO, and other styles

42

XIE, GING-JIANG, and 謝清江. "A Chinese text-to-speech system based on formant synthesis." Thesis, 1987. http://ndltd.ncl.edu.tw/handle/68840754016731337307.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Zheng, Yuan-Jie, and 鄭元傑. "A Telephone Number Text-to-Speech System With Speaker Adaptation." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/14289661377522120035.

Full text

Abstract:

碩士
國立中興大學
應用數學系
89
In this thesis, We developed a Mandarin telephone number text-to-speech system with speaker adaptation. We use some parameters to predict prosody in a hierarchical way. The parameters of prosody include the numbers before and after the target number, segment information, and the number of syllables. We use the above parameters predict duration, volume, and pause. For the duration production model, the average errors of inside test and outside test are 24ms and 45ms, respectively. For volume production model, the average errors of inside test and outside test are 1.83dB and 2.22dB, respectively. In addition, we test speaker adaptation in our text-to-speech system, We try to use a speaker’s prosody to predict that of someone else who has only few training data. In our test, the average errors in duration is 44ms and the average errors in volume is 2.26dB with 5 sentences(52 syllables) in the training data; the average errors in duration is 35ms and the average errors in volume is 2.03dB with 10 sentences(107 syllables) in the training data; and the average errors in duration is 29ms and the average errors in volume is 1.99dB with 20 sentences(253 syllables) in the training data.

APA, Harvard, Vancouver, ISO, and other styles

44

Wu, Chao-Hsiang, and 吳兆祥. "A Chinese Text-to-Speech System Based on Word Units." Thesis, 1994. http://ndltd.ncl.edu.tw/handle/74068357773888875038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Li, Jie, and 李杰. "HMM-Based Chinese Text-To-Speech System with Support Speakers." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/69884613145394916907.

Full text

Abstract:

碩士
國立臺灣大學
資訊工程學研究所
100
Nowadays people can use the speech technology to make their life better. Among the speech technology, speech synthesis is regarded as an important part recently. There are two speech synthesis techniques commonly used. One is the unit selection technique and the other is the HMM-based technique. In the unit selection technique, voice in the corpus is divided into small pieces, and they will be concatenated to generate the synthesized voice. With the HMM-based technique, the acoustic model will be calculated using the acoustic features, and synthesized voice will be generated based on acoustic models. In this thesis, I used the HMM-based technique to implement the Chinese Text-to-Speech (TTS) system. In this system, it extracts the spectral feature and the frequency feature and context-dependent labels to train the models. After the training stage, it analyzes the text and uses the corresponding models to generate the voice. In the acoustic model training it needs a large amount of training data to train a high quality model. It is difficult to obtain enough training data, so conventionally we exploit the average acoustic model and speaker adaptation to make training with less data possible. However training models close to the one of the target speaker is difficult for average acoustic models, so the performance of the speaker adaptation is not good. In this thesis, I proposed several methods to find out acoustically similar speakers as the support speakers of the target speaker and use their training data to train support speaker models. I conducted objective experiments and subjective experiments. The experiments showed support speaker model technique is better than average acoustic model technique, and support speaker model technique can result in better synthesis quality.

APA, Harvard, Vancouver, ISO, and other styles

46

Shih, Shan-Shu, and 施善舒. "Processing for Generating Continuous Speech from Syllables in a Mandarin Text-to-Speech System." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/k288um.

Full text

Abstract:

碩士
國立中興大學
資訊科學與工程學系所
102
This thesis is to investigate how to generate a continuous speech from syllables in a Mandarin text-to-speech (TTS) system. We want to find more appropriate pitches from syllables to synthesis, and expect to get smooth synthesis speech. We use short-time speech processing methods to divide speech into frames and extract some feature of speech. We then use this feature to search the closest pitch. We use the following 4 method to get the pitches for synthesis: searching with individual pitch, near distance ascending, three-point average, and dynamic time warping (DTW). Combined with adjustment of duration, pitch, and volume, we hope the synthesis speech can be as natural as possible. Then, we use sentences and synthesis units recorded by real people for experiment. We compare result and efficacy of these methods by using mean opinion score (MOS). We then use the best method combined with classification of continuous speech to training curves, in order to synthesize continuous speech from syllables by using curve fitting. Finally, we build a user interface. We let user input related parameters according to prosody structure. We want to make a speech synthesis system more intuitive and approachable.

APA, Harvard, Vancouver, ISO, and other styles

47

Fu, Zhen-Hong, and 傅振宏. "Automatic Generation of Synthesis Units for Taiwanese Text-to-Speech System." Thesis, 2000. http://ndltd.ncl.edu.tw/handle/46706238089789082381.

Full text

Abstract:

碩士
長庚大學
電機工程研究所
88
In this thesis, we’ll demonstrate a Taiwanese (Min-nan) text-to-speech (TTS) system based on automatically generated synthetic units. It can read out any modern Taiwanese articles rather naturally. This TTS system is composed of 3 functional modules, namely a text analysis module, a prosody module, and a waveform synthesis module. Modern Taiwanese texts consist of Chinese characters and English alphabets simultaneously. For this reason, the text analysis module should be able to deal with the Chinese-English mixed texts first of all. In this module, text normalization, words segmentation, letter-to-phonemes and word frequency are used to deal with the multi-pronunciation. The prosody module process tone sandhi appearance and phonetic variation in Taiwanese. The synthetic units in the waveform synthesis module come from 2 sources: (1) the isolated-uttered tonal syllables including all possible tonal variations in Taiwanese, totally about 4521 in numbers, (2) the automatically generated synthetic units from a designated speech corpus. We employ a HMM-based large vocabulary Taiwanese speech recognition system to do the forced alignment for the speech corpus. The short pause recognition was proposed in the recognition system. After the synthesis units string has been extracted, the inter-syllable coarticulation information will be applied to decide how to concatenation these units. After the energy normalization, the output speech was generated. We evaluate our system on automatically segmented speech. Comparing with the human segmentation, about 85% correct rate can be achieved. The system was already implemented on a PC running MS-windows 9x/NT/2000.

APA, Harvard, Vancouver, ISO, and other styles

48

蘇子安. "A distributed text-to-speech service system and its scheduling algorithms." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/21919220997966758710.

Full text

Abstract:

碩士
國立海洋大學
資訊科學學系
91
Quick technology advances on office automation equipments, information appliances, and interconnection networks have brought us a brand new integrated office environment. In this environment, most devices have compact form factors, limited resources, limited communication bandwidth, and limited computation power. These devices provide their individual functions through integration of available public services over the local interconnecting equipments. A framework that provides services on the network has to overcome difficulties such as network traffic congestion, unbalanced server loads, and dynamic configuration of service components. In this thesis, a distributed network service for intelligent text to speech is constructed. In order to design an efficient service framework which provides better quality for service with limited resources, a supervised network service architecture and scheduling algorithms are proposed. Computer simulations are conducted to analyze the performance of each decision rule of the scheduling algorithms in different environments (the power of each server in the system, the distribution of job size and the distribution of job inter-arrival time, etc..). This result can be used to dynamically adjust the decision rules in the service framework.

APA, Harvard, Vancouver, ISO, and other styles

49

HUANG, SHAO-HUA, and 黃紹華. "A synthesis of prosodic information in mandarin text-to-speech system." Thesis, 1991. http://ndltd.ncl.edu.tw/handle/08240353472600497334.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Yang, Chi-Yu, and 楊棋宇. "Performance Improvement of Neural Network based End-to-end Text-to-Speech System." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/56575t.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Speech-to-text systems'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles