Log in

Relevant bibliographies by topics / Synthesised speech / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Synthesised speech.

Dissertations / Theses on the topic 'Synthesised speech'

Author: Grafiati

Published: 4 June 2025

Last updated: 14 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Synthesised speech.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Stibbard, Richard. "Vocal expressions of emotions in non-laboratory speech : an investigation of the Reading/Leeds Emotion in Speech Project annotation data." Thesis, University of Reading, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.343222.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Jin, Yi-Xuan. "A HIGH SPEED DIGITAL IMPLEMENTATION OF LPC SPEECH SYNTHESIZER USING THE TMS320." Thesis, The University of Arizona, 1985. http://hdl.handle.net/10150/275309.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Donovan, R. E. "Trainable speech synthesis." Thesis, University of Cambridge, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.598598.

Full text

Abstract:

This thesis is concerned with the synthesis of speech using trainable systems. The research it describes was conducted with two principal aims: to build a hidden Markov model (HMM) based speech synthesis system which could synthesise very high quality speech; and to ensure that all the parameters used by the system were obtained through training. The motivation behind the first of these aims was to determine if the HMM techniques which have been applied so successfully in recent years to the problem of automatic speech recognition could achieve a similar level of success in the field of speech synthesis. The motivation behind the second aim was to construct a system that would be very flexible with respect to changing voices, or even languages. A synthesis system was developed which used the clustered states of a set of decision-tree state-clustered HMMs as its synthesis units. The synthesis parameters for each clustered state were obtained completely automatically through training on a one hour single-speaker continuous-speech database. During synthesis the required utterance, specified as a string of words of known phonetic pronunciation, was generated as a sequence of these clustered states. Initially, each clustered state was associated with a single linear prediction (LP) vector, and LP synthesis used to generate the sequence of vectors corresponding to the state sequence required. Numerous shortcomings were identified in this system, and these were addressed through improvements to its transcription, clustering, and segmentation capabilities. The LP synthesis scheme was replaced by a TD-PSOLA scheme which synthesised speech by concatenating waveform segments selected to represent each clustered state.

APA, Harvard, Vancouver, ISO, and other styles

4

Greenwood, Andrew Richard. "Articulatory speech synthesis." Thesis, University of Liverpool, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.386773.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Tsukanova, Anastasiia. "Articulatory speech synthesis." Electronic Thesis or Diss., Université de Lorraine, 2019. http://www.theses.fr/2019LORR0166.

Full text

Abstract:

Cette thèse se situe dans le domaine de la synthèse articulatoire de la parole et est organisé en trois grandes parties : les deux premières sont consacrées au développement de deux synthétiseurs articulatoires de la parole ; la troisième traite des liens que l'on peut établir entre les deux approches utilisées. Le premier synthétiseur est issu d'une approche à base de règles. Celle-ci visait à obtenir le contrôle complet sur les articulateurs (mâchoire, langue, lèvres, vélum, larynx et épiglotte). Elle s'appuyait sur des données statiques du plan sagittal médian obtenues par IRM (Imagerie par Résonance Magnétique) correspondant à des articulations bloquées de voyelles du français, ainsi que des syllabes de type consonne-voyelle, et était composée de plusieurs étapes : l'encodage de l'ensemble des données grâce à un modèle du conduit vocal basé sur l'ACP (analyse en composantes principales) ; l'utilisation des configurations articulatoires obtenues comme sources de positions à atteindre et destinées à piloter le synthétiseur à base de règles qui est la contribution principale de cette première partie ; l'ajustement des conduits vocaux obtenus selon une perspective phonétique ; la simulation acoustique permettant d'obtenir un signal acoustique. Les résultats de cette synthèse ont été évalués de manière visuelle, acoustique et perceptuelle, et les problèmes rencontrés ont été identifiés et classés selon leurs origines, qui pouvaient être : les données, leur modélisation, l'algorithme contrôlant la forme du conduit vocal, la traduction de cette forme en fonctions d'aire, ou encore la simulation acoustique. Ces analyses nous permettent de conclure que, parmi les tests effectués, les stratégies articulatoires des voyelles et des occlusives sont les plus correctes, suivies par celles des nasales et des fricatives. La seconde approche a été développée en s'appuyant sur un synthétiseur de référence constitué d'un réseau de neurones feed-forward entraîné à l'aide de la méthode standard du système Merlin sur des données audio composées de parole en langue française enregistrée par IRM en temps réel. Ces données ont été segmentées phonétiquement et linguistiquement. Ces données audio, malgré un débruitage, étaient fortement parasitées par le son de la machine à IRM. Nous avons complété le synthétiseur de référence en ajoutant huit paramètres représentant de l'information articulatoire : l'ouverture des lèvres et leur protrusion, la distance entre la langue et le vélum, entre le vélum et la paroi pharyngale, et enfin entre la langue et la paroi pharyngale. Ces paramètres ont été extraits automatiquement à partir des images et alignés au signal et aux spécifications linguistiques. Les séquences articulatoires et les séquences de parole, générées conjointement, ont été évaluées à l'aide de différentes mesures : distance de déformation temporelle dynamique, la distortion mel-cepstrum moyenne, l'erreur de prédiction de l'apériodicité, et trois mesures pour F0 : RMSE (root mean square error), CORR (coéfficient de corrélation) and V/UV (frame-level voiced/unvoiced error). Une analyse de la pertinence des paramètres articulatoires par rapport aux labels phonétiques a également été réalisée. Elle permet de conclure que les paramètres articulatoires générés s'approchent de manière acceptable des paramètres originaux, et que l'ajout des paramètres articulatoires n'a pas dégradé le modèle acoustique original. Les deux approches présentées ci-dessus ont en commun l'utilisation de deux types de données IRM. Ce point commun a motivé la recherche, dans les données temps réel, des images clés, c'est-à-dire les configurations statiques IRM, utilisées pour modéliser la coarticulation. Afin de comparer les images IRM statiques avec les images dynamiques en temps réel, nous avons utilisé plusieurs mesures : [...]<br>The thesis is set in the domain of articulatory speech synthesis and consists of three major parts: the first two are dedicated to the development of two articulatory speech synthesizers and the third addresses how we can relate them to each other. The first approach results from a rule-based approach to articulatory speech synthesis that aimed to have a comprehensive control over the articulators (the jaw, the tongue, the lips, the velum, the larynx and the epiglottis). This approach used a dataset of static mid-sagittal magnetic resonance imaging (MRI) captures showing blocked articulation of French vowels and a set of consonant-vowel syllables; that dataset was encoded with a PCA-based vocal tract model. Then the system comprised several components: using the recorded articulatory configurations to drive a rule-based articulatory speech synthesizer as a source of target positions to attain (which is the main contribution of this first part); adjusting the obtained vocal tract shapes from the phonetic perspective; running an acoustic simulation unit to obtain the sound. The results of this synthesis were evaluated visually, acoustically and perceptually, and the problems encountered were broken down by their origin: the dataset, its modeling, the algorithm for managing the vocal tract shapes, their translation to the area functions, and the acoustic simulation. We concluded that, among our test examples, the articulatory strategies for vowels and stops are most correct, followed by those of nasals and fricatives. The second explored approach started off a baseline deep feed-forward neural network-based speech synthesizer trained with the standard recipe of Merlin on the audio recorded during real-time MRI (RT-MRI) acquisitions: denoised (and yet containing a considerable amount of noise of the MRI machine) speech in French and force-aligned state labels encoding phonetic and linguistic information. This synthesizer was augmented with eight parameters representing articulatory information---the lips opening and protrusion, the distance between the tongue and the velum, the velum and the pharyngeal wall and the tongue and the pharyngeal wall---that were automatically extracted from the captures and aligned with the audio signal and the linguistic specification. The jointly synthesized speech and articulatory sequences were evaluated objectively with dynamic time warping (DTW) distance, mean mel-cepstrum distortion (MCD), BAP (band aperiodicity prediction error), and three measures for F0: RMSE (root mean square error), CORR (correlation coefficient) and V/UV (frame-level voiced/unvoiced error). The consistency of articulatory parameters with the phonetic label was analyzed as well. I concluded that the generated articulatory parameter sequences matched the original ones acceptably closely, despite struggling more at attaining a contact between the articulators, and that the addition of articulatory parameters did not hinder the original acoustic model. The two approaches above are linked through the use of two different kinds of MRI speech data. This motivated a search for such coarticulation-aware targets as those that we had in the static case to be present or absent in the real-time data. To compare static and real-time MRI captures, the measures of structural similarity, Earth mover's distance, and SIFT were utilized; having analyzed these measures for validity and consistency, I qualitatively and quantitatively studied their temporal behavior, interpreted it and analyzed the identified similarities. I concluded that SIFT and structural similarity did capture some articulatory information and that their behavior, overall, validated the static MRI dataset. [...]

APA, Harvard, Vancouver, ISO, and other styles

6

Sun, Felix (Felix W. ). "Speech Representation Models for Speech Synthesis and Multimodal Speech Recognition." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106378.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.<br>This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.<br>Cataloged from student-submitted PDF version of thesis.<br>Includes bibliographical references (pages 59-63).<br>The field of speech recognition has seen steady advances over the last two decades, leading to the accurate, real-time recognition systems available on mobile phones today. In this thesis, I apply speech modeling techniques developed for recognition to two other speech problems: speech synthesis and multimodal speech recognition with images. In both problems, there is a need to learn a relationship between speech sounds and another source of information. For speech synthesis, I show that using a neural network acoustic model results in a synthesizer that is more tolerant of noisy training data than previous work. For multimodal recognition, I show how information from images can be effectively integrated into the recognition search framework, resulting in improved accuracy when image data is available.<br>by Felix Sun.<br>M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

7

Cahn, Janet E. (Janet Elizabeth). "Generating expression in synthesized speech." Thesis, Massachusetts Institute of Technology, 1989. http://hdl.handle.net/1721.1/14218.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Wong, Chun-ho Eddy. "Reliability of rating synthesized hypernasal speech signals in connected speech and vowels." Click to view the E-thesis via HKU Scholars Hub, 2007. http://lookup.lib.hku.hk/lookup/bib/B4200617X.

Full text

Abstract:

Thesis (B.Sc)--University of Hong Kong, 2007.<br>"A dissertation submitted in partial fulfilment of the requirements for the Bachelor of Science (Speech and Hearing Sciences), The University of Hong Kong, June 30, 2007." Includes bibliographical references (p. 28-30). Also available in print.

APA, Harvard, Vancouver, ISO, and other styles

9

Morton, K. "Speech production and synthesis." Thesis, University of Essex, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.377930.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Klompje, Gideon. "A parametric monophone speech synthesis system." Thesis, Link to online version, 2006. http://hdl.handle.net/10019/561.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Smith, Lloyd A. (Lloyd Allen). "Speech Recognition Using a Synthesized Codebook." Thesis, University of North Texas, 1988. https://digital.library.unt.edu/ark:/67531/metadc332203/.

Full text

Abstract:

Speech sounds generated by a simple waveform synthesizer were used to create a vector quantization codebook for use in speech recognition. Recognition was tested over the TI-20 isolated word data base using a conventional DTW matching algorithm. Input speech was band limited to 300 - 3300 Hz, then passed through the Scott Instruments Corp. Coretechs process, implemented on a VET3 speech terminal, to create the speech representation for matching. Synthesized sounds were processed in software by a VET3 signal processing emulation program. Emulation and recognition were performed on a DEC VAX 11/750. The experiments were organized in 2 series. A preliminary experiment, using no vector quantization, provided a baseline for comparison. The original codebook contained 109 vectors, all derived from 2 formant synthesized sounds. This codebook was decimated through the course of the first series of experiments, based on the number of times each vector was used in quantizing the training data for the previous experiment, in order to determine the smallest subset of vectors suitable for coding the speech data base. The second series of experiments altered several test conditions in order to evaluate the applicability of the minimal synthesized codebook to conventional codebook training. The baseline recognition rate was 97%. The recognition rate for synthesized codebooks was approximately 92% for sizes ranging from 109 to 16 vectors. Accuracy for smaller codebooks was slightly less than 90%. Error analysis showed that the primary loss in dropping below 16 vectors was in coding of voiced sounds with high frequency second formants. The 16 vector synthesized codebook was chosen as the seed for the second series of experiments. After one training iteration, and using a normalized distortion score, trained codebooks performed with an accuracy of 95.1%. When codebooks were trained and tested on different sets of speakers, accuracy was 94.9%, indicating that very little speaker dependence was introduced by the training.

APA, Harvard, Vancouver, ISO, and other styles

12

Campbell, Wilhelm. "Multi-level speech timing control." Thesis, University of Sussex, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.283832.

Full text

Abstract:

This thesis describes a model of speech timing, predicting at the syllable level, with sensitivity to rhythmic factors at the foot level, that predicts segmental durations by a process of accommodation into the higher-level timing framework. The model is based on analyses of two large databases of British English speech; one illustrating the range of prosodic variation in the language, the other illustrating segmental duration characteristics in various phonetic environments. Designed for a speech synthesis application, the model also has relevance to linguistic and phonetic theory, and shows that phonological specification of prosodic variation is independent of the phonetic realisation of segmental duration. It also shows, using normalisation of phone-specific timing characteristics, that lengthening of segments within the syllable is of three kinds: prominence-related, applying more to onset segments; boundary-related, applying more to coda segments; and rhythm/rate-related, being more uniform across all component segments. In this model, durations are first predicted at the level of the syllable from consideration of the number of component segments, the nature of the rhyme, and the three types of lengthening. The segmental durations are then constrained to sum to this value by determining an appropriate uniform quantile of their individual distributions. Segmental distributions define the range of likely durations each might show under a given set of conditions; their parameters are predicted from broad-class features of place and manner of articulation, factored for position in the syllable, clustering, stress, and finality. Two parameters determine the segmental duration . pdfs, assuming a Gamma distribution, and one parameter determines the quantile within that pdf to predict the duration of any segment in a given prosodic context. In experimental tests, each level produced durations that closely fitted the data of four speakers of British English, and showed performance rates higher than a comparable model predicting exclusively at the level of the segment.

APA, Harvard, Vancouver, ISO, and other styles

13

Moers-Prinz, Donata [Verfasser]. "Fast Speech in Unit Selection Speech Synthesis / Donata Moers-Prinz." Bielefeld : Universitätsbibliothek Bielefeld, 2020. http://d-nb.info/1219215201/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Peng, Antai. "Speech expression modeling and synthesis." Diss., Georgia Institute of Technology, 1996. http://hdl.handle.net/1853/13560.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Brierton, Richard A. "Variable frame-rate speech synthesis." Thesis, University of Liverpool, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.357363.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Moakes, Paul Alan. "On-line adaptive nonlinear modelling of speech." Thesis, University of Sheffield, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.364189.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Mazel, David S. "Sinusoidal modeling of speech." Thesis, Georgia Institute of Technology, 1986. http://hdl.handle.net/1853/13873.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Gordon, Jane S. "Use of synthetic speech in tests of speech discrimination." PDXScholar, 1985. https://pdxscholar.library.pdx.edu/open_access_etds/3443.

Full text

Abstract:

The purpose of this study was to develop two tape-recorded synthetic speech discrimination test tapes and assess their intelligibility in order to determine whether or not synthetic speech was intelligible and if it would prove useful in speech discrimination testing. Four scramblings of the second MU-6 monosyllable word list were generated by the ECHO l C speech synthesizer using two methods of generating synthetic speech called TEXTALKER and SPEAKEASY. These stimuli were presented in one ear to forty normal-hearing adult subjects, 36 females and 4 males, at 60 dB HL under headphone&. Each subject listened to two different scramblings of the 50 monosyllable word list, one scrambling generated by TEXTALKER and the other scrambling generated by SPEAKEASY. The order in which the TEXTALKER and SPEAKEASY mode of presentation occurred as well as which ear to test per subject was randomly determined.

APA, Harvard, Vancouver, ISO, and other styles

19

Liu, Zhu Lin. "Speech synthesis via adaptive Fourier decomposition." Thesis, University of Macau, 2011. http://umaclib3.umac.mo/record=b2493215.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Jauk, Igor. "Unsupervised learning for expressive speech synthesis." Doctoral thesis, Universitat Politècnica de Catalunya, 2017. http://hdl.handle.net/10803/460814.

Full text

Abstract:

Nowadays, especially with the upswing of neural networks, speech synthesis is almost totally data driven. The goal of this thesis is to provide methods for automatic and unsupervised learning from data for expressive speech synthesis. In comparison to "ordinary" synthesis systems, it is more difficult to find reliable expressive training data, despite huge availability on sources like Internet. The main difficulty consists in the highly speaker- and situation-dependent nature of expressiveness, causing many and acoustically substantial variations. The consequences are, first, it is very difficult to define labels which reliably identify expressive speech with all nuances. The typical definition of 6 basic emotions, or alike, is a simplification which will have inexcusable consequences dealing with data outside the lab. Second, even if a label set is defined, apart of the enormous manual effort, it is difficult to gain sufficient training data for the models respecting all the nuances and variations. The goal of this thesis is to study automatic training methods for expressive speech synthesis avoiding labeling and to develop applications from these proposals. The focus lies on the acoustic and the semantic domains. For the part of the acoustic domain, the goal is to find suitable acoustic features to represent expressive speech, especially for the multi-speaker domain, as getting closer to real-life uncontrolled data. For this, the perspective will slide away from traditional, mainly prosody-based, features towards features gained with factor analysis, trying to identify the principal components of the expressiveness, namely using i-vectors. Results show that a combination of traditional and i-vector based features performs better in unsupervised clustering of expressive speech than traditional features and even better than large state-of-the-art sets in the multi-speaker domain. Once the feature set is defined, it is used for unsupervised clustering of an audiobook, where from each cluster a voice is trained. Then, the method is evaluated in an audiobook-editing application, where users can use the synthetic voices to create their own dialogues. The obtained results validate the proposal. In this editing application users choose synthetic voices and assign them to the sentences considering the speaking characters and the expressiveness. Involving the semantic domain, this assignment can be achieved automatically, at least partly. Words and sentences are represented numerically in trainable semantic vector spaces, called embeddings, and these can be used to predict the expressiveness to some extent. This method not only permits fully automatic reading of larger text passages, considering the local context, but can also be used as a semantic search engine for training data. Both applications are evaluated in a perceptual test showing the potential of the proposed method. Finally, accounting for the new tendencies in the speech synthesis world, deep neural network based expressive speech synthesis is designed and tested. Emotionally motivated semantic representations of text, sentiment embeddings, trained on the positiveness and the negativeness of movie reviews, are used as an additional input to the system. The neural network now learns not only from segmental and contextual information, but also from the sentiment embeddings, affecting especially prosody. The system is evaluated in two perceptual experiments which show preferences for the inclusion of sentiment embeddings as an additional input.<br>Hoy en día, especialmente con el auge de las redes neuronales, la síntesis de habla se basa casi totalmente en datos. El objetivo de esta tesis es proveer métodos de entrenamiento automático y no supervisado a partir de datos para la síntesis de habla expresiva. En comparación con sistemas de síntesis "neutrales", resulta más difícil encontrar datos de entrenamiento fiables para la síntesis expresiva, a pesar de la gran disponibilidad de recursos como internet. La dificultad principal se origina en la naturaleza del habla expresiva, altamente dependiente del hablante y la situación, resultando en muchas variaciones acústicas. Las consecuencias son, primero, que es muy difícil definir etiquetas que identifiquen fiablemente todos los detalles del habla expresiva. La definición típica de 6 emociones básicas es una simplificación que tendrá consecuencias inexcusables cuando se trata con datos fuera del laboratorio. Segundo, incluso si se llegara a definir un conjunto de etiquetas, aparte del enorme esfuerzo manual que supondría, sería muy difícil conseguir suficientes datos de entrenamiento para cada variante respetando todos sus matices. El objetivo de esta tesis es estudiar métodos de entrenamiento automático para la síntesis de habla expresiva evitando etiquetas y desarrollar aplicaciones a base de estas propuestas. El enfoque abarca los dominios acústico y semántico. Con respecto al dominio acústico, el objetivo es encontrar características acústicas aptas para representar habla expresiva, especialmente en el dominio multi-locutor, acercándose a datos reales e incontrolados. Para esto, la perspectiva se apartará de las características tradicionales, principalmente basadas en la prosodia, hacia características ganadas a partir del análisis de factores, intentando identificar los componentes principales de la expresividad, concretamente los i-vectors. Los resultados demuestran que una combinación de características tradicionales y de las basadas en los i-vectors rinde mejor en la tarea del "clustering" no supervisado del habla expresiva que solo las características tradicionales e incluso mejor que amplios conjuntos de características del estado del arte en el dominio multi-locutor. Una vez definido, el conjunto de características se utiliza para el "clustering" no supervisado de un audiolibro, entrenando de cada "cluster" una voz. El método se ha evaluado en una aplicación de edición de audiolibro, donde los usuarios utilizaban las voces sintéticas para crear sus propios diálogos. Los resultados obtenidos validan la propuesta. En la aplicación de edición, los usuarios eligen voces sintéticas y las asignan a frases considerando los personajes y la expresividad. Implicando el dominio semántico, esta asignación podría realizarse automáticamente. En esta parte de la tesis, palabras y frases se representan numéricamente en espacios vectoriales entrenables, llamados embeddings, y pueden utilizarse para predecir la expresividad. Este método no solo permite una lectura automática de pasajes de texto, tomando en cuenta el contexto local, sino que también puede utilizarse como una herramienta de búsqueda semántica para datos de entrenamiento. Ambas aplicaciones se han evaluado en un experimento perceptual demostrando el potencial de la metodología propuesta. Finalmente, siguiendo las nuevas tendencias en el mundo de la síntesis de habla basada en redes neuronales, se ha desarrollado y evaluado un sistema de síntesis de voz expresiva utilizando esta tecnología. Representaciones semánticas de texto, motivadas emocionalmente, llamadas "sentiment embeddings", entrenadas con reseñas de cine, se utilizan como input adicional en el sistema. La red neuronal ahora aprende no solamente de la información segmental y contextual, sino también de esta representación del sentimiento, afectando especialmente la prosodia. El sistema se ha evaluado en dos experimentos perceptuales, demostrando la preferencia del sistema que incluye esta nueva represent

APA, Harvard, Vancouver, ISO, and other styles

21

Macon, Michael W. "Speech synthesis based on sinusoidal modeling." Diss., Georgia Institute of Technology, 1996. http://hdl.handle.net/1853/13904.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Wang, Min 1961. "Format-based synthesis of Chinese speech." Thesis, McGill University, 1986. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=66001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Rahim, Mazin. "Neural networks in articulatory speech synthesis." Thesis, University of Liverpool, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.317191.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Fekkai, Souhila. "Fractal based speech recognition and synthesis." Thesis, De Montfort University, 2002. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.269246.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Wright, Richard Douglas. "An investigation of speech synthesis parameters." Thesis, University of Southampton, 1988. https://eprints.soton.ac.uk/52279/.

Full text

Abstract:

The model of speech production generally used in speech synthesis is that of a source modified by a digital filter. The major difference between a number of models is the form of the digital filter. The purpose of this research is to compare the properties of these filters when used for speech synthesis. Six models were investigated: (1) series resonance; (2) direct form; (3) reflection coefficients; (4) area function; (5) parallel resonance; and (6) a simple articulatory model. Types (2,3,4) are three varieties of linear predictive coding (LPC) parameters. There are five parts to the investigation: (1) an historical survey of models for speech synthesis and their problems; (2) a formal description of the models and their analytical relationships; (3) an objective assessment of the behaviour of the models during interpolation; (4) measurement of intelligibility (using a FAAF test); and (5) measurement of naturalness. Principal results are: synthesizer types (1) to (4) are all-pole models, formally equivalent in the steady state. But when the parameters of any of the models are interpolated, consequences for motion of vocal tract resonances (formants) differ. These differences exceed the discrimination limen for formant frequency, and make a small but statistically significant difference to intelligibility, but not to naturalness. Simple linear interpolation was found to be as good as cosine or piecewise-linear interpolation. Complete lack of interpolation reduced intelligibility by 30%. Finally, the synthesis studied achieved as few place-of-articulation errors as did LPC speech, indicating that intelligibility was limited not by parameter and transition type, but by other factors such as the excitation signal, phoneme target values, and durations.

APA, Harvard, Vancouver, ISO, and other styles

26

Andersson, Johan Sebastian. "Synthesis and evaluation of conversational characteristics in speech synthesis." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/8891.

Full text

Abstract:

Conventional synthetic voices can synthesise neutral read aloud speech well. But, to make synthetic speech more suitable for a wider range of applications, the voices need to express more than just the word identity. We need to develop voices that can partake in a conversation and express, e.g. agreement, disagreement, hesitation, in a natural and believable manner. In speech synthesis there are currently two dominating frameworks: unit selection and HMM-based speech synthesis. Both frameworks utilise recordings of human speech to build synthetic voices. Despite the fact that the content of the recordings determines the segmental and prosodic phenomena that can be synthesised, surprisingly little research has been made on utilising the corpus to extend the limited behaviour of conventional synthetic voices. In this thesis we will show how natural sounding conversational characteristics can be added to both unit selection and HMM-based synthetic voices, by adding speech from a spontaneous conversation to the voices. We recorded a spontaneous conversation, and by manually transcribing and selecting utterances we obtained approximately two thousand utterances from it. These conversational utterances were rich in conversational speech phenomena, but they lacked the general coverage that allows unit selection and HMM-based synthesis techniques to synthesise high quality speech. Therefore we investigated a number of blending approaches in the synthetic voices, where the conversational utterances were augmented with conventional read aloud speech. The synthetic voices that contained conversational speech were contrasted with conventional voices without conversational speech. The perceptual evaluations showed that the conversational voices were generally perceived by listeners as having a more conversational style than the conventional voices. This conversational style was largely due to the conversational voices’ ability to synthesise utterances that contained conversational speech phenomena in a more natural manner than the conventional voices. Additionally, we conducted an experiment that showed that natural sounding conversational characteristics in synthetic speech can convey pragmatic information, in our case an impression of certainty or uncertainty, about a topic to a listener. The conclusion drawn is that the limited behaviour of conventional synthetic voices can be enriched by utilising conversational speech in both unit selection and HMM-based speech synthesis.

APA, Harvard, Vancouver, ISO, and other styles

27

Merva, Monica Ann. "The effects of speech rate, message repetition, and information placement on synthesized speech intelligibility." Thesis, Virginia Tech, 1987. http://hdl.handle.net/10919/41554.

Full text

Abstract:

<p>Recent improvements in speech technology have made synthetic speech a viable I/O alternative. However, little research has focused on optimizing the various speech parameters which influence system performance. This study examined the effects of speech rate, message repetition, and the placement of information in a message. Briefly, subjects heard messages generated by a speech synthesizer and were asked to transcribe what they had heard. After entering each transcription, subjects rated the perceived difliculty of the preceding message, and how confident they were of their response. The accuracy of their response, system response time, and response latency were recorded.</p> <p> Transcription accuracy was best for messages spoken at 150 or 180 wpm and for messages repeated either twice or three times. Words at the end of messages were transcribed more accurately than words at the beginning of messages. Response latencies were fastest at 180 wpm with 3 repetitions and rose as the number of repetitions decreased. System response times were shortest when a message was repeated only once. The subjective certainty and difiiculty ratings indicated that subjects were aware of errors when incorrectly transcribing a message. These results suggest that a) message rates should lie below 210 wpm, b) a repeat feature should be included in speech interface designs, and c) important information should be contained at the end of messages.</p><br>Master of Science

APA, Harvard, Vancouver, ISO, and other styles

28

Varga, A. P. "Multipulse excited linear predictive analysis in speech coding and constructive speech synthesis." Thesis, University of Cambridge, 1985. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.372909.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Merritt, Thomas. "Overcoming the limitations of statistical parametric speech synthesis." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/22071.

Full text

Abstract:

At the time of beginning this thesis, statistical parametric speech synthesis (SPSS) using hidden Markov models (HMMs) was the dominant synthesis paradigm within the research community. SPSS systems are effective at generalising across the linguistic contexts present in training data to account for inevitable unseen linguistic contexts at synthesis-time, making these systems flexible and their performance stable. However HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning that, despite great progress, the speech output is rarely confused for natural speech. There are many hypotheses for the causes of reduced synthesis quality, and subsequent required improvements, for HMM speech synthesis in literature. However, until this thesis, these hypothesised causes were rarely tested. This thesis makes two types of contributions to the field of speech synthesis; each of these appears in a separate part of the thesis. Part I introduces a methodology for testing hypothesised causes of limited quality within HMM speech synthesis systems. This investigation aims to identify what causes these systems to fall short of natural speech. Part II uses the findings from Part I of the thesis to make informed improvements to speech synthesis. The usual approach taken to improve synthesis systems is to attribute reduced synthesis quality to a hypothesised cause. A new system is then constructed with the aim of removing that hypothesised cause. However this is typically done without prior testing to verify the hypothesised cause of reduced quality. As such, even if improvements in synthesis quality are observed, there is no knowledge of whether a real underlying issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a wide range of perceptual tests in Part I of the thesis to discover what the real underlying causes of reduced quality in HMM synthesis are and the level to which they contribute. Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements to synthesis quality. Two well-motivated improvements to standard HMM synthesis are investigated. The first of these improvements follows on from averaging across differing linguistic contexts being identified as a major contributing factor to reduced synthesis quality. This is a practice typically performed during decision tree regression in HMM synthesis. Therefore a system which removes averaging across differing linguistic contexts and instead performs averaging only across matching linguistic contexts (called rich-context synthesis) is investigated. The second of the motivated improvements follows the finding that the parametrisation (i.e., vocoding) of speech, standard practice in SPSS, introduces a noticeable drop in quality before any modelling is even performed. Therefore the hybrid synthesis paradigm is investigated. These systems aim to remove the effect of vocoding by using SPSS to inform the selection of units in a unit selection system. Both of the motivated improvements applied in Part II are found to make significant gains in synthesis quality, demonstrating the benefit of performing the style of perceptual testing conducted in the thesis.

APA, Harvard, Vancouver, ISO, and other styles

30

Shannon, Sean Matthew. "Probabilistic acoustic modelling for parametric speech synthesis." Thesis, University of Cambridge, 2014. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.708415.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Strömbergsson, Sofia. "The /k/s, the /t/s, and the inbetweens : Novel approaches to examining the perceptual consequences of misarticulated speech." Doctoral thesis, KTH, Tal-kommunikation, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-143102.

Full text

Abstract:

This thesis comprises investigations of the perceptual consequences of children’s misarticulated speech – as perceived by clinicians, by everyday listeners, and by the children themselves. By inviting methods from other areas to the study of speech disorders, this work demonstrates some successful cases of cross-fertilization. The population in focus is children with a phonological disorder (PD), who misarticulate /t/ and /k/. A theoretical assumption underlying this work is that errors in speech production are often paralleled in perception, e.g. that children base their decision on whether a speech sound is a /t/ or a /k/ on other acoustic-phonetic criteria than those employed by proficient language users. This assumption, together with an aim at stimulating self-monitoring in these children, motivated two of the included studies. Through these studies, new insights into children’s perception of their own speech were achieved – insights entailing both clinical and psycholinguistic implications. For example, the finding that children with PD generally recognize themselves as the speaker in recordings of their own utterances lends support to the use of recordings in therapy, to attract children’s attention to their own speech production. Furthermore, through the introduction of a novel method for automatic correction of children’s speech errors, these findings were extended with the observation that children with PD tend to evaluate misarticulated utterances as correct when just having produced them, and to perceive inaccuracies better when time has passed. Another theme in this thesis is the gradual nature of speech perception related to phonological categories, and a concern that perceptual sensitivity is obscured in descriptions based solely on discrete categorical labels. This concern is substantiated by the finding that listeners rate “substitutions” of [t] for /k/ as less /t/-like than correct productions of [t] for intended /t/. Finally, a novel method of registering listener reactions during the continuous playback of misarticulated speech is introduced, demonstrating a viable approach to exploring how different speech errors influence intelligibility and/or acceptability. By integrating such information in the prioritizing of therapeutic targets, intervention may be better directed at those patterns that cause the most problems for the child in his or her everyday life.<br><p>QC 20140317</p>

APA, Harvard, Vancouver, ISO, and other styles

32

Alissali, Mamoun. "Architecture logicielle pour la synthèse multilingue de la parole." Grenoble INPG, 1993. http://www.theses.fr/1993INPG0037.

Full text

Abstract:

Cette these presente une etude des specifications logicielles pour la synthese multilingue de la parole. L'objectif est la conception et la realisation d'une architecture logicielle definissant une collection d'outils appropries au developpement et a l'utilisation de systemes de synthese multilingue de la parole. L'environnement resultant, appele compost, permet la construction de systemes reconfigurables a partir de collections de modules ecrits dans deux langages de programmation: un langage de reecriture specialise, appele compost egalement, et un langage de programmation traditionnel (c). Une interface normalisee permet la coprogrammation dans ces deux langages. Des exemples de developpement et d'exploitation de systemes de synthese de la parole sous compost sont ensuite presentes. Ses capacites multilingues et son architecture repartie sont ensuite presentees. En conclusion les grandes lignes de l'evolution future de cet environnement sont tracees

APA, Harvard, Vancouver, ISO, and other styles

33

Kain, Alexander Blouke. "High resolution voice transformation /." Full text open access at:, 2001. http://content.ohsu.edu/u?/etd,189.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Chung, Jae H. "A new homomorphic vocoder framework using analysis-by-synthesis excitation analysis." Diss., Georgia Institute of Technology, 1991. http://hdl.handle.net/1853/15471.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Crosmer, Joel R. "Very low bit rate speech coding using the line spectrum pair transformation of the LPC coefficients." Diss., Georgia Institute of Technology, 1985. http://hdl.handle.net/1853/15739.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Cummings, Kathleen E. "Analysis, synthesis, and recognition of stressed speech." Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/15673.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Hagrot, Joel. "A Data-Driven Approach For Automatic Visual Speech In Swedish Speech Synthesis Applications." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-246393.

Full text

Abstract:

This project investigates the use of artificial neural networks for visual speech synthesis. The objective was to produce a framework for animated chat bots in Swedish. A survey of the literature on the topic revealed that the state-of-the-art approach was using ANNs with either audio or phoneme sequences as input. Three subjective surveys were conducted, both in the context of the final product, and in a more neutral context with less post-processing. They compared the ground truth, captured using the deep-sensing camera of the iPhone X, against both the ANN model and a baseline model. The statistical analysis used mixed effects models to find any statistically significant differences. Also, the temporal dynamics and the error were analyzed. The results show that a relatively simple ANN was capable of learning a mapping from phoneme sequences to blend shape weight sequences with satisfactory results, except for the fact that certain consonant requirements were unfulfilled. The issues with certain consonants were also observed in the ground truth, to some extent. Post-processing with consonant-specific overlays made the ANN’s animations indistinguishable from the ground truth and the subjects perceived them as more realistic than the baseline model’s animations. The ANN model proved useful in learning the temporal dynamics and coarticulation effects for vowels, but may have needed more data to properly satisfy the requirements of certain consonants. For the purposes of the intended product, these requirements can be satisfied using consonant-specific overlays.<br>Detta projekt utreder hur artificiella neuronnät kan användas för visuell talsyntes. Ändamålet var att ta fram ett ramverk för animerade chatbotar på svenska. En översikt över litteraturen kom fram till att state-of-the-art-metoden var att använda artificiella neuronnät med antingen ljud eller fonemsekvenser som indata. Tre enkäter genomfördes, både i den slutgiltiga produktens kontext, samt i en mer neutral kontext med mindre bearbetning. De jämförde sanningsdatat, inspelat med iPhone X:s djupsensorkamera, med både neuronnätsmodellen och en grundläggande så kallad baselinemodell. Den statistiska analysen använde mixed effects-modeller för att hitta statistiskt signifikanta skillnader i resultaten. Även den temporala dynamiken analyserades. Resultaten visar att ett relativt enkelt neuronnät kunde lära sig att generera blendshapesekvenser utifrån fonemsekvenser med tillfredsställande resultat, förutom att krav såsom läppslutning för vissa konsonanter inte alltid uppfylldes. Problemen med konsonanter kunde också i viss mån ses i sanningsdatat. Detta kunde lösas med hjälp av konsonantspecifik bearbetning, vilket gjorde att neuronnätets animationer var oskiljbara från sanningsdatat och att de samtidigt upplevdes vara bättre än baselinemodellens animationer. Sammanfattningsvis så lärde sig neuronnätet vokaler väl, men hade antagligen behövt mer data för att på ett tillfredsställande sätt uppfylla kraven för vissa konsonanter. För den slutgiltiga produktens skull kan dessa krav ändå uppnås med hjälp av konsonantspecifik bearbetning.

APA, Harvard, Vancouver, ISO, and other styles

38

Manukyan, Karen. "MULTI-PLATFORM IMPLEMENTATION OF SPEECH APIS." The Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1211350344.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Makashay, Matthew Joel. "Individual Differences in Speech and Non-Speech Perception of Frequency and Duration." The Ohio State University, 2003. http://rave.ohiolink.edu/etdc/view?acc_num=osu1047489733.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Engwall, Olov. "Tongue Talking : Studies in Intraoral Speech Synthesis." Doctoral thesis, KTH, Tal, musik och hörsel, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3380.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Sakai, Shinsuke. "A Probabilistic Approach to Concatenative Speech Synthesis." 京都大学 (Kyoto University), 2012. http://hdl.handle.net/2433/152508.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Hassanain, Elham. "Novel cepstral techniques applied to speech synthesis." Thesis, University of Surrey, 2006. http://epubs.surrey.ac.uk/842745/.

Full text

Abstract:

The aim of this research was to develop an improved analysis and synthesis model for utilization in speech synthesis. Conventionally, linear prediction has been used in speech synthesis but is restricted by the requirement of an all-pole, minimum phase model. Here, cepstral homomorphic deconvolution techniques were used to approach the problem, since there are fewer constraints on the model and some evidence in the literature that shows that cepstral homomorphic deconvolution can give improved performance. Specifically the spectral root cepstrum was developed in an attempt to separate the magnitude and phase spectra. Analysis and synthesis filters were developed on these two data streams independently in an attempt to improve the process. It is shown that independent analysis of the magnitude and phase spectra is preferable to a combined analysis, and so the concept of a phase cepstrum is introduced, and a number of different phase cepstra are defined. Although extremely difficult for many types of signals, phase analysis via a root cepstrum and the Hartley phase cepstrum give encouraging results for a wide range of both minimum and maximum phase signals. Overall, this research has shown that improved synthesis can be achieved with these techniques.

APA, Harvard, Vancouver, ISO, and other styles

43

Vepa, Jithendra. "Join cost for unit selection speech synthesis." Thesis, University of Edinburgh, 2004. http://hdl.handle.net/1842/1452.

Full text

Abstract:

Undoubtedly, state-of-the-art unit selection-based concatenative speech systems produce very high quality synthetic speech. this is due to a large speech database containing many instances of each speech unit, with a varied and natural distribution of prosodic and spectral characteristics. the join cost, which measures how well two units can be joined together is one of the main criteria for selecting appropriate units from this large speech database. The ideal join cost is one that measures percieved discontinuity based on easily measurable spectral properties of the units being joined, inorder to ensure smooth and natural sounding synthetic speech. During first part of my research, I have investigated various spectrally based distance measures for use in computation of the join cost by designing a perceptual listening experiment. A variation to the usual perceptual test paradigm is proposed in this thesis by deliberately including a wide range of qualities of join in polysyllabic words. The test stimuli are obtained using a state-of-the-art unit-selection text-to-speech system: rVoice from Rhetorical Systems Ltd. Three spectral features Mel-frequency cepstral coefficients (MFCC), line spectral frequencies (LSF) and multiple centroid analysis (MCA) parameters and various statistical distances - Euclidean, Kullback-Leibler, Mahalanobis - are used to obtain distance measures. Based on the correlations between perceptual scores and these spectral distances. I proposed new spectral distance measures, which have good correlation with human perception to concatenation discontinuities. The second part of my research concentrates on combining join cost computation and the smoothing operation, which is required to disguise joins, by learning an underlying representation from the acoustic signal. In order to accomplish this task, I have chosen linear dynamic models (LDM), sometimes known as Kalman filters. Three different initialisation schemes are used prior to Expectation-Maximisation (KM) in LDM training. Once the models are trained, the join cost is computed based on the error between model predictions and actual observations. Analytical measures are derived based on the shape of this error plot. These measures and initialisation schemes are compared by computing correlations using the perceptual data. The LDMs are also able to smooth the observations which are then used to synthesise speech. To evaluate the LDM smoothing operation, another listening test is performed where it is compared with the standard methods (simple linear interpolation). I have compared the best three join cost functions, chosen from the first and second parts of my research, subjectively using a listening test in the third part of my research. in this test, I also evaluated different smoothing methods: no smoothing, linear smoothing and smoothing achieved using LDMs.

APA, Harvard, Vancouver, ISO, and other styles

44

Watts, Oliver Samuel. "Unsupervised learning for text-to-speech synthesis." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/7982.

Full text

Abstract:

This thesis introduces a general method for incorporating the distributional analysis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to produce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characterisation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manually labelled data or language-specific expertise. Results of a subjective evaluation are presented.

APA, Harvard, Vancouver, ISO, and other styles

45

Vine, Daniel Samuel Gordon. "Time-domain concatenative text-to-speech synthesis." Thesis, Bournemouth University, 1998. http://eprints.bournemouth.ac.uk/351/.

Full text

Abstract:

A concatenation framework for time-domain concatenative speech synthesis (TDCSS) is presented and evaluated. In this framework, speech segments are extracted from CV, VC, CVC and CC waveforms, and abutted. Speech rhythm is controlled via a single duration parameter, which specifies the initial portion of each stored waveform to be output. An appropriate choice of segmental durations reduces spectral discontinuity problems at points of concatenation, thus reducing reliance upon smoothing procedures. For text-to-speech considerations, a segmental timing system is described, which predicts segmental durations at the word level, using a timing database and a pattern matching look-up algorithm. The timing database contains segmented words with associated duration values, and is specific to an actual inventory of concatenative units. Segmental duration prediction accuracy improves as the timing database size increases. The problem of incomplete timing data has been addressed by using `default duration' entries in the database, which are created by re-categorising existing timing data according to articulation manner. If segmental duration data are incomplete, a default duration procedure automatically categorises the missing speech segments according to segment class. The look-up algorithm then searches the timing database for duration data corresponding to these re-categorised segments. The timing database is constructed using an iterative synthesis/adjustment technique, in which a `judge' listens to synthetic speech and adjusts segmental durations to improve naturalness. This manual technique for constructing the timing database has been evaluated. Since the timing data is linked to an expert judge's perception, an investigation examined whether the expert judge's perception of speech naturalness is representative of people in general. Listening experiments revealed marked similarities between an expert judge's perception of naturalness and that of the experimental subjects. It was also found that the expert judge's perception remains stable over time. A synthesis/adjustment experiment found a positive linear correlation between segmental durations chosen by an experienced expert judge and duration values chosen by subjects acting as expert judges. A listening test confirmed that between 70% and 100% intelligibility can be achieved with words synthesised using TDCSS. In a further test, a TDCSS synthesiser was compared with five well-known text-to-speech synthesisers, and was ranked fifth most natural out of six. An alternative concatenation framework (TDCSS2) was also evaluated, in which duration parameters specify both the start point and the end point of the speech to be extracted from a stored waveform and concatenated. In a similar listening experiment, TDCSS2 stimuli were compared with five well-known text-tospeech synthesisers, and were ranked fifth most natural out of six.

APA, Harvard, Vancouver, ISO, and other styles

46

Edge, James D. "Techniques for the synthesis of visual speech." Thesis, University of Sheffield, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.419276.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

SOLEWICZ, JOSE ALBERTO. "TEXT-TO-SPEECH SYNTHESIS FOR BRAZILIAN PORTUGUESE." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 1993. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=8690@1.

Full text

Abstract:

Este trabalho apresenta um sistema de síntese de voz a partir de texto irrestrito para a língua portuguesa falada no Brasil. O sistema é baseado na técnica de concatenação, por regras, de unidades de voz previamente codificadas. Propõe-se um inventário de unidades de síntese extremamente reduzido (149 unidades) composto, basicamente, por transições consoante-vogal (CV), que representam segmentos acústicos cruciais no processo de produção da fala. Mostrou-se ser possível produzir voz altamente inteligível através da concatenação destas unidades. É proposto, também, o uso de um modelo CELP como estrutura de compressão e síntese do inventário de unidades, incluindo as adaptações necessárias para as alterações prosódicas do sinal no momento de sua codificação. Resultados de testes auditivos mostraram que a síntese através do modelo CELP proposto é superior àquela obtida através do Vocoder-LPC (excitação mono- pulso/ruído) usualmente empregado nos sistemas de síntese de voz a partir de texto.<br>This work presents na unrestricted text-to-speech synthesis system for brazilian portuguese. The system is based on the concatenation by rules of previously coded speech units. An extremely reduced set of synthesis units (149) is proposed. This set is mostly comprised of consonant-vowel (CV) transitions, which represent crucial acoustic segments in the speech production process. Production of highly intelligible speech is show to be possible through concatenation of these units. A CELP model is also proposed as a compression and synthesis structure, which includes necessary adaptations in order to modify the speech prosody during its decoding phase. Subjective tests showed that speech synthesized through the proposed CELP model is judged superior to that obtained through an LPC Vocoder (mono-pulse/noise excited), which is traditionally used in text-to-speech synthesis systems.

APA, Harvard, Vancouver, ISO, and other styles

48

Hardwick, John C. (John Clark). "A high quality speech analysis/synthesis system." Thesis, Massachusetts Institute of Technology, 1986. http://hdl.handle.net/1721.1/14901.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Halabi, Nawar. "Modern standard Arabic phonetics for speech synthesis." Thesis, University of Southampton, 2016. https://eprints.soton.ac.uk/409695/.

Full text

Abstract:

Arabic phonetics and phonology have not been adequately studied for the purposes of speech synthesis and speech synthesis corpus design. The only sources of knowledge available are either archaic or targeted towards other disciplines such as education. This research conducted a three-stage study. First, Arabic phonology research was reviewed in general, and the results of this review were triangulated with expert opinions – gathered throughout the project – to create a novel formalisation of Arabic phonology for speech synthesis. Secondly, this formalisation was used to create a speech corpus in Modern Standard Arabic and this corpus was used to produce a speech synthesiser. This corpus was the first to be constructed and published for this dialect of Arabic using scientifically-supported phonological formalisms. The corpus was semi-automatically annotated with phoneme boundaries and stress marks; it is word-aligned with the orthographical transcript. The accuracy of these alignments was compared with previous published work, which showed that even slightly less accurate alignments are sufficient for producing high quality synthesis. Finally, objective and subjective evaluations were conducted to assess the quality of this corpus. The objective evaluation showed that the corpus based on the proposed phonological formalism had sufficient phonetic coverage compared with previous work. The subjective evaluation showed that this corpus can be used to produce high quality parametric and unit selection speech synthesisers. In addition, it showed that the use of orthographically extracted stress marks can improve the quality of the generated speech for general purpose synthesis. These stress marks are the first to be tested for Modern Standard Arabic, which thus opens this subject for future research.

APA, Harvard, Vancouver, ISO, and other styles

50

Beněk, Tomáš. "Implementing and Improving a Speech Synthesis System." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236079.

Full text

Abstract:

Tato práce se zabývá syntézou řeči z textu. V práci je podán základní teoretický úvod do syntézy řeči z textu. Práce je postavena na MARY TTS systému, který umožňuje využít existujících modulů k vytvoření vlastního systému pro syntézu řeči z textu, a syntéze řeči pomocí skrytých Markovových modelů natrénovaných na vytvořené řečové databázi. Bylo vytvořeno několik jednoduchých programů ulehčujících vytvoření databáze a přidání nového jazyka a hlasu pro MARY TTS systém bylo demonstrováno. Byl vytvořen a publikován modul a hlas pro Český jazyk. Byl popsán a implementován algoritmus pro přepis grafémů na fonémy.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!