Log in

Relevant bibliographies by topics / Speech synthesis/recognition / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Speech synthesis/recognition.

Dissertations / Theses on the topic 'Speech synthesis/recognition'

Author: Grafiati

Published: 4 June 2021

Last updated: 7 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 39 dissertations / theses for your research on the topic 'Speech synthesis/recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Sun, Felix (Felix W. ). "Speech Representation Models for Speech Synthesis and Multimodal Speech Recognition." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106378.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.<br>This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.<br>Cataloged from student-submitted PDF version of thesis.<br>Includes bibliographical references (pages 59-63).<br>The field of speech recognition has seen steady advances over the last two decades, leading to the accurate, real-time recognition systems available on mobile phones today. In this thesis, I apply speech modeling techniques developed for recognition to two other speech problems: speech synthesis and multimodal speech recognition with images. In both problems, there is a need to learn a relationship between speech sounds and another source of information. For speech synthesis, I show that using a neural network acoustic model results in a synthesizer that is more tolerant of noisy training data than previous work. For multimodal recognition, I show how information from images can be effectively integrated into the recognition search framework, resulting in improved accuracy when image data is available.<br>by Felix Sun.<br>M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

2

Fekkai, Souhila. "Fractal based speech recognition and synthesis." Thesis, De Montfort University, 2002. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.269246.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Cummings, Kathleen E. "Analysis, synthesis, and recognition of stressed speech." Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/15673.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

McCulloch, Neil Andrew. "Neural network approaches to speech recognition and synthesis." Thesis, Keele University, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.387255.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Scott, Simon David. "A data-driven approach to visual speech synthesis." Thesis, University of Bath, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.307116.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Devaney, Jason Wayne. "A study of articulatory gestures for speech synthesis." Thesis, University of Liverpool, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.284254.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Haque, Serajul. "Perceptual features for speech recognition." University of Western Australia. School of Electrical, Electronic and Computer Engineering, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0187.

Full text

Abstract:

Automatic speech recognition (ASR) is one of the most important research areas in the field of speech technology and research. It is also known as the recognition of speech by a machine or, by some artificial intelligence. However, in spite of focused research in this field for the past several decades, robust speech recognition with high reliability has not been achieved as it degrades in presence of speaker variabilities, channel mismatch condi- tions, and in noisy environments. The superb ability of the human auditory system has motivated researchers to include features of human perception in the speech recognition process. This dissertation investigates the roles of perceptual features of human hearing in automatic speech recognition in clean and noisy environments. Methods of simplified synaptic adaptation and two-tone suppression by companding are introduced by temporal processing of speech using a zero-crossing algorithm. It is observed that a high frequency enhancement technique such as synaptic adaptation performs better in stationary Gaussian white noise, whereas a low frequency enhancement technique such as the two-tone sup- pression performs better in non-Gaussian non-stationary noise types. The effects of static compression on ASR parametrization are investigated as observed in the psychoacoustic input/output (I/O) perception curves. A method of frequency dependent asymmetric compression technique, that is, higher compression in the higher frequency regions than the lower frequency regions, is proposed. By asymmetric compression, degradation of the spectral contrast of the low frequency formants due to the added compression is avoided. A novel feature extraction method for ASR based on the auditory processing in the cochlear nucleus is presented. The processings for synchrony detection, average discharge (mean rate) processing and the two tone suppression are segregated and processed separately at the feature extraction level according to the differential processing scheme as observed in the AVCN, PVCN and the DCN, respectively, of the cochlear nucleus. It is further observed that improved ASR performances can be achieved by separating the synchrony detection from the synaptic processing. A time-frequency perceptual spectral subtraction method based on several psychoacoustic properties of human audition is developed and evaluated by an ASR front-end. An auditory masking threshold is determined based on these psychoacoustic e?ects. It is observed that in speech recognition applications, spec- tral subtraction utilizing psychoacoustics may be used for improved performance in noisy conditions. The performance may be further improved if masking of noise by the tonal components is augmented by spectral subtraction in the masked region.

APA, Harvard, Vancouver, ISO, and other styles

8

Peters, Richard Alan II. "A LINEAR PREDICTION CODING MODEL OF SPEECH (SYNTHESIS, LPC, COMPUTER, ELECTRONIC)." Thesis, The University of Arizona, 1985. http://hdl.handle.net/10150/291240.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Benkrid, A. "Real time TLM vocal tract modelling." Thesis, University of Nottingham, 1989. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.352958.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Hu, Hongwei. "Towards an improved model of dynamics for speech recognition and synthesis." Thesis, University of Birmingham, 2012. http://etheses.bham.ac.uk//id/eprint/3704/.

Full text

Abstract:

This thesis describes the research on the use of non-linear formant trajectories to model speech dynamics under the framework of a multiple-level segmental hidden Markov model (MSHMM). The particular type of intermediate-layer model investigated in this study is based on the 12-dimensional parallel formant synthesiser (PFS) control parameters, which can be directly used to synthesise speech with a formant synthesiser. The non-linear formant trajectories are generated by using the speech parameter generation algorithm proposed by Tokuda and colleagues. The performance of the newly developed non-linear trajectory model of dynamics is tested against the piecewise linear trajectory model in both speech recognition and speech synthesis. In speech synthesis experiments, the 12 PFS control parameters and their time derivatives are used as the feature vectors in the HMM-based text-to-speech system. The human listening test and objective test results show that, despite the low overall quality of the synthetic speech, the non-linear trajectory model of dynamics can significantly improve the intelligibility and naturalness of the synthetic speech. Moreover, the generated non-linear formant trajectories match actual formant trajectories in real human speech fairly well. The \(\char{cmmi10}{0x4e}\)-best list rescoring paradigm is employed for the speech recognition experiments. Both context-independent and context-dependent MSHMMs, based on different formant-to-acoustic mapping schemes, are used to rescore an \(\char{cmmi10}{0x4e}\)-best list. The rescoring results show that the introduction of the non-linear trajectory model of formant dynamics results in statistically significant improvement under certain mapping schemes. In addition, the smoothing in the non-linear formant trajectories has been shown to be able to account for contextual effects such as coarticulation.

APA, Harvard, Vancouver, ISO, and other styles

11

Tuerk, Christine M. "Automatic speech synthesis using auditory transforms and artificial neural networks." Thesis, University of Cambridge, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.385362.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Ho, Ching-Hsiang. "Speaker modelling for voice conversion." Thesis, Brunel University, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.365076.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Moore, John Humphrey. "Digitizing human faces for the analysis and synthesis of visible speech." Thesis, Leeds Beckett University, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.277886.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Richards, Elizabeth A. "Automatic formant labeling in continuous speech /." Online version of thesis, 1989. http://hdl.handle.net/1850/10543.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Bivainis, Robertas. "Balso atpažinimo programų lietuvinimo galimybių tyrimas." Master's thesis, Lithuanian Academic Libraries Network (LABT), 2013. http://vddb.laba.lt/obj/LT-eLABa-0001:E.02~2013~D_20130930_090749-15587.

Full text

Abstract:

Šiame darbe yra analizuojama ir tiriama kaip veikia balso atpažinimo sistema HTK, kokie žingsniai turi būti atlikti norint sėkmingai atpažinti lietuviškai išartus žodžius. Taip pat apžvelgiamos kokių kalbos technologijų samprata reikalinga norint sukurti balso atpažinimo programą. Balso atpažinime labai svarbu yra kalbos signalų atpažinimo modeliai ir paslėptosios Markovo grandinės, todėl analizėje yra apžvelgiama jų veikimo principai ir algoritmai.<br>This thesis will focus on how the speech recognition program HTK operates and what steps have to be taken in order to recognize spoken Lithuanian words. Also the emphasis of this thesis goes to conceptions of speech recognition technologies which are needed to create a speech recognition program.

APA, Harvard, Vancouver, ISO, and other styles

16

Berg, Brian LaRoy. "Investigating Speaker Features From Very Short Speech Records." Diss., Virginia Tech, 2001. http://hdl.handle.net/10919/28691.

Full text

Abstract:

A procedure is presented that is capable of extracting various speaker features, and is of particular value for analyzing records containing single words and shorter segments of speech. By taking advantage of the fast convergence properties of adaptive filtering, the approach is capable of modeling the nonstationarities due to both the vocal tract and vocal cord dynamics. Specifically, the procedure extracts the vocal tract estimate from within the closed glottis interval and uses it to obtain a time-domain glottal signal. This procedure is quite simple, requires minimal manual intervention (in cases of inadequate pitch detection), and is particularly unique because it derives both the vocal tract and glottal signal estimates directly from the time-varying filter coefficients rather than from the prediction error signal. Using this procedure, several glottal signals are derived from human and synthesized speech and are analyzed to demonstrate the glottal waveform modeling performance and kind of glottal characteristics obtained therewith. Finally, the procedure is evaluated using automatic speaker identity verification.<br>Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

17

Matthews, Brett Alexander. "Probabilistic modeling of neural data for analysis and synthesis of speech." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/50116.

Full text

Abstract:

This research consists of probabilistic modeling of speech audio signals and deep-brain neurological signals in brain-computer interfaces. A significant portion of this research consists of a collaborative effort with Neural Signals Inc., Duluth, GA, and Boston University to develop an intracortical neural prosthetic system for speech restoration in a human subject living with Locked-In Syndrome, i.e., he is paralyzed and unable to speak. The work is carried out in three major phases. We first use kernel-based classifiers to detect evidence of articulation gestures and phonological attributes speech audio signals. We demonstrate that articulatory information can be used to decode speech content in speech audio signals. In the second phase of the research, we use neurological signals collected from a human subject with Locked-In Syndrome to predict intended speech content. The neural data were collected with a microwire electrode surgically implanted in speech motor cortex of the subject's brain, with the implant location chosen to capture extracellular electric potentials related to speech motor activity. The data include extracellular traces, and firing occurrence times for neural clusters in the vicinity of the electrode identified by an expert. We compute continuous firing rate estimates for the ensemble of neural clusters using several rate estimation methods and apply statistical classifiers to the rate estimates to predict intended speech content. We use Gaussian mixture models to classify short frames of data into 5 vowel classes and to discriminate intended speech activity in the data from non-speech. We then perform a series of data collection experiments with the subject designed to test explicitly for several speech articulation gestures, and decode the data offline. Finally, in the third phase of the research we develop an original probabilistic method for the task of spike-sorting in intracortical brain-computer interfaces, i.e., identifying and distinguishing action potential waveforms in extracellular traces. Our method uses both action potential waveforms and their occurrence times to cluster the data. We apply the method to semi-artificial data and partially labeled real data. We then classify neural spike waveforms, modeled with single multivariate Gaussians, using the method of minimum classification error for parameter estimation. Finally, we apply our joint waveforms and occurrence times spike-sorting method to neurological data in the context of a neural prosthesis for speech.

APA, Harvard, Vancouver, ISO, and other styles

18

Alsabaan, Majed Soliman K. "Pronunciation support for Arabic learners." Thesis, University of Manchester, 2015. https://www.research.manchester.ac.uk/portal/en/theses/pronunciation-support-for-arabic-learners(3db28816-90ed-4e8b-b64c-4bbd35f98be7).html.

Full text

Abstract:

The aim of the thesis is to find out whether providing feedback to Arabic language learners will help them improve their pronunciation, particularly of words involving sounds that are not distinguished in their native languages. In addition, it aims to find out, if possible, what type of feedback will be most helpful. In order to achieve this aim, we developed a computational tool with a number of component sub tools. These tools involve the implementation of several substantial pieces of software. The first task was to ensure the system we were building could distinguish between the more challenging sounds when they were produced by a native speaker, since without that it will not be possible to classify learners’ attempts at these sounds. To this end, a number of experiments were carried out with the hidden Markov model toolkit (the HTK), a well known speech recognition toolkit, in order to ensure that it can distinguish between the confusable sounds, i.e. the ones that people have difficulty with. The developed computational tool analyses the differences between the user’s pronunciation and that of a native speaker by using grammar of minimal pairs, where each utterance is treated as coming from a family of similar words. This provides the ability to categorise learners’ errors - if someone is trying to say cat and the recogniser thinks they have said cad then it is likely that they are voicing the final consonant when it should be unvoiced. Extensive testing shows that the system can reliably distinguish such minimal pairs when they are produced by a native speaker, and that this approach does provide effective diagnostic information about errors. The tool provides feedback in three different sub-tools: as an animation of the vocal tract, as a synthesised version of the target utterance, and as a set of written instructions. The tool was evaluated by placing it in a classroom setting and asking 50 Arabic students to use the different versions of the tool. Each student had a thirty minute session with the tool, working their way through a set of pronunciation exercises at their own pace. The results of this group showed that their pronunciation does improve over the course of the session, though it was not possible to determine whether the improvement is sustained over an extended period. The evaluation was done from three points of view: quantitative analysis, qualitative analysis, and using a questionnaire. Firstly, the quantitative analysis gives raw numbers telling whether a learner had improved their pronunciation or not. Secondly, the qualitative analysis shows a behaviour pattern of what a learner did and how they used the tool. Thirdly, the questionnaire gives feedback from learners and their comments about the tool. We found that providing feedback does appear to help Arabic language learners, but we did not have enough data to see which form of feedback is most helpful. However, we provided an informative analysis of behaviour patterns to see how Arabic students used the tool and interacted with it, which could be useful for more data analysis.

APA, Harvard, Vancouver, ISO, and other styles

19

Šimkus, Ramūnas, and Tomas Stumbras. "Lietuvių kalbos priebalsių spketro analizė." Master's thesis, Lithuanian Academic Libraries Network (LABT), 2010. http://vddb.laba.lt/obj/LT-eLABa-0001:E.02~2010~D_20100903_125555-95238.

Full text

Abstract:

20 amžiaus antrojoje pusėje ypač suaktyvėjo tyrimai kalbančiojo atpažinimo ir kalbos sintezavimo srityje. Jau nuo penktojo dešimtmečio vykdomi tyrimai siekiant sukurti sistemas galinčias atpažinti šnekamąją kalbą. Ypač svarbu šioje srityje yra kokybiškai atskirti kalbos signalus. Aštuntajame dešimtmetyje buvo sukurta eilė požymių išskyrimo metodų. Svarbesni iš jų yra melų skalės kepstras, suvokimu paremta tiesinė prognozė (perceptual linear prediction), delta kepstras ir kiti.[3] Naudojant šiuolaikinę kompiuterinę įrangą, signalų atskyrimo uždavinys gerokai supaprastėja, tačiau vis tiek išlieka labai sudėtingas. Kalbos sintezatorius yra kompiuterinė sistema, kuri gali atpažinti žmogaus balsą bet kokiame tekste. Sistema gali automatiškai sugeneruoti žmogaus balsą. Viena iš perspektyviausių balso technologijų panaudojimo sričių – įvairūs neįgaliems žmonėms skirti taikymai (akliems ir silpnaregiams, nevaikščiojantiems arba turintiems ribotas judėjimo galimybes). Balso technologijų panaudojimas dažnai yra esminis arba net vienintelis tokių žmonių integravimo į visuomenę būdas. Dar yra daugybė tokių sistemų panaudojimo sričių: • telefoninių ryšių centrai, automatiškai aptarnaujantys telefoninius pokalbius, atpažįstantys ir suprantantys, ką skambinantis sako; • automatinės transporto tvarkaraščių užklausimo sistemos; • automobilio mazgų valdymo žmogaus balsu priemonės; • nenutrūkstamos kalbos atpažinimo sistemos darbui teksto redaktoriais; Kalbos signalams analizuoti bei atskirti... [toliau žr. visą tekstą]<br>In 20th century speech recognition and synthesis became very important part of science. In last 50 years were a lot of researches in speech recognition. And for the moment there are many systems for speech recognition and synthesis for popular European languages, such as French, English, Germanic languages. One of the most important benefits of this is for disabled people to make their life more comfortable and adopt them to normal life, to create new interfaces and possibility to use personal computers for them. For Lithuanian language need researches, because of our language unique. An aim of research is a spectrum of Lithuanian consonants. Main method is linear prediction is used for finding formants. There are some main methods for speech signals analysis: linear prediction, Furier transformation, cepstral analysis. For linear prediction are several different algorithms. We used Burg algorithm for finding formants. In this research paper records of words were annotated and analyzed by PRAAT software. Formant movement obtained with same program. Obtained data of research was processed with MATLAB 6.5 software. All consonants were divided to groups, such as voiced and unvoiced, semivowels, plosives and fricatives. In our research was analyzed influence of vowels following after consonant. Obtained data is useful for increasing quality in speech recognition and synthesis. Paper includes: 1. Speech generation analysis. 2. Spectrum analysis methods. 3. Experiment methodology... [to full text]

APA, Harvard, Vancouver, ISO, and other styles

20

Svensson, Cecilia. "Alternativa metoder för att kontrollera ett användargränsnitt i en browser för teknisk dokumentation." Thesis, Linköping University, Department of Science and Technology, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-1775.

Full text

Abstract:

<p>When searching for better and more practical interfaces between users and their computers, additional or alternative modes of communication between the two parties would be of great use. This thesis handles the possibilities of using eye and head movements as well as voice input as these alternative modes of communication. </p><p>One part of this project is devoted to find possible interaction techniques when navigating in a computer interface with movements of the eye or the head. The result of this part is four different controls of an interface, adapted to suit this kind of navigation, combined together in a demo application. </p><p>Another part of the project is devoted to the development of an application, with voice control as primary input method. The application developed is a simplified version of the application ActiViewer., developed by AerotechTelub Information&Media AB.</p>

APA, Harvard, Vancouver, ISO, and other styles

21

Mohamadi, Tayeb. "Synthèse à partir du texte de visages parlants : réalisation d'un prototype et mesures d'intelligibilité bimodale." Grenoble INPG, 1993. http://www.theses.fr/1993INPG0010.

Full text

Abstract:

Le but de cette etude est l'analyse geometrique des differentes formes de levres en francais, leur intelligibilite audiovisuelle et la realisation d'un prototype de synthetiseur de visage parlant francais. Dans ce manuscrit, nous retracons d'abord le role des levres dans la production de la parole, et l'apport de leur vision a l'intelligibilite de la parole degradee (une analyse phonetique des confusions des voyelles et des consonnes choisies, a ete faite en parallele), nous presentons les resultats d'une etude de leur geometrie et de leur mouvement qui a permis d'identifier une vingtaine de formes labiales de base appelees visemes. Ensuite, nous presentons un prototype de synthetiseur audiovisuel a partir du texte realise a partir de ce jeu de visemes et son evaluation en intelligibilite. Enfin, nous evaluons l'apport de l'intelligibilite en parole naturelle degradee de deux modeles de levres synthetiques realises a l'icp, avec une comparaison au cas naturel

APA, Harvard, Vancouver, ISO, and other styles

22

Holmes, William Paul. "Voice input for the disabled /." Title page, contents and summary only, 1987. http://web4.library.adelaide.edu.au/theses/09ENS/09ensh749.pdf.

Full text

Abstract:

Thesis (M. Eng. Sc.)--University of Adelaide, 1987.<br>Typescript. Includes a copy of a paper presented at TADSEM '85 --Australian Seminar on Devices for Expressive Communication and Environmental Control, co-authored by the author. Includes bibliographical references (leaves [115-121]).

APA, Harvard, Vancouver, ISO, and other styles

23

Lirussi, Igor. "Human-Robot interaction with low computational-power humanoids." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/19120/.

Full text

Abstract:

This article investigates the possibilities of human-humanoid interaction with robots whose computational power is limited. The project has been carried during a year of work at the Computer and Robot Vision Laboratory (VisLab), part of the Institute for Systems and Robotics in Lisbon, Portugal. Communication, the basis of interaction, is simultaneously visual, verbal, and gestural. The robot's algorithm provides users a natural language communication, being able to catch and understand the person’s needs and feelings. The design of the system should, consequently, give it the capability to dialogue with people in a way that makes possible the understanding of their needs. The whole experience, to be natural, is independent from the GUI, used just as an auxiliary instrument. Furthermore, the humanoid can communicate with gestures, touch and visual perceptions and feedbacks. This creates a totally new type of interaction where the robot is not just a machine to use, but a figure to interact and talk with: a social robot.

APA, Harvard, Vancouver, ISO, and other styles

24

Hain, Horst-Udo. "Phonetische Transkription für ein multilinguales Sprachsynthesesystem." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2012. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-81777.

Full text

Abstract:

Die vorliegende Arbeit beschäftigt sich mit einem datengetriebenen Verfahren zur Graphem-Phonem-Konvertierung für ein Sprachsynthesesystem. Die Aufgabe besteht darin, die Aussprache für beliebige Wörter zu bestimmen, auch für solche Wörter, die nicht im Lexikon des Systems enthalten sind. Die Architektur an sich ist sprachenunabhängig, von der Sprache abhängig sind lediglich die Wissensquellen, die zur Laufzeit des Systems geladen werden. Die Erstellung von Wissensquellen für weitere Sprachen soll weitgehend automatisch und ohne Einsatz von Expertenwissen möglich sein. Expertenwissen kann verwendet werden, um die Ergebnisse zu verbessern, darf aber keine Voraussetzung sein. Für die Bestimmung der Transkription werden zwei neuronale Netze verwendet. Das erste Netz generiert aus der Buchstabenfolge des Wortes die zu realisierenden Laute einschließlich der Silbengrenzen, und das zweite bestimmt im Anschluß daran die Position der Wortbetonung. Diese Trennung hat den Vorteil, daß man für die Bestimmung des Wortakzentes das Wissen über die gesamte Lautfolge einbeziehen kann. Andere Verfahren, die die Transkription in einem Schritt bestimmen, haben das Problem, bereits zu Beginn des Wortes über den Akzent entscheiden zu müssen, obwohl die Aussprache des Wortes noch gar nicht feststeht. Zudem bietet die Trennung die Möglichkeit, zwei speziell auf die Anforderung zugeschnittene Netze zu trainieren. Die Besonderheit der hier verwendeten neuronalen Netze ist die Einführung einer Skalierungsschicht zwischen der eigentlichen Eingabe und der versteckten Schicht. Eingabe und Skalierungsschicht werden über eine Diagonalmatrix verbunden, wobei auf die Gewichte dieser Verbindung ein Weight Decay (Gewichtezerfall) angewendet wird. Damit erreicht man eine Bewertung der Eingabeinformation während des Trainings. Eingabeknoten mit einem großen Informationsgehalt werden verstärkt, während weniger interessante Knoten abgeschwächt werden. Das kann sogar soweit gehen, daß einzelne Knoten vollständig abgetrennt werden. Der Zweck dieser Verbindung ist, den Einfluß des Rauschens in den Trainingsdaten zu reduzieren. Durch das Ausblenden der unwichtigen Eingabewerte ist das Netz besser in der Lage, sich auf die wichtigen Daten zu konzentrieren. Das beschleunigt das Training und verbessert die erzielten Ergebnisse. In Verbindung mit einem schrittweisen Ausdünnen der Gewichte (Pruning) werden zudem störende oder unwichtige Verbindungen innerhalb der Netzwerkarchitektur gelöscht. Damit wird die Generalisierungsfähigkeit noch einmal erhöht. Die Aufbereitung der Lexika zur Generierung der Trainingsmuster für die neuronalen Netze wird ebenfalls automatisch durchgeführt. Dafür wird mit Hilfe der dynamischen Zeitanpassung (DTW) der optimale Pfad in einer Ebene gesucht, die auf der einen Koordinate durch die Buchstaben des Wortes und auf der anderen Koordinate durch die Lautfolge aufgespannt wird. Somit erhält man eine Zuordnung der Laute zu den Buchstaben. Aus diesen Zuordnungen werden die Muster für das Training der Netze generiert. Um die Transkriptionsergebnisse weiter zu verbessern, wurde ein hybrides Verfahren unter Verwendung der Lexika und der Netze entwickelt. Unbekannte Wörter werden zuerst in Bestandteile aus dem Lexikon zerlegt und die Lautfolgen dieser Teilwörter zur Gesamttranskription zusammengesetzt. Dabei werden Lücken zwischen den Teilwörtern durch die neuronalen Netze aufgefüllt. Dies ist allerdings nicht ohne weiteres möglich, da es zu Fehlern an den Schnittstellen zwischen den Teiltranskriptionen kommen kann. Dieses Problem wird mit Hilfe des Lexikons gelöst, das für die Generierung der Trainingsmuster aufbereitet wurde. Hier ist eine eindeutige Zuordnung der Laute zu den sie generierenden Buchstaben enthalten. Somit können die Laute an den Schnittstellen neu bewertet und Transkriptionsfehler vermieden werden. Die Verlagsausgabe dieser Dissertation erschien 2005 im w.e.b.-Universitätsverlag Dresden (ISBN 3-937672-76-1)<br>The topic of this thesis is a system which is able to perform a grapheme-to-phoneme conversion for several languages without changes in its architecture. This is achieved by separation of the language dependent knowledge bases from the run-time system. Main focus is an automated adaptation to new languages by generation of new knowledge bases without manual effort with a minimal requirement for additional information. The only source is a lexicon containing all the words together with their appropriate phonetic transcription. Additional knowledge can be used to improve or accelerate the adaptation process, but it must not be a prerequisite. Another requirement is a fully automatic process without manual interference or post-editing. This allows for the adaptation to a new language without even having a command of that language. The only precondition is the pronunciation dictionary which should be enough for the data-driven approach to learn a new language. The automatic adaptation process is divided into two parts. In the first step the lexicon is pre-processed to determine which grapheme sequence belongs to which phoneme. This is the basis for the generation of the training patterns for the data-driven learning algorithm. In the second part mapping rules are derived automatically which are finally used to create the phonetic transcription of any word, even if it not contained in the dictionary. Task is to have a generalisation process that can handle all words in a text that has to be read out by a text-to-speech system

APA, Harvard, Vancouver, ISO, and other styles

25

Lelong, Amélie. "Convergence phonétique en interaction Phonetic convergence in interaction." Thesis, Grenoble, 2012. http://www.theses.fr/2012GRENT079/document.

Full text

Abstract:

Le travail présenté dans cette thèse est basé sur l’étude d’un phénomène appelé convergence phonétique qui postule que deux interlocuteurs en interaction vont avoir tendance à adapter leur façon de parler à leur interlocuteur dans un but communicatif. Nous avons donc mis en place un paradigme appelé « Dominos verbaux » afin de collecter un corpus large pour caractériser ce phénomène, le but final étant de doter un agent conversationnel animé de cette capacité d’adaptation afin d’améliorer la qualité des interactions homme-machine.Nous avons mené différentes études pour étudier le phénomène entre des paires d’inconnus, d’amis de longue date, puis entre des personnes provenant de la même famille. On s’attend à ce que l’amplitude de la convergence soit liée à la distance sociale entre les deux interlocuteurs. On retrouve bien ce résultat. Nous avons ensuite étudié l’impact de la connaissance de la cible linguistique sur l’adaptation. Pour caractériser la convergence phonétique, nous avons développé deux méthodes : la première basée sur une analyse discriminante linéaire entre les coefficients MFCC de chaque locuteur, la seconde utilisant la reconnaissance de parole. La dernière méthode nous permettra par la suite d’étudier le phénomène en condition moins contrôlée.Finalement, nous avons caractérisé la convergence phonétique à l’aide d’une mesure subjective en utilisant un nouveau test de perception basé sur la détection « en ligne » d’un changement de locuteur. Le test a été réalisé à l’aide signaux extraits des interactions mais également avec des signaux obtenus avec une synthèse adaptative basé sur la modélisation HNM. Nous avons obtenus des résultats comparables démontrant ainsi la qualité de notre synthèse adaptative<br>The work presented in this manuscript is based on the study of a phenomenon called phonetic convergence which postulates that two people in interaction will tend to adapt how they talk to their partner in a communicative purpose. We have developed a paradigm called “Verbal Dominoes“ to collect a large corpus to characterize this phenomenon, the ultimate goal being to fill a conversational agent of this adaptability in order to improve the quality of human-machine interactions.We have done several studies to investigate the phenomenon between pairs of unknown people, good friends, and between people coming from the same family. We expect that the amplitude of convergence is proportional to the social distance between the two speakers. We found this result. Then, we have studied the knowledge of the linguistic target impact on adaptation. To characterize the phonetic convergence, we have developed two methods: the first one is based on a linear discriminant analysis between the MFCC coefficients of each speaker and the second one used speech recognition techniques. The last method will allow us to study the phenomenon in less controlled conditions.Finally, we characterized the phonetic convergence with a subjective measurement using a new perceptual test called speaker switching. The test was performed using signals coming from real interactions but also with synthetic data obtained with the harmonic plus

APA, Harvard, Vancouver, ISO, and other styles

26

Paleari, Marco. "Informatique Affective : Affichage, Reconnaissance, et Synthèse par Ordinateur des Émotions." Phd thesis, Télécom ParisTech, 2009. http://pastel.archives-ouvertes.fr/pastel-00005615.

Full text

Abstract:

L'informatique Affective regarde la computation que se rapporte, surgit de, ou influence délibérément les émotions et trouve son domaine d'application naturel dans les interactions homme-machine a haut niveau d'abstraction. L'informatique affective peut être divisée en trois sujets principaux, à savoir: l'affichage,l'identification, et la synthèse. La construction d'une machine intelligente capable dinteragir'de façon naturelle avec son utilisateur passe forcement par ce trois phases. Dans cette thèse nous proposions une architecture basée principalement sur le modèle dite "Multimodal Affective User Interface" de Lisetti et la théorie psychologique des émotions nommé "Component Process Theory" de Scherer. Dans nos travaux nous avons donc recherché des techniques pour l'extraction automatique et en temps-réel des émotions par moyen des expressions faciales et de la prosodie vocale. Nous avons aussi traité les problématiques inhérentes la génération d'expressions sur de différentes plateformes, soit elles des agents virtuel ou robotique. Finalement, nous avons proposé et développé une architecture pour des agents intelligents capable de simuler le processus humaine d'évaluation des émotions comme décrit par Scherer.

APA, Harvard, Vancouver, ISO, and other styles

27

Gourinda, Ahmed. "Codage et reconnaissance de la parole par quantification vectorielle." Grenoble 2 : ANRT, 1988. http://catalogue.bnf.fr/ark:/12148/cb37613884c.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Cosgrove, Paul. "Detection of frequency and intensity changes using synthetic vowels and other sounds." Thesis, Keele University, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.329556.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Monzo, Sánchez Carlos Manuel. "Modelado de la cualidad de la voz para la síntesis del habla expresiva." Doctoral thesis, Universitat Ramon Llull, 2010. http://hdl.handle.net/10803/9145.

Full text

Abstract:

Aquesta tesi es realitza dins del marc de treball existent en el grup d'investigació Grup de Recerca en Tecnologies Mèdia (GTM) d'Enginyeria i Arquitectura La Salle, amb l'objectiu de dotar de major naturalitat a la interacció home-màquina. Per això ens basem en les limitacions de la tecnologia emprada fins al moment, detectant punts de millora en els que poder aportar solucions. Donat que la naturalitat de la parla està íntimament relacionada amb l'expressivitat que aquesta pot transmetre, aquests punts de millora es centren en la capacitat de treballar amb emocions o estils de parla expressius en general.<br/>L'objectiu últim d'aquesta tesi és la generació d'estils de parla expressius en l'àmbit de sistemes de Conversió de Text a Parla (CTP) orientats a la Síntesi de la Parla Expressiva (SPE), essent possible transmetre un missatge oral amb una certa expressivitat que l'oient sigui capaç de percebre i interpretar correctament. No obstant, aquest objectiu implica diferents metes intermitges: conèixer les opcions de parametrització existents, entendre cadascun dels paràmetres, detectar els pros i contres de la seva utilització, descobrir les relacions existents entre ells i els estils de parla expressius i, finalment, portar a terme la síntesi de la parla expressiva. Donat això, el propi procés de síntesi implica un treball previ en reconeixement d'emocions, que en si mateix podria ser una línia complerta d'investigació, ja que aporta el coneixement necessari per extreure models que poden ser usats durant el procés de síntesi.<br/>La cerca de l'increment de la naturalitat ha implicat una millor caracterització de la parla emocional o expressiva, raó per la qual s'ha investigat en parametritzacions que poguessin portar a terme aquesta comesa. Aquests són els paràmetres de Qualitat de la Veu Voice Quality (VoQ), que presenten com a característica principal que són capaços de caracteritzar individualment la parla, identificant cadascun dels factors que fan que sigui única. Els beneficis potencials, que aquest tipus de parametrització pot aportar a la interacció natural, són de dos classes: el reconeixement i la síntesi d'estils de parla expressius. La proposta de la parametrització de VoQ no pretén substituir a la ja emprada prosòdia, sinó tot el contrari, treballar conjuntament amb ella per tal de millorar els resultats obtinguts fins al moment.<br/>Un cop realitzada la selecció de paràmetres es planteja el modelat de la VoQ, és a dir la metodologia d'anàlisi i de modificació, de forma que cadascun d'ells pugui ser extret a partir de la senyal de veu i posteriorment modificat durant la síntesi. Així mateix, es proposen variacions pels paràmetres implicats i tradicionalment utilitzats, adaptant la seva definició al context de la parla expressiva. A partir d'aquí es passa a treballar en les relacions existents amb els estils de parla expressius, presentant finalment la metodologia de transformació d'aquests últims, mitjançant la modificació conjunta de la VoQ y la prosòdia, per a la SPE en un sistema de CTP.<br>Esta tesis se realiza dentro del marco de trabajo existente en el grupo de investigación Grup de Recerca en Tecnologies Mèdia (GTM) de Enginyeria i Arquitectura La Salle, con el objetivo de dotar de mayor naturalidad a la interacción hombre-máquina. Para ello nos basamos en las limitaciones de la tecnología empleada hasta el momento, detectando puntos de mejora en los que poder aportar soluciones. Debido a que la naturalidad del habla está íntimamente relacionada con la expresividad que esta puede transmitir, estos puntos de mejora se centran en la capacidad de trabajar con emociones o estilos de habla expresivos en general.<br/>El objetivo último de esta tesis es la generación de estilos de habla expresivos en el ámbito de sistemas de Conversión de Texto en Habla (CTH) orientados a la Síntesis del Habla Expresiva (SHE), siendo posible transmitir un mensaje oral con una cierta expresividad que el oyente sea capaz de percibir e interpretar correctamente. No obstante, este objetivo implica diferentes metas intermedias: conocer las opciones de parametrización existentes, entender cada uno de los parámetros, detectar los pros y contras de su utilización, descubrir las relaciones existentes entre ellos y los estilos de habla expresivos y, finalmente, llevar a cabo la síntesis del habla expresiva. El propio proceso de síntesis implica un trabajo previo en reconocimiento de emociones, que en sí mismo podría ser una línea completa de investigación, ya que muestra la viabilidad de usar los parámetros seleccionados en la discriminación de estos y aporta el conocimiento necesario para extraer los modelos que pueden ser usados durante el proceso de síntesis.<br/>La búsqueda del incremento de la naturalidad ha implicado una mejor caracterización del habla emocional o expresiva, con lo que para ello se ha investigado en parametrizaciones que pudieran llevar a cabo este cometido. Estos son los parámetros de Cualidad de la Voz Voice Quality (VoQ), que presentan como característica principal que son capaces de caracterizar individualmente el habla, identificando cada uno de los factores que hacen que sea única. Los beneficios potenciales, que este tipo de parametrización puede aportar a la interacción natural, son de dos clases: el reconocimiento y la síntesis de estilos de habla expresivos. La propuesta de la parametrización de VoQ no pretende sustituir a la ya empleada prosodia, sino todo lo contrario, trabajar conjuntamente con ella para mejorar los resultados obtenidos hasta el momento.<br/>Una vez realizada la selección de los parámetros se plantea el modelado de la VoQ, es decir, la metodología de análisis y de modificación de forma que cada uno de ellos pueda ser extraído a partir de la señal de voz y posteriormente modificado durante la síntesis. Asimismo, se proponen variaciones para los parámetros implicados y tradicionalmente utilizados, adaptando su definición al contexto del habla expresiva.<br/>A partir de aquí se pasa a trabajar en las relaciones existentes con los estilos de habla expresivos, presentando finalmente la metodología de transformación de estos últimos, mediante la modificación conjunta de VoQ y prosodia, para la SHE en un sistema de CTH.<br>This thesis is conducted on the existing working framework in the Grup de Recerca en Tecnologies Mèdia (GTM) research group of the Enginyeria i Arquitectura La Salle, with the aim of providing the man-machine interaction with more naturalness. To do this, we are based on the limitations of the technology used up to now, detecting the improvement points where we could contribute solutions. Given that the speech naturalness is closely linked with the expressivity communication, these improvement points are focused on the ability of working with emotions or expressive speech styles in general.<br/>The final goal of this thesis is the expressive speech styles generation in the field of Text-to-Speech (TTS) systems aimed at Expressive Speech Synthesis (ESS), with the possibility of communicating an oral message with a certain expressivity that the listener will be able to correctly perceive and interpret. Nevertheless, this goal involves different intermediate aims: to know the existing parameterization options, to understand each of the parameters, to find out the existing relations among them and the expressive speech styles and, finally, to carry out the expressive speech synthesis. All things considered, the synthesis process involves a previous work in emotion recognition, which could be a complete research field, since it shows the feasibility of using the selected parameters during their discrimination and provides with the necessary knowledge for the modelling that can be used during the synthesis process.<br/>The search for the naturalness improvement has implied a better characterization of the emotional or expressive speech, so we have researched on parameterizations that could perform this task. These are the Voice Quality (VoQ) parameters, which main feature is they are able to characterize the speech in an individual way, identifying each factor that makes it unique. The potential benefits that this kind of parameterization can provide with natural interaction are twofold: the expressive speech styles recognition and the synthesis. The VoQ parameters proposal is not trying to replace prosody, but working altogether to improve the results so far obtained.<br/>Once the parameters selection is conducted, the VoQ modelling is raised (i. e. analysis and modification methodology), so each of them can be extracted from the voice signal and later on modified during the synthesis. Also, variations are proposed for the involved and traditionally used parameters, adjusting their definition to the expressive speech context. From here, we work on the existing relations with the expressive speech styles and, eventually we show the transformation methodology for these ones, by means of the modification of VoQ and prosody, for the ESS in a TTS system.

APA, Harvard, Vancouver, ISO, and other styles

30

Gous, Georgina. "Effects of manipulating fundamental frequency and speech rate on synthetic voice recognition performance and perceived speaker identity, sex, and age." Thesis, Nottingham Trent University, 2017. http://irep.ntu.ac.uk/id/eprint/33899/.

Full text

Abstract:

Vocal fundamental frequency (F0) and speech rate provide the listener with important information relating to the identity, sex, and age of the speaker. Furthermore, it has also been demonstrated that manipulations in F0 or speech rate can lead to accentuation effects in voice memory. As a result, listeners appear to exaggerate the representation of a target voice in terms of F0 or speech rate, and mistakenly remember it as being higher or lower in F0, or faster or slower in speech rate, than the voice originally heard. The aim of this thesis was to understand the effect of manipulations/shifts in F0 or speech rate on voice matching performance and perceived speaker identity, sex, and age. Synthesised male and female voices speaking prescribed sentences were generated and shifted in either F0 and speech rate. In the first set of experiments (Experiments 2, 3, and 4), male and female listeners made judgements about the perceived identity, sex, or age of the speaker. In the second set of experiments (Experiment 5, 6, and 7) male and female listeners made target matching responses for voices presented with and without a delay, and with different spoken sentences. The results of Experiments 2, 3, and 4 indicated the following: (1) Shifts in either F0 or speech rate increased uncertainty about the identity of the speaker, though were more robust to shifts in speech rate than they were to shifts in F0. (2) Shifts in F0 also increased uncertainty about speaker sex, but shifts in speech rate did not. Male voices were accurately perceived as male irrespective of the direction of manipulation in F0. However, for female voices, decreasing F0 increased the uncertainty of speaker sex (i.e., the voices were more likely to be perceived as male rather than female). (3) Increasing either F0 or speech rate resulted in both male and female voices as sounding younger, whereas decreasing either F0 or speech rate lead to listeners perceiving the voices as sounding older. The results of Experiments 5, 6, and 7 indicated the following: (4) Shifts in either F0 or speech rate did increase matching errors for the target voice, however, there was no evidence of an accentuation effect. Specifically, for voices shifted in F0, there was an increase in the selection of voices higher in F0 compared to voices lower in F0. For voices shifted in speech rate, there was an increase in the selection of voices faster in speech rate compared to voices slower in speech rate, but only for slow speech rate target voices. (5) Accentuation errors were no more likely to occur when the inter-stimulus interval was increased, or (6) when a different sentence was spoken in the sequential voice pair to the one previously spoken by the target voice. The findings have theoretical and applied relevance. The work has provided a clearer understanding of how shifts in F0 or speech rate are likely to affect perceptions about the identity, sex, and age of the speaker than was possible to establish from previous studies. It has also contributed further to our understanding about the effect of shifts in F0 or speech rate on voice matching performance, and their importance in accurate recognition. This information might be insightful to the police and help to determine the accuracy of descriptions made about a voice and decisions made during a voice lineup, particularly if a suspect of a crime was likely to be disguising their voice.

APA, Harvard, Vancouver, ISO, and other styles

31

Guerrero, Razuri Javier Francisco. "Decisional-Emotional Support System for a Synthetic Agent : Influence of Emotions in Decision-Making Toward the Participation of Automata in Society." Doctoral thesis, Stockholms universitet, Institutionen för data- och systemvetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-122084.

Full text

Abstract:

Emotion influences our actions, and this means that emotion has subjective decision value. Emotions, properly interpreted and understood, of those affected by decisions provide feedback to actions and, as such, serve as a basis for decisions. Accordingly, "affective computing" represents a wide range of technological opportunities toward the implementation of emotions to improve human-computer interaction, which also includes insights across a range of contexts of computational sciences into how we can design computer systems to communicate and recognize the emotional states provided by humans. Today, emotional systems such as software-only agents and embodied robots seem to improve every day at managing large volumes of information, and they remain emotionally incapable to read our feelings and react according to them. From a computational viewpoint, technology has made significant steps in determining how an emotional behavior model could be built; such a model is intended to be used for the purpose of intelligent assistance and support to humans. Human emotions are engines that allow people to generate useful responses to the current situation, taking into account the emotional states of others. Recovering the emotional cues emanating from the natural behavior of humans such as facial expressions and bodily kinetics could help to develop systems that allow recognition, interpretation, processing, simulation, and basing decisions on human emotions. Currently, there is a need to create emotional systems able to develop an emotional bond with users, reacting emotionally to encountered situations with the ability to help, assisting users to make their daily life easier. Handling emotions and their influence on decisions can improve the human-machine communication with a wider vision. The present thesis strives to provide an emotional architecture applicable for an agent, based on a group of decision-making models influenced by external emotional information provided by humans, acquired through a group of classification techniques from machine learning algorithms. The system can form positive bonds with the people it encounters when proceeding according to their emotional behavior. The agent embodied in the emotional architecture will interact with a user, facilitating their adoption in application areas such as caregiving to provide emotional support to the elderly. The agent's architecture uses an adversarial structure based on an Adversarial Risk Analysis framework with a decision analytic flavor that includes models forecasting a human's behavior and their impact on the surrounding environment. The agent perceives its environment and the actions performed by an individual, which constitute the resources needed to execute the agent's decision during the interaction. The agent's decision that is carried out from the adversarial structure is also affected by the information of emotional states provided by a classifiers-ensemble system, giving rise to a "decision with emotional connotation" included in the group of affective decisions. The performance of different well-known classifiers was compared in order to select the best result and build the ensemble system, based on feature selection methods that were introduced to predict the emotion. These methods are based on facial expression, bodily gestures, and speech, with satisfactory accuracy long before the final system.<br><p>At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 8: Accepted.</p>

APA, Harvard, Vancouver, ISO, and other styles

32

Darbandi, Hossein B. "Speech recognition & diphone extraction for natural speech synthesis." Thesis, 2002. http://hdl.handle.net/2429/12055.

Full text

Abstract:

Modern speech synthesizers use concatenated words and sub-word segments, such as diphones, to synthesize natural speech. Synthesizers available today can synthesize speech with only a limited selection of voices provided by the vendors. The voice segments (e.g. words & diphones) are often created using semi-manual processes that are prone to human error and make the segments non-uniform. The main goal of this thesis is developing an automatic method to segment and label a natural speech into words, diphones, and phonemes. To segment speech into words and sub-words, I use a speech recognition engine. The commercially available speech recognition engines do not provide all the necessary functionality to segment the speech into diphones accurately. As a result, I have developed an engine to segment speech. For developing the engine, I have employed HTK tools provided by Cambridge University, available for free.

APA, Harvard, Vancouver, ISO, and other styles

33

Gritzman, Ashley Daniel. "Adaptive threshold optimisation for colour-based lip segmentation in automatic lip-reading systems." Thesis, 2016. http://hdl.handle.net/10539/22664.

Full text

Abstract:

A thesis submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in ful lment of the requirements for the degree of Doctor of Philosophy. Johannesburg, September 2016<br>Having survived the ordeal of a laryngectomy, the patient must come to terms with the resulting loss of speech. With recent advances in portable computing power, automatic lip-reading (ALR) may become a viable approach to voice restoration. This thesis addresses the image processing aspect of ALR, and focuses three contributions to colour-based lip segmentation. The rst contribution concerns the colour transform to enhance the contrast between the lips and skin. This thesis presents the most comprehensive study to date by measuring the overlap between lip and skin histograms for 33 di erent colour transforms. The hue component of HSV obtains the lowest overlap of 6:15%, and results show that selecting the correct transform can increase the segmentation accuracy by up to three times. The second contribution is the development of a new lip segmentation algorithm that utilises the best colour transforms from the comparative study. The algorithm is tested on 895 images and achieves percentage overlap (OL) of 92:23% and segmentation error (SE) of 7:39 %. The third contribution focuses on the impact of the histogram threshold on the segmentation accuracy, and introduces a novel technique called Adaptive Threshold Optimisation (ATO) to select a better threshold value. The rst stage of ATO incorporates -SVR to train the lip shape model. ATO then uses feedback of shape information to validate and optimise the threshold. After applying ATO, the SE decreases from 7:65% to 6:50%, corresponding to an absolute improvement of 1:15 pp or relative improvement of 15:1%. While this thesis concerns lip segmentation in particular, ATO is a threshold selection technique that can be used in various segmentation applications.<br>MT2017

APA, Harvard, Vancouver, ISO, and other styles

34

Chang, Tsung-Chuan, and 張聰泉. "The Interactive Voice Response System with Functions of Speech Recognition and Synthesis based on VoiceXML." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/01251913527754000130.

Full text

Abstract:

碩士<br>國立高雄第一科技大學<br>電腦與通訊工程所<br>91<br>The telephone network is today''s most widely usedcommunication tool. Utilizing telephone for informationaccess, however, usually requires interactive with machine and human. The most traditional method is done through keystrokes notifying the information supplier what information to obtain.This type of primitive interaction method is inconvenient,the action of keyboard entry is cumbersome especially in the current rapid development of mobile communication. The ability to develop a human machine interface capable of accepting the user''s spoken dialog will fulfill the optimal dialog model desired by information suppliers. This dialog model includes the features to automatically recognize the user''s spoken language, and to synthesize voice for transformation of alphanumeric data into voice information. The objective of this work is to integrate speech recognition and synthesis. The integration of speech recognition and synthesis in this research is based on VoiceXML language specification, a technology jointly developed by IBM, AT&T, Lucent, Motorola, and other companies to allow consumers surf the web by means of voice interaction. The most notable advantage of the specification published by W3C is easy integration of Automatic Speech Recognition and Synthesis, and easier control and arrangement of dialog flow. These features make it suitable for developing voice application languages. To make voice applications more flexible, JSP language is utilized to dynamically generate VoiceXML files. This combination will enable more effective and more flexible development of voice applications. In this work we discuss VoiceXML language grammar rules,and articulate the interrelations of each dialog control elements. In-depth discussion on the related technologies supported by J2EE, especially the combined application of JSP and JavaBeans, enables more dynamic application of these related languages in developing voice application systems. Finally we discuss the design concepts of voice user interface, and analyze the structure of dialog flow. We compile a popular standard dialog control model and integrate it in a VoiceXML aided design network. It allows designers to conveniently query the needed design information and to apply the standard model, and to provide a real test environment to facilitate development of voice application systems. In our experiments we apply the research on related languages as described above by constructing an automatic Voice Response Central System for computer training courses to verify our research results, and build a Interactive Voice Response Central System with function of Speech Recognition and Synthesis using VoiceXML as the development language.

APA, Harvard, Vancouver, ISO, and other styles

35

Rato, João Pedro Cordeiro. "Conversação homem-máquina. Caracterização e avaliação do estado actual das soluções de speech recognition, speech synthesis e sistemas de conversação homem-máquina." Master's thesis, 2016. http://hdl.handle.net/10400.8/2375.

Full text

Abstract:

A comunicação verbal humana é realizada em dois sentidos, existindo uma compreensão de ambas as partes que resulta em determinadas considerações. Este tipo de comunicação, também chamada de diálogo, para além de agentes humanos pode ser constituído por agentes humanos e máquinas. A interação entre o Homem e máquinas, através de linguagem natural, desempenha um papel importante na melhoria da comunicação entre ambos. Com o objetivo de perceber melhor a comunicação entre Homem e máquina este documento apresenta vários conhecimentos sobre sistemas de conversação Homemmáquina, entre os quais, os seus módulos e funcionamento, estratégias de diálogo e desafios a ter em conta na sua implementação. Para além disso, são ainda apresentados vários sistemas de Speech Recognition, Speech Synthesis e sistemas que usam conversação Homem-máquina. Por último são feitos testes de performance sobre alguns sistemas de Speech Recognition e de forma a colocar em prática alguns conceitos apresentados neste trabalho, é apresentado a implementação de um sistema de conversação Homem-máquina. Sobre este trabalho várias ilações foram obtidas, entre as quais, a alta complexidade dos sistemas de conversação Homem-máquina, a baixa performance no reconhecimento de voz em ambientes com ruído e as barreiras que se podem encontrar na implementação destes sistemas.

APA, Harvard, Vancouver, ISO, and other styles

36

Lee, Margaret A. "Development of a model which provides a total system approach to integrating voice recognition and speech synthesis into the cockpit of US Navy Aircraft." Thesis, 1988. http://hdl.handle.net/10945/23174.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Hudeček, Vojtěch. "Využití uživatelské odezvy pro zvýšení kvality řečové syntézy." Master's thesis, 2017. http://www.nusl.cz/ntk/nusl-365179.

Full text

Abstract:

Although spoken dialogue systems have greatly improved, they still cannot handle communications involving unknown topics. One of the problems is, that they experience difficulties when they should pronounce unknown words. We will investigate methods that can improve spoken dialogue systems by correcting the pronunciation of unknown words. This is a crucial step to provide a better user experience, since for example mispronounced proper nouns are highly undesirable. Incorrect pronunciation is caused by imperfect phonetic representation of the word. We aim to detect incorrectly pronounced words, use knowledge about the pronunciation and user's feedback and correct the transcriptions accordingly. Furthermore, the learned phonetic transcriptions can be added to the speech recognition module's vocabulary. Thus extracting correct pronunciations benefits both speech recognition and text-to-speech components of the dialogue systems.

APA, Harvard, Vancouver, ISO, and other styles

38

Μπόρας, Ιωσήφ. "Αυτόματος τεμαχισμός ψηφιακών σημάτων ομιλίας και εφαρμογή στη σύνθεση ομιλίας, αναγνώριση ομιλίας και αναγνώριση γλώσσας". Thesis, 2009. http://nemertes.lis.upatras.gr/jspui/handle/10889/2068.

Full text

Abstract:

Η παρούσα διατριβή εισάγει μεθόδους για τον αυτόματο τεμαχισμό σημάτων ομιλίας. Συγκεκριμένα παρουσιάζονται τέσσερις νέες μέθοδοι για τον αυτόματο τεμαχισμό σημάτων ομιλίας, τόσο για γλωσσολογικά περιορισμένα όσο και μη προβλήματα. Η πρώτη μέθοδος κάνει χρήση των σημείων του σήματος που αντιστοιχούν στα ανοίγματα των φωνητικών χορδών κατά την διάρκεια της ομιλίας για να εξάγει όρια ψευδό-φωνημάτων με χρήση του αλγορίθμου δυναμικής παραμόρφωσης χρόνου. Η δεύτερη τεχνική εισάγει μια καινοτόμα υβριδική μέθοδο εκπαίδευσης κρυμμένων μοντέλων Μαρκώφ, η οποία τα καθιστά πιο αποτελεσματικά στον τεμαχισμό της ομιλίας. Η τρίτη μέθοδος χρησιμοποιεί αλγορίθμους μαθηματικής παλινδρόμησης για τον συνδυασμό ανεξαρτήτων μηχανών τεμαχισμού ομιλίας. Η τέταρτη μέθοδος εισάγει μια επέκταση του αλγορίθμου Βιτέρμπι με χρήση πολλαπλών παραμετρικών τεχνικών για τον τεμαχισμό της ομιλίας. Τέλος, οι προτεινόμενες μέθοδοι τεμαχισμού χρησιμοποιούνται για την βελτίωση συστημάτων στο πρόβλημα της σύνθεσης ομιλίας, αναγνώρισης ομιλίας και αναγνώρισης γλώσσας.<br>The present dissertation introduces methods for the automatic segmentation of speech signals. In detail, four new segmentation methods are presented both in for the cases of linguistically constrained or not segmentation. The first method uses pitchmark points to extract pseudo-phonetic boundaries using dynamic time warping algorithm. The second technique introduces a new hybrid method for the training of hidden Markov models, which makes them more effective in the speech segmentation task. The third method uses regression algorithms for the fusion of independent segmentation engines. The fourth method is an extension of the Viterbi algorithm using multiple speech parameterization techniques for segmentation. Finally, the proposed methods are used to improve systems in the task of speech synthesis, speech recognition and language recognition.

APA, Harvard, Vancouver, ISO, and other styles

39

Hain, Horst-Udo. "Phonetische Transkription für ein multilinguales Sprachsynthesesystem." Doctoral thesis, 2004. https://tud.qucosa.de/id/qucosa%3A25872.

Full text

Abstract:

Die vorliegende Arbeit beschäftigt sich mit einem datengetriebenen Verfahren zur Graphem-Phonem-Konvertierung für ein Sprachsynthesesystem. Die Aufgabe besteht darin, die Aussprache für beliebige Wörter zu bestimmen, auch für solche Wörter, die nicht im Lexikon des Systems enthalten sind. Die Architektur an sich ist sprachenunabhängig, von der Sprache abhängig sind lediglich die Wissensquellen, die zur Laufzeit des Systems geladen werden. Die Erstellung von Wissensquellen für weitere Sprachen soll weitgehend automatisch und ohne Einsatz von Expertenwissen möglich sein. Expertenwissen kann verwendet werden, um die Ergebnisse zu verbessern, darf aber keine Voraussetzung sein. Für die Bestimmung der Transkription werden zwei neuronale Netze verwendet. Das erste Netz generiert aus der Buchstabenfolge des Wortes die zu realisierenden Laute einschließlich der Silbengrenzen, und das zweite bestimmt im Anschluß daran die Position der Wortbetonung. Diese Trennung hat den Vorteil, daß man für die Bestimmung des Wortakzentes das Wissen über die gesamte Lautfolge einbeziehen kann. Andere Verfahren, die die Transkription in einem Schritt bestimmen, haben das Problem, bereits zu Beginn des Wortes über den Akzent entscheiden zu müssen, obwohl die Aussprache des Wortes noch gar nicht feststeht. Zudem bietet die Trennung die Möglichkeit, zwei speziell auf die Anforderung zugeschnittene Netze zu trainieren. Die Besonderheit der hier verwendeten neuronalen Netze ist die Einführung einer Skalierungsschicht zwischen der eigentlichen Eingabe und der versteckten Schicht. Eingabe und Skalierungsschicht werden über eine Diagonalmatrix verbunden, wobei auf die Gewichte dieser Verbindung ein Weight Decay (Gewichtezerfall) angewendet wird. Damit erreicht man eine Bewertung der Eingabeinformation während des Trainings. Eingabeknoten mit einem großen Informationsgehalt werden verstärkt, während weniger interessante Knoten abgeschwächt werden. Das kann sogar soweit gehen, daß einzelne Knoten vollständig abgetrennt werden. Der Zweck dieser Verbindung ist, den Einfluß des Rauschens in den Trainingsdaten zu reduzieren. Durch das Ausblenden der unwichtigen Eingabewerte ist das Netz besser in der Lage, sich auf die wichtigen Daten zu konzentrieren. Das beschleunigt das Training und verbessert die erzielten Ergebnisse. In Verbindung mit einem schrittweisen Ausdünnen der Gewichte (Pruning) werden zudem störende oder unwichtige Verbindungen innerhalb der Netzwerkarchitektur gelöscht. Damit wird die Generalisierungsfähigkeit noch einmal erhöht. Die Aufbereitung der Lexika zur Generierung der Trainingsmuster für die neuronalen Netze wird ebenfalls automatisch durchgeführt. Dafür wird mit Hilfe der dynamischen Zeitanpassung (DTW) der optimale Pfad in einer Ebene gesucht, die auf der einen Koordinate durch die Buchstaben des Wortes und auf der anderen Koordinate durch die Lautfolge aufgespannt wird. Somit erhält man eine Zuordnung der Laute zu den Buchstaben. Aus diesen Zuordnungen werden die Muster für das Training der Netze generiert. Um die Transkriptionsergebnisse weiter zu verbessern, wurde ein hybrides Verfahren unter Verwendung der Lexika und der Netze entwickelt. Unbekannte Wörter werden zuerst in Bestandteile aus dem Lexikon zerlegt und die Lautfolgen dieser Teilwörter zur Gesamttranskription zusammengesetzt. Dabei werden Lücken zwischen den Teilwörtern durch die neuronalen Netze aufgefüllt. Dies ist allerdings nicht ohne weiteres möglich, da es zu Fehlern an den Schnittstellen zwischen den Teiltranskriptionen kommen kann. Dieses Problem wird mit Hilfe des Lexikons gelöst, das für die Generierung der Trainingsmuster aufbereitet wurde. Hier ist eine eindeutige Zuordnung der Laute zu den sie generierenden Buchstaben enthalten. Somit können die Laute an den Schnittstellen neu bewertet und Transkriptionsfehler vermieden werden. Die Verlagsausgabe dieser Dissertation erschien 2005 im w.e.b.-Universitätsverlag Dresden (ISBN 3-937672-76-1).<br>The topic of this thesis is a system which is able to perform a grapheme-to-phoneme conversion for several languages without changes in its architecture. This is achieved by separation of the language dependent knowledge bases from the run-time system. Main focus is an automated adaptation to new languages by generation of new knowledge bases without manual effort with a minimal requirement for additional information. The only source is a lexicon containing all the words together with their appropriate phonetic transcription. Additional knowledge can be used to improve or accelerate the adaptation process, but it must not be a prerequisite. Another requirement is a fully automatic process without manual interference or post-editing. This allows for the adaptation to a new language without even having a command of that language. The only precondition is the pronunciation dictionary which should be enough for the data-driven approach to learn a new language. The automatic adaptation process is divided into two parts. In the first step the lexicon is pre-processed to determine which grapheme sequence belongs to which phoneme. This is the basis for the generation of the training patterns for the data-driven learning algorithm. In the second part mapping rules are derived automatically which are finally used to create the phonetic transcription of any word, even if it not contained in the dictionary. Task is to have a generalisation process that can handle all words in a text that has to be read out by a text-to-speech system.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!