To see the other types of publications on this topic, follow the link: Computer recognition of speech.

Dissertations / Theses on the topic 'Computer recognition of speech'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Computer recognition of speech.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Wang, Peidong. "Robust Automatic Speech Recognition By Integrating Speech Separation." The Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1619099401042668.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Tyler, J. E. M. "Speech recognition by computer : algorithms and architectures." Thesis, University of Greenwich, 1988. http://gala.gre.ac.uk/8707/.

Full text
Abstract:
This work is concerned with the investigation of algorithms and architectures for computer recognition of human speech. Three speech recognition algorithms have been implemented, using (a) Walsh Analysis, (b) Fourier Analysis and (c) Linear Predictive Coding. The Fourier Analysis algorithm made use of the Prime-number Fourier Transform technique. The Linear Predictive Coding algorithm made use of LeRoux and Gueguen's method for calculating the coefficients. The system was organised so that the speech samples could be input to a PC/XT microcomputer in a typical office environment. The PC/XT was linked via Ethernet to a Sun 2/180s computer system which allowed the data to be stored on a Winchester disk so that the data used for testing each algorithm was identical. The recognition algorithms were implemented entirely in Pascal, to allow evaluation to take place on several different machines. The effectiveness of the algorithms was tested with a group of five naive speakers, results being in the form of recognition scores. The results showed the superiority of the Linear Predictive Coding algorithm, which achieved a mean recognition score of 93.3%. The software was implemented on three different computer systems. These were an 8-bit microprocessor, a sixteen-bit microcomputer based on the IBM PC/XT, and a Motorola 68020 based Sun Workstation. The effectiveness of the implementations was measured in terms of speed of execution of the recognition software. By limiting the vocabulary to ten words, it has been shown that it would be possible to achieve recognition of isolated utterances in real time using a single 68020 microprocessor. The definition of real time in this context is understood to mean that the recognition task will on average, be completed within the duration of the utterance, for all the utterances in the recogniser's vocabulary. A speech recogniser architecture is proposed which would achieve real time speech recognition without any limitation being placed upon (a) the order of the transform, and (b) the size of the recogniser's vocabulary. This is achieved by utilising a pipeline of four processors, with the pattern matching process performed in parallel on groups of words in the vocabulary.
APA, Harvard, Vancouver, ISO, and other styles
3

Sun, Felix (Felix W. ). "Speech Representation Models for Speech Synthesis and Multimodal Speech Recognition." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106378.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.<br>This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.<br>Cataloged from student-submitted PDF version of thesis.<br>Includes bibliographical references (pages 59-63).<br>The field of speech recognition has seen steady advances over the last two decades, leading to the accurate, real-time recognition systems available on mobile phones today. In this thesis, I apply speech modeling techniques developed for recognition to two other speech problems: speech synthesis and multimodal speech recognition with images. In both problems, there is a need to learn a relationship between speech sounds and another source of information. For speech synthesis, I show that using a neural network acoustic model results in a synthesizer that is more tolerant of noisy training data than previous work. For multimodal recognition, I show how information from images can be effectively integrated into the recognition search framework, resulting in improved accuracy when image data is available.<br>by Felix Sun.<br>M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
4

Eriksson, Mattias. "Speech recognition availability." Thesis, Linköping University, Department of Computer and Information Science, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2651.

Full text
Abstract:
<p>This project investigates the importance of availability in the scope of dictation programs. Using speech recognition technology for dictating has not reached the public, and that may very well be a result of poor availability in today’s technical solutions. </p><p>I have constructed a persona character, Johanna, who personalizes the target user. I have also developed a solution that streams audio into a speech recognition server and sends back interpreted text. Johanna affirmed that the solution was successful in theory. </p><p>I then incorporated test users that tried out the solution in practice. Half of them do indeed claim that their usage has been and will continue to be increased thanks to the new level of availability.</p>
APA, Harvard, Vancouver, ISO, and other styles
5

Melnikoff, Stephen Jonathan. "Speech recognition in programmable logic." Thesis, University of Birmingham, 2003. http://etheses.bham.ac.uk//id/eprint/16/.

Full text
Abstract:
Speech recognition is a computationally demanding task, especially the decoding part, which converts pre-processed speech data into words or sub-word units, and which incorporates Viterbi decoding and Gaussian distribution calculations. In this thesis, this part of the recognition process is implemented in programmable logic, specifically, on a field-programmable gate array (FPGA). Relevant background material about speech recognition is presented, along with a critical review of previous hardware implementations. Designs for a decoder suitable for implementation in hardware are then described. These include details of how multiple speech files can be processed in parallel, and an original implementation of an algorithm for summing Gaussian mixture components in the log domain. These designs are then implemented on an FPGA. An assessment is made as to how appropriate it is to use hardware for speech recognition. It is concluded that while certain parts of the recognition algorithm are not well suited to this medium, much of it is, and so an efficient implementation is possible. Also presented is an original analysis of the requirements of speech recognition for hardware and software, which relates the parameters that dictate the complexity of the system to processing speed and bandwidth. The FPGA implementations are compared to equivalent software, written for that purpose. For a contemporary FPGA and processor, the FPGA outperforms the software by an order of magnitude.
APA, Harvard, Vancouver, ISO, and other styles
6

Nilsson, Tobias. "Speech Recognition Software and Vidispine." Thesis, Umeå universitet, Institutionen för datavetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-71428.

Full text
Abstract:
To evaluate libraries for continuous speech recognition, a test based on TED-talk videos was created. The different speech recognition libraries PocketSphinx, Dragon NaturallySpeaking and Microsoft Speech API were part of the evaluation. From the words that the libraries recognized, Word Error Rate (WER) was calculated and the results show that Microsoft SAPI performed worst with a WER of 60.8%, PocketSphinx at second place with 59.9% and Dragon NaturallySpeaking as the best with 42.6%. These results were all achieved with a Real Time Factor (RTF) of less than 1.0. PocketSphinx was chosen as the best candidate for the intended system on the basis that it is open-source, free and would be a better match to the system. By modifying the language model and dictionary to closer resemble typical TED-talk contents, it was also possible to improve the WER for PocketSphinx to a value of 39.5%, however with the cost of RTF which passed the 1.0 limit,making it less useful for live video.
APA, Harvard, Vancouver, ISO, and other styles
7

Price, Michael Ph D. (Michael R. ). Massachusetts Institute of Technology. "Energy-scalable speech recognition circuits." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106090.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.<br>Cataloged from PDF version of thesis.<br>Includes bibliographical references (pages 135-141).<br>As people become more comfortable with speaking to machines, the applications of speech interfaces will diversify and include a wider range of devices, such as wearables, appliances, and robots. Automatic speech recognition (ASR) is a key component of these interfaces that is computationally intensive. This thesis shows how we designed special-purpose integrated circuits to bring local ASR capabilities to electronic devices with a small size and power footprint. This thesis adopts a holistic, system-driven approach to ASR hardware design. We identify external memory bandwidth as the main driver in system power consumption and select algorithms and architectures to minimize it. We evaluate three acoustic modeling approaches-Gaussian mixture models (GMMs), subspace GMMs (SGMMs), and deep neural networks (DNNs)-and identify tradeoffs between memory bandwidth and recognition accuracy. DNNs offer the best tradeoffs for our application; we describe a SIMD DNN architecture using parameter quantization and sparse weight matrices to save bandwidth. We also present a hidden Markov model (HMM) search architecture using a weighted finite-state transducer (WFST) representation. Enhancements to the search architecture, including WFST compression and caching, predictive beam width control, and a word lattice, reduce memory bandwidth to 10 MB/s or less, despite having just 414 kB of on-chip SRAM. The resulting system runs in real-time with accuracy comparable to a software recognizer using the same models. We provide infrastructure for deploying recognizers trained with open-source tools (Kaldi) on the hardware platform. We investigate voice activity detection (VAD) as a wake-up mechanism and conclude that an accurate and robust algorithm is necessary to minimize system power, even if it results in larger area and power for the VAD itself. We design fixed-point digital implementations of three VAD algorithms and explore their performance on two synthetic tasks with SNRs from -5 to 30 dB. The best algorithm uses modulation frequency features with an NN classifier, requiring just 8.9 kB of parameters. Throughout this work we emphasize energy scalability, or the ability to save energy when high accuracy or complex models are not required. Our architecture exploits scalability from many sources: model hyperparameters, runtime parameters such as beam width, and voltage/frequency scaling. We demonstrate these concepts with results from five ASR tasks, with vocabularies ranging from 11 words to 145,000 words.<br>by Michael Price.<br>Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
8

Yoder, Benjamin W. (Benjamin Wesley) 1977. "Spontaneous speech recognition using HMMs." Thesis, Massachusetts Institute of Technology, 2001. http://hdl.handle.net/1721.1/36108.

Full text
Abstract:
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2003.<br>Includes bibliographical references (leaf 63).<br>This thesis describes a speech recognition system that was built to support spontaneous speech understanding. The system is composed of (1) a front end acoustic analyzer which computes Mel-frequency cepstral coefficients, (2) acoustic models of context-dependent phonemes (triphones), (3) a back-off bigram statistical language model, and (4) a beam search decoder based on the Viterbi algorithm. The contextdependent acoustic models resulted in 67.9% phoneme recognition accuracy on the standard TIMIT speech database. Spontaneous speech was collected using a "Wizard of Oz" simulation of a simple spatial manipulation game. Naive subjects were instructed to manipulate blocks on a computer screen in order to solve a series of geometric puzzles using only spoken commands. A hidden human operator performed actions in response to each spoken command. The speech from thirteen subjects formed the corpus for the speech recognition results reported here. Using a task-specific bigram statistical language model and context-dependent acoustic models, the system achieved a word recognition accuracy of 67.6%. The recognizer operated using a vocabulary of 523 words. The recognition had a word perplexity of 36.<br>by Benjamin W. Yoder.<br>M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
9

Ganapathiraju, Aravind. "Support Vector Machines for Speech Recognition." MSSTATE, 2002. http://sun.library.msstate.edu/ETD-db/theses/available/etd-02202002-111027/.

Full text
Abstract:
Hidden Markov models (HMM) with Gaussian mixture observation densities are the dominant approach in speech recognition. These systems typically use a representational model for acoustic modeling which can often be prone to overfitting and does not translate to improved discrimination. We propose a new paradigm centered on principles of structural risk minimization using a discriminative framework for speech recognition based on support vector machines (SVMs). SVMs have the ability to simultaneously optimize the representational and discriminative ability of the acoustic classifiers. We have developed the first SVM-based large vocabulary speech recognition system that improves performance over traditional HMM-based systems. This hybrid system achieves a state-of-the-art word error rate of 10.6% on a continuous alphadigit task ? a 10% improvement relative to an HMM system. On SWITCHBOARD, a large vocabulary task, the system improves performance over a traditional HMM system from 41.6% word error rate to 40.6%. This dissertation discusses several practical issues that arise when SVMs are incorporated into the hybrid system.
APA, Harvard, Vancouver, ISO, and other styles
10

Lebart, Katia. "Speech dereverberation applied to automatic speech recognition and hearing aids." Thesis, University of Sussex, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.285064.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Mwanyoha, Sadiki Pili 1974. "A speech recognition module for speech-to-text language translation." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/9862.

Full text
Abstract:
Thesis (S.B. and M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.<br>Includes bibliographical references (leaves 47-48).<br>by Sadiki Pili Mwanyoha.<br>S.B.and M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
12

Zhu, Bo Ph D. Massachusetts Institute of Technology. "Multimodal speech recognition with ultrasonic sensors." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/46530.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.<br>Includes bibliographical references (p. 95-96).<br>Ultrasonic sensing of articulator movement is an area of multimodal speech recognition that has not been researched extensively. The widely-researched audio-visual speech recognition (AVSR), which relies upon video data, is awkwardly high-maintenance in its setup and data collection process, as well as computationally expensive because of image processing. In this thesis we explore the effectiveness of ultrasound as a more lightweight secondary source of information in speech recognition. We first describe our hardware systems that made simultaneous audio and ultrasound capture possible. We then discuss the new types of features that needed to be extracted; traditional Mel-Frequency Cepstral Coefficients (MFCCs) were not effective in this narrowband domain. Spectral analysis pointed to frequency-band energy averages, energy-band frequency midpoints, and spectrogram peak location vs. acoustic event timing as convenient features. Next, we devised ultrasonic-only phonetic classification experiments to investigate the ultrasound's abilities and weaknesses in classifying phones. We found that several acoustically-similar phone pairs were distinguishable through ultrasonic classification. Additionally, several same-place consonants were also distinguishable. We also compared classification metrics across phonetic contexts and speakers. Finally, we performed multimodal continuous digit recognition in the presence of acoustic noise. We found that the addition of ultrasonic information reduced word error rates by 24-29% over a wide range of acoustic signal-to-noise ratio (SNR) (clean to OdB). This research indicates that ultrasound has the potential to be a financially and computationally cheap noise-robust modality for speech recognition systems.<br>by Bo Zhu.<br>M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
13

Lau, Raymond 1971. "Subword lexical modelling for speech recognition." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/46181.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Badr, Ibrahim. "Pronunciation learning for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/66022.

Full text
Abstract:
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.<br>Cataloged from PDF version of thesis.<br>Includes bibliographical references (p. 99-101).<br>In many ways, the lexicon remains the Achilles heel of modern automatic speech recognizers (ASRs). Unlike stochastic acoustic and language models that learn the values of their parameters from training data, the baseform pronunciations of words in an ASR vocabulary are typically specified manually, and do not change, unless they are edited by an expert. Our work presents a novel generative framework that uses speech data to learn stochastic lexicons, thereby taking a step towards alleviating the need for manual intervention and automnatically learning high-quality baseform pronunciations for words. We test our model on a variety of domains: an isolated-word telephone speech corpus, a weather query corpus and an academic lecture corpus. We show significant improvements of 25%, 15% and 2% over expert-pronunciation lexicons, respectively. We also show that further improvements can be made by combining our pronunciation learning framework with acoustic model training.<br>by Ibrahim Badr.<br>S.M.
APA, Harvard, Vancouver, ISO, and other styles
15

Park, Chi-youn 1981. "Consonant landmark detection for speech recognition." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/44905.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.<br>This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.<br>Includes bibliographical references (p. 191-197).<br>This thesis focuses on the detection of abrupt acoustic discontinuities in the speech signal, which constitute landmarks for consonant sounds. Because a large amount of phonetic information is concentrated near acoustic discontinuities, more focused speech analysis and recognition can be performed based on the landmarks. Three types of consonant landmarks are defined according to its characteristics -- glottal vibration, turbulence noise, and sonorant consonant -- so that the appropriate analysis method for each landmark point can be determined. A probabilistic knowledge-based algorithm is developed in three steps. First, landmark candidates are detected and their landmark types are classified based on changes in spectral amplitude. Next, a bigram model describing the physiologically-feasible sequences of consonant landmarks is proposed, so that the most likely landmark sequence among the candidates can be found. Finally, it has been observed that certain landmarks are ambiguous in certain sets of phonetic and prosodic contexts, while they can be reliably detected in other contexts. A method to represent the regions where the landmarks are reliably detected versus where they are ambiguous is presented. On TIMIT test set, 91% of all the consonant landmarks and 95% of obstruent landmarks are located as landmark candidates. The bigram-based process for determining the most likely landmark sequences yields 12% deletion and substitution rates and a 15% insertion rate. An alternative representation that distinguishes reliable and ambiguous regions can detect 92% of the landmarks and 40% of the landmarks are judged to be reliable. The deletion rate within reliable regions is as low as 5%.<br>(cont.) The resulting landmark sequences form a basis for a knowledge-based speech recognition system since the landmarks imply broad phonetic classes of the speech signal and indicate the points of focus for estimating detailed phonetic information. In addition, because the reliable regions generally correspond to lexical stresses and word boundaries, it is expected that the landmarks can guide the focus of attention not only at the phoneme-level, but at the phrase-level as well.<br>by Chiyoun Park.<br>Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
16

Kent, Christopher Grant. "Personalized Computer Architecture as Contextual Partitioning for Speech Recognition." Thesis, Virginia Tech, 2009. http://hdl.handle.net/10919/35957.

Full text
Abstract:
Computing is entering an era of hundreds to thousands of processing elements per chip, yet no known parallelism form scales to that degree. To address this problem, we investigate the foundation of a computer architecture where processing elements and memory are contextually partitioned based upon facets of a userâ s life. Such Contextual Partitioning (CP), the situational handling of inputs, employs a method for allocating resources, novel from approaches used in todayâ s architectures. Instead of focusing components on mutually exclusive parts of a task, as in Thread Level Parallelism, CP assigns different physical components to different versions of the same task, defining versions by contextual distinctions in device usage. Thus, application data is processed differently based on the situation of the user. Further, partitions may be user specific, leading to personalized architectures. Our focus is mobile devices, which are, or can be, personalized to one owner. Our investigation is centered on leveraging CP for accurate and real-time speech recognition on mobile devices, scalable to large vocabularies, a highly desired application for future user interfaces. By contextually partitioning a vocabulary, training partitions as separate acoustic models with SPHINX, we demonstrate a maximum error reduction of 61% compared to a unified approach. CP also allows for systems robust to changes in vocabulary, requiring up to 97% less training when updating old vocabulary entries with new words, and incurring fewer errors from the replacement. Finally, CP has the potential to scale nearly linearly with increasing core counts, offering architectures effective with future processor designs.<br>Master of Science
APA, Harvard, Vancouver, ISO, and other styles
17

Witt, Silke Maren. "Use of speech recognition in computer-assisted language learning." Thesis, University of Cambridge, 2000. https://www.repository.cam.ac.uk/handle/1810/251707.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Zhang, Li. "A syllable-based, pseudo-articulatory approach to speech recognition." Thesis, University of Birmingham, 2004. http://etheses.bham.ac.uk//id/eprint/4905/.

Full text
Abstract:
The prevailing approach to speech recognition is Hidden Markov Modelling, which yields good performance. However, it ignores phonetics, which has the potential for going beyond the acoustic variance to provide a more abstract underlying representation. The novel approach pursued in this thesis is motivated by phonetic and phonological considerations. It is based on the notion of pseudo-articulatory representations, which are abstract and idealized accounts of articulatory activity. The original work presented here demonstrates the recovery of syllable structure information from pseudo-articulatory representations directly without resorting to statistical models of phone sequences. The work is also original in its use of syllable structures to recover phonemes. This thesis presents the three-stage syllable based, pseudo-articulatory approach in detail. Though it still has problems, this research leads to a more plausible style of automatic speech recognition and will contribute to modelling and understanding speech behaviour. Additionally, it also permits a 'multithreaded' approach combining information from different processes.
APA, Harvard, Vancouver, ISO, and other styles
19

Najafian, Maryam. "Acoustic model selection for recognition of regional accented speech." Thesis, University of Birmingham, 2016. http://etheses.bham.ac.uk//id/eprint/6461/.

Full text
Abstract:
Accent is cited as an issue for speech recognition systems. Our experiments showed that the ASR word error rate is up to seven times greater for accented speech compared with standard British English. The main objective of this research is to develop Automatic Speech Recognition (ASR) techniques that are robust to accent variation. We applied different acoustic modelling techniques to compensate for the effects of regional accents on the ASR performance. For conventional GMM-HMM based ASR systems, we showed that using a small amount of data from a test speaker to choose an accent dependent model using an accent identification system, or building a model using the data from N neighbouring speakers in AID space, will result in superior performance compared to that obtained with unsupervised or supervised speaker adaptation. In addition we showed that using a DNN-HMM rather than a GMM-HMM based acoustic model would improve the recognition accuracy considerably. Even if we apply two stages of accent followed by speaker adaptation to the GMM-HMM baseline system, the GMM-HMM based system will not outperform the baseline DNN-HMM based system. For more contemporary DNN-HMM based ASR systems we investigated how adding different types of accented data to the training set can provide better recognition accuracy on accented speech. Finally, we proposed a new approach for visualisation of the AID feature space. This is helpful in analysing the AID recognition accuracies and analysing AID confusion matrices.
APA, Harvard, Vancouver, ISO, and other styles
20

Cardinal, Patrick. "Finite-state transducers and speech recognition." Thesis, McGill University, 2003. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=78335.

Full text
Abstract:
Finite-state automata and finite-state transducers have been extensively studied over the years. Recently, the theory of transducers has been generalized by Mohri for the weighted case. This generalization has allowed the use of finite-state transducers in a large variety of applications such as speech recognition. In this work, most of the algorithms for performing operations on weighted finite-state transducers are described in detail and analyzed. Then, an example of their use is given via a description of a speech recognition system based on them.
APA, Harvard, Vancouver, ISO, and other styles
21

Livescu, Karen 1975. "Analysis and modeling of non-native speech for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 1999. http://hdl.handle.net/1721.1/80204.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Tran, Thao, and Nathalie Tkauc. "Face recognition and speech recognition for access control." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-39776.

Full text
Abstract:
This project is a collaboration with the company JayWay in Halmstad. In order to enter theoffice today, a tag-key is needed for the employees and a doorbell for the guests. If someonerings the doorbell, someone on the inside has to open the door manually which is consideredas a disturbance during work time. The purpose with the project is to minimize thedisturbances in the office. The goal with the project is to develop a system that uses facerecognition and speech-to-text to control the lock system for the entrance door. The components used for the project are two Raspberry Pi’s, a 7 inch LCD-touch display, aRaspberry Pi Camera Module V2, a external sound card, a microphone and speaker. Thewhole project was written in Python and the platform used was Amazon Web Services (AWS)for storage and the face recognition while speech-to-text was provided by Google.The system is divided in three functions for employees, guests and deliveries. The employeefunction has two authentication steps, the face recognition and a random generated code that needs to be confirmed to avoid biometric spoofing. The guest function includes the speech-to-text service to state an employee's name that the guest wants to meet and the employee is then notified. The delivery function informs the specific persons in the office that are responsiblefor the deliveries by sending a notification.The test proves that the system will always match with the right person when using the facerecognition. It also shows what the threshold for the face recognition can be set to, to makesure that only authorized people enters the office.Using the two steps authentication, the face recognition and the code makes the system secureand protects the system against spoofing. One downside is that it is an extra step that takestime. The speech-to-text is set to swedish and works quite well for swedish-speaking persons.However, for a multicultural company it can be hard to use the speech-to-text service. It canalso be hard for the service to listen and translate if there is a lot of background noise or ifseveral people speak at the same time.
APA, Harvard, Vancouver, ISO, and other styles
23

May, Daniel Olen. "NONLINEAR DYNAMIC INVARIANTS FOR CONTINUOUS SPEECH RECOGNITION." MSSTATE, 2008. http://sun.library.msstate.edu/ETD-db/theses/available/etd-07072008-155511/.

Full text
Abstract:
In this work, nonlinear acoustic information is combined with traditional linear acoustic information in order to produce a noise-robust set of features for speech recognition. Classical acoustic modeling techniques for speech recognition have relied on a standard assumption of linear acoustics where signal processing is primarily performed in the signal's frequency domain. While these conventional techniques have demonstrated good performance under controlled conditions, the performance of these systems suffers significant degradations when the acoustic data is contaminated with previously unseen noise. The objective of this thesis was to determine whether nonlinear dynamic invariants are able to boost speech recognition performance when combined with traditional acoustic features. Several sets of experiments are used to evaluate both clean and noisy speech data. The invariants resulted in a maximum relative increase of 11.1% for the clean evaluation set. However, an average relative decrease of 7.6% was observed for the noise-contaminated evaluation sets. The fact that recognition performance decreased with the use of dynamic invariants suggests that additional research is required for robust filtering of phase spaces constructed from noisy time series.
APA, Harvard, Vancouver, ISO, and other styles
24

Varenhorst, Christopher J. "Making speech recognition work on the web." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/66814.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.<br>Cataloged from PDF version of thesis.<br>Includes bibliographical references (p. 43-44).<br>We present an improved Audio Controller for Web-Accessible Multimodal Interface toolkit -- a system that provides a simple way for developers to add speech recognition to web pages. Our improved system offers increased usability and performance for users and greater flexibility for developers. Tests performed show a 36% increase in, recognition response time in the best possible networking conditions. Preliminary tests show an improved users experience. The new Wowza platform also provides a means of upgrading other Audio Controllers easily.<br>by Christopher J. Varenhorst.<br>M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
25

Lee, Steven C. (Steven Cheng-Chi) 1975. "Probabilistic segmentation for segment-based speech recognition." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/47605.

Full text
Abstract:
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.<br>Includes bibliographical references (leaves 64-66).<br>by Steven C. Lee.<br>M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
26

Gür, Burkay. "Improving speech recognition accuracy for clinical conversations." Thesis, Massachusetts Institute of Technology, 2012. http://hdl.handle.net/1721.1/76817.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.<br>This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.<br>Cataloged from student submitted PDF version of thesis.<br>Includes bibliographical references (p. 73-74).<br>Accurate and comprehensive data form the lifeblood of health care. Unfortunately, there is much evidence that current data collection methods sometimes fail. Our hypothesis is that it should be possible to improve the thoroughness and quality of information gathered through clinical encounters by developing a computer system that (a) listens to a conversation between a patient and a provider, (b) uses automatic speech recognition technology to transcribe that conversation to text, (c) applies natural language processing methods to extract the important clinical facts from the conversation, (d) presents this information in real time to the participants, permitting correction of errors in understanding, and (e) organizes those facts into an encounter note that could serve as a first draft of the note produces by the clinician. In this thesis, we present our attempts to measure the performances of two state-of-the-art automatic speech recognizers (ASRs) for the task of transcribing clinical conversations, and explore the potential ways of optimizing these software packages for the specific task. In the course of this thesis, we have (1) introduced a new method for quantitatively measuring the difference between two language models and showed that conversational and dictational speech have different underlying language models, (2) measured the perplexity of clinical conversations and dictations and shown that spontaneous speech has a higher perplexity than dictational speech, (3) improved speech recognition accuracy by language adaptation using a conversational corpus, and (4) introduced a fast and simple algorithm for cross talk elimination in two speaker settings.<br>by Burkay Gür.<br>M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
27

Duchnowski, Paul. "A new structure for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 1993. http://hdl.handle.net/1721.1/17333.

Full text
Abstract:
Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1993.<br>Includes bibliographical references (leaves 102-110).<br>by Paul Duchnowski.<br>Sc.D.
APA, Harvard, Vancouver, ISO, and other styles
28

Dujari, Rajeev. "Parallel Viterbi search algorithm for speech recognition." Thesis, Massachusetts Institute of Technology, 1992. http://hdl.handle.net/1721.1/13111.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Wang, Stanley Xinlei. "Using graphone models in automatic speech recognition." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/53114.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.<br>Includes bibliographical references (p. 87-90).<br>This research explores applications of joint letter-phoneme subwords, known as graphones, in several domains to enable detection and recognition of previously unknown words. For these experiments, graphones models are integrated into the SUMMIT speech recognition framework. First, graphones are applied to automatically generate pronunciations of restaurant names for a speech recognizer. Word recognition evaluations show that graphones are effective for generating pronunciations for these words. Next, a graphone hybrid recognizer is built and tested for searching song lyrics by voice, as well as transcribing spoken lectures in a open vocabulary scenario. These experiments demonstrate significant improvement over traditional word-only speech recognizers. Modifications to the flat hybrid model such as reducing the graphone set size are also considered. Finally, a hierarchical hybrid model is built and compared with the flat hybrid model on the lecture transcription task.<br>by Stanley Xinlei Wang.<br>M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
30

Saenko, Ekaterina 1976. "Articulatory features for robust visual speech recognition." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/28736.

Full text
Abstract:
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.<br>Includes bibliographical references (p. 99-105).<br>This thesis explores a novel approach to visual speech modeling. Visual speech, or a sequence of images of the speaker's face, is traditionally viewed as a single stream of contiguous units, each corresponding to a phonetic segment. These units are defined heuristically by mapping several visually similar phonemes to one visual phoneme, sometimes referred to as a viseme. However, experimental evidence shows that phonetic models trained from visual data are not synchronous in time with acoustic phonetic models, indicating that visemes may not be the most natural building blocks of visual speech. Instead, we propose to model the visual signal in terms of the underlying articulatory features. This approach is a natural extension of feature-based modeling of acoustic speech, which has been shown to increase robustness of audio-based speech recognition systems. We start by exploring ways of defining visual articulatory features: first in a data-driven manner, using a large, multi-speaker visual speech corpus, and then in a knowledge-driven manner, using the rules of speech production. Based on these studies, we propose a set of articulatory features, and describe a computational framework for feature-based visual speech recognition. Multiple feature streams are detected in the input image sequence using Support Vector Machines, and then incorporated in a Dynamic Bayesian Network to obtain the final word hypothesis. Preliminary experiments show that our approach increases viseme classification rates in visually noisy conditions, and improves visual word recognition through feature-based context modeling.<br>by Ekaterina Saenko.<br>S.M.
APA, Harvard, Vancouver, ISO, and other styles
31

Chang, Jane W. (Jane Wen) 1970. "Speech recognition system robustness to microphone variations." Thesis, Massachusetts Institute of Technology, 1995. http://hdl.handle.net/1721.1/36536.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Al-Talabani, Abdulbasit. "Automatic Speech Emotion Recognition : feature space dimensionality and classification challenges." Thesis, University of Buckingham, 2015. http://bear.buckingham.ac.uk/101/.

Full text
Abstract:
In the last decade, research in Speech Emotion Recognition (SER) has become a major endeavour in Human Computer Interaction (HCI), and speech processing. Accurate SER is essential for many applications, like assessing customer satisfaction with quality of services, and detecting/assessing emotional state of children in care. The large number of studies published on SER reflects the demand for its use. The main concern of this thesis is the investigation of SER from a pattern recognition and machine learning points of view. In particular, we aim to identify appropriate mathematical models of SER and examine the process of designing automatic emotion recognition schemes. There are major challenges to automatic SER including ambiguity about the list/definition of emotions, the lack of agreement on a manageable set of uncorrelated speech-based emotion relevant features, and the difficulty of collected emotion-related datasets under natural circumstances. We shall initiate our work by dealing with the identification of appropriate sets of emotion related features/attributes extractible from speech signals as considered from psychological and computational points of views. We shall investigate the use of pattern-recognition approaches to remove redundancies and achieve compactification of digital representation of the extracted data with minimal loss of information. The thesis will include the design of new or complement existing SER schemes and conduct large sets of experiments to empirically test their performances on different databases, identify advantages, and shortcomings of using speech alone for emotion recognition. Existing SER studies seem to deal with the ambiguity/dis-agreement on a “limited” number of emotion-related features by expanding the list from the same speech signal source/sites and apply various feature selection procedures as a mean of reducing redundancies. Attempts are made to discover more relevant features to emotion from speech. One of our investigations focuses on proposing a newly sets of features for SER, extracted from Linear Predictive (LP)-residual speech. We shall demonstrate the usefulness of the proposed relatively small set of features by testing the performance of an SER scheme that is based on fusing our set of features with the existing set of thousands of features using common machine learning schemes of Support Vector Machine (SVM) and Artificial Neural Network (ANN). The challenge of growing dimensionality of SER feature space and its impact on increased model complexity is another major focus of our research project. By studying the pros and cons of the commonly used feature selection approaches, we argued in favour of meta-feature selection and developed various methods in this direction, not only to reduce dimension, but also to adapt and de-correlate emotional feature spaces for improved SER model recognition accuracy. We used rincipal Component Analysis (PCA) and proposed Data Independent PCA (DIPCA) by training on independent emotional and non-emotional datasets. The DIPCA projections, especially when extracted from speech data coloured with different emotions or from Neutral speech data, had comparable capability to the PCA in terms of SER performance. Another adopted approach in this thesis for dimension reduction is the Random Projection (RP) matrices, independent of training data. We have shown that some versions of RP with SVM classifier can offer an adaptation space for Speaker Independent SER that avoid over-fitting and hence improves recognition accuracy. Using PCA trained on a set of data, while testing on emotional data features, has significant implication for machine learning in general. The thesis other major contribution focuses on the classification aspects of SER. We investigate the drawbacks of the well-known SVM classifier when applied to a preprocessed data by PCA and RP. We shall demonstrate the advantages of using the Linear Discriminant Classifier (LDC) instead especially for PCA de-correlated metafeatures. We initiated a variety of LDC-based ensembles classification, to test performance of scheme using a new form of bagging different subsets of metafeature subsets extracted by PCA with encouraging results. The experiments conducted were applied on two benchmark datasets (Emo-Berlin and FAU-Aibo), and an in-house dataset in the Kurdish language. Recognition accuracy achieved by are significantly higher than the state of art results on all datasets. The results, however, revealed a difficult challenge in the form of persisting wide gap in accuracy over different datasets, which cannot be explained entirely by the differences between the natures of the datasets. We conducted various pilot studies that were based on various visualizations of the confusion matrices for the “difficult” databases to build multi-level SER schemes. These studies provide initial evidences to the presence of more than one “emotion” in the same portion of speech. A possible solution may be through presenting recognition accuracy in a score-based measurement like the spider chart. Such an approach may also reveal the presence of Doddington zoo phenomena in SER.
APA, Harvard, Vancouver, ISO, and other styles
33

Wadkins, Eric J. "A continuous silent speech recognition system for AlterEgo, a silent speech interface." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/123121.

Full text
Abstract:
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.<br>Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019<br>Cataloged from student-submitted PDF version of thesis.<br>Includes bibliographical references (pages 83-85).<br>In this thesis, I present my work on a continuous silent speech recognition system for AlterEgo, a silent speech interface. By transcribing residual neurological signals sent from the brain to speech articulators during internal articulation, the system allows one to communicate without the need to speak or perform any visible movements or gestures. It is capable of transcribing continuous silent speech at a rate of over 100 words per minute. The system therefore provides a natural alternative to normal speech at a rate not far below that of conversational speech. This alternative method of communication enables those who cannot speak, such as people with speech or neurological disorders, as well as those in environments not suited for normal speech, to communicate more easily and quickly. In the same capacity, it can serve as a discreet, digital interface that augments the user with information and services without the use of an external device. I discuss herein the data processing and sequence prediction techniques used, describe the collected datasets, and evaluate various models for achieving such a continuous system, the most promising among them being a deep convolutional neural network (CNN) with connectionist temporal classification (CTC). I also share the results of various feature selection and visualization techniques, an experiment to quantify electrode contribution, and the use of a language model to boost transcription accuracy by leveraging the context provided by transcribing an entire sentence at once.<br>by Eric J. Wadkins.<br>M. Eng.<br>M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
34

Duguay, Richard. "Speech recognition : transition probability training in diphone bootstraping." Thesis, McGill University, 1999. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=21544.

Full text
Abstract:
This work explores possible methods of improving already well-trained diphone models using the same data set that was used to train the base monophones. The emphasis is placed on transition probability training. A simple approach to probability adaptation is used as a test of the expected magnitude of change in performance. Various other methods of probability modifications are explored, including sample pruning, unseen model substitution, and use phonetically tied mixtures. Model performance improvement is observed by comparison with similar experiments.
APA, Harvard, Vancouver, ISO, and other styles
35

Kopru, Selcuk. "Coupling Speech Recognition And Rule-based Machine Translation." Phd thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/12610001/index.pdf.

Full text
Abstract:
The objective of this thesis was to study the coupling of automatic speech recognition (ASR) systems with rule-based machine translation (MT) systems. In this thesis, a unique approach to integrating ASR with MT for speech translation (ST) tasks was proposed. The proposed approach is unique, essentially because it includes the rst rule-based MT system that can process speech data in a word graph format. Compared to other rule-based MT systems, our system processes both a word graph and a stream of words. Thus, the suggested integration method of the ASR and the rule-based MT system is more detailed than a simple software engineering practice. The second reason why it is unique is because this coupling approach performed better than the rst-best and N-best list techniques, which are the only other methods used to integrate an ASR with a rule-based MT system. The enhanced performance of the coupling approach was verified with experiments. The utilization of rule-based MT systems for ST tasks is important<br>however, there are some unresolved issues. Most of the literature concerning coupling systems has focused on how to integrate ASR with statistical MT rather than rule-based MT. This is because statistical MT systems can process word graphs as input, and therefore, the resolution of ambiguities can be moved to the MT component. With the new approach proposed in this thesis, this same advantage exists in rule-based MT systems. The success of such an approach could facilitate the efficient usage of rule-based systems for ST tasks.
APA, Harvard, Vancouver, ISO, and other styles
36

Bengio, Yoshua. "Connectionist models applied to automatic speech recognition." Thesis, McGill University, 1987. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=63920.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Peters, Richard Alan II. "A LINEAR PREDICTION CODING MODEL OF SPEECH (SYNTHESIS, LPC, COMPUTER, ELECTRONIC)." Thesis, The University of Arizona, 1985. http://hdl.handle.net/10150/291240.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Aboul-Hosn, Rafah. "Improvement of acoustic models in automatic speech recognition systems." Thesis, McGill University, 1995. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=23380.

Full text
Abstract:
This thesis explores the use of efficient acoustic modeling techniques to improve the performance of automatic speech recognition (ASR) systems. The principal idea behind this study is that the pronunciation of a word is not only affected by the variability of the speaker and the environment, but largely by the words that precede and follow it. Hence, to accurately represent a language, one should model the sounds in context with other sounds. Furthermore, due to the large amount of models produced when every sound is represented in every context it can appear in, one needs to use a clustering technique by which sounds of similar properties are grouped together so as to limit the number of models needed. The aim of this research is two fold: the first is to explore the effects of using context dependent models on the performance of an ASR system, the second is to combine the context dependent models pertaining to a specific sound in a complex structure to produce a context independent model containing contextual information. Two such complex structures are designed and their performance is tested.
APA, Harvard, Vancouver, ISO, and other styles
39

Kuhn, Roland. "A cache-based natural language model for speech recognition /." Thesis, McGill University, 1988. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=61941.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Galler, Michael. "Methods for more efficient, effective and robust speech recognition." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0032/NQ64560.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Chang, Hung-An Ph D. Massachusetts Institute of Technology. "Multi-level acoustic modeling for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 2012. http://hdl.handle.net/1721.1/74981.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.<br>Cataloged from PDF version of thesis.<br>Includes bibliographical references (p. 183-192).<br>Context-dependent acoustic modeling is commonly used in large-vocabulary Automatic Speech Recognition (ASR) systems as a way to model coarticulatory variations that occur during speech production. Typically, the local phoneme context is used as a means to define context-dependent units. Because the number of possible context-dependent units can grow exponentially with the length of the contexts, many units will not have enough training examples to train a robust model, resulting in a data sparsity problem. For nearly two decades, this data sparsity problem has been dealt with by a clustering-based framework which systematically groups different context-dependent units into clusters such that each cluster can have enough data. Although dealing with the data sparsity issue, the clustering-based approach also makes all context-dependent units within a cluster have the same acoustic score, resulting in a quantization effect that can potentially limit the performance of the context-dependent model. In this work, a multi-level acoustic modeling framework is proposed to address both the data sparsity problem and the quantization effect. Under the multi-level framework, each context-dependent unit is associated with classifiers that target multiple levels of contextual resolution, and the outputs of the classifiers are linearly combined for scoring during recognition. By choosing the classifiers judiciously, both the data sparsity problem and the quantization effect can be dealt with. The proposed multi-level framework can also be integrated into existing large-vocabulary ASR systems, such as FST-based ASR systems, and is compatible with state-of-the-art error reduction techniques for ASR systems, such as discriminative training methods. Multiple sets of experiments have been conducted to compare the performance of the clustering-based acoustic model and the proposed multi-level model. In a phonetic recognition experiment on TIMIT, the multi-level model has about 8% relative improvement in terms of phone error rate, showing that the multi-level framework can help improve phonetic prediction accuracy. In a large-vocabulary transcription task, combining the proposed multi-level modeling framework with discriminative training can provide more than 20% relative improvement over a clustering baseline model in terms of Word Error Rate (WER), showing that the multi-level framework can be integrated into existing large-vocabulary decoding frameworks and that it combines well with discriminative training methods. In speaker adaptive transcription task, the multi-level model has about 14% relative WER improvement, showing that the proposed framework can adapt better to new speakers, and potentially to new environments than the conventional clustering-based approach.<br>by Hung-An Chang.<br>Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
42

Cai, Carrie Jun. "Adapting existing games for education using speech recognition." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/82184.

Full text
Abstract:
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.<br>This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.<br>Cataloged from PDF student-submitted version of thesis.<br>Includes bibliographical references (p. 73-77).<br>Although memory exercises and arcade-style games are alike in their repetitive nature, memorization tasks like vocabulary drills tend to be mundane and tedious while arcade-style games are popular, intense and broadly addictive. The repetitive structure of arcade games suggests an opportunity to modify these well-known games for the purpose of learning. Arcade-style games like Tetris and Pac-man are often difficult to adapt for educational purposes because their fast-paced intensity and keystroke-heavy nature leave little room for simultaneous practice of other skills. Incorporating spoken language technology could make it possible for users to learn as they play, keeping up with game speed through multimodal interaction. Two challenges exist in this research: first, it is unclear which learning strategy would be most eective when incorporated into an already fast-paced, mentally demanding game. Secondly, it remains difficult to augment fast-paced games with speech interaction because the frustrating effect of recognition errors highly compromises entertainment. In this work, we designed and implemented Tetrilingo, a modified version of Tetris with speech recognition to help students practice and remember word-picture mappings. With our speech recognition prototype, we investigated the extent to which various forms of memory practice impact learning and engagement, and found that free-recall retrieval practice was less enjoyable to slower learners despite producing signicant learning benefits over alternative learning strategies. Using utterances collected from learners interacting with Tetrilingo, we also evaluated several techniques to increase speech recognition accuracy in fast-paced games by leveraging game context. Results show that, because false negative recognition errors are self-perpetuating and more prevalent than false positives, relaxing the constraints of the speech recognizer towards greater leniency may enhance overall recognition performance.<br>by Carrie Jun Cai.<br>S.M.
APA, Harvard, Vancouver, ISO, and other styles
43

Muzumdar, Manish D. (Manish Deepak). "Automatic acoustic measurement optimization for segmental speech recognition." Thesis, Massachusetts Institute of Technology, 1996. http://hdl.handle.net/1721.1/41387.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996.<br>Includes bibliographical references (p. 73-74).<br>by Manish D. Muzumdar.<br>M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
44

Chuangsuwanich, Ekapol. "Multilingual techniques for low resource automatic speech recognition." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/105571.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.<br>This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.<br>Cataloged from student-submitted PDF version of thesis.<br>Includes bibliographical references (pages [133]-143).<br>Out of the approximately 7000 languages spoken around the world, there are only about 100 languages with Automatic Speech Recognition (ASR) capability. This is due to the fact that a vast amount of resources is required to build a speech recognizer. This often includes thousands of hours of transcribed speech data, a phonetic pronunciation dictionary or lexicon which spans all words in the language, and a text collection on the order of several million words. Moreover, ASR technologies usually require years of research in order to deal with the specific idiosyncrasies of each language. This makes building a speech recognizer on a language with few resources a daunting task. In this thesis, we propose a universal ASR framework for transcription and keyword spotting (KWS) tasks that work on a variety of languages. We investigate methods to deal with the need of a pronunciation dictionary by using a Pronunciation Mixture Model that can learn from existing lexicons and acoustic data to generate pronunciation for new words. In the case when no dictionary is available, a graphemic lexicon provides comparable performance to the expert lexicon. To alleviate the need for text corpora, we investigate the use of subwords and web data which helps im- prove KWS spotting results. Finally, we reduce the need for speech recordings by using bottleneck (BN) features trained on multilingual corpora. We first propose the Low-rank Stacked Bottleneck architecture which improves ASR performance over previous state-of-the-art systems. We then investigate a method to select data from various languages that is most similar to the target language in a data-driven manner, which helps improve the eectiveness of the BN features. Using techniques described and proposed in this thesis, we are able to more than double the KWS performance for a low-resource language compared to using standard techniques geared towards rich resource domains.<br>by Ekapol Chuangsuwanich.<br>Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
45

Napier, James M. (John Marion). "Integrating speech recognition and generation capabilities into timeliner." Thesis, Massachusetts Institute of Technology, 1996. http://hdl.handle.net/1721.1/41388.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Miyakawa, Laura S. (Laura Sinclair) 1979. "Distributed speech recognition within a segment-based framework." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/87362.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Livescu, Karen 1975. "Feature-based pronunciation modeling for automatic speech recognition." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/34488.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.<br>Includes bibliographical references (p. 131-140).<br>Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech. One approach to handling this variation consists of expanding the dictionary with phonetic substitution, insertion, and deletion rules. Common rule sets, however, typically leave many pronunciation variants unaccounted for and increase word confusability due to the coarse granularity of phone units. We present an alternative approach, in which many types of variation are explained by representing a pronunciation as multiple streams of linguistic features rather than a single stream of phones. Features may correspond to the positions of the speech articulators, such as the lips and tongue, or to acoustic or perceptual categories. By allowing for asynchrony between features and per-feature substitutions, many pronunciation changes that are difficult to account for with phone-based models become quite natural. Although it is well-known that many phenomena can be attributed to this "semi-independent evolution" of features, previous models of pronunciation variation have typically not taken advantage of this. In particular, we propose a class of feature-based pronunciation models represented as dynamic Bayesian networks (DBNs).<br>(cont.) The DBN framework allows us to naturally represent the factorization of the state space of feature combinations into feature-specific factors, as well as providing standard algorithms for inference and parameter learning. We investigate the behavior of such a model in isolation using manually transcribed words. Compared to a phone-based baseline, the feature-based model has both higher coverage of observed pronunciations and higher recognition rate for isolated words. We also discuss the ways in which such a model can be incorporated into various types of end-to-end speech recognizers and present several examples of implemented systems, for both acoustic speech recognition and lipreading tasks.<br>by Karen Livescu.<br>Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
48

Chung, Grace Yuet-Chee. "Hierarchical duration modeling for a speech recognition system." Thesis, Massachusetts Institute of Technology, 1997. http://hdl.handle.net/1721.1/42666.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Wang, Chao 1972. "Prosodic modeling for improved speech recognition and understanding." Thesis, Massachusetts Institute of Technology, 2001. http://hdl.handle.net/1721.1/86728.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Er, Chiangkai. "Speech recognition by clustering wavelet and PLP coefficients." Thesis, Massachusetts Institute of Technology, 1997. http://hdl.handle.net/1721.1/42742.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!