To see the other types of publications on this topic, follow the link: Network speech recognition.

Dissertations / Theses on the topic 'Network speech recognition'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Network speech recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Alphonso, Issac John. "Network training for continuous speech recognition." Master's thesis, Mississippi State : Mississippi State University, 2003. http://library.msstate.edu/etd/show.asp?etd=etd-10252003-105104.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Ma, Rui. "Parametric Speech Emotion Recognition Using Neural Network." Thesis, Högskolan i Gävle, Avdelningen för elektronik, matematik och naturvetenskap, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-17694.

Full text
Abstract:
The aim of this thesis work is to investigate the algorithm of speech emotion recognition using MATLAB. Firstly, five most commonly used features are selected and extracted from speech signal. After this, statistical values such as mean, variance will be derived from the features. These data along with their related emotion target will be fed to MATLAB neural network tool to train and test to make up the classifier. The overall system provides a reliable performance, classifying correctly more than 82% speech samples after properly training.
APA, Harvard, Vancouver, ISO, and other styles
3

Ali, Ahmed Mohamed Abdel Maksoud. "Multi-dialect Arabic broadcast speech recognition." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/31224.

Full text
Abstract:
Dialectal Arabic speech research suffers from the lack of labelled resources and standardised orthography. There are three main challenges in dialectal Arabic speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training robust dialectal speech recognition models from limited labelled data and (iii) evaluating speech recognition for dialects with no orthographic rules. This thesis is concerned with the following three contributions: Arabic Dialect Identification: We are mainly dealing with Arabic speech without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently diverse to the extent that one can argue that they are different languages rather than dialects of the same language. We have two contributions: First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected from Al Jazeera TV channel. We obtained utterance level dialect labels for 57 hours of high-quality consisting of four major varieties of dialectal Arabic (DA), comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification (ADI) system. We explored two main groups of features, namely acoustic features and linguistic features. For the linguistic features, we look at a wide range of features, addressing words, characters and phonemes. With respect to acoustic features, we look at raw features such as mel-frequency cepstral coefficients combined with shifted delta cepstra (MFCC-SDC), bottleneck features and the i-vector as a latent variable. We studied both generative and discriminative classifiers, in addition to deep learning approaches, namely deep neural network (DNN) and convolutional neural network (CNN). In our work, we propose Arabic as a five class dialect challenge comprising of the previously mentioned four dialects as well as modern standard Arabic. Arabic Speech Recognition: We introduce our effort in building Arabic automatic speech recognition (ASR) and we create an open research community to advance it. This section has two main goals: First, creating a framework for Arabic ASR that is publicly available for research. We address our effort in building two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast news using more than 1,200 hours of speech and 130M words of text collected from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre data with limited non-orthographic speech collected from YouTube, with special attention paid to transfer learning. Second, building a robust Arabic ASR system and reporting a competitive word error rate (WER) to use it as a potential benchmark to advance the state of the art in Arabic ASR. Our overall system is a combination of five acoustic models (AM): unidirectional long short term memory (LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN), TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely sequence trained neural networks lattice-free maximum mutual information (LFMMI). The generated lattices are rescored using a four-gram language model (LM) and a recurrent neural network with maximum entropy (RNNME) LM. Our official WER is 13%, which has the lowest WER reported on this task. Evaluation: The third part of the thesis addresses our effort in evaluating dialectal speech with no orthographic rules. Our methods learn from multiple transcribers and align the speech hypothesis to overcome the non-orthographic aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU score used in machine translation (MT). We have also automated this process by learning different spelling variants from Twitter data. We mine automatically from a huge collection of tweets in an unsupervised fashion to build more than 11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with no reference transcription using decoding and language features. We show that our word error rate estimation is robust for many scenarios with and without the decoding features.
APA, Harvard, Vancouver, ISO, and other styles
4

Haider, Najmi Ghani. "A digital neural network approach to speech recognition." Thesis, Brunel University, 1989. http://bura.brunel.ac.uk/handle/2438/7116.

Full text
Abstract:
This thesis presents two novel methods for isolated word speech recognition based on sub-word components. A digital neural network is the fundamental processing strategy in both methods. The first design is based on the 'Separate Segmentation & Labelling' (SS&L) approach. The spectral data of the input utterance is first segmented into phoneme-like units which are then time normalised by linear time normalisation. The neural network labels the time-normalised phoneme-like segments 78.36% recognition accuracy is achieved for the phoneme-like unit. In the second design, no time normalisation is required. After segmentation, recognition is performed by classifying the data in a window as it is slid one frame at a time, from the start to the end of of each phoneme-like segment in the utterance. 73.97% recognition accuracy for the phoneme-like unit is achieved in this application. The parameters of the neural net have been optimised for maximum recognition performance. A segmentation strategy using the sum of the difference in filterbank channel energy over successive spectra produced 80.27% correct segmentation of isolated utterances into phoneme-like units. A linguistic processor based on that of Kashyap & Mittal [84] enables 93.11% and 93.49% word recognition accuracy to be achieved for the SS&L and 'Sliding Window' recognisers respectively. The linguistic processor has been redesigned to make it portable so that it can be easily applied to any phoneme based isolated word speech recogniser.
APA, Harvard, Vancouver, ISO, and other styles
5

Alexi, Ramsin. "A sub-neural network ensemble for speech recognition." Thesis, University of Sussex, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.298656.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

McCulloch, Neil Andrew. "Neural network approaches to speech recognition and synthesis." Thesis, Keele University, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.387255.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Wang, Yonglian. "Speech Recognition under Stress." Available to subscribers only, 2009. http://proquest.umi.com/pqdweb?did=1968468151&sid=9&Fmt=2&clientId=1509&RQT=309&VName=PQD.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Isaacs, Dale. "A comparison of the network speech recognition and distributed speech recognition systems and their effect on speech enabling mobile devices." Master's thesis, University of Cape Town, 2010. http://hdl.handle.net/11427/11232.

Full text
Abstract:
Includes bibliographical references (leaves 67-75).<br>Over the past 10 years there has been an exponential increase in the number of mobile subscribers worldwide. Market research has shown that the number of mobile subscribers rose to 4.3 billion towards end of Q1 in 2009. The unprecedented development of the telecommunication industry over the last decade has brought about the need for ubiquitous access to a host of different information resources and services. Today, speech remains the best medium of communication between people and it is conceivable that speech enabling mobile devices will allow users who only have mobile devices, to access all the information which is now available over the world wide web.
APA, Harvard, Vancouver, ISO, and other styles
9

Wu, Chunyang. "Structured deep neural networks for speech recognition." Thesis, University of Cambridge, 2018. https://www.repository.cam.ac.uk/handle/1810/276084.

Full text
Abstract:
Deep neural networks (DNNs) and deep learning approaches yield state-of-the-art performance in a range of machine learning tasks, including automatic speech recognition. The multi-layer transformations and activation functions in DNNs, or related network variations, allow complex and difficult data to be well modelled. However, the highly distributed representations associated with these models make it hard to interpret the parameters. The whole neural network is commonly treated a ``black box''. The behaviours of activation functions and the meanings of network parameters are rarely controlled in the standard DNN training. Though a sensible performance can be achieved, the lack of interpretations to network structures and parameters causes better regularisation and adaptation on DNN models challenging. In regularisation, parameters have to be regularised universally and indiscriminately. For instance, the widely used L2 regularisation encourages all parameters to be zeros. In adaptation, it requires to re-estimate a large number of independent parameters. Adaptation schemes in this framework cannot be effectively performed when there are limited adaptation data. This thesis investigates structured deep neural networks. Special structures are explicitly designed, and they are imposed with desired interpretation to improve DNN regularisation and adaptation. For regularisation, parameters can be separately regularised based on their functions. For adaptation, parameters can be adapted in groups or partially adapted according to their roles in the network topology. Three forms of structured DNNs are proposed in this thesis. The contributions of these models are presented as follows. The first contribution of this thesis is the multi-basis adaptive neural network. This form of structured DNN introduces a set of parallel sub-networks with restricted connections. The design of restricted connectivity allows different aspects of data to be explicitly learned. Sub-network outputs are then combined, and this combination module is used as the speaker-dependent structure that can be robustly estimated for adaptation. The second contribution of this thesis is the stimulated deep neural network. This form of structured DNN relates and smooths activation functions in regions of the network. It aids the visualisation and interpretation of DNN models but also has the potential to reduce over-fitting. Novel adaptation schemes can be performed on it, taking advantages of the smooth property that the stimulated DNN offer. The third contribution of this thesis is the deep activation mixture model. Also, this form of structured DNN encourages the outputs of activation functions to achieve a smooth surface. The output of one hidden layer is explicitly modelled as the sum of a mixture model and a residual model. The mixture model forms an activation contour, and the residual model depicts fluctuations around this contour. The smoothness yielded by a mixture model helps to regularise the overall model and allows novel adaptation schemes.
APA, Harvard, Vancouver, ISO, and other styles
10

Schuy, Lars. "Speech features and their significance in speaker recognition." Thesis, University of Sussex, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.288845.

Full text
Abstract:
This thesis addresses the significance of speech features within the task of speaker recognition. Motivated by the perception of simple attributes like `loud', `smooth', `fast', more than 70 new speech features are developed. A set of basic speech features like pitch, loudness and speech speed are combined together with these new features in a feature set, one set per utterance. A neural network classifier is used to evaluate the significance of these features by creating a speaker recognition system and analysing the behaviour of successfully trained single-speaker networks. An in-depth analysis of network weights allows a rating of significance and feature contribution. A subjective listening experiment validates and confirms the results of the neural network analysis. The work starts with an extended sentence analysis; ten sentences are uttered by 630 speakers. The extraction of 100 speech features is outlined and a 100-element feature vector for each utterance is derived. Some features themselves and the methods of analysing them have been used elsewhere, for example pitch, sound pressure level, spectral envelope, loudness, speech speed and glottal-to-noise excitation. However, more than 70 of the 100 features are derivatives of these basic features and have not yet been described and used before in the speakerr ecognition research,e speciallyyn ot within a rating of feature significance. These derivatives include histogram, 3`d and 4 moments, function approximation, as well as other statistical analyses applied to the basic features. The first approach assessing the significance of features and their possible use in a recognition system is based on a probability analysis. The analysis is established on the assumption that within the speaker's ten utterances' single feature values have a small deviation and cluster around the mean value of one speaker. The presented features indeed cluster into groups and show significant differences between speakers, thus enabling a clear separation of voices when applied to a small database of < 20 speakers. The recognition and assessment of individual feature contribution jecomes impossible, when the database is extended to 200 speakers. To ensure continous vplidation of feature contribution it is necessary to consider a different type of classifier. These limitations are overcome with the introduction of neural network classifiers. A separate network is assigned to each speaker, resulting in the creation of 630 networks. All networks are of standard feed-forward backpropagation type and have a 100-input, 20- hidden-nodes, one-output architecture. The 6300 available feature vectors are split into a training, validation and test set in the ratio of 5-3-2. The networks are initially trained with the same 100-feature input database. Successful training was achieved within 30 to 100 epochs per network. The speaker related to the network with the highest output is declared as the speaker represented by the input. The achieved recognition rate for 630 speakers is -49%. A subsequent preclusion of features with minor significance raises the recognition rate to 57%. The analysis of the network weight behaviour reveals two major pointsA definite ranking order of significance exists between the 100 features. Many of the newly introduced derivatives of pitch, brightness, spectral voice patterns and speech speed contribute intensely to recognition, whereas feature groups related to glottal-to-noiseexcitation ratio and sound pressure level play a less important role. The significance of features is rated by the training, testing and validation behaviour of the networks under data sets with reduced information content, the post-trained weight distribution and the standard deviation of weight distribution within networks. The findings match with results of a subjective listening experiment. As a second major result the analysis shows that there are large differences between speakers and the significance of features, i. e. not all speakers use the same feature set to the same extent. The speaker-related networks exhibit key features, where they are uniquely identifiable and these key features vary from speaker to speaker. Some features like pitch are used by all networks; other features like sound pressure level and glottal-to-noise excitation ratio are used by only a few distinct classifiers. Again, the findings correspond with results of a subjective listening experiment. This thesis presents more than 70 new features which never have been used before in speaker recognition. A quantitative ranking order of 100 speech features is introduced. Such a ranking order has not been documented elsewhere and is comparatively new to the area of speaker recognition. This ranking order is further extended and describes the amount to which a classifier uses or omits single features, solely depending on the characteristics of the voice sample. Such a separation has not yet been documented and is a novel contribution. The close correspondence of the subjective listening experiment and the findings of the network classifiers show that it is plausible to model the behaviour of human speech recognition with an artificial neural network. Again such a validation is original in the area of speaker recognition
APA, Harvard, Vancouver, ISO, and other styles
11

Rochford, Matthew. "Visual Speech Recognition Using a 3D Convolutional Neural Network." DigitalCommons@CalPoly, 2019. https://digitalcommons.calpoly.edu/theses/2109.

Full text
Abstract:
Main stream automatic speech recognition (ASR) makes use of audio data to identify spoken words, however visual speech recognition (VSR) has recently been of increased interest to researchers. VSR is used when audio data is corrupted or missing entirely and also to further enhance the accuracy of audio-based ASR systems. In this research, we present both a framework for building 3D feature cubes of lip data from videos and a 3D convolutional neural network (CNN) architecture for performing classification on a dataset of 100 spoken words, recorded in an uncontrolled envi- ronment. Our 3D-CNN architecture achieves a testing accuracy of 64%, comparable with recent works, but using an input data size that is up to 75% smaller. Overall, our research shows that 3D-CNNs can be successful in finding spatial-temporal features using unsupervised feature extraction and are a suitable choice for VSR-based systems.
APA, Harvard, Vancouver, ISO, and other styles
12

Gangireddy, Siva Reddy. "Recurrent neural network language models for automatic speech recognition." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/28990.

Full text
Abstract:
The goal of this thesis is to advance the use of recurrent neural network language models (RNNLMs) for large vocabulary continuous speech recognition (LVCSR). RNNLMs are currently state-of-the-art and shown to consistently reduce the word error rates (WERs) of LVCSR tasks when compared to other language models. In this thesis we propose various advances to RNNLMs. The advances are: improved learning procedures for RNNLMs, enhancing the context, and adaptation of RNNLMs. We learned better parameters by a novel pre-training approach and enhanced the context using prosody and syntactic features. We present a pre-training method for RNNLMs, in which the output weights of a feed-forward neural network language model (NNLM) are shared with the RNNLM. This is accomplished by first fine-tuning the weights of the NNLM, which are then used to initialise the output weights of an RNNLM with the same number of hidden units. To investigate the effectiveness of the proposed pre-training method, we have carried out text-based experiments on the Penn Treebank Wall Street Journal data, and ASR experiments on the TED lectures data. Across the experiments, we observe small but significant improvements in perplexity (PPL) and ASR WER. Next, we present unsupervised adaptation of RNNLMs. We adapted the RNNLMs to a target domain (topic or genre or television programme (show)) at test time using ASR transcripts from first pass recognition. We investigated two approaches to adapt the RNNLMs. In the first approach the forward propagating hidden activations are scaled - learning hidden unit contributions (LHUC). In the second approach we adapt all parameters of RNNLM.We evaluated the adapted RNNLMs by showing the WERs on multi genre broadcast speech data. We observe small (on an average 0.1% absolute) but significant improvements in WER compared to a strong unadapted RNNLM model. Finally, we present the context-enhancement of RNNLMs using prosody and syntactic features. The prosody features were computed from the acoustics of the context words and the syntactic features were from the surface form of the words in the context. We trained the RNNLMs with word duration, pause duration, final phone duration, syllable duration, syllable F0, part-of-speech tag and Combinatory Categorial Grammar (CCG) supertag features. The proposed context-enhanced RNNLMs were evaluated by reporting PPL and WER on two speech recognition tasks, Switchboard and TED lectures. We observed substantial improvements in PPL (5% to 15% relative) and small but significant improvements in WER (0.1% to 0.5% absolute).
APA, Harvard, Vancouver, ISO, and other styles
13

Lin, Alvin. "Video Based Automatic Speech Recognition Using Neural Networks." DigitalCommons@CalPoly, 2020. https://digitalcommons.calpoly.edu/theses/2343.

Full text
Abstract:
Neural network approaches have become popular in the field of automatic speech recognition (ASR). Most ASR methods use audio data to classify words. Lip reading ASR techniques utilize only video data, which compensates for noisy environments where audio may be compromised. A comprehensive approach, including the vetting of datasets and development of a preprocessing chain, to video-based ASR is developed. This approach will be based on neural networks, namely 3D convolutional neural networks (3D-CNN) and Long short-term memory (LSTM). These types of neural networks are designed to take in temporal data such as videos. Various combinations of different neural network architecture and preprocessing techniques are explored. The best performing neural network architecture, a CNN with bidirectional LSTM, compares favorably against recent works on video-based ASR.
APA, Harvard, Vancouver, ISO, and other styles
14

Bireddy, Chakradhar. "Hopfield Networks as an Error Correcting Technique for Speech Recognition." Thesis, University of North Texas, 2004. https://digital.library.unt.edu/ark:/67531/metadc5551/.

Full text
Abstract:
I experimented with Hopfield networks in the context of a voice-based, query-answering system. Hopfield networks are used to store and retrieve patterns. I used this technique to store queries represented as natural language sentences and I evaluated the accuracy of the technique for error correction in a spoken question-answering dialog between a computer and a user. I show that the use of an auto-associative Hopfield network helps make the speech recognition system more fault tolerant. I also looked at the available encoding schemes to convert a natural language sentence into a pattern of zeroes and ones that can be stored in the Hopfield network reliably, and I suggest scalable data representations which allow storing a large number of queries.
APA, Harvard, Vancouver, ISO, and other styles
15

Hmad, N. F. "Deep neural network acoustic models for multi-dialect Arabic speech recognition." Thesis, Nottingham Trent University, 2015. http://irep.ntu.ac.uk/id/eprint/27934/.

Full text
Abstract:
Speech is a desirable communication method between humans and computers. The major concerns of the automatic speech recognition (ASR) are determining a set of classification features and finding a suitable recognition model for these features. Hidden Markov Models (HMMs) have been demonstrated to be powerful models for representing time varying signals. Artificial Neural Networks (ANNs) have also been widely used for representing time varying quasi-stationary signals. Arabic is one of the oldest living languages and one of the oldest Semitic languages in the world, it is also the fifth most generally used language and is the mother tongue for roughly 200 million people. Arabic speech recognition has been a fertile area of reasearch over the previous two decades, as attested by the various papers that have been published on this subject. This thesis investigates phoneme and acoustic models based on Deep Neural Networks (DNN) and Deep Echo State Networks for multi-dialect Arabic Speech Recognition. Moreover, the TIMIT corpus with a wide variety of American dialects is also aimed to evaluate the proposed models. The availability of speech data that is time-aligned and labelled at phonemic level is a fundamental requirement for building speech recognition systems. A developed Arabic phoneme database (APD) was manually timed and phonetically labelled. This dataset was constructed from the King Abdul-Aziz Arabic Phonetics Database (KAPD) database for Saudi Arabia dialect and the Centre for Spoken Language Understanding (CSLU2002) database for different Arabic dialects. This dataset covers 8148 Arabic phonemes. In addition, a corpus of 120 speakers (13 hours of Arabic speech) randomly selected from the Levantine Arabic dialect database that is used for training and 24 speakers (2.4 hours) for testing are revised and transcription errors were manually corrected. The selected dataset is labelled automatically using the HTK Hidden Markov Model toolkit. TIMIT corpus is also used for phone recognition and acoustic modelling task. We used 462 speakers (3.14 hours) for training and 24 speakers (0.81 hours) for testing. For Automatic Speech Recognition (ASR), a Deep Neural Network (DNN) is used to evaluate its adoption in developing a framewise phoneme recognition and an acoustic modelling system for Arabic speech recognition. Restricted Boltzmann Machines (RBMs) DNN models have not been explored for any Arabic corpora previously. This allows us to claim priority for adopting this RBM DNN model for the Levantine Arabic acoustic models. A post-processing enhancement was also applied to the DNN acoustic model outputs in order to improve the recognition accuracy and to obtain the accuracy at a phoneme level instead of the frame level. This post process has significantly improved the recognition performance. An Echo State Network (ESN) is developed and evaluated for Arabic phoneme recognition with different learning algorithms. This investigated the use of the conventional ESN trained with supervised and forced learning algorithms. A novel combined supervised/forced supervised learning algorithm (unsupervised adaptation) was developed and tested on the proposed optimised Arabic phoneme recognition datasets. This new model is evaluated on the Levantine dataset and empirically compared with the results obtained from the baseline Deep Neural Networks (DNNs). A significant improvement on the recognition performance was achieved when the ESN model was implemented compared to the baseline RBM DNN model’s result. The results show that the ESN model has a better ability for recognizing phonemes sequences than the DNN model for a small vocabulary size dataset. The adoption of the ESNs model for acoustic modeling is seen to be more valid than the adoption of the DNNs model for acoustic modeling speech recognition, as ESNs are recurrent models and expected to support sequence models better than the RBM DNN models even with the contextual input window. The TIMIT corpus is also used to investigate deep learning for framewise phoneme classification and acoustic modelling using Deep Neural Networks (DNNs) and Echo State Networks (ESNs) to allow us to make a direct and valid comparison between the proposed systems investigated in this thesis and the published works in equivalent projects based on framewise phoneme recognition used the TIMIT corpus. Our main finding on this corpus is that ESN network outperform time-windowed RBM DNN ones. However, our developed system ESN-based shows 10% lower performance when it was compared to the other systems recently reported in the literature that used the same corpus. This due to the hardware availability and not applying speaker and noise adaption that can improve the results in this thesis as our aim is to investigate the proposed models for speech recognition and to make a direct comparison between these models.
APA, Harvard, Vancouver, ISO, and other styles
16

Abraham, Aby. "Continous Speech Recognition Using Long Term Memory Cells." Ohio University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1377777011.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Le, Chau Giang. "Application of a back-propagation neural network to isolated-word speech recognition." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1993. http://handle.dtic.mil/100.2/ADA272495.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Plahl, Christian [Verfasser]. "Neural network based feature extraction for speech and image recognition / Christian Plahl." Aachen : Hochschulbibliothek der Rheinisch-Westfälischen Technischen Hochschule Aachen, 2014. http://d-nb.info/1058851160/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Kocour, Martin. "Automatic Speech Recognition System Continually Improving Based on Subtitled Speech Data." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2019. http://www.nusl.cz/ntk/nusl-399164.

Full text
Abstract:
V dnešnej dobe systémy rozpoznávania reči s veľkým slovníkom dosahujú pomerne vysoké presnosti. Za ich výsledkami však často stoja desiatky ba až stovky hodín manuálne oanotovaných trénovacích dát. Takéto dáta sú často bežne nedostupné alebo pre požadovaný jazyk vôbec neexistujú. Možným riešením je použitie bežne dostupných no menej kvalitných audiovizuálnych dát. Táto práca sa zaoberá technikou zpracovania práve takýchto dát a ich použitím pre trénovanie akustických modelov. Ďalej táto práca pojednáva o možnom využití týchto dát pre kontinuálne vylepšovanie modelov, kedže tieto dáta sú prakticky nevyčerpateľné. Pre tieto účely bol v rámci práce navrhnutý nový prístup pre výber dát.
APA, Harvard, Vancouver, ISO, and other styles
20

Sun, Pengfei. "Conditional Neural Networks for Speech and Language Processing." OpenSIUC, 2017. https://opensiuc.lib.siu.edu/dissertations/1414.

Full text
Abstract:
Neural networks based deep learning methods have gained significant success in several real world tasks: from machine translation to web recommendation, and it is also greatly improving the computer vision and the natural language processing. Compared with conventional machine learning techniques, neural network based deep learning do not require careful engineering and consideration domain expertise to design a feature extractor that transformed the raw data to a suitable internal representation. Its extreme efficacy on multiple levels of representation and feature learning ensures this type of approaches can process high dimensional data. It integrates the feature representation, learning and recognition into a systematical framework, which allows the learning starts at one level (i.e., being with raw input) and end at a higher slightly more abstract level. By simply stacking enough such transformations, very complex functions can be obtained. In general, high level feature representation facilitate the discrimination of patterns, and additionally can reduce the impact of irrelevant variations. However, previous studies indicate that deep composition of the networks make the training errors become vanished. To overcome this weakness, several techniques have been developed, for instance, dropout, stochastic gradient decent and residual network structures. In this study, we incorporates latent information into different network structures (e.g., restricted Boltzmann machine, recursive neural networks, and long short term memory). The conditional latent information reflects the high dimensional correlation existed in the data structure, and the typical network structure may not learn this kind of features due to limitation of the initial design (i.e., the network size the parameters). Similarly to residual nets, the conditional neural networks jointly learns the global features and local features, and the specifically designed network structure helps to incorporate the modulation derived from the probability distribution. The proposed models have been widely tested in different datasets, for instance, the conditional RBM has been applied to detect the speech components, and a language model based gated RBM has been used to recognize speech related EEG patterns. The conditional RNN has been tested in both general natural language modeling and medical notes prediction tasks. The results indicate that by introducing conditional branches in the conventional network structures, the latent features can be globally and locally learned.
APA, Harvard, Vancouver, ISO, and other styles
21

Owen, Mark. "A binary recurrent self-organising neural network and its application to speech recognition." Thesis, University of Cambridge, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.259480.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Heymann, Jahn [Verfasser]. "Robust multi-channel speech recognition with neural network supported statistical beamforming / Jahn Heymann." Paderborn : Universitätsbibliothek, 2020. http://d-nb.info/1226097383/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Novoa, Ilic José Eduardo. "Robust speech recognition in noisy and reverberant environments using deep neural network-based systems." Tesis, Universidad de Chile, 2018. http://repositorio.uchile.cl/handle/2250/168062.

Full text
Abstract:
Doctor en Ingeniería Eléctrica<br>In this thesis an uncertainty weighting scheme for deep neural network-hidden Markov model (DNN-HMM) based automatic speech recognition (ASR) is proposed to increase discriminability in the decoding process. To this end, the DNN pseudo-log-likelihoods are weighted according to the uncertainty variance assigned to the acoustic observation. The results presented here suggest that substantial reduction in word error rate (WER) is achieved with clean training. Moreover, modelling the uncertainty propagation through the DNN is not required and no approximations for non linear activation functions are made. The presented method can be applied to any network topology that delivers log likelihood-like scores. It can be combined with any noise removal technique and adds a minimal computational cost. This technique was exhaustively evaluated and combined with uncertainty-propagation-based schemes for computing the pseudo-log-likelihoods and uncertainty variance at the DNN output. Two proposed methods optimized the parameters of the weighting function by leveraging the grid search either on a development database representing the given task or on each utterance based on discrimination metrics. Experiments with Aurora-4 task showed that, with clean training, the proposed weighting scheme can reduce WER by a maximum of 21% compared with a baseline system with spectral subtraction and uncertainty propagation using the unscented transform. Additionally, it is proposed to replace the classical black box integration of automatic speech recognition technology in human-robot interaction (HRI) applications with the incorporation of the HRI environment representation and modeling, and the robot and user states and contexts. Accordingly, this thesis focuses on the environment representation and modeling by training a DNN-HMM based automatic speech recognition engine combining clean utterances with the acoustic channel responses and noise that were obtained from an HRI testbed built with a PR2 mobile manipulation robot. This method avoids recording a training database in all the possible acoustic environments given an HRI scenario. In the generated testbed, the resulting ASR engine provided a WER that is at least 26% and 38% lower than publicly available speech recognition application programming interfaces (APIs) with the loudspeaker and human speakers testing databases, respectively, with a limited amount of training data. This thesis demonstrates that even state-of-the-art DNN-HMM based speech recognizers can benefit by combining systems for which the acoustic models have been trained using different feature sets. In this context, the complementarity of DNN-HMM based ASR systems trained with the same data set but with different signal representations is discussed. DNN fusion methods based on flat-weight combination, the minimization of mutual information and the maximization of discrimination metrics were proposed and tested. Schemes that consider the combination of ASR systems with lattice combination and minimum Bayes risk decoding were also evaluated and combined with DNN fusion techniques. The experimental results were obtained using a publicly-available naturally-recorded highly reverberant speech data. Significant improvements in WER were observed by combining DNN-HMM based ASR systems with different feature sets, obtaining relative improvements of 10% with two classifiers and 18% with four classifiers, without any tuning or a priori information of the ASR accuracy.
APA, Harvard, Vancouver, ISO, and other styles
24

Mancini, Eleonora. "Disruptive Situations Detection on Public Transports through Speech Emotion Recognition." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/24721/.

Full text
Abstract:
In this thesis, we describe a study on the application of Machine Learning and Deep Learning methods for Voice Activity Detection (VAD) and Speech Emotion Recognition (SER). The study is in the context of a European project whose objective is to detect disruptive situations in public transports. To this end, we developed an architecture, implemented a prototype and ran validation tests on a variety of options. The architecture consists of several modules. The denoising module was realized through the use of a filter and the VAD module through an open-source toolkit, while the SER system was entirely developed in this thesis. For SER architecture we adopted the use of two audio features (MFCC and RMS) and two kind of classifiers, namely CNN and SVM, to detect emotions indicative of disruptive situations such as fighting or shouting. We aggregated several models through ensemble learning. The ensemble was evaluated on several datasets and showed encouraging experimental results, even compared to the baselines of the state-of the-art. The code is available at: https://github.com/helemanc/ambient-intelligence
APA, Harvard, Vancouver, ISO, and other styles
25

Abdallah, Moatassem Mahmoud. "A Study in Speaker Dependent Medium Vocabulary Word Recognition: Application to Human/Computer Interface." Diss., Virginia Tech, 2000. http://hdl.handle.net/10919/77997.

Full text
Abstract:
Human interfaces to computers continue to be an active area of research. The keyboard is considered the basic interface for editing control as well as text input. Problems of correct typing and typing speed have urged research for alternative means for keyboard replacement, or at least "resizing" its monopoly. Pointing devices (e.g. a mouse) have been developed, and supporting software with icons is now widely used. Two other means are being developed and operationally tested, namely, the pen for handwriting text, commands and drawings, and spoken language interface, which is the subject of this thesis. Human/computer interface is an interactive man-machine communication facility that enjoys the following advantages. <ul><li>High input speed: some experiments reveal that the rate of information input by speech is three times faster than keyboard input and eight times faster than inputting characters by hand. </li><li>No training needed: because the generation of speech is a very natural human action, it requires no special training. </li><li>Parallel processing with other information: production of speech works quite well in conjunction with gestures of hands and feet for visual perception of information. </li><li>Simple and economical input sensor: microphones are inexpensive and are readily available. </li><li>Coping with handicaps: these interfaces can be used in unusual circumstances of darkness, blindness, or other visual handicap.</li></ul> This dissertation presents a design of a Human Computer Interface (HCI) system that can be trained to work with an individual speaker. A new approach is introduced to extract key voice features, called Median Linear Predictive Coding (MLPC). MLPC reduces the HCI calculation time and gives an improved recognition rate. This design eliminated the typical Multi-layer Perceptron (MLP) problems of complexity growth with vocabulary size, the large training times required and the need for complete re-training whenever the vocabulary is extended. A novel modular neural network architecture, called a Pyramidal Modular Neural Network (PMNN), is introduced for recursive speech identification. In addition, many other system algorithms/components, such as speech endpoint detection, automatic noise thresholding, etc., must be tailored correctly in order to achieve high recognition accuracy.<br>Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
26

Sklar, Alexander Gabriel. "Channel Modeling Applied to Robust Automatic Speech Recognition." Scholarly Repository, 2007. http://scholarlyrepository.miami.edu/oa_theses/87.

Full text
Abstract:
In automatic speech recognition systems (ASRs), training is a critical phase to the system?s success. Communication media, either analog (such as analog landline phones) or digital (VoIP) distort the speaker?s speech signal often in very complex ways: linear distortion occurs in all channels, either in the magnitude or phase spectrum. Non-linear but time-invariant distortion will always appear in all real systems. In digital systems we also have network effects which will produce packet losses and delays and repeated packets. Finally, one cannot really assert what path a signal will take, and so having error or distortion in between is almost a certainty. The channel introduces an acoustical mismatch between the speaker's signal and the trained data in the ASR, which results in poor recognition performance. The approach so far, has been to try to undo the havoc produced by the channels, i.e. compensate for the channel's behavior. In this thesis, we try to characterize the effects of different transmission media and use that as an inexpensive and repeatable way to train ASR systems.
APA, Harvard, Vancouver, ISO, and other styles
27

Larsson, Joel. "Optimizing text-independent speaker recognition using an LSTM neural network." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-26312.

Full text
Abstract:
In this paper a novel speaker recognition system is introduced. Automated speaker recognition has become increasingly popular to aid in crime investigations and authorization processes with the advances in computer science. Here, a recurrent neural network approach is used to learn to identify ten speakers within a set of 21 audio books. Audio signals are processed via spectral analysis into Mel Frequency Cepstral Coefficients that serve as speaker specific features, which are input to the neural network. The Long Short-Term Memory algorithm is examined for the first time within this area, with interesting results. Experiments are made as to find the optimum network model for the problem. These show that the network learns to identify the speakers well, text-independently, when the recording situation is the same. However the system has problems to recognize speakers from different recordings, which is probably due to noise sensitivity of the speech processing algorithm in use.
APA, Harvard, Vancouver, ISO, and other styles
28

Kullmann, Emelie. "Speech to Text for Swedish using KALDI." Thesis, KTH, Optimeringslära och systemteori, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189890.

Full text
Abstract:
The field of speech recognition has during the last decade left the re- search stage and found its way in to the public market. Most computers and mobile phones sold today support dictation and transcription in a number of chosen languages.  Swedish is often not one of them. In this thesis, which is executed on behalf of the Swedish Radio, an Automatic Speech Recognition model for Swedish is trained and the performance evaluated. The model is built using the open source toolkit Kaldi.  Two approaches of training the acoustic part of the model is investigated. Firstly, using Hidden Markov Model and Gaussian Mixture Models and secondly, using Hidden Markov Models and Deep Neural Networks. The later approach using deep neural networks is found to achieve a better performance in terms of Word Error Rate.<br>De senaste åren har olika tillämpningar inom människa-dator interaktion och främst taligenkänning hittat sig ut på den allmänna marknaden. Många system och tekniska produkter stöder idag tjänsterna att transkribera tal och diktera text. Detta gäller dock främst de större språken och sällan finns samma stöd för mindre språk som exempelvis svenskan. I detta examensprojekt har en modell för taligenkänning på svenska ut- vecklas. Det är genomfört på uppdrag av Sveriges Radio som skulle ha stor nytta av en fungerande taligenkänningsmodell på svenska. Modellen är utvecklad i ramverket Kaldi. Två tillvägagångssätt för den akustiska träningen av modellen är implementerade och prestandan för dessa två är evaluerade och jämförda. Först tränas en modell med användningen av Hidden Markov Models och Gaussian Mixture Models och slutligen en modell där Hidden Markov Models och Deep Neural Networks an- vänds, det visar sig att den senare uppnår ett bättre resultat i form av måttet Word Error Rate.
APA, Harvard, Vancouver, ISO, and other styles
29

Tomashenko, Natalia. "Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems." Thesis, Le Mans, 2017. http://www.theses.fr/2017LEMA1040/document.

Full text
Abstract:
Les différences entre conditions d'apprentissage et conditions de test peuvent considérablement dégrader la qualité des transcriptions produites par un système de reconnaissance automatique de la parole (RAP). L'adaptation est un moyen efficace pour réduire l'inadéquation entre les modèles du système et les données liées à un locuteur ou un canal acoustique particulier. Il existe deux types dominants de modèles acoustiques utilisés en RAP : les modèles de mélanges gaussiens (GMM) et les réseaux de neurones profonds (DNN). L'approche par modèles de Markov cachés (HMM) combinés à des GMM (GMM-HMM) a été l'une des techniques les plus utilisées dans les systèmes de RAP pendant de nombreuses décennies. Plusieurs techniques d'adaptation ont été développées pour ce type de modèles. Les modèles acoustiques combinant HMM et DNN (DNN-HMM) ont récemment permis de grandes avancées et surpassé les modèles GMM-HMM pour diverses tâches de RAP, mais l'adaptation au locuteur reste très difficile pour les modèles DNN-HMM. L'objectif principal de cette thèse est de développer une méthode de transfert efficace des algorithmes d'adaptation des modèles GMM aux modèles DNN. Une nouvelle approche pour l'adaptation au locuteur des modèles acoustiques de type DNN est proposée et étudiée : elle s'appuie sur l'utilisation de fonctions dérivées de GMM comme entrée d'un DNN. La technique proposée fournit un cadre général pour le transfert des algorithmes d'adaptation développés pour les GMM à l'adaptation des DNN. Elle est étudiée pour différents systèmes de RAP à l'état de l'art et s'avère efficace par rapport à d'autres techniques d'adaptation au locuteur, ainsi que complémentaire<br>Differences between training and testing conditions may significantly degrade recognition accuracy in automatic speech recognition (ASR) systems. Adaptation is an efficient way to reduce the mismatch between models and data from a particular speaker or channel. There are two dominant types of acoustic models (AMs) used in ASR: Gaussian mixture models (GMMs) and deep neural networks (DNNs). The GMM hidden Markov model (GMM-HMM) approach has been one of the most common technique in ASR systems for many decades. Speaker adaptation is very effective for these AMs and various adaptation techniques have been developed for them. On the other hand, DNN-HMM AMs have recently achieved big advances and outperformed GMM-HMM models for various ASR tasks. However, speaker adaptation is still very challenging for these AMs. Many adaptation algorithms that work well for GMMs systems cannot be easily applied to DNNs because of the different nature of these models. The main purpose of this thesis is to develop a method for efficient transfer of adaptation algorithms from the GMM framework to DNN models. A novel approach for speaker adaptation of DNN AMs is proposed and investigated. The idea of this approach is based on using so-called GMM-derived features as input to a DNN. The proposed technique provides a general framework for transferring adaptation algorithms, developed for GMMs, to DNN adaptation. It is explored for various state-of-the-art ASR systems and is shown to be effective in comparison with other speaker adaptation techniques and complementary to them
APA, Harvard, Vancouver, ISO, and other styles
30

Zeghidour, Neil. "Learning representations of speech from the raw waveform." Thesis, Paris Sciences et Lettres (ComUE), 2019. http://www.theses.fr/2019PSLEE004/document.

Full text
Abstract:
Bien que les réseaux de neurones soient à présent utilisés dans la quasi-totalité des composants d’un système de reconnaissance de la parole, du modèle acoustique au modèle de langue, l’entrée de ces systèmes reste une représentation analytique et fixée de la parole dans le domaine temps-fréquence, telle que les mel-filterbanks. Cela se distingue de la vision par ordinateur, un domaine où les réseaux de neurones prennent en entrée les pixels bruts. Les mel-filterbanks sont le produit d’une connaissance précieuse et documentée du système auditif humain, ainsi que du traitement du signal, et sont utilisées dans les systèmes de reconnaissance de la parole les plus en pointe, systèmes qui rivalisent désormais avec les humains dans certaines conditions. Cependant, les mel-filterbanks, comme toute représentation fixée, sont fondamentalement limitées par le fait qu’elles ne soient pas affinées par apprentissage pour la tâche considérée. Nous formulons l’hypothèse qu’apprendre ces représentations de bas niveau de la parole, conjontement avec le modèle, permettrait de faire avancer davantage l’état de l’art. Nous explorons tout d’abord des approches d’apprentissage faiblement supervisé et montrons que nous pouvons entraîner un unique réseau de neurones à séparer l’information phonétique de celle du locuteur à partir de descripteurs spectraux ou du signal brut et que ces représentations se transfèrent à travers les langues. De plus, apprendre à partir du signal brut produit des représentations du locuteur significativement meilleures que celles d’un modèle entraîné sur des mel-filterbanks. Ces résultats encourageants nous mènent par la suite à développer une alternative aux mel-filterbanks qui peut être entraînée à partir des données. Dans la seconde partie de cette thèse, nous proposons les Time-Domain filterbanks, une architecture neuronale légère prenant en entrée la forme d’onde, dont on peut initialiser les poids pour répliquer les mel-filterbanks et qui peut, par la suite, être entraînée par rétro-propagation avec le reste du réseau de neurones. Au cours d’expériences systématiques et approfondies, nous montrons que les Time-Domain filterbanks surclassent systématiquement les melfilterbanks, et peuvent être intégrées dans le premier système de reconnaissance de la parole purement convolutif et entraîné à partir du signal brut, qui constitue actuellement un nouvel état de l’art. Les descripteurs fixes étant également utilisés pour des tâches de classification non-linguistique, pour lesquelles elles sont d’autant moins optimales, nous entraînons un système de détection de dysarthrie à partir du signal brut, qui surclasse significativement un système équivalent entraîné sur des mel-filterbanks ou sur des descripteurs de bas niveau. Enfin, nous concluons cette thèse en expliquant en quoi nos contributions s’inscrivent dans une transition plus large vers des systèmes de compréhension du son qui pourront être appris de bout en bout<br>While deep neural networks are now used in almost every component of a speech recognition system, from acoustic to language modeling, the input to such systems are still fixed, handcrafted, spectral features such as mel-filterbanks. This contrasts with computer vision, in which a deep neural network is now trained on raw pixels. Mel-filterbanks contain valuable and documented prior knowledge from human auditory perception as well as signal processing, and are the input to state-of-the-art speech recognition systems that are now on par with human performance in certain conditions. However, mel-filterbanks, as any fixed representation, are inherently limited by the fact that they are not fine-tuned for the task at hand. We hypothesize that learning the low-level representation of speech with the rest of the model, rather than using fixed features, could push the state-of-the art even further. We first explore a weakly-supervised setting and show that a single neural network can learn to separate phonetic information and speaker identity from mel-filterbanks or the raw waveform, and that these representations are robust across languages. Moreover, learning from the raw waveform provides significantly better speaker embeddings than learning from mel-filterbanks. These encouraging results lead us to develop a learnable alternative to mel-filterbanks, that can be directly used in replacement of these features. In the second part of this thesis we introduce Time-Domain filterbanks, a lightweight neural network that takes the waveform as input, can be initialized as an approximation of mel-filterbanks, and then learned with the rest of the neural architecture. Across extensive and systematic experiments, we show that Time-Domain filterbanks consistently outperform melfilterbanks and can be integrated into a new state-of-the-art speech recognition system, trained directly from the raw audio signal. Fixed speech features being also used for non-linguistic classification tasks for which they are even less optimal, we perform dysarthria detection from the waveform with Time-Domain filterbanks and show that it significantly improves over mel-filterbanks or low-level descriptors. Finally, we discuss how our contributions fall within a broader shift towards fully learnable audio understanding systems
APA, Harvard, Vancouver, ISO, and other styles
31

Fox, Elizabeth J. S. "Call-independent identification in birds." University of Western Australia. School of Animal Biology, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0218.

Full text
Abstract:
[Truncated abstract] The identification of individual animals based on acoustic parameters is a non-invasive method of identifying individuals with considerable advantages over physical marking procedures. One requirement for an effective and practical method of acoustic individual identification is that it is call-independent, i.e. determining identity does not require a comparison of the same call or song type. This means that an individuals identity over time can be determined regardless of any changes to its vocal repertoire, and different individuals can be compared regardless of whether they share calls. Although several methods of acoustic identification currently exist, for example discriminant function analysis or spectrographic cross-correlation, none are call-independent. Call-independent identification has been developed for human speaker recognition, and this thesis aimed to: 1) determine if call-independent identification was possible in birds, using similar methods to those used for human speaker recognition, 2) examine the impact of noise in a recording on the identification accuracy and determine methods of removing the noise and increasing accuracy, 3) provide a comparison of features and classifiers to determine the best method of call-independent identification in birds, and 4) determine the practical limitations of call-independent identification in birds, with respect to increasing population size, changing vocal characteristics over time, using different call categories, and using the method in an open population. ... For classification, Gaussian mixture models and probabilistic neural networks resulted in higher accuracy, and were simpler to use, than multilayer perceptrons. Using the best methods of feature extraction and classification resulted in 86-95.5% identification accuracy for two passerine species, with all individuals correctly identified. A study of the limitations of the technique, in terms of population size, the category of call used, accuracy over time, and the effects of having an open population, found that acoustic identification using perceptual linear prediction and probabilistic neural networks can be used to successfully identify individuals in a population of at least 40 individuals, can be used successfully on call categories other than song, and can be used in open populations in which a new recording may belong to a previously unknown individual. However, identity was only able to be determined with accuracy for less than three months, limiting the current technique to short-term field studies. This thesis demonstrates the application of speaker recognition technology to enable call-independent identification in birds. Call-independence is a pre-requisite for the successful application of acoustic individual identification in many species, especially passerines, but has so far received little attention in the scientific literature. This thesis demonstrates that call-independent identification is possible in birds, as well as testing and finding methods to overcome the practical limitations of the methods, enabling their future use in biological studies, particularly for the conservation of threatened species.
APA, Harvard, Vancouver, ISO, and other styles
32

Zamora, Martínez Francisco Julián. "Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automática." Doctoral thesis, Universitat Politècnica de València, 2012. http://hdl.handle.net/10251/18066.

Full text
Abstract:
El procesamiento del lenguaje natural es un área de aplicación de la inteligencia artificial, en particular, del reconocimiento de formas que estudia, entre otras cosas, incorporar información sintáctica (modelo de lenguaje) sobre cómo deben juntarse las palabras de una determinada lengua, para así permitir a los sistemas de reconocimiento/traducción decidir cual es la mejor hipótesis �con sentido común�. Es un área muy amplia, y este trabajo se centra únicamente en la parte relacionada con el modelado de lenguaje y su aplicación a diversas tareas: reconocimiento de secuencias mediante modelos ocultos de Markov y traducción automática estadística. Concretamente, esta tesis tiene su foco central en los denominados modelos conexionistas de lenguaje, esto es, modelos de lenguaje basados en redes neuronales. Los buenos resultados de estos modelos en diversas áreas del procesamiento del lenguaje natural han motivado el desarrollo de este estudio. Debido a determinados problemas computacionales que adolecen los modelos conexionistas de lenguaje, los sistemas que aparecen en la literatura se construyen en dos etapas totalmente desacopladas. En la primera fase se encuentra, a través de un modelo de lenguaje estándar, un conjunto de hipótesis factibles, asumiendo que dicho conjunto es representativo del espacio de búsqueda en el cual se encuentra la mejor hipótesis. En segundo lugar, sobre dicho conjunto, se aplica el modelo conexionista de lenguaje y se extrae la hipótesis con mejor puntuación. A este procedimiento se le denomina �rescoring�. Este escenario motiva los objetivos principales de esta tesis: � Proponer alguna técnica que pueda reducir drásticamente dicho coste computacional degradando lo mínimo posible la calidad de la solución encontrada. � Estudiar el efecto que tiene la integración de los modelos conexionistas de lenguaje en el proceso de búsqueda de las tareas propuestas. � Proponer algunas modificaciones del modelo original que permitan mejorar su calidad<br>Zamora Martínez, FJ. (2012). Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automática [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/18066<br>Palancia
APA, Harvard, Vancouver, ISO, and other styles
33

Struhař, Michal. "Detekce chybné výslovnosti v mluvené řeči." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2008. http://www.nusl.cz/ntk/nusl-217311.

Full text
Abstract:
This thesis deals with detection of speech disorders. One of the aims of this thesis is choosing suitable parameterization: short-time energy, zero-crossing rate, linear predictive analysis, perceptual linear predictive analysis, RASTA method, cepstral analysis and mel-frequency cepstral coefficient can be chosed for detections. Next aim is construction of detector of speech disorders based on DTW (Dynamic Time Warping) and artificial neuron network. Single detection proceeds on the base of collected tokens from chosen analysis and phonetic transcription of speech. Analyses, detector and phonetic transcription of Czech language are implemented in simulation environment of MATLAB.
APA, Harvard, Vancouver, ISO, and other styles
34

Lerjebo, Linus, and Johannes Hägglund. "Intelligent chatbot assistant: A study of Natural Language Processing and Artificial Intelligence." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-42691.

Full text
Abstract:
The development and research of Artificial Intelligence have had a recent surge in recent years, which includes the medical field. Despite the new technology and tools available, the staff is still under a heavy workload. The goal of this thesis is to analyze the possibilities of a chatbot whose purpose is to assist the medical staff and provide safety for the patients by guaranteeing that they are being monitored. With the use of technologies such as Artificial Intelligence, Natural Language Processing, and Voice Over Internet Protocol, the chatbot can communicate with the patient. It will work as an assistant for the working staff and provide the information from the calls to the medical staff. With the answers provided from the call, the staff will not be needing to ask routine questions every time and can provide help more quickly. The chatbot is administrated through a web application where administrators can initiate calls and add patients to the database.
APA, Harvard, Vancouver, ISO, and other styles
35

Nguyen, Tien Dung. "Multimodal emotion recognition using deep learning techniques." Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/180753/1/Tien%20Dung_Nguyen_Thesis.pdf.

Full text
Abstract:
This thesis investigates the use of deep learning techniques to address the problem of machine understanding of human affective behaviour and improve the accuracy of both unimodal and multimodal human emotion recognition. The objective was to explore how best to configure deep learning networks to capture individually and jointly, the key features contributing to human emotions from three modalities (speech, face, and bodily movements) to accurately classify the expressed human emotion. The outcome of the research should be useful for several applications including the design of social robots.
APA, Harvard, Vancouver, ISO, and other styles
36

Cantrell, Mark E. "Diphone-based speech recognition using neural networks." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1996. http://handle.dtic.mil/100.2/ADA313872.

Full text
Abstract:
Thesis (M.S. in Systems Technology and M.S. in Computer Science) Naval Postgraduate School, June 1996.<br>Thesis advisor(s): Dan C. Boger, Robert B. McGhee. "June 1996." Includes bibliographical references. Also available online.
APA, Harvard, Vancouver, ISO, and other styles
37

Rabi, Gihad. "Visual speech recognition by recurrent neural networks." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1997. http://www.collectionscanada.ca/obj/s4/f2/dsk2/tape16/PQDD_0010/MQ36169.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Prager, Richard William. "Parallel processing networks for automatic speech recognition." Thesis, University of Cambridge, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.238443.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Elvira, Jose M. "Neural networks for speech and speaker recognition." Thesis, Staffordshire University, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.262314.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Egorova, Ekaterina. "Multi-Task Neural Networks for Speech Recognition." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236134.

Full text
Abstract:
První část této diplomové práci se zabývá teoretickým rozborem principů neuronových sítí, včetně možnosti jejich použití v oblasti rozpoznávání řeči. Práce pokračuje popisem viceúkolových neuronových sítí a souvisejících experimentů. Praktická část práce obsahovala změny software pro trénování neuronových sítí, které umožnily viceúkolové trénování. Je rovněž popsáno připravené prostředí, včetně několika dedikovaných skriptů. Experimenty představené v této diplomové práci ověřují použití artikulačních characteristik řeči pro viceúkolové trénování. Experimenty byly provedeny na dvou řečových databázích lišících se kvalitou a velikostí a representujících různé jazyky - angličtinu a vietnamštinu. Artikulační charakteristiky byly také kombinovány s jinými sekundárními úkoly, například kontextem, s záměrem ověřit jejich komplementaritu. Porovnaní je provedeno s neuronovými sítěmi různých velikostí tak, aby byl popsán vztah mezi velikostí neuronových sítí a efektivitou viceúkolového trénování. Závěrem provedených experimentů je, že viceúkolové trénování s použitím artikulačnich charakteristik jako sekundárních úkolů vede k lepšímu trénování neuronových sítí a výsledkem tohoto trénování může být přesnější rozpoznávání fonémů. V závěru práce jsou viceúkolové neuronové sítě testovány v systému rozpoznávání řeči jako extraktor příznaků.
APA, Harvard, Vancouver, ISO, and other styles
41

MELLO, SIMON. "VATS : Voice-Activated Targeting System." Thesis, KTH, Skolan för industriell teknik och management (ITM), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279837.

Full text
Abstract:
Machine learning implementations in computer vision and speech recognition are wide and growing; both low- and high-level applications being required. This paper takes a look at the former and if basic implementations are good enough for real-world applications. To demonstrate this, a simple artificial neural network coded in Python and already existing libraries for Python are used to control a laser pointer via a servomotor and an Arduino, to create a voice-activated targeting system. The neural network trained on MNIST data consistently achieves an accuracy of 0.95 ± 0.01 when classifying MNIST test data, but also classifies captured images correctly if noise-levels are low. This also applies to the speech recognition, rarely giving wrong readings. The final prototype achieves success in all domains except turning the correctly classified images into targets that the Arduino can read and aim at, failing to merge the computer vision and speech recognition.<br>Maskininlärning är viktigt inom röstigenkänning och datorseende, för både små såväl som stora applikationer. Syftet med det här projektet är att titta på om enkla implementationer av maskininlärning duger för den verkligen världen. Ett enkelt artificiellt neuronnät kodat i Python, samt existerande programbibliotek för Python, används för att kontrollera en laserpekare via en servomotor och en Arduino, för att skapa ett röstaktiverat identifieringssystem. Neuronnätet tränat på MNIST data når en precision på 0.95 ± 0.01 när den försöker klassificera MNIST test data, men lyckas även klassificera inspelade bilder korrekt om störningen är låg. Detta gäller även för röstigenkänningen, då den sällan ger fel avläsningar. Den slutliga prototypen lyckas i alla domäner förutom att förvandla bilder som klassificerats korrekt till mål som Arduinon kan läsa av och sikta på, vilket betyder att prototypen inte lyckas sammanfoga röstigenkänningen och datorseendet.
APA, Harvard, Vancouver, ISO, and other styles
42

Gaspar, Laszlo. "Assessment of signal reprocessing for speech recognition." Thesis, Nottingham Trent University, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.320597.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Robinson, Anthony John. "Dynamic error propagation networks." Thesis, University of Cambridge, 1989. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.303145.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Veselý, Michal. "Dynamický dekodér pro rozpoznávání řeči." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-363846.

Full text
Abstract:
The result of this work is a fully working and significantly optimized implementation of a dynamic decoder. This decoder is based on dynamic recognition network generation and decoding by a modified version of the Token Passing algorithm. The implemented solution provides very similar results to the original static decoder from BSCORE (API of Phonexia company). Compared to BSCORE this implementation offers significant reduction of memory usage. This makes use of more complex language models possible. It also facilitates integration the speech recognition to some mobile devices or dynamic adding of new words to the system.
APA, Harvard, Vancouver, ISO, and other styles
45

Wärmegård, Erik. "Intelligent chatbot assistant: A study of integration with VOIP and Artificial Intelligence." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-42693.

Full text
Abstract:
Development and research on Artificial Intelligence have increased during recent years, and the field of medicine is not excluded as a target audience for this top modern technology. Despite new research and tools in favor of medical care, the staff is still under heavy workloads. The goal of this thesis is to analyze and propose the possibility of a chatbot that aims to ease the pressure on the medical staff. To provide a guarantee that patients are being monitored. With Artificial Intelligence, VOIP, Natural Language Processing, and web development, this chatbot can communicate with a patient, which will act as an assistant tool that conducts preparatory work for the medical staff. The system of the chatbot is integrated through a web application where the administrator can initiate call and store clients onto the database. To ascertain that the system operates in real-time, several tests have been carried out to tests concerning the latency between subsystems and the quality of service.<br>I utvecklingen av intelligenta system har sjukvården etablerat sig som en stor målgrupp. Trots avancerade tekniker så är sjukvården fortfarande under tung belastning. Målet för detta examensarbete är att undersöka möjligheten av en chatbot vars syfte är att lätta på arbetsbelastningen hos sjukvårdspersonalen och samtidigt erbjuda en garanti för att patienter får den tillsyn och återkoppling de behöver. Med hjälp av Artificiell Intelligens, VOIP, Natural Language Processing och webbutveckling kan denna chatbot kommunicera med patienten. Chatboten agerar som ett assisterande verktyg som står för ett förarbete i beslutstagandet för sjukvårdspersonal. Ett systemsom inte bara ger praktisk nytta utan också ett främjande av den utveckling som Artificiell Intelligens gör inom sjukvården. Systemet administreras genom en hemsida som kopplar samman de flera olika komponenterna. Här kan en administratör initiera samtal och spara klienter som ska ringas till databasen. För att kunna fastställa att systemet opererar i realtid har görs flertalet prestandatester avseende både tidsfördröjningar och samtalskvalité.
APA, Harvard, Vancouver, ISO, and other styles
46

Fábry, Marko. "Zobrazení a analýza aktivit neuronové sítě ve skrytých vrstvách." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255399.

Full text
Abstract:
Goal of this work was to create system capable of visualisation of activation function values, which were produced by neurons placed in hidden layers of neural networks used for speech recognition. In this work are also described experiments comparing methods for visualisation, visualisations of neural networks with different architectures and neural networks trained with different types of input data. Visualisation system implemented in this work is based on previous work of Mr. Khe Chai Sim and extended with new methods of data normalization. Kaldi toolkit was used for neural network training data preparation. CNTK framework was used for neural network training. Core of this work - the visualisation system was implemented in scripting language Python.
APA, Harvard, Vancouver, ISO, and other styles
47

Chan, William. "End-to-End Speech Recognition Models." Research Showcase @ CMU, 2016. http://repository.cmu.edu/dissertations/723.

Full text
Abstract:
For the past few decades, the bane of Automatic Speech Recognition (ASR) systems have been phonemes and Hidden Markov Models (HMMs). HMMs assume conditional indepen-dence between observations, and the reliance on explicit phonetic representations requires expensive handcrafted pronunciation dictionaries. Learning is often via detached proxy problems, and there especially exists a disconnect between acoustic model performance and actual speech recognition performance. Connectionist Temporal Classification (CTC) character models were recently proposed attempts to solve some of these issues, namely jointly learning the pronunciation model and acoustic model. However, HMM and CTC models still suffer from conditional independence assumptions and must rely heavily on language models during decoding. In this thesis, we question the traditional paradigm of ASR and highlight the limitations of HMM and CTC models. We propose a novel approach to ASR with neural attention models and we directly optimize speech transcriptions. Our proposed method is not only an end-to- end trained system but also an end-to-end model. The end-to-end model jointly learns all the traditional components of a speech recognition system: the pronunciation model, acoustic model and language model. Our model can directly emit English/Chinese characters or even word pieces given the audio signal. There is no need for explicit phonetic representations, intermediate heuristic loss functions or conditional independence assumptions. We demonstrate our end-to-end speech recognition model on various ASR tasks. We show competitive results compared to a state-of-the-art HMM based system on the Google voice search task. We demonstrate an online end-to-end Chinese Mandarin model and show how to jointly optimize the Pinyin transcriptions during training. Finally, we also show state-of-the-art results on the Wall Street Journal ASR task compared to other end-to-end models.
APA, Harvard, Vancouver, ISO, and other styles
48

Knight, Stephen. "Speech intelligibility estimation via neural networks /." Online version of thesis, 1990. http://hdl.handle.net/1850/10595.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Janod, Killian. "La représentation des documents par réseaux de neurones pour la compréhension de documents parlés." Thesis, Avignon, 2017. http://www.theses.fr/2017AVIG0222/document.

Full text
Abstract:
Les méthodes de compréhension de la parole visent à extraire des éléments de sens pertinents du signal parlé. On distingue principalement deux catégories dans la compréhension du signal parlé : la compréhension de dialogues homme/machine et la compréhension de dialogues homme/homme. En fonction du type de conversation, la structure des dialogues et les objectifs de compréhension varient. Cependant, dans les deux cas, les systèmes automatiques reposent le plus souvent sur une étape de reconnaissance automatique de la parole pour réaliser une transcription textuelle du signal parlé. Les systèmes de reconnaissance automatique de la parole, même les plus avancés, produisent dans des contextes acoustiques complexes des transcriptions erronées ou partiellement erronées. Ces erreurs s'expliquent par la présence d'informations de natures et de fonction variées, telles que celles liées aux spécificités du locuteur ou encore l'environnement sonore. Celles-ci peuvent avoir un impact négatif important pour la compréhension. Dans un premier temps, les travaux de cette thèse montrent que l'utilisation d'autoencodeur profond permet de produire une représentation latente des transcriptions d'un plus haut niveau d'abstraction. Cette représentation permet au système de compréhension de la parole d'être plus robuste aux erreurs de transcriptions automatiques. Dans un second temps, nous proposons deux approches pour générer des représentations robustes en combinant plusieurs vues d'un même dialogue dans le but d'améliorer les performances du système la compréhension. La première approche montre que plusieurs espaces thématiques différents peuvent être combinés simplement à l'aide d'autoencodeur ou dans un espace thématique latent pour produire une représentation qui augmente l'efficacité et la robustesse du système de compréhension de la parole. La seconde approche propose d'introduire une forme d'information de supervision dans les processus de débruitages par autoencodeur. Ces travaux montrent que l'introduction de supervision de transcription dans un autoencodeur débruitant dégrade les représentations latentes, alors que les architectures proposées permettent de rendre comparables les performances d'un système de compréhension reposant sur une transcription automatique et un système de compréhension reposant sur des transcriptions manuelles<br>Application of spoken language understanding aim to extract relevant items of meaning from spoken signal. There is two distinct types of spoken language understanding : understanding of human/human dialogue and understanding in human/machine dialogue. Given a type of conversation, the structure of dialogues and the goal of the understanding process varies. However, in both cases, most of the time, automatic systems have a step of speech recognition to generate the textual transcript of the spoken signal. Speech recognition systems in adverse conditions, even the most advanced one, produce erroneous or partly erroneous transcript of speech. Those errors can be explained by the presence of information of various natures and functions such as speaker and ambience specificities. They can have an important adverse impact on the performance of the understanding process. The first part of the contribution in this thesis shows that using deep autoencoders produce a more abstract latent representation of the transcript. This latent representation allow spoken language understanding system to be more robust to automatic transcription mistakes. In the other part, we propose two different approaches to generate more robust representation by combining multiple views of a given dialogue in order to improve the results of the spoken language understanding system. The first approach combine multiple thematic spaces to produce a better representation. The second one introduce new autoencoders architectures that use supervision in the denoising autoencoders. These contributions show that these architectures reduce the difference in performance between a spoken language understanding using automatic transcript and one using manual transcript
APA, Harvard, Vancouver, ISO, and other styles
50

Reikeras, Helge. "Audio-visual automatic speech recognition using Dynamic Bayesian Networks." Thesis, Stellenbosch : University of Stellenbosch, 2011. http://hdl.handle.net/10019.1/6777.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!