Log in

Relevant bibliographies by topics / DeepSpeech

Contents

Journal articles
Dissertations / Theses
Conference papers

Academic literature on the topic 'DeepSpeech'

Author: Grafiati

Published: 4 June 2021

Last updated: 14 June 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'DeepSpeech.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "DeepSpeech"

1

Messaoudi, Abir, Hatem Haddad, Chayma Fourati, Moez BenHaj Hmida, Aymen Ben Elhaj Mabrouk, and Mohamed Graiet. "Tunisian Dialectal End-to-end Speech Recognition based on DeepSpeech." Procedia Computer Science 189 (2021): 183–90. http://dx.doi.org/10.1016/j.procs.2021.05.082.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Galatang, Danny Henry, and Suyanto Suyanto. "Syllable-Based Indonesian Automatic Speech Recognition." International Journal on Electrical Engineering and Informatics 12, no. 4 (2020): 720–28. http://dx.doi.org/10.15676/ijeei.2020.12.4.2.

Full text

Abstract:

The syllable-based automatic speech recognition (ASR) systems commonly perform better than the phoneme-based ones. This paper focuses on developing an Indonesian monosyllable-based ASR (MSASR) system using an ASR engine called SPRAAK and comparing it to a phoneme-based one. The Mozilla DeepSpeech-based end-to-end ASR (MDSE2EASR), one of the state-of-the-art models based on character (similar to the phoneme-based model), is also investigated to confirm the result. Besides, a novel Kaituoxu SpeechTransformer (KST) E2EASR is also examined. Testing on the Indonesian speech corpus of 5,439 words shows that the proposed MSASR produces much higher word accuracy (76.57%) than the monophone-based one (63.36%). Its performance is comparable to the character-based MDS-E2EASR, which produces 76.90%, and the character-based KST-E2EASR (78.00%). In the future, this monosyllable-based ASR is possible to be improved to the bisyllable-based one to give higher word accuracy. Nevertheless, extensive bisyllable acoustic models must be handled using an advanced method.

APA, Harvard, Vancouver, ISO, and other styles

3

Vazhenina, Daria, and Konstantin Markov. "End-to-End Noisy Speech Recognition Using Fourier and Hilbert Spectrum Features." Electronics 9, no. 7 (2020): 1157. http://dx.doi.org/10.3390/electronics9071157.

Full text

Abstract:

Despite the progress of deep neural networks over the last decade, the state-of-the-art speech recognizers in noisy environment conditions are still far from reaching satisfactory performance. Methods to improve noise robustness usually include adding components to the recognition system that often need optimization. For this reason, data augmentation of the input features derived from the Short-Time Fourier Transform (STFT) has become a popular approach. However, for many speech processing tasks, there is an evidence that the combination of STFT-based and Hilbert–Huang transform (HHT)-based features improves the overall performance. The Hilbert spectrum can be obtained using adaptive mode decomposition (AMD) techniques, which are noise-robust and suitable for non-linear and non-stationary signal analysis. In this study, we developed a DeepSpeech2-based recognition system by adding a combination of STFT and HHT spectrum-based features. We propose several ways to combine those features at different levels of the neural network. All evaluations were performed using the WSJ and CHiME-4 databases. Experimental results show that combining STFT and HHT spectra leads to a 5–7% relative improvement in noisy speech recognition.

APA, Harvard, Vancouver, ISO, and other styles

4

Coto-Solano, Rolando, James N. Stanford, and Sravana K. Reddy. "Advances in Completely Automated Vowel Analysis for Sociophonetics: Using End-to-End Speech Recognition Systems With DARLA." Frontiers in Artificial Intelligence 4 (September 24, 2021). http://dx.doi.org/10.3389/frai.2021.662097.

Full text

Abstract:

In recent decades, computational approaches to sociophonetic vowel analysis have been steadily increasing, and sociolinguists now frequently use semi-automated systems for phonetic alignment and vowel formant extraction, including FAVE (Forced Alignment and Vowel Extraction, Rosenfelder et al., 2011; Evanini et al., Proceedings of Interspeech, 2009), Penn Aligner (Yuan and Liberman, J. Acoust. Soc. America, 2008, 123, 3878), and DARLA (Dartmouth Linguistic Automation), (Reddy and Stanford, DARLA Dartmouth Linguistic Automation: Online Tools for Linguistic Research, 2015a). Yet these systems still have a major bottleneck: manual transcription. For most modern sociolinguistic vowel alignment and formant extraction, researchers must first create manual transcriptions. This human step is painstaking, time-consuming, and resource intensive. If this manual step could be replaced with completely automated methods, sociolinguists could potentially tap into vast datasets that have previously been unexplored, including legacy recordings that are underutilized due to lack of transcriptions. Moreover, if sociolinguists could quickly and accurately extract phonetic information from the millions of hours of new audio content posted on the Internet every day, a virtual ocean of speech from newly created podcasts, videos, live-streams, and other audio content would now inform research. How close are the current technological tools to achieving such groundbreaking changes for sociolinguistics? Prior work (Reddy et al., Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71–75) showed that an HMM-based Automated Speech Recognition system, trained with CMU Sphinx (Lamere et al., 2003), was accurate enough for DARLA to uncover evidence of the US Southern Vowel Shift without any human transcription. Even so, because that automatic speech recognition (ASR) system relied on a small training set, it produced numerous transcription errors. Six years have passed since that study, and since that time numerous end-to-end automatic speech recognition (ASR) algorithms have shown considerable improvement in transcription quality. One example of such a system is the RNN/CTC-based DeepSpeech from Mozilla (Hannun et al., 2014). (RNN stands for recurrent neural networks, the learning mechanism for DeepSpeech. CTC stands for connectionist temporal classification, the mechanism to merge phones into words). The present paper combines DeepSpeech with DARLA to push the technological envelope and determine how well contemporary ASR systems can perform in completely automated vowel analyses with sociolinguistic goals. Specifically, we used these techniques on audio recordings from 352 North American English speakers in the International Dialects of English Archive (IDEA1), extracting 88,500 tokens of vowels in stressed position from spontaneous, free speech passages. With this large dataset we conducted acoustic sociophonetic analyses of the Southern Vowel Shift and the Northern Cities Chain Shift in the North American IDEA speakers. We compared the results using three different sources of transcriptions: 1) IDEA’s manual transcriptions as the baseline “ground truth”, 2) the ASR built on CMU Sphinx used by Reddy et al. (Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71–75), and 3) the latest publicly available Mozilla DeepSpeech system. We input these three different transcriptions to DARLA, which automatically aligned and extracted the vowel formants from the 352 IDEA speakers. Our quantitative results show that newer ASR systems like DeepSpeech show considerable promise for sociolinguistic applications like DARLA. We found that DeepSpeech’s automated transcriptions had significantly fewer character error rates than those from the prior Sphinx system (from 46 to 35%). When we performed the sociolinguistic analysis of the extracted vowel formants from DARLA, we found that the automated transcriptions from DeepSpeech matched the results from the ground truth for the Southern Vowel Shift (SVS): five vowels showed a shift in both transcriptions, and two vowels didn’t show a shift in either transcription. The Northern Cities Shift (NCS) was more difficult to detect, but ground truth and DeepSpeech matched for four vowels: One of the vowels showed a clear shift, and three showed no shift in either transcription. Our study therefore shows how technology has made progress toward greater automation in vowel sociophonetics, while also showing what remains to be done. Our statistical modeling provides a quantified view of both the abilities and the limitations of a completely “hands-free” analysis of vowel shifts in a large dataset. Naturally, when comparing a completely automated system against a semi-automated system involving human manual work, there will always be a tradeoff between accuracy on the one hand versus speed and replicability on the other hand [Kendall and Joseph, Towards best practices in sociophonetics (with Marianna DiPaolo), 2014]. The amount of “noise” that can be tolerated for a given study will depend on the particular research goals and researchers’ preferences. Nonetheless, our study shows that, for certain large-scale applications and research goals, a completely automated approach using publicly available ASR can produce meaningful sociolinguistic results across large datasets, and these results can be generated quickly, efficiently, and with full replicability.

APA, Harvard, Vancouver, ISO, and other styles

5

Hjortnæs, Nils, Niko Partanen, Michael Rießler, and Francis M. Tyers. "The Relevance of the Source Language in Transfer Learning for ASR." Proceedings of the Workshop on Computational Methods for Endangered Languages 1, no. 2 (2021). http://dx.doi.org/10.33011/computel.v1i.959.

Full text

Abstract:

This study presents new experiments on Zyrian Komi speech recognition. We use Deep-Speech to train ASR models from a language documentation corpus that contains both contemporary and archival recordings. Earlier studies have shown that transfer learning from English and using a domain matching Komi language model both improve the CER and WER. In this study we experiment with transfer learning from a more relevant source language, Russian, and including Russian text in the language model construction. The motivation for this is that Russian and Komi are contemporary contact languages, and Russian is regularly present in the corpus. We found that despite the close contact of Russian and Komi, the size of the English speech corpus yielded greater performance when used as the source language. Additionally, we can report that already an update in DeepSpeech version improved the CER by 3.9% against the earlier studies, which is an important step in the development of Komi ASR.

APA, Harvard, Vancouver, ISO, and other styles

6

Bausch, Johannes, Sathyawageeswar Subramanian, and Stephen Piddock. "A quantum search decoder for natural language processing." Quantum Machine Intelligence 3, no. 1 (2021). http://dx.doi.org/10.1007/s42484-021-00041-1.

Full text

Abstract:

AbstractProbabilistic language models, e.g. those based on recurrent neural networks such as long short-term memory models (LSTMs), often face the problem of finding a high probability prediction from a sequence of random variables over a set of tokens. This is commonly addressed using a form of greedy decoding such as beam search, where a limited number of highest-likelihood paths (the beam width) of the decoder are kept, and at the end the maximum-likelihood path is chosen. In this work, we construct a quantum algorithm to find the globally optimal parse (i.e. for infinite beam width) with high constant success probability. When the input to the decoder follows a power law with exponent k > 0, our algorithm has runtime Rnf(R, k), where R is the alphabet size, n the input length; here f < 1/2, and $f\rightarrow 0$ f → 0 exponentially fast with increasing k, hence making our algorithm always more than quadratically faster than its classical counterpart. We further modify our procedure to recover a finite beam width variant, which enables an even stronger empirical speedup while still retaining higher accuracy than possible classically. Finally, we apply this quantum beam search decoder to Mozilla’s implementation of Baidu’s DeepSpeech neural net, which we show to exhibit such a power law word rank frequency.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "DeepSpeech"

1

Laryea, Joycelyn, and Nipunika Jayasundara. "Automatic Speech Recognition System for Somali in the interest of reducing Maternal Morbidity and Mortality." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34436.

Full text

Abstract:

Developing an Automatic Speech Recognition (ASR) system for the Somali language, though not novel, is not actively explored; hence there has been no success in a model for conversational speech. Neither are related works accessible as open-source. The unavailability of digital data is what labels Somali as a low resource language and poses the greatest impediment to the development of an ASR for Somali. The incentive to develop an ASR system for the Somali language is to contribute to reducing the Maternal Mortality Rate (MMR) in Somalia. Researchers acquire interview audio data regarding maternal health and behaviour in the Somali language; to be able to engage the relevant stakeholders to bring about the needed change, these audios must be transcribed into text, which is an important step towards translation into any language. This work investigates available ASR for Somali and attempts to develop a prototype ASR system to convert Somali audios into Somali text. To achieve this target, we first identified the available open-source systems for speech recognition and selected the DeepSpeech engine for the implementation of the prototype. With three hours of audio data, the accuracy of transcription is not as required and cannot be deployed for use. This we attribute to insufficient training data and estimate that the effort towards an ASR for Somali will be more significant by acquiring about 1200 hours of audio to train the DeepSpeech engine

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "DeepSpeech"

1

Xu, Jiahua, Kaveen Matta, Shaiful Islam, and Andreas Nürnberger. "German Speech Recognition System using DeepSpeech." In NLPIR 2020: 4th International Conference on Natural Language Processing and Information Retrieval. ACM, 2020. http://dx.doi.org/10.1145/3443279.3443313.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

de Morais, Rene Avalloni, and Baidya Nath Saha. "End-to-End Speech Recognition Using Recurrent Neural Network (RNN)." In Intelligent Computing and Technologies Conference. AIJR Publisher, 2021. http://dx.doi.org/10.21467/proceedings.115.20.

Full text

Abstract:

Deep learning algorithms have received dramatic progress in the area of natural language processing and automatic human speech recognition. However, the accuracy of the deep learning algorithms depends on the amount and quality of the data and training deep models requires high-performance computing resources. In this backdrop, this paper adresses an end-to-end speech recognition system where we finetune Mozilla DeepSpeech architecture using two different datasets: LibriSpeech clean dataset and Harvard speech dataset. We train Long Short Term Memory (LSTM) based deep Recurrent Neural Netowrk (RNN) models in Google Colab platform and use their GPU resources. Extensive experimental results demonstrate that Mozilla DeepSpeech model could be fine-tuned for different audio datasets to recognize speeches successfully.

APA, Harvard, Vancouver, ISO, and other styles

3

Tripathi, Ayush, Swapnil Bhosale, and Sunil Kumar Kopparapu. "Improved Speaker Independent Dysarthria Intelligibility Classification Using Deepspeech Posteriors." In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. http://dx.doi.org/10.1109/icassp40776.2020.9054492.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Rasley, Jeff, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. "DeepSpeed." In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 2020. http://dx.doi.org/10.1145/3394486.3406703.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!