Дисертації з теми "Low resource language"
Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями
Ознайомтеся з топ-31 дисертацій для дослідження на тему "Low resource language".
Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.
Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.
Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.
Jansson, Herman. "Low-resource Language Question Answering Systemwith BERT." Thesis, Mittuniversitetet, Institutionen för informationssystem och –teknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-42317.
Zhang, Yuan Ph D. Massachusetts Institute of Technology. "Transfer learning for low-resource natural language analysis." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/108847.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 131-142).
Expressive machine learning models such as deep neural networks are highly effective when they can be trained with large amounts of in-domain labeled training data. While such annotations may not be readily available for the target task, it is often possible to find labeled data for another related task. The goal of this thesis is to develop novel transfer learning techniques that can effectively leverage annotations in source tasks to improve performance of the target low-resource task. In particular, we focus on two transfer learning scenarios: (1) transfer across languages and (2) transfer across tasks or domains in the same language. In multilingual transfer, we tackle challenges from two perspectives. First, we show that linguistic prior knowledge can be utilized to guide syntactic parsing with little human intervention, by using a hierarchical low-rank tensor method. In both unsupervised and semi-supervised transfer scenarios, this method consistently outperforms state-of-the-art multilingual transfer parsers and the traditional tensor model across more than ten languages. Second, we study lexical-level multilingual transfer in low-resource settings. We demonstrate that only a few (e.g., ten) word translation pairs suffice for an accurate transfer for part-of-speech (POS) tagging. Averaged across six languages, our approach achieves a 37.5% improvement over the monolingual top-performing method when using a comparable amount of supervision. In the second monolingual transfer scenario, we propose an aspect-augmented adversarial network that allows aspect transfer over the same domain. We use this method to transfer across different aspects in the same pathology reports, where traditional domain adaptation approaches commonly fail. Experimental results demonstrate that our approach outperforms different baselines and model variants, yielding a 24% gain on this pathology dataset.
by Yuan Zhang.
Ph. D.
Zouhair, Taha. "Automatic Speech Recognition for low-resource languages using Wav2Vec2 : Modern Standard Arabic (MSA) as an example of a low-resource language." Thesis, Högskolan Dalarna, Institutionen för information och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:du-37702.
Packham, Sean. "Crowdsourcing a text corpus for a low resource language." Master's thesis, University of Cape Town, 2016. http://hdl.handle.net/11427/20436.
Lakew, Surafel Melaku. "Multilingual Neural Machine Translation for Low Resource Languages." Doctoral thesis, Università degli studi di Trento, 2020. http://hdl.handle.net/11572/257906.
Mairidan, Wushouer. "Pivot-Based Bilingual Dictionary Creation for Low-Resource Languages." 京都大学 (Kyoto University), 2015. http://hdl.handle.net/2433/199441.
Samson, Juan Sarah Flora. "Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia." Thesis, Université Grenoble Alpes (ComUE), 2015. http://www.theses.fr/2015GREAM061/document.
Languages in Malaysia are dying in an alarming rate. As of today, 15 languages are in danger while two languages are extinct. One of the methods to save languages is by documenting languages, but it is a tedious task when performed manually.Automatic Speech Recognition (ASR) system could be a tool to help speed up the process of documenting speeches from the native speakers. However, building ASR systems for a target language requires a large amount of training data as current state-of-the-art techniques are based on empirical approach. Hence, there are many challenges in building ASR for languages that have limited data available.The main aim of this thesis is to investigate the effects of using data from closely-related languages to build ASR for low-resource languages in Malaysia. Past studies have shown that cross-lingual and multilingual methods could improve performance of low-resource ASR. In this thesis, we try to answer several questions concerning these approaches: How do we know which language is beneficial for our low-resource language? How does the relationship between source and target languages influence speech recognition performance? Is pooling language data an optimal approach for multilingual strategy?Our case study is Iban, an under-resourced language spoken in Borneo island. We study the effects of using data from Malay, a local dominant language which is close to Iban, for developing Iban ASR under different resource constraints. We have proposed several approaches to adapt Malay data to obtain pronunciation and acoustic models for Iban speech.Building a pronunciation dictionary from scratch is time consuming, as one needs to properly define the sound units of each word in a vocabulary. We developed a semi-supervised approach to quickly build a pronunciation dictionary for Iban. It was based on bootstrapping techniques for improving Malay data to match Iban pronunciations.To increase the performance of low-resource acoustic models we explored two acoustic modelling techniques, the Subspace Gaussian Mixture Models (SGMM) and Deep Neural Networks (DNN). We performed cross-lingual strategies using both frameworks for adapting out-of-language data to Iban speech. Results show that using Malay data is beneficial for increasing the performance of Iban ASR. We also tested SGMM and DNN to improve low-resource non-native ASR. We proposed a fine merging strategy for obtaining an optimal multi-accent SGMM. In addition, we developed an accent-specific DNN using native speech data. After applying both methods, we obtained significant improvements in ASR accuracy. From our study, we observe that using SGMM and DNN for cross-lingual strategy is effective when training data is very limited
Tafreshi, Shabnam. "Cross-Genre, Cross-Lingual, and Low-Resource Emotion Classification." Thesis, The George Washington University, 2021. http://pqdtopen.proquest.com/#viewpdf?dispub=28088437.
Singh, Mittul [Verfasser], and Dietrich [Akademischer Betreuer] Klakow. "Handling long-term dependencies and rare words in low-resource language modelling / Mittul Singh ; Betreuer: Dietrich Klakow." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2017. http://d-nb.info/1141677962/34.
Dyer, Andrew. "Low Supervision, Low Corpus size, Low Similarity! Challenges in cross-lingual alignment of word embeddings : An exploration of the limitations of cross-lingual word embedding alignment in truly low resource scenarios." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-395946.
Arbi, Haza Nasution. "Bilingual Lexicon Induction Framwork for Closely Related Languages." Kyoto University, 2018. http://hdl.handle.net/2433/235115.
Godard, Pierre. "Unsupervised word discovery for computational language documentation." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS062/document.
Language diversity is under considerable pressure: half of the world’s languages could disappear by the end of this century. This realization has sparked many initiatives in documentary linguistics in the past two decades, and 2019 has been proclaimed the International Year of Indigenous Languages by the United Nations, to raise public awareness of the issue and foster initiatives for language documentation and preservation. Yet documentation and preservation are time-consuming processes, and the supply of field linguists is limited. Consequently, the emerging field of computational language documentation (CLD) seeks to assist linguists in providing them with automatic processing tools. The Breaking the Unwritten Language Barrier (BULB) project, for instance, constitutes one of the efforts defining this new field, bringing together linguists and computer scientists. This thesis examines the particular problem of discovering words in an unsegmented stream of characters, or phonemes, transcribed from speech in a very-low-resource setting. This primarily involves a segmentation procedure, which can also be paired with an alignment procedure when a translation is available. Using two realistic Bantu corpora for language documentation, one in Mboshi (Republic of the Congo) and the other in Myene (Gabon), we benchmark various monolingual and bilingual unsupervised word discovery methods. We then show that using expert knowledge in the Adaptor Grammar framework can vastly improve segmentation results, and we indicate ways to use this framework as a decision tool for the linguist. We also propose a tonal variant for a strong nonparametric Bayesian segmentation algorithm, making use of a modified backoff scheme designed to capture tonal structure. To leverage the weak supervision given by a translation, we finally propose and extend an attention-based neural segmentation method, improving significantly the segmentation performance of an existing bilingual method
Black, Kevin P. "Interactive Machine Assistance: A Case Study in Linking Corpora and Dictionaries." BYU ScholarsArchive, 2015. https://scholarsarchive.byu.edu/etd/5620.
Meftah, Sara. "Neural Transfer Learning for Domain Adaptation in Natural Language Processing." Thesis, université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG021.
Recent approaches based on end-to-end deep neural networks have revolutionised Natural Language Processing (NLP), achieving remarkable results in several tasks and languages. Nevertheless, these approaches are limited with their "gluttony" in terms of annotated data, since they rely on a supervised training paradigm, i.e. training from scratch on large amounts of annotated data. Therefore, there is a wide gap between NLP technologies capabilities for high-resource languages compared to the long tail of low-resourced languages. Moreover, NLP researchers have focused much of their effort on training NLP models on the news domain, due to the availability of training data. However, many research works have highlighted that models trained on news fail to work efficiently on out-of-domain data, due to their lack of robustness against domain shifts. This thesis presents a study of transfer learning approaches, through which we propose different methods to take benefit from the pre-learned knowledge on the high-resourced domain to enhance the performance of neural NLP models in low-resourced settings. Precisely, we apply our approaches to transfer from the news domain to the social media domain. Indeed, despite the importance of its valuable content for a variety of applications (e.g. public security, health monitoring, or trends highlight), this domain is still poor in terms of annotated data. We present different contributions. First, we propose two methods to transfer the knowledge encoded in the neural representations of a source model pretrained on large labelled datasets from the source domain to the target model, further adapted by a fine-tuning on few annotated examples from the target domain. The first transfers contextualised supervisedly pretrained representations, while the second method transfers pretrained weights, used to initialise the target model's parameters. Second, we perform a series of analysis to spot the limits of the above-mentioned proposed methods. We find that even if the proposed transfer learning approach enhances the performance on social media domain, a hidden negative transfer may mitigate the final gain brought by transfer learning. In addition, an interpretive analysis of the pretrained model, show that pretrained neurons may be biased by what they have learned from the source domain, thus struggle with learning uncommon target-specific patterns. Third, stemming from our analysis, we propose a new adaptation scheme which augments the target model with normalised, weighted and randomly initialised neurons that beget a better adaptation while maintaining the valuable source knowledge. Finally, we propose a model, that in addition to the pre-learned knowledge from the high-resource source-domain, takes advantage of various supervised NLP tasks
Karagol-Ayan, Burcu. "Resource generation from structured documents for low-density languages." College Park, Md.: University of Maryland, 2007. http://hdl.handle.net/1903/7580.
Thesis research directed by: Dept. of Computer Science. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.
Karim, Hiva. "Best way for collecting data for low-resourced languages." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-35945.
Aufrant, Lauriane. "Training parsers for low-resourced languages : improving cross-lingual transfer with monolingual knowledge." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS089/document.
As a result of the recent blossoming of Machine Learning techniques, the Natural Language Processing field faces an increasingly thorny bottleneck: the most efficient algorithms entirely rely on the availability of large training data. These technological advances remain consequently unavailable for the 7,000 languages in the world, out of which most are low-resourced. One way to bypass this limitation is the approach of cross-lingual transfer, whereby resources available in another (source) language are leveraged to help building accurate systems in the desired (target) language. However, despite promising results in research settings, the standard transfer techniques lack the flexibility regarding cross-lingual resources needed to be fully usable in real-world scenarios: exploiting very sparse resources, or assorted arrays of resources. This limitation strongly diminishes the applicability of that approach. This thesis consequently proposes to combine multiple sources and resources for transfer, with an emphasis on selectivity: can we estimate which resource of which language is useful for which input? This strategy is put into practice in the frame of transition-based dependency parsing. To this end, a new transfer framework is designed, with a cascading architecture: it enables the desired combination, while ensuring better targeted exploitation of each resource, down to the level of the word. Empirical evaluation dampens indeed the enthusiasm for the purely cross-lingual approach -- it remains in general preferable to annotate just a few target sentences -- but also highlights its complementarity with other approaches. Several metrics are developed to characterize precisely cross-lingual similarities, syntactic idiosyncrasies, and the added value of cross-lingual information compared to monolingual training. The substantial benefits of typological knowledge are also explored. The whole study relies on a series of technical improvements regarding the parsing framework: this work includes the release of a new open source software, PanParser, which revisits the so-called dynamic oracles to extend their use cases. Several purely monolingual contributions complete this work, including an exploration of monolingual cascading, which offers promising perspectives with easy-then-hard strategies
Fapšo, Michal. "Vyhledávání výrazů v řeči pomocí mluvených příkladů." Doctoral thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-261237.
Vu, Ngoc Thang [Verfasser]. "Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information / Ngoc Thang Vu." Aachen : Shaker, 2014. http://d-nb.info/1058315811/34.
Vu, Ngoc Thang [Verfasser], and T. [Akademischer Betreuer] Schultz. "Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information / Ngoc Thang Vu. Betreuer: T. Schultz." Karlsruhe : KIT-Bibliothek, 2014. http://d-nb.info/1051848229/34.
Susman, Derya. "Turkish Large Vocabulary Continuous Speech Recognition By Using Limited Audio Corpus." Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12614207/index.pdf.
Cecovniuc, Ioana. "¿Qué prefiere usted, pagar en metálico o con cardo?: los falsos amigos y su concienciación lingüística en las aulas plurilingües de ELE." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/401588.
The major problems relating to false friends entail difficulties for most students in terms of pronunciation, writing and interaction in a foreign language. With false friends we refer to words or structures of different languages that display morphological affinity but, at the same time, semantic divergence. Examples of (total or partial) false linguistic relations between mother tongue and target language abound. Nevertheless, a practical experience concerning this subject is quite scarce. False friends are either omitted or reduced, in general, to lists or inventories of the most notable cases. Hence, false friends are not (much) integrated in the foreign language classroom along with other teaching and learning strategies and tactics for a meaningful instruction. Consequently, this Ph.D. thesis aims at providing relevant aspects of both theoretical and practical nature in relation to the typology, the uses and the didactic value of false friends as instances of interlinguistic influence. In conjunction therewith, the contrastive perspective makes available the epistemological elements of the present analysis: the speakers are naturally prone to transfer structures and meanings and distributions of these structures and meanings of their mother tongue or of their first foreign language to another foreign language to learn. This basic statement is analyzed by means of Contrastive Linguistics which includes psycholinguistic and educational postulates of two complementary methodologies: Contrastive Analysis and Analysis of Errors. In addition, there are considered the principles that define Plurilingualism by investigating the recent studies regarding interlinguistic influences when learning a foreign language. In light of the theoretical assumptions, interferences result from the relationship between the mother tongue and the foreign language. However, to learn two or more foreign languages is a worldwide tendency nowadays, therefore delimitations in the learning processes could not be given between the mother tongue and the target language exclusively. Thus, a first foreign language can be a factor that positively or negatively influences when learning a second foreign language. In this regard, the present paper is an attempt to cover different aspects on the topic of false friends when Spanish (second foreign language to learn) comes into contact with English (first foreign language) and Romanian or Dutch (mother tongues).
Els falsos amics constitueixen una de les dificultats principals que afronten els discents que persegueixen parlar, escriure i interaccionar en una llengua estrangera. Amb un fals amic s'indica una paraula o estructura d'un idioma estranger que s'assembla, en l'escrit o en la forma oral, a una paraula o estructura en la llengua materna del parlant, però que compta amb un significat diferent. Abunden els exemples de falses relacions lingüístiques (totals o parcials) entre llengua materna i llengua meta. No obstant això, l'experiència pràctica respecte a aquest tema d'exploració és força escassa. Els falsos amics o bé s'ometen, o bé es redueixen, en general, a llistes o inventaris dels casos més notables i, per tant, no s'integren a l'aula juntament amb altres estratègies i tàctiques d'ensenyament-aprenentatge per a una instrucció significativa. Conseqüentment, es pretén amb la present tesi doctoral aportar aspectes rellevants, tant teòrics com de caràcter pràctic, pel que fa a la tipologia, els usos i la utilitat i interés didàctic a les aules d’ELE dels falsos amics com mostres d’interferències interlingüístiques. Vinculat amb això, la perspectiva contrastiva subministra els elements epistemològics d'anàlisi: els parlants són naturalment propensos a passar estructures i significats i distribucions d'aquestes estructures i significats de la seva llengua materna o la seva primera llengua d'aprenentatge a una altra llengua estrangera que estan aprenent. Aquesta suposició bàsica precisa el camp d'aplicació de la Lingüística Contrastiva que engloba postulats psicolingüístics i pedagògics de dues metodologies complementàries entre sí: Anàlisi Contrastiva i Anàlisi d'Errors. També, es valoren els princips del Plurilingüisme amb indagar en els últims desenvolupaments sobre les influències interlingüístiques quan s’aprèn una llengua estrangera. A la llum de les premisses teòriques, les interferències són efecte de la relació entre la llengua materna i la llengua estrangera que s'estudia. D'altra banda, avui en dia, aprendre dos o més idiomes estrangers és una disposició mundial, per aquest motiu la delimitació en els processos d'aprenentatge no es dóna exclusivament entre la llengua materna i la llengua meta. Així doncs, una primera llengua estrangera apresa pot resultar un factor que influeixi positivament o negativament en una segona o tercera llengua estrangera que s'està aprenent. En aquest sentit, amb aquesta recerca l'atenció es polaritza al voltant dels falsos amics quan l'espanyol (segona llengua estrangera) entra en contacte amb l'anglès (primera llengua estrangera) i el romanès o l'holandès (llengües maternes).
Pronto, Lindon N. "Exploring German and American Modes of Pedagogical and Institutional Sustainability: Forging a Way into the Future." Scholarship @ Claremont, 2012. http://scholarship.claremont.edu/pitzer_theses/21.
Farra, Noura. "Cross-Lingual and Low-Resource Sentiment Analysis." Thesis, 2019. https://doi.org/10.7916/d8-x3b7-1r92.
(8776265), Xiao Zhang. "Flexible Structured Prediction in Natural Language Processing with Partially Annotated Corpora." Thesis, 2020.
Kamran, Amir. "Hybrid Machine Translation Approaches for Low-Resource Languages." Master's thesis, 2011. http://www.nusl.cz/ntk/nusl-313015.
"Query-by-example spoken term detection for low-resource languages." 2014. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1290682.
The framework of acoustic segment modeling (ASM) is adopted for unsupervised training of a speech tokenizer. Three novel algorithms are developed for segment labeling in the ASM framework. The proposed algorithms are based on the use of different class-by-segment posterior representations and spectral clustering techniques. The posterior representations are shown to be more robust than conventional spectral representations. Spectral clustering has achieved significant success in many applications. Reformulations of spectral clustering algorithms are made to make them computationally feasible for clustering a large number of speech segments. Experiments on a multilingual speech database demonstrate the advantage of the proposed algorithms over existing approaches.
The speech tokenizer obtained with ASM is applied to QbyE STD. The detection of spoken queries is based on a frame-based template matching framework. The ASM tokenizer serves as the front-end to generate posterior features, which are used for temporal template matching by dynamic time warping (DTW). Experiments show that the ASM tokenizer outperforms a GMM tokenizer and language-mismatched phoneme recognizers. Moreover, a two-step approach is proposed for efficient search.
The frame-based template matching framework for QbyE STD is enhanced in three ways. A novel DTW matrix combination approach is proposed for the fusion of multiple systems with different posterior features. Pseudo-relevance feedback is used for query expansion, and score normalization is applied to calibrate the score distributions of different query terms. Experimental results show that the performances of the QbyE STD system are significantly improved by the three approaches.
關鍵詞檢測是一項在大量語音數據庫中查找某關鍵詞位置的技術。關鍵詞檢測無論在學術研究領域還是實際應用領域都有非常重要的價值。傳統關鍵詞檢測的研究主要針對資源豐富的語言。本文研究針對資源匱乏的語言的關鍵詞檢測。在本文設定條件下,目標語言沒有足夠的資源訓練語音識別系統,並且關鍵詞以聲音樣例的形式給定。
本文採用聲學語音段建模(ASM)框架來無監督訓練語音識別器。我們提出三種新的方法用於ASM框架中的語音片段聚類。我們的方法基於一種新的魯棒的語音片段特徵,並且採用了譜聚類技術。實驗證明我們的方法優於另外三種常用的基線方法,能夠取得更好的建模效果。
我們將ASM識別器用於基於模板匹配的關鍵詞檢測系統中。在該系統中,ASM識別器被視為前端特徵轉換模塊,用於提取後驗概率特徵。為了提高檢測效率,我們還提出一種兩步檢測方法。實驗效果證明我們的方法能夠取得較高的檢測準確率。
為了進一步提高檢測準確率,本文從三個角度優化基於模板匹配的關鍵詞檢測系統。首先我們提出在動態時間規整的距離矩陣上進行系統融合。其次我們提出用偽相關反饋技術來獲取更多的關鍵詞樣例。最後我們對系統打分進行規整從而有利於在設定統一的打分門限。實驗結果證明這三種方法都有效的提高了關鍵詞檢測的系統性能。
Wang, Haipeng.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2014.
Includes bibliographical references (leaves 110-127).
Abstracts also in Chinese.
Title from PDF title page (viewed on 05, December, 2016).
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Cooper, Erica Lindsay. "Text-to-Speech Synthesis Using Found Data for Low-Resource Languages." Thesis, 2019. https://doi.org/10.7916/d8-vdzp-j870.
Lozano, Argüelles Cristina. "Formación y uso de la tecnología de los profesores de escuelas de inmersión en español." Thesis, 2014. http://hdl.handle.net/1805/6037.
El propósito de esta investigación es ahondar en los usos tecnológicos de los profesores de español y en la formación que han recibido para integrar las TIC en sus clases. En concreto, nos interesa saber su actitud y nivel de seguridad ante la tecnología, de qué recursos disponen y cuáles utilizan en sus clases, cómo aprenden a utilizarlos (formal e informalmente), qué problemas perciben y cómo les gustaría mejorar la integración de la tecnología en sus clases. El estudio se centra en un grupo de escuelas de inmersión de español en los estados de Indiana, Kentucky y Ohio.
Michell, Colin Simon. "Investigating the use of forensic stylistic and stylometric techniques in the analyses of authorship on a publicly accessible social networking site (Facebook)." Diss., 2013. http://hdl.handle.net/10500/13324.
Linguistics
MA (Linguistics)
Mahabeer, Sandhya D. "Barriers in acquiring basic english reading and spelling skills by Zulu-speaking foundation phase learners." Diss., 2003. http://hdl.handle.net/10500/1166.
Educational Studies
M. Ed. (Guidance & Counselling)