Увійти

Готові списки джерел за темами / Low resource language / Дисертації

Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Low resource language.

Дисертації з теми "Low resource language"

Автор: Grafiati

Опубліковано: 19 червня 2021

Оновлено: 21 червня 2025

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-39 дисертацій для дослідження на тему "Low resource language".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Jansson, Herman. "Low-resource Language Question Answering Systemwith BERT." Thesis, Mittuniversitetet, Institutionen för informationssystem och –teknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-42317.

Повний текст джерела

Анотація:

The complexity for being at the forefront regarding information retrieval systems are constantly increasing. Recent technology of natural language processing called BERT has reached superhuman performance in high resource languages for reading comprehension tasks. However, several researchers has stated that multilingual model’s are not enough for low-resource languages, since they are lacking a thorough understanding of those languages. Recently, a Swedish pre-trained BERT model has been introduced which is trained on significantly more Swedish data than the multilingual models currently available. This study compares both multilingual and Swedish monolingual inherited BERT model’s for question answering utilizing both a English and a Swedish machine translated SQuADv2 data set during its fine-tuning process. The models are evaluated with SQuADv2 benchmark and within a implemented question answering system built upon the classical retriever-reader methodology. This study introduces a naive and more robust prediction method for the proposed question answering system as well finding a sweet spot for each individual model approach integrated into the system. The question answering system is evaluated and compared against another question answering library at the leading edge within the area, applying a custom crafted Swedish evaluation data set. The results show that the fine-tuned model based on the Swedish pre-trained model and the Swedish SQuADv2 data set were superior in all evaluation metrics except speed. The comparison between the different systems resulted in a higher evaluation score but a slower prediction time for this study’s system.

Стилі APA, Harvard, Vancouver, ISO та ін.

2

Zhang, Yuan Ph D. Massachusetts Institute of Technology. "Transfer learning for low-resource natural language analysis." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/108847.

Повний текст джерела

Анотація:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.<br>This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.<br>Cataloged from student-submitted PDF version of thesis.<br>Includes bibliographical references (pages 131-142).<br>Expressive machine learning models such as deep neural networks are highly effective when they can be trained with large amounts of in-domain labeled training data. While such annotations may not be readily available for the target task, it is often possible to find labeled data for another related task. The goal of this thesis is to develop novel transfer learning techniques that can effectively leverage annotations in source tasks to improve performance of the target low-resource task. In particular, we focus on two transfer learning scenarios: (1) transfer across languages and (2) transfer across tasks or domains in the same language. In multilingual transfer, we tackle challenges from two perspectives. First, we show that linguistic prior knowledge can be utilized to guide syntactic parsing with little human intervention, by using a hierarchical low-rank tensor method. In both unsupervised and semi-supervised transfer scenarios, this method consistently outperforms state-of-the-art multilingual transfer parsers and the traditional tensor model across more than ten languages. Second, we study lexical-level multilingual transfer in low-resource settings. We demonstrate that only a few (e.g., ten) word translation pairs suffice for an accurate transfer for part-of-speech (POS) tagging. Averaged across six languages, our approach achieves a 37.5% improvement over the monolingual top-performing method when using a comparable amount of supervision. In the second monolingual transfer scenario, we propose an aspect-augmented adversarial network that allows aspect transfer over the same domain. We use this method to transfer across different aspects in the same pathology reports, where traditional domain adaptation approaches commonly fail. Experimental results demonstrate that our approach outperforms different baselines and model variants, yielding a 24% gain on this pathology dataset.<br>by Yuan Zhang.<br>Ph. D.

Стилі APA, Harvard, Vancouver, ISO та ін.

3

Zouhair, Taha. "Automatic Speech Recognition for low-resource languages using Wav2Vec2 : Modern Standard Arabic (MSA) as an example of a low-resource language." Thesis, Högskolan Dalarna, Institutionen för information och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:du-37702.

Повний текст джерела

Анотація:

The need for fully automatic translation at DigitalTolk, a Stockholm-based company providing translation services, leads to exploring Automatic Speech Recognition as a first step for Modern Standard Arabic (MSA). Facebook AI recently released a second version of its Wav2Vec models, dubbed Wav2Vec 2.0, which uses deep neural networks and provides several English pretrained models along with a multilingual model trained in 53 different languages, referred to as the Cross-Lingual Speech Representation (XLSR-53). The small English and the XLSR-53 pretrained models are tested, and the results stemming from them discussed, with the Arabic data from Mozilla Common Voice. In this research, the small model did not yield any results and may have needed more unlabelled data to train whereas the large model proved to be successful in predicting the audio recordings in Arabic and a Word Error Rate of 24.40% was achieved, an unprecedented result. The small model turned out to be not suitable for training especially on languages other than English and where the unlabelled data is not enough. On the other hand, the large model gave very promising results despite the low amount of data. The large model should be the model of choice for any future training that needs to be done on low resource languages such as Arabic.

Стилі APA, Harvard, Vancouver, ISO та ін.

4

Packham, Sean. "Crowdsourcing a text corpus for a low resource language." Master's thesis, University of Cape Town, 2016. http://hdl.handle.net/11427/20436.

Повний текст джерела

Анотація:

Low resourced languages, such as South Africa's isiXhosa, have a limited number of digitised texts, making it challenging to build language corpora and the information retrieval services, such as search and translation that depend on them. Researchers have been unable to assemble isiXhosa corpora of sufficient size and quality to produce working machine translation systems and it has been acknowledged that there is little to know training data and sourcing translations from professionals can be a costly process. A crowdsourcing translation game which paid participants for their contributions was proposed as a solution to source original and relevant parallel corpora for low resource languages such as isiXhosa. The objectives of this dissertation is to report on the four experiments that were conducted to assess user motivation and contribution quantity under various scenarios using the developed crowdsourcing translation game. The first experiment was a pilot study to test a custom built system and to find out if social network users would volunteer to participate in a translation game for free. The second experiment tested multiple payment schemes with users from the University of Cape Town. The schemes rewarded users with consistent, increasing or decreasing amounts for subsequent contributions. Experiment 3 tested whether the same users from Experiment 2 would continue contributing if payments were taken away. The last experiment tested a payment scheme that did not offer a direct and guaranteed reward. Users were paid based on their leaderboard placement and only a limited number of the top leaderboard spots were allocated rewards. From experiment 1 and 3 we found that people do not volunteer without financial incentives, experiment 2 and 4 showed that people want increased rewards when putting in increased effort , experiment 3 also showed that people will not continue contributing if the financial incentives are taken away and experiment 4 also showed that the possibility of incentives is as attractive as offering guaranteed incentives .

Стилі APA, Harvard, Vancouver, ISO та ін.

5

Louvan, Samuel. "Low-Resource Natural Language Understanding in Task-Oriented Dialogue." Doctoral thesis, Università degli studi di Trento, 2022. http://hdl.handle.net/11572/333813.

Повний текст джерела

Анотація:

Task-oriented dialogue (ToD) systems need to interpret the user's input to understand the user's needs (intent) and corresponding relevant information (slots). This process is performed by a Natural Language Understanding (NLU) component, which maps the text utterance into a semantic frame representation, involving two subtasks: intent classification (text classification) and slot filling (sequence tagging). Typically, new domains and languages are regularly added to the system to support more functionalities. Collecting domain-specific data and performing fine-grained annotation of large amounts of data every time a new domain and language is introduced can be expensive. Thus, developing an NLU model that generalizes well across domains and languages with less labeled data (low-resource) is crucial and remains challenging. This thesis focuses on investigating transfer learning and data augmentation methods for low-resource NLU in ToD. Our first contribution is a study of the potential of non-conversational text as a source for transfer. Most transfer learning approaches assume labeled conversational data as the source task and adapt the NLU model to the target task. We show that leveraging similar tasks from non-conversational text improves performance on target slot filling tasks through multi-task learning in low-resource settings. Second, we propose a set of lightweight augmentation methods that apply data transformation on token and sentence levels through slot value substitution and syntactic manipulation. Despite its simplicity, the performance is comparable to deep learning-based augmentation models, and it is effective on six languages on NLU tasks. Third, we investigate the effectiveness of domain adaptive pre-training for zero-shot cross-lingual NLU. In terms of overall performance, continued pre-training in English is effective across languages. This result indicates that the domain knowledge learned in English is transferable to other languages. In addition to that, domain similarity is essential. We show that intermediate pre-training data that is more similar – in terms of data distribution – to the target dataset yields better performance.

Стилі APA, Harvard, Vancouver, ISO та ін.

6

Lakew, Surafel Melaku. "Multilingual Neural Machine Translation for Low Resource Languages." Doctoral thesis, Università degli studi di Trento, 2020. http://hdl.handle.net/11572/257906.

Повний текст джерела

Анотація:

Machine Translation (MT) is the task of mapping a source language to a target language. The recent introduction of neural MT (NMT) has shown promising results for high-resource language, however, poorly performing for low-resource language (LRL) settings. Furthermore, the vast majority of the 7, 000+ languages around the world do not have parallel data, creating a zero-resource language (ZRL) scenario. In this thesis, we present our approach to improving NMT for LRL and ZRL, leveraging a multilingual NMT modeling (M-NMT), an approach that allows building a single NMT to translate across multiple source and target languages. This thesis begins by i) analyzing the effectiveness of M-NMT for LRL and ZRL translation tasks, spanning two NMT modeling architectures (Recurrent and Transformer), ii) presents a self-learning approach for improving the zero-shot translation directions of ZRLs, iii) proposes a dynamic transfer-learning approach from a pre-trained (parent) model to a LRL (child) model by tailoring to the vocabulary entries of the latter, iv) extends M-NMT to translate from a source language to specific language varieties (e.g. dialects), and finally, v) proposes an approach that can control the verbosity of an NMT model output. Our experimental findings show the effectiveness of the proposed approaches in improving NMT of LRLs and ZRLs.

Стилі APA, Harvard, Vancouver, ISO та ін.

7

Lakew, Surafel Melaku. "Multilingual Neural Machine Translation for Low Resource Languages." Doctoral thesis, Università degli studi di Trento, 2020. http://hdl.handle.net/11572/257906.

Повний текст джерела

Анотація:

Machine Translation (MT) is the task of mapping a source language to a target language. The recent introduction of neural MT (NMT) has shown promising results for high-resource language, however, poorly performing for low-resource language (LRL) settings. Furthermore, the vast majority of the 7, 000+ languages around the world do not have parallel data, creating a zero-resource language (ZRL) scenario. In this thesis, we present our approach to improving NMT for LRL and ZRL, leveraging a multilingual NMT modeling (M-NMT), an approach that allows building a single NMT to translate across multiple source and target languages. This thesis begins by i) analyzing the effectiveness of M-NMT for LRL and ZRL translation tasks, spanning two NMT modeling architectures (Recurrent and Transformer), ii) presents a self-learning approach for improving the zero-shot translation directions of ZRLs, iii) proposes a dynamic transfer-learning approach from a pre-trained (parent) model to a LRL (child) model by tailoring to the vocabulary entries of the latter, iv) extends M-NMT to translate from a source language to specific language varieties (e.g. dialects), and finally, v) proposes an approach that can control the verbosity of an NMT model output. Our experimental findings show the effectiveness of the proposed approaches in improving NMT of LRLs and ZRLs.

Стилі APA, Harvard, Vancouver, ISO та ін.

8

Mairidan, Wushouer. "Pivot-Based Bilingual Dictionary Creation for Low-Resource Languages." 京都大学 (Kyoto University), 2015. http://hdl.handle.net/2433/199441.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

9

Samson, Juan Sarah Flora. "Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia." Thesis, Université Grenoble Alpes (ComUE), 2015. http://www.theses.fr/2015GREAM061/document.

Повний текст джерела

Анотація:

Les langues en Malaisie meurent à un rythme alarmant. A l'heure actuelle, 15 langues sont en danger alors que deux langues se sont éteintes récemment. Une des méthodes pour sauvegarder les langues est de les documenter, mais c'est une tâche fastidieuse lorsque celle-ci est effectuée manuellement.Un système de reconnaissance automatique de la parole (RAP) serait utile pour accélérer le processus de documentation de ressources orales. Cependant, la construction des systèmes de RAP pour une langue cible nécessite une grande quantité de données d'apprentissage comme le suggèrent les techniques actuelles de l'état de l'art, fondées sur des approches empiriques. Par conséquent, il existe de nombreux défis à relever pour construire des systèmes de transcription pour les langues qui possèdent des quantités de données limitées.L'objectif principal de cette thèse est d'étudier les effets de l'utilisation de données de langues étroitement liées, pour construire un système de RAP pour les langues à faibles ressources en Malaisie. Des études antérieures ont montré que les méthodes inter-lingues et multilingues pourraient améliorer les performances des systèmes de RAP à faibles ressources. Dans cette thèse, nous essayons de répondre à plusieurs questions concernant ces approches: comment savons-nous si une langue est utile ou non dans un processus d'apprentissage trans-lingue ? Comment la relation entre la langue source et la langue cible influence les performances de la reconnaissance de la parole ? La simple mise en commun (pooling) des données d'une langue est-elle une approche optimale ?Notre cas d'étude est l'iban, une langue peu dotée de l'île de Bornéo. Nous étudions les effets de l'utilisation des données du malais, une langue locale dominante qui est proche de l'iban, pour développer un système de RAP pour l'iban, sous différentes contraintes de ressources. Nous proposons plusieurs approches pour adapter les données du malais afin obtenir des modèles de prononciation et des modèles acoustiques pour l'iban.Comme la contruction d'un dictionnaire de prononciation à partir de zéro nécessite des ressources humaines importantes, nous avons développé une approche semi-supervisée pour construire rapidement un dictionnaire de prononciation pour l'iban. Celui-ci est fondé sur des techniques d'amorçage, pour améliorer la correspondance entre les données du malais et de l'iban.Pour augmenter la performance des modèles acoustiques à faibles ressources, nous avons exploré deux techniques de modélisation : les modèles de mélanges gaussiens à sous-espaces (SGMM) et les réseaux de neurones profonds (DNN). Nous avons proposé, dans ce cadre, des méthodes de transfert translingue pour la modélisation acoustique permettant de tirer profit d'une grande quantité de langues “proches” de la langue cible d'intérêt. Les résultats montrent que l'utilisation de données du malais est bénéfique pour augmenter les performances des systèmes de RAP de l'iban. Par ailleurs, nous avons également adapté les modèles SGMM et DNN au cas spécifique de la transcription automatique de la parole non native (très présente en Malaisie). Nous avons proposé une approche fine de fusion pour obtenir un SGMM multi-accent optimal. En outre, nous avons développé un modèle DNN spécifique pour la parole accentuée. Les deux approches permettent des améliorations significatives de la précision du système de RAP. De notre étude, nous observons que les modèles SGMM et, de façon plus surprenante, les modèles DNN sont très performants sur des jeux de données d'apprentissage en quantité limités<br>Languages in Malaysia are dying in an alarming rate. As of today, 15 languages are in danger while two languages are extinct. One of the methods to save languages is by documenting languages, but it is a tedious task when performed manually.Automatic Speech Recognition (ASR) system could be a tool to help speed up the process of documenting speeches from the native speakers. However, building ASR systems for a target language requires a large amount of training data as current state-of-the-art techniques are based on empirical approach. Hence, there are many challenges in building ASR for languages that have limited data available.The main aim of this thesis is to investigate the effects of using data from closely-related languages to build ASR for low-resource languages in Malaysia. Past studies have shown that cross-lingual and multilingual methods could improve performance of low-resource ASR. In this thesis, we try to answer several questions concerning these approaches: How do we know which language is beneficial for our low-resource language? How does the relationship between source and target languages influence speech recognition performance? Is pooling language data an optimal approach for multilingual strategy?Our case study is Iban, an under-resourced language spoken in Borneo island. We study the effects of using data from Malay, a local dominant language which is close to Iban, for developing Iban ASR under different resource constraints. We have proposed several approaches to adapt Malay data to obtain pronunciation and acoustic models for Iban speech.Building a pronunciation dictionary from scratch is time consuming, as one needs to properly define the sound units of each word in a vocabulary. We developed a semi-supervised approach to quickly build a pronunciation dictionary for Iban. It was based on bootstrapping techniques for improving Malay data to match Iban pronunciations.To increase the performance of low-resource acoustic models we explored two acoustic modelling techniques, the Subspace Gaussian Mixture Models (SGMM) and Deep Neural Networks (DNN). We performed cross-lingual strategies using both frameworks for adapting out-of-language data to Iban speech. Results show that using Malay data is beneficial for increasing the performance of Iban ASR. We also tested SGMM and DNN to improve low-resource non-native ASR. We proposed a fine merging strategy for obtaining an optimal multi-accent SGMM. In addition, we developed an accent-specific DNN using native speech data. After applying both methods, we obtained significant improvements in ASR accuracy. From our study, we observe that using SGMM and DNN for cross-lingual strategy is effective when training data is very limited

Стилі APA, Harvard, Vancouver, ISO та ін.

10

Tafreshi, Shabnam. "Cross-Genre, Cross-Lingual, and Low-Resource Emotion Classification." Thesis, The George Washington University, 2021. http://pqdtopen.proquest.com/#viewpdf?dispub=28088437.

Повний текст джерела

Анотація:

Emotions can be defined as a natural, instinctive state of mind arising from one’s circumstances, mood, and relationships with others. It has long been a question to be answered by psychology that how and what is it that humans feel. Enabling computers to recognize human emotions has been an of interest to researchers since 1990s (Picard et al., 1995). Ever since, this area of research has grown significantly and emotion detection is becoming an important component in many natural language processing tasks. Several theories exist for defining emotions and are chosen by researchers according to their needs. For instance, according to appraisal theory, a psychology theory, emotions are produced by our evaluations (appraisals or estimates) of events that cause a specific reaction in different people. Some emotions are easy and universal, while others are complex and nuanced. Emotion classification is generally the process of labeling a piece of text with one or more corresponding emotion labels. Psychologists have developed numerous models and taxonomies of emotions. The model or taxonomy depends on the problem, and thorough study is often required to select the best model. Early studies of emotion classification focused on building computational models to classify basic emotion categories. In recent years, increasing volumes of social media and the digitization of data have opened a new horizon in this area of study, where emotion classification is a key component of applications, including mood and behavioral studies, as well as disaster relief, amongst many other applications. Sophisticated models have been built to detect and classify emotion in text, but few analyze how well a model is able to learn emotion cues. The ability to learn emotion cues properly and be able to generalize this learning is very important. This work investigates the robustness of emotion classification approaches across genres and languages, with a focus on quantifying how well state-of-the-art models are able to learn emotion cues. First, we use multi-task learning and hierarchical models to build emotion models that were trained on data combined from multiple genres. Our hypothesis is that a multi-genre, noisy training environment will help the classifier learn emotion cues that are prevalent across genres. Second, we explore splitting text (i.e. sentence) into its clauses and testing whether the model’s performance improves. Emotion analysis needs fine-grained annotation and clause-level annotation can be beneficial to design features to improve emotion detection performance. Intuitively, clause-level annotations may help the model focus on emotion cues, while ignoring irrelevant portions of the text. Third, we adopted a transfer learning approach for cross-lingual/genre emotion classification to focus the classifier’s attention on emotion cues which are consistent across languages. Fourth, we empirically show how to combine different genres to be able to build robust models that can be used as source models for emotion transfer to low-resource target languages. Finally, this study involved curating and re-annotating popular emotional data sets in different genres, and annotating a multi-genre corpus of Persian tweets and news, and generating a collection of emotional sentences for a low-resource language, Azerbaijani, a language spoken in the north west of Iran.

Стилі APA, Harvard, Vancouver, ISO та ін.

11

Singh, Mittul [Verfasser], and Dietrich [Akademischer Betreuer] Klakow. "Handling long-term dependencies and rare words in low-resource language modelling / Mittul Singh ; Betreuer: Dietrich Klakow." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2017. http://d-nb.info/1141677962/34.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

12

Riabi, Arij. "Small is Beautiful : addressing resource scarcity, language variation, and transfer challenges for automatic detection of Harmful language." Electronic Thesis or Diss., Sorbonne université, 2025. http://www.theses.fr/2025SORUS049.

Повний текст джерела

Анотація:

Les plateformes en ligne sont devenues des espaces majeurs de discussion publique et, à ce titre, favorisent de nombreuses interactions positives. Cependant, ces forums peuvent aussi promouvoir des contenus nuisibles ou radicaux. La montée des abus en ligne, bien qu'ils ne représentent qu'une fraction des communications sur internet, se manifestent sous diverses formes, notamment par des comportements agressifs. L'extrémisme en ligne est devenu l'un des défis les plus pressants découlant de ces comportements agressifs, car il propage non seulement des idéologies nuisibles, mais conduit aussi à des violences hors ligne et, dans les cas extrêmes, à des attaques terroristes. Compte tenu de ces enjeux, la détection multilingue de contenus radicaux en ligne est une tâche importante. Cependant, comparée à d'autres tâches en traitement automatique des langues (TAL), elle reste moins formalisée et nécessite donc des travaux supplémentaires pour en aborder efficacement les complexités. Il est important de noter que les ensembles de données et modèles existants ne parviennent souvent pas à résoudre les difficultés associées à des données multilingues et diversifiées. Pour aborder véritablement la complexité de la multilinguistique dans les forums en ligne, nous avons surmonté plusieurs défis, notamment la rareté des ressources, la variation linguistique et le transfert limité entre langues et tâches. Alors que les capacités des modèles de langues continuent de se développer à un rythme effréné, il est également essentiel de se rappeler que l'efficacité des technologies TAL dépend fortement de notre capacité à les adapter à divers contextes socioculturels. En l'état actuel des choses, le manque de données et de ressources nécessaires pour les applications modernes du TAL dans les langues et dialectes sous-représentés crée des écarts importants dans l'accès à ces nouvelles technologies. Avec l'intérêt public croissant, les écarts de performance entre langues pour les technologies TAL sont désormais plus visibles, en particulier lorsqu'elles sont appliquées en dehors des scénarios les plus communs. Cette thèse répond au besoin de rendre les technologies TAL modernes efficaces dans des scénarios à très faibles ressources. Nous construisons des modèles basés sur les caractères pour des dialectes et domaines à forte variabilité. Pour ce faire, et dans le but d'optimiser l'apprentissage par transfert dans des contextes de classification zéro-shot, nous créons de nouveaux ensembles de données. Nous étudions également les variétés dialectales et l'impact de la variation culturelle sur le transfert inter-variétés en utilisant l'espagnol comme cas d'étude. Enfin, nous étendons ces perspectives à la détection de contenu radical, en proposant un nouvel ensemble de données multilingue et multi-idéologie pour le contenu radical. Nous abordons les préoccupations liées à la confidentialité et aux biais inhérents à la création d'ensembles de données. A ce titre, nous attirons l'attention sur l'importance de prendre en compte la variation humaine dans les annotations de tâches sensibles, dès l'étape de collecte et tout au long du processus de modélisation. Compte tenu de la nature évolutive et progressive de la radicalisation, nous nous interrogeons dans cette thèse sur la pertinence des approches actuelles pour capturer cette complexité<br>Online platforms have become major forums for public discussion, and in this capacity often promote many positive kinds of interaction. However, such forums also have the potential to foster harmful or radical content. Consequently, a significant concern is the rise of online abuse, which, while making up only a small fraction of online communication, manifests in various forms and in particular aggressive behavior. More specifically, online extremism has emerged as one of the most pressing challenges stemming from such aggressive behavior, as it not only propagates harmful ideologies but also escalates to offline violence and, in extreme cases, terrorist attacks. Given such high stakes, the detection of online multilingual radicalization, our starting point in this thesis, is an important task. Yet, in comparison to other natural language processing (NLP) tasks, it remains less formalized and therefore requires further exploration to address its complexities effectively. Importantly, existing datasets and associated models often fail to address the difficulties associated with multilingual and culturally diverse data. To truly tackle the complexity of multilinguality in online forums, it is necessary to overcome several challenges, including resource scarcity, language variation, and limited transfer between languages and tasks. As the capabilities of language models continue to develop at a breakneck pace, it is also essential to remember that the effectiveness of NLP technologies is highly dependent on our ability to adapt them to various sociocultural contexts. As it stands, the lack of necessary data and resources for modern NLP applications for under-represented languages and dialects creates significant gaps in access to these new technologies. With recent surges in public interest, discrepancies in performance across languages for NLP technologies are more visible now than ever before, especially when they are applied outside of the most common scenarios. This thesis addresses the need to make modern NLP technologies efficient in very low-resource scenarios by building data-efficient character-based models for dialects and domains of high variability, creating new datasets and using auxiliary tasks to optimize transfer learning in zero-shot multilingual classification settings. We also investigate dialectal varieties and the impact of cultural variation on cross-variety transfer using Spanish as a use case. Finally, we extend these insights to radical content detection, proposing a new multilingual, multi-ideology dataset in this domain. We address privacy concerns and inherent biases in dataset creation, drawing attention to the importance of accounting for human variation in sensitive task annotations from the collection step onward through the modeling process. Given the evolving and gradual nature of radicalization, we question in this thesis whether current approaches to the framing of this task are adequate for capturing its complexity

Стилі APA, Harvard, Vancouver, ISO та ін.

13

Dyer, Andrew. "Low Supervision, Low Corpus size, Low Similarity! Challenges in cross-lingual alignment of word embeddings : An exploration of the limitations of cross-lingual word embedding alignment in truly low resource scenarios." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-395946.

Повний текст джерела

Анотація:

Cross-lingual word embeddings are an increasingly important reseource in cross-lingual methods for NLP, particularly for their role in transfer learning and unsupervised machine translation, purportedly opening up the opportunity for NLP applications for low-resource languages. However, most research in this area implicitly expects the availablility of vast monolingual corpora for training embeddings, a scenario which is not realistic for many of the world's languages. Moreover, much of the reporting of the performance of cross-lingual word embeddings is based on a fairly narrow set of mostly European language pairs. Our study examines the performance of cross-lingual alignment across a more diverse set of language pairs; controls for the effect of the corpus size on which the monolingual embedding spaces are trained; and studies the impact of spectral graph properties of the embedding spsace on alignment. Through our experiments on a more diverse set of language pairs, we find that performance in bilingual lexicon induction is generally poor in heterogeneous pairs, and that even using a gold or heuristically derived dictionary has little impact on the performance on these pairs of languages. We also find that the performance for these languages only increases slowly with corpus size. Finally, we find a moderate correlation between the isospectral difference of the source and target embeddings and the performance of bilingual lexicon induction. We infer that methods other than cross-lingual alignment may be more appropriate in the case of both low resource languages and heterogeneous language pairs.

Стилі APA, Harvard, Vancouver, ISO та ін.

14

Arbi, Haza Nasution. "Bilingual Lexicon Induction Framwork for Closely Related Languages." Kyoto University, 2018. http://hdl.handle.net/2433/235115.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

15

Godard, Pierre. "Unsupervised word discovery for computational language documentation." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS062/document.

Повний текст джерела

Анотація:

La diversité linguistique est actuellement menacée : la moitié des langues connues dans le monde pourraient disparaître d'ici la fin du siècle. Cette prise de conscience a inspiré de nombreuses initiatives dans le domaine de la linguistique documentaire au cours des deux dernières décennies, et 2019 a été proclamée Année internationale des langues autochtones par les Nations Unies, pour sensibiliser le public à cette question et encourager les initiatives de documentation et de préservation. Néanmoins, ce travail est coûteux en temps, et le nombre de linguistes de terrain, limité. Par conséquent, le domaine émergent de la documentation linguistique computationnelle (CLD) vise à favoriser le travail des linguistes à l'aide d'outils de traitement automatique. Le projet Breaking the Unwritten Language Barrier (BULB), par exemple, constitue l'un des efforts qui définissent ce nouveau domaine, et réunit des linguistes et des informaticiens. Cette thèse examine le problème particulier de la découverte de mots dans un flot non segmenté de caractères, ou de phonèmes, transcrits à partir du signal de parole dans un contexte de langues très peu dotées. Il s'agit principalement d'une procédure de segmentation, qui peut également être couplée à une procédure d'alignement lorsqu'une traduction est disponible. En utilisant deux corpus en langues bantoues correspondant à un scénario réaliste pour la linguistique documentaire, l'un en Mboshi (République du Congo) et l'autre en Myene (Gabon), nous comparons diverses méthodes monolingues et bilingues de découverte de mots sans supervision. Nous montrons ensuite que l'utilisation de connaissances linguistiques expertes au sein du formalisme des Adaptor Grammars peut grandement améliorer les résultats de la segmentation, et nous indiquons également des façons d'utiliser ce formalisme comme outil de décision pour le linguiste. Nous proposons aussi une variante tonale pour un algorithme de segmentation bayésien non-paramétrique, qui utilise un schéma de repli modifié pour capturer la structure tonale. Pour tirer parti de la supervision faible d'une traduction, nous proposons et étendons, enfin, une méthode de segmentation neuronale basée sur l'attention, et améliorons significativement la performance d'une méthode bilingue existante<br>Language diversity is under considerable pressure: half of the world’s languages could disappear by the end of this century. This realization has sparked many initiatives in documentary linguistics in the past two decades, and 2019 has been proclaimed the International Year of Indigenous Languages by the United Nations, to raise public awareness of the issue and foster initiatives for language documentation and preservation. Yet documentation and preservation are time-consuming processes, and the supply of field linguists is limited. Consequently, the emerging field of computational language documentation (CLD) seeks to assist linguists in providing them with automatic processing tools. The Breaking the Unwritten Language Barrier (BULB) project, for instance, constitutes one of the efforts defining this new field, bringing together linguists and computer scientists. This thesis examines the particular problem of discovering words in an unsegmented stream of characters, or phonemes, transcribed from speech in a very-low-resource setting. This primarily involves a segmentation procedure, which can also be paired with an alignment procedure when a translation is available. Using two realistic Bantu corpora for language documentation, one in Mboshi (Republic of the Congo) and the other in Myene (Gabon), we benchmark various monolingual and bilingual unsupervised word discovery methods. We then show that using expert knowledge in the Adaptor Grammar framework can vastly improve segmentation results, and we indicate ways to use this framework as a decision tool for the linguist. We also propose a tonal variant for a strong nonparametric Bayesian segmentation algorithm, making use of a modified backoff scheme designed to capture tonal structure. To leverage the weak supervision given by a translation, we finally propose and extend an attention-based neural segmentation method, improving significantly the segmentation performance of an existing bilingual method

Стилі APA, Harvard, Vancouver, ISO та ін.

16

Muller, Benjamin. "How Can We Make Language Models Better at Handling the Diversity and Variability of Natural Languages ?" Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS399.

Повний текст джерела

Анотація:

Ces dernières années, le passage à l’échelle (scaling) des modèles de langues basés sur l’apprentissage profond — principalement en termes de taille de modèle, de taille de l’ensemble de données d’entraînement et de puissance de calcul d’entraînement — est devenu l’une des principales forces motrices des progrès empiriques en Traitement Automatique du Langage (TAL). Comme l’illustrent les exemples de (Peters et al., 2018b; Devlin et al., 2018a; Brown et al., 2020;Zhang et al., 2022; Chowdhery et al., 2022), cela conduit à de meilleures performances en apprentissage supervisé ainsi qu’à de meilleures capacités de zero-shot (i.e. sans données annotées pour une tâche dans une langue donnée) et de few-shot (i.e. pour une quantité très limitée de données annotées) et cela pour une grande variété de tâches. Dans cette thèse, nous travaillons avec des modèles monolingues et multilingues de type BERT (Devlin et al., 2018a). Pour répondre à notre principale question de recherche: “Comment rendre les modèles de langue meilleurs face à la diversité et la variabilité des langues?” Nous explorons trois directions principales.1. Analyses comportementales (behavioral) et structurelles des modèles de langues 2. Approche de réduction des différences de domaine 3. Approche par technique d’adaptation. Tout d’abord, les modèles de langues de type BERT sont des objets complexes. La première étape de cette thèse a été de mener des analyses approfondies pour comprendre le comportement de ces modèles dans différents scénarios d’entraînement et de test (behavioral analysis). Ces analyses ont été enrichies par des études structurelles des modèles en décrivant leur fonctionnement interne. Ensuite, nous nous sommes concentrés sur une approche de réduction de l’écart entre les domaines. Dans cette approche, l’objectif est de rendre les données hautement variables hors domaine plus similaires aux données d’apprentissage. Enfin, nous présentons des techniques d’adaptation qui modélisent directement les données hors-domaine ou dans une langue différente des données d’apprentissage<br>Deep Learning for NLP has led to impressive empirical progress in recent years. In essence, this progress is based on better contextualized representations that can be easily used for a wide variety of tasks. However, these models usually require substantial computing power and large amounts of raw textual data. This makes language’s inherent diversity and variability a vivid challenge in NLP. We focus on the following: How can we make language models better at handling the variability and diversity of natural languages?. First, we explore the generalizability of language models by building and analyzing one of the first large-scale replication of a BERT model for a non-English language. Our results raise the question of using these language models on highly-variable domains such as these found online. Focusing on lexical normalization, we show that this task can be approached with BERT-like models. However, we show that it only partially helps downstream performance. In consequence, we focus on adaptation techniques using what we refer to as representation transfer and explore challenging settings such as the zero-shot setting, low-resource languages. We show that multilingual language models can be adapted and used efficiently with low-resource languages, even with the ones unseen during pretraining, and that the script is a critical component in this adaptation

Стилі APA, Harvard, Vancouver, ISO та ін.

17

Black, Kevin P. "Interactive Machine Assistance: A Case Study in Linking Corpora and Dictionaries." BYU ScholarsArchive, 2015. https://scholarsarchive.byu.edu/etd/5620.

Повний текст джерела

Анотація:

Machine learning can provide assistance to humans in making decisions, including linguistic decisions such as determining the part of speech of a word. Supervised machine learning methods derive patterns indicative of possible labels (decisions) from annotated example data. For many problems, including most language analysis problems, acquiring annotated data requires human annotators who are trained to understand the problem and to disambiguate among multiple possible labels. Hence, the availability of experts can limit the scope and quantity of annotated data. Machine-learned pre-annotation assistance, which suggests probable labels for unannotated items, can enable expert annotators to work more quickly and thus to produce broader and larger annotated resources more cost-efficiently. Yet, because annotated data is required to build the pre-annotation model, bootstrapping is an obstacle to utilizing pre-annotation assistance, especially for low-resource problems where little or no annotated data exists. Interactive pre-annotation assistance can mitigate bootstrapping costs, even for low-resource problems, by continually refining the pre-annotation model with new annotated examples as the annotators work. In practice, continually refining models has seldom been done except for the simplest of models which can be trained quickly. As a case study in developing sophisticated, interactive, machine-assisted annotation, this work employs the task of corpus-dictionary linkage (CDL), which is to link each word token in a corpus to its correct dictionary entry. CDL resources, such as machine-readable dictionaries and concordances, are essential aids in many tasks including language learning and corpus studies. We employ a pipeline model to provide CDL pre-annotations, with one model per CDL sub-task. We evaluate different models for lemmatization, the most significant CDL sub-task since many dictionary entry headwords are usually lemmas. The best performing lemmatization model is a hybrid which uses a maximum entropy Markov model (MEMM) to handle unknown (novel) word tokens and other component models to handle known word tokens. We extend the hybrid model design to the other CDL sub-tasks in the pipeline. We develop an incremental training algorithm for the MEMM which avoids wasting previous computation as would be done by simply retraining from scratch. The incremental training algorithm facilitates the addition of new dictionary entries over time (i.e., new labels) and also facilitates learning from partially annotated sentences which allows annotators to annotate words in any order. We validate that the hybrid model attains high accuracy and can be trained sufficiently quickly to provide interactive pre-annotation assistance by simulating CDL annotation on Quranic Arabic and classical Syriac data.

Стилі APA, Harvard, Vancouver, ISO та ін.

18

Meftah, Sara. "Neural Transfer Learning for Domain Adaptation in Natural Language Processing." Thesis, université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG021.

Повний текст джерела

Анотація:

Les méthodes d’apprentissage automatique qui reposent sur les Réseaux de Neurones (RNs) ont démontré des performances de prédiction qui s'approchent de plus en plus de la performance humaine dans plusieurs applications du Traitement Automatique de la Langue (TAL) qui bénéficient de la capacité des différentes architectures des RNs à généraliser à partir des régularités apprises à partir d'exemples d'apprentissage. Toutefois, ces modèles sont limités par leur dépendance aux données annotées. En effet, pour être performants, ces modèles neuronaux ont besoin de corpus annotés de taille importante. Par conséquent, uniquement les langues bien dotées peuvent bénéficier directement de l'avancée apportée par les RNs, comme par exemple les formes formelles des langues. Dans le cadre de cette thèse, nous proposons des méthodes d'apprentissage par transfert neuronal pour la construction d'outils de TAL pour les langues peu dotées en exploitant leurs similarités avec des langues bien dotées. Précisément, nous expérimentons nos approches pour le transfert à partir du domaine source des textes formels vers le domaine cible des textes informels (langue utilisée dans les réseaux sociaux). Tout au long de cette thèse nous proposons différentes contributions. Tout d'abord, nous proposons deux approches pour le transfert des connaissances encodées dans les représentations neuronales d'un modèle source, pré-entraîné sur les données annotées du domaine source, vers un modèle cible, adapté par la suite sur quelques exemples annotés du domaine cible. La première méthode transfère des représentations contextuelles pré-entraînées sur le domaine source. Tandis que la deuxième méthode utilise des poids pré-entraînés pour initialiser les paramètres du modèle cible. Ensuite, nous effectuons une série d'analyses pour repérer les limites des méthodes proposées ci-dessus. Nous constatons que, même si l'approche d'apprentissage par transfert proposée améliore les résultats du domaine cible, un transfert négatif « dissimulé » peut atténuer le gain final apporté par l'apprentissage par transfert. De plus, une analyse interprétative du modèle pré-entraîné, montre que les neurones pré-entraînés peuvent être biaisés par ce qu'ils ont appris du domaine source, et donc peuvent avoir des difficultés à apprendre des « patterns » spécifiques au domaine cible. Issu de notre analyse, nous proposons un nouveau schéma d'adaptation qui augmente le modèle cible avec des neurones normalisés, pondérés et initialisés aléatoirement qui permettent une meilleure adaptation au domaine cible tout en conservant les connaissances apprises du domaine source. Enfin, nous proposons une approche d’apprentissage par transfert qui permet de profiter des similarités entre différentes tâches, en plus des connaissances pré-apprises du domaine source<br>Recent approaches based on end-to-end deep neural networks have revolutionised Natural Language Processing (NLP), achieving remarkable results in several tasks and languages. Nevertheless, these approaches are limited with their "gluttony" in terms of annotated data, since they rely on a supervised training paradigm, i.e. training from scratch on large amounts of annotated data. Therefore, there is a wide gap between NLP technologies capabilities for high-resource languages compared to the long tail of low-resourced languages. Moreover, NLP researchers have focused much of their effort on training NLP models on the news domain, due to the availability of training data. However, many research works have highlighted that models trained on news fail to work efficiently on out-of-domain data, due to their lack of robustness against domain shifts. This thesis presents a study of transfer learning approaches, through which we propose different methods to take benefit from the pre-learned knowledge on the high-resourced domain to enhance the performance of neural NLP models in low-resourced settings. Precisely, we apply our approaches to transfer from the news domain to the social media domain. Indeed, despite the importance of its valuable content for a variety of applications (e.g. public security, health monitoring, or trends highlight), this domain is still poor in terms of annotated data. We present different contributions. First, we propose two methods to transfer the knowledge encoded in the neural representations of a source model pretrained on large labelled datasets from the source domain to the target model, further adapted by a fine-tuning on few annotated examples from the target domain. The first transfers contextualised supervisedly pretrained representations, while the second method transfers pretrained weights, used to initialise the target model's parameters. Second, we perform a series of analysis to spot the limits of the above-mentioned proposed methods. We find that even if the proposed transfer learning approach enhances the performance on social media domain, a hidden negative transfer may mitigate the final gain brought by transfer learning. In addition, an interpretive analysis of the pretrained model, show that pretrained neurons may be biased by what they have learned from the source domain, thus struggle with learning uncommon target-specific patterns. Third, stemming from our analysis, we propose a new adaptation scheme which augments the target model with normalised, weighted and randomly initialised neurons that beget a better adaptation while maintaining the valuable source knowledge. Finally, we propose a model, that in addition to the pre-learned knowledge from the high-resource source-domain, takes advantage of various supervised NLP tasks

Стилі APA, Harvard, Vancouver, ISO та ін.

19

Karagol-Ayan, Burcu. "Resource generation from structured documents for low-density languages." College Park, Md.: University of Maryland, 2007. http://hdl.handle.net/1903/7580.

Повний текст джерела

Анотація:

Thesis (Ph. D.) -- University of Maryland, College Park, 2007.<br>Thesis research directed by: Dept. of Computer Science. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

Стилі APA, Harvard, Vancouver, ISO та ін.

20

Karim, Hiva. "Best way for collecting data for low-resourced languages." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-35945.

Повний текст джерела

Анотація:

Low resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language corpus, making it a big obstacle for experts to provide a comprehensive textprocessing system. In this study, I Found out the best practices for producing and collectingdata for such zero/low resource languages by means of crowd-sourcing. For the purpose of thisstudy, a number of research articles (n=260) were extracted from Google Scholar, MicrosoftAcademic, and science direct. From these articles, only 60 of them, which met the inclusioncriteria' demands, were considered to review for eligibility. A full-text version of these researcharticles was downloaded and then were carefully screened to ensure eligibility. On the result ofthe eligibility assessment from potentially eligible 60 full-text articles for inclusion, only 25were selected and qualified to include in the final review. The final pool of the selected articles,concerning data generation practices and collection of low resource languages, can beconcluded that speech-based audio data is one of the most common and accessible data types.It can be contended that the collection of audio data from speech-based resources such as nativespeakers of the intended language and available audio recording by taking the advantages ofnew technologies is the most practical, cost-effective, and common method for collecting datafor low resource languages.

Стилі APA, Harvard, Vancouver, ISO та ін.

21

Fapšo, Michal. "Vyhledávání výrazů v řeči pomocí mluvených příkladů." Doctoral thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-261237.

Повний текст джерела

Анотація:

Tato práce se zabývá vyhledáváním výrazů v řeči pomocí mluvených příkladů (QbE STD). Výrazy jsou zadávány v mluvené podobě a jsou vyhledány v množině řečových nahrávek, výstupem vyhledávání je seznam detekcí s jejich skóre a časováním. V práci popisujeme, analyzujeme a srovnáváme tři různé přístupy ke QbE STD v jazykově závislých a jazykově nezávislých podmínkách, s jedním a pěti příklady na dotaz. Pro naše experimenty jsme použili česká, maďarská, anglická a arabská (levantská) data, a pro každý z těchto jazyků jsme natrénovali 3-stavový fonémový rozpoznávač. To nám dalo 16 možných kombinací jazyka pro vyhodnocení a jazyka na kterém byl natrénovaný rozpoznávač. Čtyři kombinace byly tedy závislé na jazyce (language-dependent) a 12 bylo jazykově nezávislých (language-independent). Všechny QbE systémy byly vyhodnoceny na stejných datech a stejných fonémových posteriorních příznacích, pomocí metrik: nesdružené Figure-of-Merit (non pooled FOM) a námi navrhnuté nesdružené Figure-of-Merit se simulací normalizace přes promluvy (utterrance-normalized non-pooled Figure-of-Merit). Ty nám poskytly relevantní údaje pro porovnání těchto QbE přístupů a pro získání lepšího vhledu do jejich chování. QbE přístupy použité v této práci jsou: sekvenční statistické modelování (GMM/HMM), srovnávání vzorů v příznacích (DTW) a srovnávání grafů hypotéz (WFST). Abychom porovnali výsledky QbE přístupů s běžnými STD systémy vyhledávajícími textové výrazy, vyhodnotili jsme jazykově závislé konfigurace také s akustickým detektorem klíčových slov (AKWS) a systémem pro vyhledávání fonémových řetězců v grafech hypotéz (WFSTlat). Jádrem této práce je vývoj, analýza a zlepšení systému WFST QbE STD, který po zlepšení dosahuje podobných výsledků jako DTW systém v jazykově závislých podmínkách.

Стилі APA, Harvard, Vancouver, ISO та ін.

22

Aufrant, Lauriane. "Training parsers for low-resourced languages : improving cross-lingual transfer with monolingual knowledge." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS089/document.

Повний текст джерела

Анотація:

Le récent essor des algorithmes d'apprentissage automatique a rendu les méthodes de Traitement Automatique des Langues d'autant plus sensibles à leur facteur le plus limitant : la qualité des systèmes repose entièrement sur la disponibilité de grandes quantités de données, ce qui n'est pourtant le cas que d'une minorité parmi les 7.000 langues existant au monde. La stratégie dite du transfert cross-lingue permet de contourner cette limitation : une langue peu dotée en ressources (la cible) peut être traitée en exploitant les ressources disponibles dans une autre langue (la source). Les progrès accomplis sur ce plan se limitent néanmoins à des scénarios idéalisés, avec des ressources cross-lingues prédéfinies et de bonne qualité, de sorte que le transfert reste inapplicable aux cas réels de langues peu dotées, qui n'ont pas ces garanties. Cette thèse vise donc à tirer parti d'une multitude de sources et ressources cross-lingues, en opérant une combinaison sélective : il s'agit d'évaluer, pour chaque aspect du traitement cible, la pertinence de chaque ressource. L'étude est menée en utilisant l'analyse en dépendance par transition comme cadre applicatif. Le cœur de ce travail est l'élaboration d'un nouveau méta-algorithme de transfert, dont l'architecture en cascade permet la combinaison fine des diverses ressources, en ciblant leur exploitation à l'échelle du mot. L'approche cross-lingue pure n'étant en l'état pas compétitive avec la simple annotation de quelques phrases cibles, c'est avant tout la complémentarité de ces méthodes que souligne l'analyse empirique. Une série de nouvelles métriques permet une caractérisation fine des similarités cross-lingues et des spécificités syntaxiques de chaque langue, de même que de la valeur ajoutée de l'information cross-lingue par rapport au cadre monolingue. L'exploitation d'informations typologiques s'avère également particulièrement fructueuse. Ces contributions reposent largement sur des innovations techniques en analyse syntaxique, concrétisées par la publication en open source du logiciel PanParser, qui exploite et généralise la méthode dite des oracles dynamiques. Cette thèse contribue sur le plan monolingue à plusieurs autres égards, comme le concept de cascades monolingues, pouvant traiter par exemple d'abord toutes les dépendances faciles, puis seulement les difficiles<br>As a result of the recent blossoming of Machine Learning techniques, the Natural Language Processing field faces an increasingly thorny bottleneck: the most efficient algorithms entirely rely on the availability of large training data. These technological advances remain consequently unavailable for the 7,000 languages in the world, out of which most are low-resourced. One way to bypass this limitation is the approach of cross-lingual transfer, whereby resources available in another (source) language are leveraged to help building accurate systems in the desired (target) language. However, despite promising results in research settings, the standard transfer techniques lack the flexibility regarding cross-lingual resources needed to be fully usable in real-world scenarios: exploiting very sparse resources, or assorted arrays of resources. This limitation strongly diminishes the applicability of that approach. This thesis consequently proposes to combine multiple sources and resources for transfer, with an emphasis on selectivity: can we estimate which resource of which language is useful for which input? This strategy is put into practice in the frame of transition-based dependency parsing. To this end, a new transfer framework is designed, with a cascading architecture: it enables the desired combination, while ensuring better targeted exploitation of each resource, down to the level of the word. Empirical evaluation dampens indeed the enthusiasm for the purely cross-lingual approach -- it remains in general preferable to annotate just a few target sentences -- but also highlights its complementarity with other approaches. Several metrics are developed to characterize precisely cross-lingual similarities, syntactic idiosyncrasies, and the added value of cross-lingual information compared to monolingual training. The substantial benefits of typological knowledge are also explored. The whole study relies on a series of technical improvements regarding the parsing framework: this work includes the release of a new open source software, PanParser, which revisits the so-called dynamic oracles to extend their use cases. Several purely monolingual contributions complete this work, including an exploration of monolingual cascading, which offers promising perspectives with easy-then-hard strategies

Стилі APA, Harvard, Vancouver, ISO та ін.

23

Mutuvi, Stephen. "Epidemic Event Extraction in Multilingual and Low-resource Settings." Electronic Thesis or Diss., La Rochelle, 2022. http://www.theses.fr/2022LAROS044.

Повний текст джерела

Анотація:

L'extraction d'événements épidémiques a pour but d'extraire de textes des incidents d'importance pour la santé publique, tels que des épidémies. Alors que l'extraction d'événements a fait l'objet de recherches approfondies pour les langues à fortes ressources comme l'anglais, les systèmes existants d'extraction d'événements épidémiques ne sont pas optimaux pour les contextes multilingues à faibles ressources en raison de la rareté des données d'entraînement. Tout d'abord, nous nous attaquons au problème de la rareté des données en transformant et en annotant un ensemble de données multilingues existantes au niveau des documents en un ensemble de données annotées au niveau des jetons, adapté à l'apprentissage supervisé des séquences. Ensuite, nous formulons la tâche d'extraction d'événements comme une tâche d'étiquetage de séquences et nous utilisons l'ensemble de données annotées au niveau des jetons pour entraîner des modèles supervisés d'apprentissage automatique et profond pour l'extraction d'événements épidémiques. Les résultats montrent que les modèles linguistiques pré-entraînés ont produit la meilleure performance globale dans toutes les langues évaluées. Troisièmement, nous proposons une technique d'adaptation au domaine en incluant des entités épidémiologiques (noms de maladies et lieux) dans le vocabulaire des modèles pré-entraînés. L'incorporation de ces entités a eu un impact positif sur la qualité de la tokénisation, contribuant ainsi à l'amélioration des performances du modèle. Enfin, nous évaluons l'auto-formation et observons que l'approche est légèrement plus performante que les modèles formés par apprentissage supervisé<br>Epidemic event extraction aims to extract incidents of public health importance from text, such as disease outbreaks. While event extraction has been extensively researched for high-resource languages such as English, existing systems for epidemic event extraction are sub-optimal for low-resource, multilingual settings due to training data scarcity. First, we tackle the data scarcity challenge by transforming and annotating an existing document-level multilingual dataset into a token-level annotated dataset suitable for supervised sequence learning. Second, we formulate the event extraction task as a sequence labeling task and utilize the token-level annotated dataset to train supervised machine and deep learning models for epidemic event extraction. The results show that pre-trained language models produced the best overall performance across all the evaluated languages. Third, we propose a domain adaptation technique by including epidemiological entities (disease names and locations) in the vocabulary of pre-trained models. Incorporating the entities positively impacted the tokenization quality, contributing to model performance improvement. Finally, we evaluate self-training and observe that the approach performs marginally better than models trained using supervised learning

Стилі APA, Harvard, Vancouver, ISO та ін.

24

Vu, Ngoc Thang [Verfasser]. "Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information / Ngoc Thang Vu." Aachen : Shaker, 2014. http://d-nb.info/1058315811/34.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

25

Vu, Ngoc Thang [Verfasser], and T. [Akademischer Betreuer] Schultz. "Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information / Ngoc Thang Vu. Betreuer: T. Schultz." Karlsruhe : KIT-Bibliothek, 2014. http://d-nb.info/1051848229/34.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

26

Susman, Derya. "Turkish Large Vocabulary Continuous Speech Recognition By Using Limited Audio Corpus." Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12614207/index.pdf.

Повний текст джерела

Анотація:

Speech recognition in Turkish Language is a challenging problem in several perspectives. Most of the challenges are related to the morphological structure of the language. Since Turkish is an agglutinative language, it is possible to generate many words from a single stem by using suffixes. This characteristic of the language increases the out-of-vocabulary (OOV) words, which degrade the performance of a speech recognizer dramatically. Also, Turkish language allows words to be ordered in a free manner, which makes it difficult to generate robust language models. In this thesis, the existing models and approaches which address the problem of Turkish LVCSR (Large Vocabulary Continuous Speech Recognition) are explored. Different recognition units (words, morphs, stem and endings) are used in generating the n-gram language models. 3-gram and 4-gram language models are generated with respect to the recognition unit. Since the solution domain of speech recognition is involved with machine learning, the performance of the recognizer depends on the sufficiency of the audio data used in acoustic model training. However, it is difficult to obtain rich audio corpora for the Turkish language. In this thesis, existing approaches are used to solve the problem of Turkish LVCSR by using a limited audio corpus. We also proposed several data selection approaches in order to improve the robustness of the acoustic model.

Стилі APA, Harvard, Vancouver, ISO та ін.

27

Cordova, Johanna. "Le quechua dans les outils numériques, un défi pour le TAL ? Développement de ressources linguistiques et numériques pour le quechua ancashino." Electronic Thesis or Diss., Paris, INALCO, 2024. http://www.theses.fr/2024INAL0031.

Повний текст джерела

Анотація:

Les langues quechuas constituent l'une des familles linguistiques amérindiennes comptant le plus grand nombre de locuteurs natifs. Au Pérou, selon le recensement de 2017, 13,9% de la population a le quechua pour première langue et environ 20% le parle. Pourtant, elle est presque totalement absente des usages numériques. En traitement automatique des langues (TAL), c'est une langue peu dotée, avec une forte disparité de ressources selon la variété de quechua considérée. L'objectif de cette thèse est de développer un ensemble d'outils fondamentaux pour le traitement automatique d'une variété du quechua central, le quechua ancashino, parlé par environ 400 000 personnes, et en danger d'extinction d'après la classification de l'UNESCO. Ce processus comporte trois étapes : la numérisation des ressources disponibles dans cette variété (dictionnaires, corpus écrits), l'implémentation d'un analyseur morphologique, et l'élaboration d'un corpus arboré pour l'analyse en morpho-syntaxe. Les ressources développées seront valorisées à travers des applications telles qu'un moteur de recherche permettant d'interroger l'ensemble des dictionnaires. Dans un contexte global de valorisation des langues originaires et alors que d'ambitieuses politiques liées aux droits linguistiques sont en cours de déploiement dans les pays de l'aire andine, la présence du quechua dans les technologies constitue un important levier pour renforcer sa pratique et faciliter son enseignement<br>Quechua languages are one of the Amerindian language families with the largest number of native speakers. In Peru, according to the 2017 census, 13.9% of the population have Quechua as their first language, and around 20% speak it. However, the language is almost totally absent from digital tools. In natural language processing (NLP), it is an under-resourced language, with a strong disparity in the amount of resources depending on the variety of Quechua considered. The aim of this thesis is to develop a set of fundamental tools for the automatic processing of a variety of central Quechua, Ancash Quechua, spoken by around 400,000 people and in danger of extinction according to the UNESCO classification. This process involves three stages: digitisation of the resources available in this variety (dictionaries, written corpora), implementation of a morphological analyser, and development of a treebank for morpho-syntactic analysis. These resources will be made available on the web via applications, in particular a search engine that can be used to query the dictionaries available for this language. In a global context of preservation movement of native languages, and while ambitious policies related to linguistic rights are being deployed in the countries of the Andean region, the presence of Quechua in technologies would be an important lever to strengthen its practice and facilitate its teaching

Стилі APA, Harvard, Vancouver, ISO та ін.

28

Cecovniuc, Ioana. "¿Qué prefiere usted, pagar en metálico o con cardo?: los falsos amigos y su concienciación lingüística en las aulas plurilingües de ELE." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/401588.

Повний текст джерела

Анотація:

Los falsos amigos constituyen una de las dificultades principales que afrontan los discentes que persiguen hablar, escribir e interaccionar en una lengua extranjera. Con un falso amigo se indica una palabra o estructura de un idioma extranjero que se asemeja, en lo escrito o en lo hablado, a una palabra o estructura en la lengua materna del hablante, pero que cuenta con un significado distinto. Abundan los ejemplos de falsas relaciones lingüísticas (totales o parciales) entre lengua materna y lengua meta. No obstante, la experiencia práctica respecto a este tema de investigación es bastante escasa. Los falsos amigos bien se omiten, bien se reducen, en general, a listas o inventarios de los casos más notables y, por ende, no se integran en el aula junto con otras estrategias y tácticas de enseñanza-aprendizaje para una instrucción significativa. Consecuentemente, se pretende con la presente tesis doctoral aportar aspectos relevantes, tanto teóricos como de carácter práctico, tocante a la tipología, los usos y la utilidad e interés didáctico en las aulas de ELE de los falsos amigos como muestras de interferencias interlingüísticas. Vinculado a ello, la perspectiva contrastiva suministra los elementos epistemológicos de análisis: los hablantes están naturalmente expuestos a pasar estructuras y significados y distribuciones de esas estructuras y significados de su lengua materna o su primera lengua de aprendizaje a otra lengua extranjera que están aprendiendo. Esta suposición básica precisa el campo de aplicación de la Lingüística Contrastiva que engloba postulados psicolingüísticos y pedagógicos de dos metodologías complementarias entre sí: Análisis Contrastivo y Análisis de Errores. Asimismo, se consideran los fundamentos que enmarcan el Plurilingüismo con indagar en los últimos desarrollos teórico-empíricos en cuanto a las influencias interlingüísticas a la hora de aprender una lengua extranjera. A la luz de las premisas teóricas, las interferencias son efecto de la relación entre la lengua materna y la lengua que se estudia. Por otra parte, hoy día, aprender dos o más idiomas es una disposición mundial, de ahí que la delimitación en los procesos de aprendizaje no se dé exclusivamente entre la lengua materna y la lengua meta. Así pues, una primera lengua extranjera aprendida puede resultar un factor que influya positiva o negativamente en una segunda o tercera lengua extranjera que se está aprendiendo. En este sentido, con la presente investigación la atención se concentra en torno a los falsos amigos cuando el español (segunda lengua extranjera) entra en contacto con el inglés (primera lengua extranjera) y el rumano o el holandés (lenguas maternas).<br>The major problems relating to false friends entail difficulties for most students in terms of pronunciation, writing and interaction in a foreign language. With false friends we refer to words or structures of different languages that display morphological affinity but, at the same time, semantic divergence. Examples of (total or partial) false linguistic relations between mother tongue and target language abound. Nevertheless, a practical experience concerning this subject is quite scarce. False friends are either omitted or reduced, in general, to lists or inventories of the most notable cases. Hence, false friends are not (much) integrated in the foreign language classroom along with other teaching and learning strategies and tactics for a meaningful instruction. Consequently, this Ph.D. thesis aims at providing relevant aspects of both theoretical and practical nature in relation to the typology, the uses and the didactic value of false friends as instances of interlinguistic influence. In conjunction therewith, the contrastive perspective makes available the epistemological elements of the present analysis: the speakers are naturally prone to transfer structures and meanings and distributions of these structures and meanings of their mother tongue or of their first foreign language to another foreign language to learn. This basic statement is analyzed by means of Contrastive Linguistics which includes psycholinguistic and educational postulates of two complementary methodologies: Contrastive Analysis and Analysis of Errors. In addition, there are considered the principles that define Plurilingualism by investigating the recent studies regarding interlinguistic influences when learning a foreign language. In light of the theoretical assumptions, interferences result from the relationship between the mother tongue and the foreign language. However, to learn two or more foreign languages is a worldwide tendency nowadays, therefore delimitations in the learning processes could not be given between the mother tongue and the target language exclusively. Thus, a first foreign language can be a factor that positively or negatively influences when learning a second foreign language. In this regard, the present paper is an attempt to cover different aspects on the topic of false friends when Spanish (second foreign language to learn) comes into contact with English (first foreign language) and Romanian or Dutch (mother tongues).<br>Els falsos amics constitueixen una de les dificultats principals que afronten els discents que persegueixen parlar, escriure i interaccionar en una llengua estrangera. Amb un fals amic s'indica una paraula o estructura d'un idioma estranger que s'assembla, en l'escrit o en la forma oral, a una paraula o estructura en la llengua materna del parlant, però que compta amb un significat diferent. Abunden els exemples de falses relacions lingüístiques (totals o parcials) entre llengua materna i llengua meta. No obstant això, l'experiència pràctica respecte a aquest tema d'exploració és força escassa. Els falsos amics o bé s'ometen, o bé es redueixen, en general, a llistes o inventaris dels casos més notables i, per tant, no s'integren a l'aula juntament amb altres estratègies i tàctiques d'ensenyament-aprenentatge per a una instrucció significativa. Conseqüentment, es pretén amb la present tesi doctoral aportar aspectes rellevants, tant teòrics com de caràcter pràctic, pel que fa a la tipologia, els usos i la utilitat i interés didàctic a les aules d’ELE dels falsos amics com mostres d’interferències interlingüístiques. Vinculat amb això, la perspectiva contrastiva subministra els elements epistemològics d'anàlisi: els parlants són naturalment propensos a passar estructures i significats i distribucions d'aquestes estructures i significats de la seva llengua materna o la seva primera llengua d'aprenentatge a una altra llengua estrangera que estan aprenent. Aquesta suposició bàsica precisa el camp d'aplicació de la Lingüística Contrastiva que engloba postulats psicolingüístics i pedagògics de dues metodologies complementàries entre sí: Anàlisi Contrastiva i Anàlisi d'Errors. També, es valoren els princips del Plurilingüisme amb indagar en els últims desenvolupaments sobre les influències interlingüístiques quan s’aprèn una llengua estrangera. A la llum de les premisses teòriques, les interferències són efecte de la relació entre la llengua materna i la llengua estrangera que s'estudia. D'altra banda, avui en dia, aprendre dos o més idiomes estrangers és una disposició mundial, per aquest motiu la delimitació en els processos d'aprenentatge no es dóna exclusivament entre la llengua materna i la llengua meta. Així doncs, una primera llengua estrangera apresa pot resultar un factor que influeixi positivament o negativament en una segona o tercera llengua estrangera que s'està aprenent. En aquest sentit, amb aquesta recerca l'atenció es polaritza al voltant dels falsos amics quan l'espanyol (segona llengua estrangera) entra en contacte amb l'anglès (primera llengua estrangera) i el romanès o l'holandès (llengües maternes).

Стилі APA, Harvard, Vancouver, ISO та ін.

29

Pronto, Lindon N. "Exploring German and American Modes of Pedagogical and Institutional Sustainability: Forging a Way into the Future." Scholarship @ Claremont, 2012. http://scholarship.claremont.edu/pitzer_theses/21.

Повний текст джерела

Анотація:

Rooted deep in Germany's past is its modern socio-political grounding for environmental respect and sustainability. This translates into individual and collective action and extends equally to the economic and policy realm as it does to educational institutions. This thesis evaluates research conducted in Germany with a view to what best approaches are transferable to the United States liberal arts setting. Furthermore, exemplary American models of institutional sustainability and environmental education are explored and combined with those from abroad to produce a blueprint and action plan fitting for the American college and university.

Стилі APA, Harvard, Vancouver, ISO та ін.

30

Farra, Noura. "Cross-Lingual and Low-Resource Sentiment Analysis." Thesis, 2019. https://doi.org/10.7916/d8-x3b7-1r92.

Повний текст джерела

Анотація:

Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages. This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language. Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis. To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments. The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language. In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment.

Стилі APA, Harvard, Vancouver, ISO та ін.

31

(8776265), Xiao Zhang. "Flexible Structured Prediction in Natural Language Processing with Partially Annotated Corpora." Thesis, 2020.

Знайти повний текст джерела

Анотація:

<div>Structured prediction makes coherent decisions as structured objects to present the interrelations of these predicted variables. They have been widely used in many areas, such as bioinformatics, computer vision, speech recognition, and natural language processing. Machine Learning with reduced supervision aims to leverage the laborious and error-prone annotation effects and benefit the low-resource languages. In this dissertation we study structured prediction with reduced supervision for two sets of problems, sequence labeling and dependency parsing, both of which are representatives of structured prediction problems in NLP. We investigate three different approaches.</div><div> </div><div>The first approach is learning with modular architecture by task decomposition. By decomposing the labels into location sub-label and type sub-label, we designed neural modules to tackle these sub-labels respectively, with an additional module to infuse the information. The experiments on the benchmark datasets show the modular architecture outperforms existing models and can make use of partially labeled data together with fully labeled data to improve on the performance of using fully labeled data alone.</div><div><br></div><div>The second approach builds the neural CRF autoencoder (NCRFAE) model that combines a discriminative component and a generative component for semi-supervised sequence labeling. The model has a unified structure of shared parameters, using different loss functions for labeled and unlabeled data. We developed a variant of the EM algorithm for optimizing the model with tractable inference. The experiments on several languages in the POS tagging task show the model outperforms existing systems in both supervised and semi-supervised setup.</div><div><br></div><div>The third approach builds two models for semi-supervised dependency parsing, namely local autoencoding parser (LAP) and global autoencoding parser (GAP). LAP assumes the chain-structured sentence has a latent representation and uses this representation to construct the dependency tree, while GAP treats the dependency tree itself as a latent variable. Both models have unified structures for sentence with and without annotated parse tree. The experiments on several languages show both parsers can use unlabeled sentences to improve on the performance with labeled sentences alone, and LAP is faster while GAP outperforms existing models.</div>

Стилі APA, Harvard, Vancouver, ISO та ін.

32

Kamran, Amir. "Hybrid Machine Translation Approaches for Low-Resource Languages." Master's thesis, 2011. http://www.nusl.cz/ntk/nusl-313015.

Повний текст джерела

Анотація:

In recent years, corpus based machine translation systems produce significant results for a number of language pairs. However, for low-resource languages like Urdu the purely statistical or purely example based methods are not performing well. On the other hand, the rule-based approaches require a huge amount of time and resources for the development of rules, which makes it difficult in most scenarios. Hybrid machine translation systems might be one of the solutions to overcome these problems, where we can combine the best of different approaches to achieve quality translation. The goal of the thesis is to explore different combinations of approaches and to evaluate their performance over the standard corpus based methods currently in use. This includes: 1. Use of syntax-based and dependency-based reordering rules with Statistical Machine Translation. 2. Automatic extraction of lexical and syntactic rules using statistical methods to facilitate the Transfer-Based Machine Translation. The novel element in the proposed work is to develop an algorithm to learn automatic reordering rules for English-to-Urdu statistical machine translation. Moreover, this approach can be extended to learn lexical and syntactic rules to build a rule-based machine translation system.

Стилі APA, Harvard, Vancouver, ISO та ін.

33

Anoop, C. S. "Automatic speech recognition for low-resource Indian languages." Thesis, 2023. https://etd.iisc.ac.in/handle/2005/6195.

Повний текст джерела

Анотація:

Building good models for automatic speech recognition (ASR) requires large amounts of annotated speech data. Recent advancements in end-to-end speech recognition have aggravated the need for data. However, most Indian languages are low-resourced and lack enough training data to build robust and efficient ASR systems. Despite the challenges associated with the scarcity of data, Indian languages offer some unique characteristics that can be utilized to improve speech recognition in low-resource settings. Most languages have an overlapping phoneme set and a strong correspondence between their character sets and pronunciations. Though the writing systems are different, the Unicode tables are organized so that similar-sounding characters occur at the same offset in the range assigned for each language. In the first part of the thesis, we try to exploit the pronunciation similarities among multiple Indian languages by using a shared set of pronunciation-based tokens. We evaluate the ASR performance for four choices of tokens, namely Epitran, Indian language speech sound label set (ILSL12), Sanskrit phonetic library encoding (SLP1), and SLP1-M (SLP1 modified to include some contextual pronunciation rules). Using Sanskrit as a representative Indian language, we conduct monolingual experiments to evaluate their ASR performance. Conventional Gaussian mixture model (GMM) - hidden Markov model (HMM) approaches, and neural network models leveraging on the alignments from the conventional models benefit from the stringent pronunciation modeling in SLP1-M. However, end-to-end (E2E) trained time-delay neural networks (TDNN) yield the best results with SLP1. Most Indian languages are spoken in units of syllables. However, syllables have never been used for E2E speech recognition in the Indian language, to the best of our knowledge. So we compare token units like native script characters, SLP1, and syllables in the monolingual settings for multiple Indian languages. We also evaluate the performance of sub-word units generated with the byte pair encoding (BPE) and unigram language model (ULM) algorithms on these basic units. We find that syllable-based sub-word units are promising alternatives to graphemes in monolingual speech recognition if the dataset fairly covers the syllables in the language. The benefits of syllable sub-words in E2E speech recognition may be attributed to the reduced effective length of the token sequences. We also investigate if the models trained on different token units can complement each other in a pretraining-fine-tuning setup. However, the performance improvements in such a setup with syllable-BPE and SLP1 character tokens are minor compared to the syllable-BPE trained model. We also investigate the suitability of syllable-based units in a cross-lingual training setup for a low-resource target language. However, the model faces convergence issues. SLP1 characters are a better choice in crosslingual transfer learning than the syllable sub-words. In the first part, we also verify the effectiveness of SpecAugment in an extremely low-resource setting. We apply SpecAugment on the log-mel spectrogram for data augmentation in a limited dataset of just 5.5 hours. The assumption is that the target language has no closely related high-resource source language, and only very limited data is available. SpecAugment provides an absolute improvement of 13.86% in WER on a connectionist temporal classification (CTC) based E2E system with weighted finite-state transducer (WFST) decoding. Based on this result, we extensively use SpecAugment in our experiments with E2E models. In the second part of the thesis, we address the strategies for improving the performance of ASR systems in low-resource scenarios (target languages), exploiting the annotated data from high-resource languages (source languages). Based on the results in the first part of the thesis, we extensively use SLP1 tokens in multilingual experiments on E2E networks. We specifically explore the following settings: (a) Labeled audio data is not available in the target language. Only a limited amount of unlabeled data is available. We propose using unsupervised domain adaptation (UDA) approaches in a hybrid DNN(deep neural network)-HMM setting to build ASR systems for low-resource languages sharing a common acoustic space with high-resource languages. We explore two architectures: i) domain adversarial training using gradient reversal layer (GRL) and ii) domain separation network (DSN). The GRL and DSN architectures give absolute improvements of 6.71% and 7.32%, respectively, in word error rate (WER) over the baseline DNN with Hindi in the source domain and Sanskrit in the target domain. We also find that a judicious selection of the source language yields further improvements. (b) Target language has only a small amount of labeled data and has some amount of text data to build language models. We try to benefit from the available data in high-resource languages through a common shared label set to build unified acoustic (AM) and language models (LM). We study and compare the performance of these unified models with that of the monolingual model in low-resource conditions. The unified language-agnostic AM + LM performs better than monolingual AM + LM in cases where (a) only limited speech data is available for training the acoustic models and (b) the speech data is from domains different from that used in training. Multilingual AM + monolingual LM performs the best in general. However, from the results, applying unified models directly (without fine-tuning) to unseen languages does not seem to be a good choice. (c) There are N target languages with limited training data and several source languages with large training sets. We explore the usefulness of model-agnostic meta-learning (MAML) pre-training for Indian languages and study the importance of selection of the source languages. We find that MAML beats joint multilingual pretraining by an average of 5.4% in CER and 20.3% in WER with just five epochs of fine-tuning. Moreover, MAML achieves performances similar to joint multilingual training with just 25% of the training data. Similarity with the source languages impacts the target language’s ASR performance. We propose a text-similarity-based loss-weighting scheme to exploit this artifact. We find absolute improvements of 1% (on average) in WER with the loss-weighing scheme. The main contributions of the thesis are: 1. Finding that the use of SLP1 tokens as a common label set for Indian languages helps to remove the redundancy involved in pooling the characters from multiple languages. 2. Exploring for the first time (to the best of our knowledge) syllable-based token units for E2E speech recognition in Indian languages. We find that they are suitable only for monolingual ASR systems. 3. Formulating the ASR in a low-resource language lacking labeled data (for the first time) as an unsupervised domain adaptation problem from a related high-resource language. 4. Exploring for the first time both unified acoustic and language models in a multilingual ASR for Indian languages. The scheme has shown success in cases where the data for acoustic modeling is limited and in settings where the test data is out-of-domain. 5. Proposing a textual similarity-based loss-weighing scheme for MAML pretraining which improves the performance of vanilla MAML models.

Стилі APA, Harvard, Vancouver, ISO та ін.

34

"Query-by-example spoken term detection for low-resource languages." 2014. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1290682.

Повний текст джерела

Анотація:

In this thesis, we consider the problem of query-by-example (QbyE) spoken term detection (STD) for low-resource languages. The problem is to automatically detect and locate the occurrences of a query term in a large audio database. The query term is given in the form of one or more audio examples. This research is motivated by the demand for information retrieval technologies that can handle speech data of low-resource languages. The major technical difficulty is that manual transcriptions and linguistic knowledge are not available for these languages.<br>The framework of acoustic segment modeling (ASM) is adopted for unsupervised training of a speech tokenizer. Three novel algorithms are developed for segment labeling in the ASM framework. The proposed algorithms are based on the use of different class-by-segment posterior representations and spectral clustering techniques. The posterior representations are shown to be more robust than conventional spectral representations. Spectral clustering has achieved significant success in many applications. Reformulations of spectral clustering algorithms are made to make them computationally feasible for clustering a large number of speech segments. Experiments on a multilingual speech database demonstrate the advantage of the proposed algorithms over existing approaches.<br>The speech tokenizer obtained with ASM is applied to QbyE STD. The detection of spoken queries is based on a frame-based template matching framework. The ASM tokenizer serves as the front-end to generate posterior features, which are used for temporal template matching by dynamic time warping (DTW). Experiments show that the ASM tokenizer outperforms a GMM tokenizer and language-mismatched phoneme recognizers. Moreover, a two-step approach is proposed for efficient search.<br>The frame-based template matching framework for QbyE STD is enhanced in three ways. A novel DTW matrix combination approach is proposed for the fusion of multiple systems with different posterior features. Pseudo-relevance feedback is used for query expansion, and score normalization is applied to calibrate the score distributions of different query terms. Experimental results show that the performances of the QbyE STD system are significantly improved by the three approaches.<br>關鍵詞檢測是一項在大量語音數據庫中查找某關鍵詞位置的技術。關鍵詞檢測無論在學術研究領域還是實際應用領域都有非常重要的價值。傳統關鍵詞檢測的研究主要針對資源豐富的語言。本文研究針對資源匱乏的語言的關鍵詞檢測。在本文設定條件下，目標語言沒有足夠的資源訓練語音識別系統，並且關鍵詞以聲音樣例的形式給定。<br>本文採用聲學語音段建模（ASM）框架來無監督訓練語音識別器。我們提出三種新的方法用於ASM框架中的語音片段聚類。我們的方法基於一種新的魯棒的語音片段特徵，並且採用了譜聚類技術。實驗證明我們的方法優於另外三種常用的基線方法，能夠取得更好的建模效果。<br>我們將ASM識別器用於基於模板匹配的關鍵詞檢測系統中。在該系統中，ASM識別器被視為前端特徵轉換模塊，用於提取後驗概率特徵。為了提高檢測效率，我們還提出一種兩步檢測方法。實驗效果證明我們的方法能夠取得較高的檢測準確率。<br>為了進一步提高檢測準確率，本文從三個角度優化基於模板匹配的關鍵詞檢測系統。首先我們提出在動態時間規整的距離矩陣上進行系統融合。其次我們提出用偽相關反饋技術來獲取更多的關鍵詞樣例。最後我們對系統打分進行規整從而有利於在設定統一的打分門限。實驗結果證明這三種方法都有效的提高了關鍵詞檢測的系統性能。<br>Wang, Haipeng.<br>Thesis (Ph.D.)--Chinese University of Hong Kong, 2014.<br>Includes bibliographical references (leaves 110-127).<br>Abstracts also in Chinese.<br>Title from PDF title page (viewed on 05, December, 2016).<br>Detailed summary in vernacular field only.<br>Detailed summary in vernacular field only.<br>Detailed summary in vernacular field only.<br>Detailed summary in vernacular field only.

Стилі APA, Harvard, Vancouver, ISO та ін.

35

Cooper, Erica Lindsay. "Text-to-Speech Synthesis Using Found Data for Low-Resource Languages." Thesis, 2019. https://doi.org/10.7916/d8-vdzp-j870.

Повний текст джерела

Анотація:

Text-to-speech synthesis is a key component of interactive, speech-based systems. Typically, building a high-quality voice requires collecting dozens of hours of speech from a single professional speaker in an anechoic chamber with a high-quality microphone. There are about 7,000 languages spoken in the world, and most do not enjoy the speech research attention historically paid to such languages as English, Spanish, Mandarin, and Japanese. Speakers of these so-called "low-resource languages" therefore do not equally benefit from these technological advances. While it takes a great deal of time and resources to collect a traditional text-to-speech corpus for a given language, we may instead be able to make use of various sources of "found'' data which may be available. In particular, sources such as radio broadcast news and ASR corpora are available for many languages. While this kind of data does not exactly match what one would collect for a more standard TTS corpus, it may nevertheless contain parts which are usable for producing natural and intelligible parametric TTS voices. In the first part of this thesis, we examine various types of found speech data in comparison with data collected for TTS, in terms of a variety of acoustic and prosodic features. We find that radio broadcast news in particular is a good match. Audiobooks may also be a good match despite their largely more expressive style, and certain speakers in conversational and read ASR corpora also resemble TTS speakers in their manner of speaking and thus their data may be usable for training TTS voices. In the rest of the thesis, we conduct a variety of experiments in training voices on non-traditional sources of data, such as ASR data, radio broadcast news, and audiobooks. We aim to discover which methods produce the most intelligible and natural-sounding voices, focusing on three main approaches: 1) Training data subset selection. In noisy, heterogeneous data sources, we may wish to locate subsets of the data that are well-suited for building voices, based on acoustic and prosodic features that are known to correspond with TTS-style speech, while excluding utterances that introduce noise or other artifacts. We find that choosing subsets of speakers for training data can result in voices that are more intelligible. 2) Augmenting the frontend feature set with new features. In cleaner sources of found data, we may wish to train voices on all of the data, but we may get improvements in naturalness by including acoustic and prosodic features at the frontend and synthesizing in a manner that better matches the TTS style. We find that this approach is promising for creating more natural-sounding voices, regardless of the underlying acoustic model. 3) Adaptation. Another way to make use of high-quality data while also including informative acoustic and prosodic features is to adapt to subsets, rather than to select and train only on subsets. We also experiment with training on mixed high- and low-quality data, and adapting towards the high-quality set, which produces more intelligible voices than training on either type of data by itself. We hope that our findings may serve as guidelines for anyone wishing to build their own TTS voice using non-traditional sources of found data.

Стилі APA, Harvard, Vancouver, ISO та ін.

36

Tholpadi, Goutham. "Algorithms for Multilingual IR in Low Resource Languages using Weakly Aligned Corpora." Thesis, 2018. https://etd.iisc.ac.in/handle/2005/5360.

Повний текст джерела

Анотація:

Multilingual information retrieval (MLIR) methods generally rely on linguistic resources such as dictionaries, parallel corpora, etc., to overcome the language barrier. For low resource languages without these resources, an alternative approach is to use topical cross-lingual topical correspondence models learned from document-aligned multilingual corpora. However, there is a large and growing corpus of \weakly aligned" text in the form of user-generated content (UGC) data from social networks, commenting widgets, etc. that has been ignored by researchers so far. This task is made more challenging due to romanization, sparsity, and noise. Topic models learned from such text are not readily usable for all applications. In particular, the size of the textual units of interest has a strong bearing on the methods used for applying these models. In this thesis, we develop a series of hierarchical Bayesian models for capturing topical correspondence in multilingual news-comment corpora. The nal model, called Multi-glyphic Correspondence Topic Model (MCTM), captures different kinds of topical relationships, and effectively handles problems of data sparsity, noise and comment romanization. From an application perspective, we consider three MLIR problems corresponding to differ- ent levels with respect to the size of the textual units involved. The rst MLIR problem (at the level of single words) involves inducing translation correspondences in arbitrary language pairs using Wikipedia data. We present an approach for translation induction that leverages Wikipedia corpora in auxiliary languages using the notion of translingual themes . We propose extensions that enable the computation of cross lingual semantic relatedness between words and apply the method to the task of cross-lingual Wikipedia title suggestion. The second MLIR problem (at the level of sentences or paragraphs) involves the detection of topical relevance be- tween speci c portions of articles and comments. Using MCTM, we devise methods for topical categorization of comments with respect to the article. In the third MLIR problem (at the level of document collections), the task is to generate keyword summaries for a cluster of documents. We develop an architecture for performing Scatter/Gather on multilingual document collections and a labeling procedure that generates keyword labels on-the-fly to summarize a multilingual document cluster in any language.

Стилі APA, Harvard, Vancouver, ISO та ін.

37

Lozano, Argüelles Cristina. "Formación y uso de la tecnología de los profesores de escuelas de inmersión en español." Thesis, 2014. http://hdl.handle.net/1805/6037.

Повний текст джерела

Анотація:

Indiana University-Purdue University Indianapolis (IUPUI)<br>El propósito de esta investigación es ahondar en los usos tecnológicos de los profesores de español y en la formación que han recibido para integrar las TIC en sus clases. En concreto, nos interesa saber su actitud y nivel de seguridad ante la tecnología, de qué recursos disponen y cuáles utilizan en sus clases, cómo aprenden a utilizarlos (formal e informalmente), qué problemas perciben y cómo les gustaría mejorar la integración de la tecnología en sus clases. El estudio se centra en un grupo de escuelas de inmersión de español en los estados de Indiana, Kentucky y Ohio.

Стилі APA, Harvard, Vancouver, ISO та ін.

38

Michell, Colin Simon. "Investigating the use of forensic stylistic and stylometric techniques in the analyses of authorship on a publicly accessible social networking site (Facebook)." Diss., 2013. http://hdl.handle.net/10500/13324.

Повний текст джерела

Анотація:

This research study examines the forensic application of a selection of stylistic and stylometric techniques in a simulated authorship attribution case involving texts on the social networking site, Facebook. Eight participants each submitted 2,000 words of self-authored text from their personal Facebook messages, and one of them submitted an extra 2,000 words to act as the ‘disputed text’. The texts were analysed in terms of the first 1,000 words received and then at the 2,000-word level to determine what effect text length has on the effectiveness of the chosen style markers (keywords, function words, most frequently occurring words, punctuation, use of digitally mediated communication features and spelling). It was found that despite accurately identifying the author of the disputed text at the 1,000-word level, the results were not entirely conclusive but at the 2,000-word level the results were more promising, with certain style markers being particularly effective.<br>Linguistics<br>MA (Linguistics)

Стилі APA, Harvard, Vancouver, ISO та ін.

39

Mahabeer, Sandhya D. "Barriers in acquiring basic english reading and spelling skills by Zulu-speaking foundation phase learners." Diss., 2003. http://hdl.handle.net/10500/1166.

Повний текст джерела

Анотація:

This study focuses on the barriers that hinder the Zulu-speaking English second language learner in the Foundation Phase in acquiring basic reading and spelling skills. Nine hypotheses were developed from the literature study. Emanating from this, a quantitative empirical investigation, undertaken at various Foundation Phase schools in and around the greater Durban area, examined these barriers. A questionnaire was used as the main instrument in investigating these barriers. The study highlighted the relationships between the various variables. These relationships were, in the main, found significant. The research has indicated that contextual, language, school and intrinsic factors are significantly correlated to the problems L2 learners experience in acquiring English reading and spelling skills. The limitations of this investigation were discussed and recommendations, based on these results, were forwarded.<br>Educational Studies<br>M. Ed. (Guidance & Counselling)

Стилі APA, Harvard, Vancouver, ISO та ін.

Ми пропонуємо знижки на всі преміум-плани для авторів, чиї праці увійшли до тематичних добірок літератури. Зв'яжіться з нами, щоб отримати унікальний промокод!