Дисертації з теми "Low resource language"

Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Low resource language.

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-31 дисертацій для дослідження на тему "Low resource language".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Jansson, Herman. "Low-resource Language Question Answering Systemwith BERT." Thesis, Mittuniversitetet, Institutionen för informationssystem och –teknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-42317.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The complexity for being at the forefront regarding information retrieval systems are constantly increasing. Recent technology of natural language processing called BERT has reached superhuman performance in high resource languages for reading comprehension tasks. However, several researchers has stated that multilingual model’s are not enough for low-resource languages, since they are lacking a thorough understanding of those languages. Recently, a Swedish pre-trained BERT model has been introduced which is trained on significantly more Swedish data than the multilingual models currently available. This study compares both multilingual and Swedish monolingual inherited BERT model’s for question answering utilizing both a English and a Swedish machine translated SQuADv2 data set during its fine-tuning process. The models are evaluated with SQuADv2 benchmark and within a implemented question answering system built upon the classical retriever-reader methodology. This study introduces a naive and more robust prediction method for the proposed question answering system as well finding a sweet spot for each individual model approach integrated into the system. The question answering system is evaluated and compared against another question answering library at the leading edge within the area, applying a custom crafted Swedish evaluation data set. The results show that the fine-tuned model based on the Swedish pre-trained model and the Swedish SQuADv2 data set were superior in all evaluation metrics except speed. The comparison between the different systems resulted in a higher evaluation score but a slower prediction time for this study’s system.
2

Zhang, Yuan Ph D. Massachusetts Institute of Technology. "Transfer learning for low-resource natural language analysis." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/108847.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 131-142).
Expressive machine learning models such as deep neural networks are highly effective when they can be trained with large amounts of in-domain labeled training data. While such annotations may not be readily available for the target task, it is often possible to find labeled data for another related task. The goal of this thesis is to develop novel transfer learning techniques that can effectively leverage annotations in source tasks to improve performance of the target low-resource task. In particular, we focus on two transfer learning scenarios: (1) transfer across languages and (2) transfer across tasks or domains in the same language. In multilingual transfer, we tackle challenges from two perspectives. First, we show that linguistic prior knowledge can be utilized to guide syntactic parsing with little human intervention, by using a hierarchical low-rank tensor method. In both unsupervised and semi-supervised transfer scenarios, this method consistently outperforms state-of-the-art multilingual transfer parsers and the traditional tensor model across more than ten languages. Second, we study lexical-level multilingual transfer in low-resource settings. We demonstrate that only a few (e.g., ten) word translation pairs suffice for an accurate transfer for part-of-speech (POS) tagging. Averaged across six languages, our approach achieves a 37.5% improvement over the monolingual top-performing method when using a comparable amount of supervision. In the second monolingual transfer scenario, we propose an aspect-augmented adversarial network that allows aspect transfer over the same domain. We use this method to transfer across different aspects in the same pathology reports, where traditional domain adaptation approaches commonly fail. Experimental results demonstrate that our approach outperforms different baselines and model variants, yielding a 24% gain on this pathology dataset.
by Yuan Zhang.
Ph. D.
3

Zouhair, Taha. "Automatic Speech Recognition for low-resource languages using Wav2Vec2 : Modern Standard Arabic (MSA) as an example of a low-resource language." Thesis, Högskolan Dalarna, Institutionen för information och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:du-37702.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The need for fully automatic translation at DigitalTolk, a Stockholm-based company providing translation services, leads to exploring Automatic Speech Recognition as a first step for Modern Standard Arabic (MSA). Facebook AI recently released a second version of its Wav2Vec models, dubbed Wav2Vec 2.0, which uses deep neural networks and provides several English pretrained models along with a multilingual model trained in 53 different languages, referred to as the Cross-Lingual Speech Representation (XLSR-53). The small English and the XLSR-53 pretrained models are tested, and the results stemming from them discussed, with the Arabic data from Mozilla Common Voice. In this research, the small model did not yield any results and may have needed more unlabelled data to train whereas the large model proved to be successful in predicting the audio recordings in Arabic and a Word Error Rate of 24.40% was achieved, an unprecedented result. The small model turned out to be not suitable for training especially on languages other than English and where the unlabelled data is not enough. On the other hand, the large model gave very promising results despite the low amount of data. The large model should be the model of choice for any future training that needs to be done on low resource languages such as Arabic.
4

Packham, Sean. "Crowdsourcing a text corpus for a low resource language." Master's thesis, University of Cape Town, 2016. http://hdl.handle.net/11427/20436.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Low resourced languages, such as South Africa's isiXhosa, have a limited number of digitised texts, making it challenging to build language corpora and the information retrieval services, such as search and translation that depend on them. Researchers have been unable to assemble isiXhosa corpora of sufficient size and quality to produce working machine translation systems and it has been acknowledged that there is little to know training data and sourcing translations from professionals can be a costly process. A crowdsourcing translation game which paid participants for their contributions was proposed as a solution to source original and relevant parallel corpora for low resource languages such as isiXhosa. The objectives of this dissertation is to report on the four experiments that were conducted to assess user motivation and contribution quantity under various scenarios using the developed crowdsourcing translation game. The first experiment was a pilot study to test a custom built system and to find out if social network users would volunteer to participate in a translation game for free. The second experiment tested multiple payment schemes with users from the University of Cape Town. The schemes rewarded users with consistent, increasing or decreasing amounts for subsequent contributions. Experiment 3 tested whether the same users from Experiment 2 would continue contributing if payments were taken away. The last experiment tested a payment scheme that did not offer a direct and guaranteed reward. Users were paid based on their leaderboard placement and only a limited number of the top leaderboard spots were allocated rewards. From experiment 1 and 3 we found that people do not volunteer without financial incentives, experiment 2 and 4 showed that people want increased rewards when putting in increased effort , experiment 3 also showed that people will not continue contributing if the financial incentives are taken away and experiment 4 also showed that the possibility of incentives is as attractive as offering guaranteed incentives .
5

Lakew, Surafel Melaku. "Multilingual Neural Machine Translation for Low Resource Languages." Doctoral thesis, Università degli studi di Trento, 2020. http://hdl.handle.net/11572/257906.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Machine Translation (MT) is the task of mapping a source language to a target language. The recent introduction of neural MT (NMT) has shown promising results for high-resource language, however, poorly performing for low-resource language (LRL) settings. Furthermore, the vast majority of the 7, 000+ languages around the world do not have parallel data, creating a zero-resource language (ZRL) scenario. In this thesis, we present our approach to improving NMT for LRL and ZRL, leveraging a multilingual NMT modeling (M-NMT), an approach that allows building a single NMT to translate across multiple source and target languages. This thesis begins by i) analyzing the effectiveness of M-NMT for LRL and ZRL translation tasks, spanning two NMT modeling architectures (Recurrent and Transformer), ii) presents a self-learning approach for improving the zero-shot translation directions of ZRLs, iii) proposes a dynamic transfer-learning approach from a pre-trained (parent) model to a LRL (child) model by tailoring to the vocabulary entries of the latter, iv) extends M-NMT to translate from a source language to specific language varieties (e.g. dialects), and finally, v) proposes an approach that can control the verbosity of an NMT model output. Our experimental findings show the effectiveness of the proposed approaches in improving NMT of LRLs and ZRLs.
6

Mairidan, Wushouer. "Pivot-Based Bilingual Dictionary Creation for Low-Resource Languages." 京都大学 (Kyoto University), 2015. http://hdl.handle.net/2433/199441.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
7

Samson, Juan Sarah Flora. "Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia." Thesis, Université Grenoble Alpes (ComUE), 2015. http://www.theses.fr/2015GREAM061/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Les langues en Malaisie meurent à un rythme alarmant. A l'heure actuelle, 15 langues sont en danger alors que deux langues se sont éteintes récemment. Une des méthodes pour sauvegarder les langues est de les documenter, mais c'est une tâche fastidieuse lorsque celle-ci est effectuée manuellement.Un système de reconnaissance automatique de la parole (RAP) serait utile pour accélérer le processus de documentation de ressources orales. Cependant, la construction des systèmes de RAP pour une langue cible nécessite une grande quantité de données d'apprentissage comme le suggèrent les techniques actuelles de l'état de l'art, fondées sur des approches empiriques. Par conséquent, il existe de nombreux défis à relever pour construire des systèmes de transcription pour les langues qui possèdent des quantités de données limitées.L'objectif principal de cette thèse est d'étudier les effets de l'utilisation de données de langues étroitement liées, pour construire un système de RAP pour les langues à faibles ressources en Malaisie. Des études antérieures ont montré que les méthodes inter-lingues et multilingues pourraient améliorer les performances des systèmes de RAP à faibles ressources. Dans cette thèse, nous essayons de répondre à plusieurs questions concernant ces approches: comment savons-nous si une langue est utile ou non dans un processus d'apprentissage trans-lingue ? Comment la relation entre la langue source et la langue cible influence les performances de la reconnaissance de la parole ? La simple mise en commun (pooling) des données d'une langue est-elle une approche optimale ?Notre cas d'étude est l'iban, une langue peu dotée de l'île de Bornéo. Nous étudions les effets de l'utilisation des données du malais, une langue locale dominante qui est proche de l'iban, pour développer un système de RAP pour l'iban, sous différentes contraintes de ressources. Nous proposons plusieurs approches pour adapter les données du malais afin obtenir des modèles de prononciation et des modèles acoustiques pour l'iban.Comme la contruction d'un dictionnaire de prononciation à partir de zéro nécessite des ressources humaines importantes, nous avons développé une approche semi-supervisée pour construire rapidement un dictionnaire de prononciation pour l'iban. Celui-ci est fondé sur des techniques d'amorçage, pour améliorer la correspondance entre les données du malais et de l'iban.Pour augmenter la performance des modèles acoustiques à faibles ressources, nous avons exploré deux techniques de modélisation : les modèles de mélanges gaussiens à sous-espaces (SGMM) et les réseaux de neurones profonds (DNN). Nous avons proposé, dans ce cadre, des méthodes de transfert translingue pour la modélisation acoustique permettant de tirer profit d'une grande quantité de langues “proches” de la langue cible d'intérêt. Les résultats montrent que l'utilisation de données du malais est bénéfique pour augmenter les performances des systèmes de RAP de l'iban. Par ailleurs, nous avons également adapté les modèles SGMM et DNN au cas spécifique de la transcription automatique de la parole non native (très présente en Malaisie). Nous avons proposé une approche fine de fusion pour obtenir un SGMM multi-accent optimal. En outre, nous avons développé un modèle DNN spécifique pour la parole accentuée. Les deux approches permettent des améliorations significatives de la précision du système de RAP. De notre étude, nous observons que les modèles SGMM et, de façon plus surprenante, les modèles DNN sont très performants sur des jeux de données d'apprentissage en quantité limités
Languages in Malaysia are dying in an alarming rate. As of today, 15 languages are in danger while two languages are extinct. One of the methods to save languages is by documenting languages, but it is a tedious task when performed manually.Automatic Speech Recognition (ASR) system could be a tool to help speed up the process of documenting speeches from the native speakers. However, building ASR systems for a target language requires a large amount of training data as current state-of-the-art techniques are based on empirical approach. Hence, there are many challenges in building ASR for languages that have limited data available.The main aim of this thesis is to investigate the effects of using data from closely-related languages to build ASR for low-resource languages in Malaysia. Past studies have shown that cross-lingual and multilingual methods could improve performance of low-resource ASR. In this thesis, we try to answer several questions concerning these approaches: How do we know which language is beneficial for our low-resource language? How does the relationship between source and target languages influence speech recognition performance? Is pooling language data an optimal approach for multilingual strategy?Our case study is Iban, an under-resourced language spoken in Borneo island. We study the effects of using data from Malay, a local dominant language which is close to Iban, for developing Iban ASR under different resource constraints. We have proposed several approaches to adapt Malay data to obtain pronunciation and acoustic models for Iban speech.Building a pronunciation dictionary from scratch is time consuming, as one needs to properly define the sound units of each word in a vocabulary. We developed a semi-supervised approach to quickly build a pronunciation dictionary for Iban. It was based on bootstrapping techniques for improving Malay data to match Iban pronunciations.To increase the performance of low-resource acoustic models we explored two acoustic modelling techniques, the Subspace Gaussian Mixture Models (SGMM) and Deep Neural Networks (DNN). We performed cross-lingual strategies using both frameworks for adapting out-of-language data to Iban speech. Results show that using Malay data is beneficial for increasing the performance of Iban ASR. We also tested SGMM and DNN to improve low-resource non-native ASR. We proposed a fine merging strategy for obtaining an optimal multi-accent SGMM. In addition, we developed an accent-specific DNN using native speech data. After applying both methods, we obtained significant improvements in ASR accuracy. From our study, we observe that using SGMM and DNN for cross-lingual strategy is effective when training data is very limited
8

Tafreshi, Shabnam. "Cross-Genre, Cross-Lingual, and Low-Resource Emotion Classification." Thesis, The George Washington University, 2021. http://pqdtopen.proquest.com/#viewpdf?dispub=28088437.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Emotions can be defined as a natural, instinctive state of mind arising from one’s circumstances, mood, and relationships with others. It has long been a question to be answered by psychology that how and what is it that humans feel. Enabling computers to recognize human emotions has been an of interest to researchers since 1990s (Picard et al., 1995). Ever since, this area of research has grown significantly and emotion detection is becoming an important component in many natural language processing tasks. Several theories exist for defining emotions and are chosen by researchers according to their needs. For instance, according to appraisal theory, a psychology theory, emotions are produced by our evaluations (appraisals or estimates) of events that cause a specific reaction in different people. Some emotions are easy and universal, while others are complex and nuanced. Emotion classification is generally the process of labeling a piece of text with one or more corresponding emotion labels. Psychologists have developed numerous models and taxonomies of emotions. The model or taxonomy depends on the problem, and thorough study is often required to select the best model. Early studies of emotion classification focused on building computational models to classify basic emotion categories. In recent years, increasing volumes of social media and the digitization of data have opened a new horizon in this area of study, where emotion classification is a key component of applications, including mood and behavioral studies, as well as disaster relief, amongst many other applications. Sophisticated models have been built to detect and classify emotion in text, but few analyze how well a model is able to learn emotion cues. The ability to learn emotion cues properly and be able to generalize this learning is very important. This work investigates the robustness of emotion classification approaches across genres and languages, with a focus on quantifying how well state-of-the-art models are able to learn emotion cues. First, we use multi-task learning and hierarchical models to build emotion models that were trained on data combined from multiple genres. Our hypothesis is that a multi-genre, noisy training environment will help the classifier learn emotion cues that are prevalent across genres. Second, we explore splitting text (i.e. sentence) into its clauses and testing whether the model’s performance improves. Emotion analysis needs fine-grained annotation and clause-level annotation can be beneficial to design features to improve emotion detection performance. Intuitively, clause-level annotations may help the model focus on emotion cues, while ignoring irrelevant portions of the text. Third, we adopted a transfer learning approach for cross-lingual/genre emotion classification to focus the classifier’s attention on emotion cues which are consistent across languages. Fourth, we empirically show how to combine different genres to be able to build robust models that can be used as source models for emotion transfer to low-resource target languages. Finally, this study involved curating and re-annotating popular emotional data sets in different genres, and annotating a multi-genre corpus of Persian tweets and news, and generating a collection of emotional sentences for a low-resource language, Azerbaijani, a language spoken in the north west of Iran.
9

Singh, Mittul [Verfasser], and Dietrich [Akademischer Betreuer] Klakow. "Handling long-term dependencies and rare words in low-resource language modelling / Mittul Singh ; Betreuer: Dietrich Klakow." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2017. http://d-nb.info/1141677962/34.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Dyer, Andrew. "Low Supervision, Low Corpus size, Low Similarity! Challenges in cross-lingual alignment of word embeddings : An exploration of the limitations of cross-lingual word embedding alignment in truly low resource scenarios." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-395946.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cross-lingual word embeddings are an increasingly important reseource in cross-lingual methods for NLP, particularly for their role in transfer learning and unsupervised machine translation, purportedly opening up the opportunity for NLP applications for low-resource languages.  However, most research in this area implicitly expects the availablility of vast monolingual corpora for training embeddings, a scenario which is not realistic for many of the world's languages.  Moreover, much of the reporting of the performance of cross-lingual word embeddings is based on a fairly narrow set of mostly European language pairs.  Our study examines the performance of cross-lingual alignment across a more diverse set of language pairs; controls for the effect of the corpus size on which the monolingual embedding spaces are trained; and studies the impact of spectral graph properties of the embedding spsace on alignment.  Through our experiments on a more diverse set of language pairs, we find that performance in bilingual lexicon induction is generally poor in heterogeneous pairs, and that even using a gold or heuristically derived dictionary has little impact on the performance on these pairs of languages.  We also find that the performance for these languages only increases slowly with corpus size.  Finally, we find a moderate correlation between the isospectral difference of the source and target embeddings and the performance of bilingual lexicon induction.  We infer that methods other than cross-lingual alignment may be more appropriate in the case of both low resource languages and heterogeneous language pairs.
11

Arbi, Haza Nasution. "Bilingual Lexicon Induction Framwork for Closely Related Languages." Kyoto University, 2018. http://hdl.handle.net/2433/235115.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
12

Godard, Pierre. "Unsupervised word discovery for computational language documentation." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS062/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
La diversité linguistique est actuellement menacée : la moitié des langues connues dans le monde pourraient disparaître d'ici la fin du siècle. Cette prise de conscience a inspiré de nombreuses initiatives dans le domaine de la linguistique documentaire au cours des deux dernières décennies, et 2019 a été proclamée Année internationale des langues autochtones par les Nations Unies, pour sensibiliser le public à cette question et encourager les initiatives de documentation et de préservation. Néanmoins, ce travail est coûteux en temps, et le nombre de linguistes de terrain, limité. Par conséquent, le domaine émergent de la documentation linguistique computationnelle (CLD) vise à favoriser le travail des linguistes à l'aide d'outils de traitement automatique. Le projet Breaking the Unwritten Language Barrier (BULB), par exemple, constitue l'un des efforts qui définissent ce nouveau domaine, et réunit des linguistes et des informaticiens. Cette thèse examine le problème particulier de la découverte de mots dans un flot non segmenté de caractères, ou de phonèmes, transcrits à partir du signal de parole dans un contexte de langues très peu dotées. Il s'agit principalement d'une procédure de segmentation, qui peut également être couplée à une procédure d'alignement lorsqu'une traduction est disponible. En utilisant deux corpus en langues bantoues correspondant à un scénario réaliste pour la linguistique documentaire, l'un en Mboshi (République du Congo) et l'autre en Myene (Gabon), nous comparons diverses méthodes monolingues et bilingues de découverte de mots sans supervision. Nous montrons ensuite que l'utilisation de connaissances linguistiques expertes au sein du formalisme des Adaptor Grammars peut grandement améliorer les résultats de la segmentation, et nous indiquons également des façons d'utiliser ce formalisme comme outil de décision pour le linguiste. Nous proposons aussi une variante tonale pour un algorithme de segmentation bayésien non-paramétrique, qui utilise un schéma de repli modifié pour capturer la structure tonale. Pour tirer parti de la supervision faible d'une traduction, nous proposons et étendons, enfin, une méthode de segmentation neuronale basée sur l'attention, et améliorons significativement la performance d'une méthode bilingue existante
Language diversity is under considerable pressure: half of the world’s languages could disappear by the end of this century. This realization has sparked many initiatives in documentary linguistics in the past two decades, and 2019 has been proclaimed the International Year of Indigenous Languages by the United Nations, to raise public awareness of the issue and foster initiatives for language documentation and preservation. Yet documentation and preservation are time-consuming processes, and the supply of field linguists is limited. Consequently, the emerging field of computational language documentation (CLD) seeks to assist linguists in providing them with automatic processing tools. The Breaking the Unwritten Language Barrier (BULB) project, for instance, constitutes one of the efforts defining this new field, bringing together linguists and computer scientists. This thesis examines the particular problem of discovering words in an unsegmented stream of characters, or phonemes, transcribed from speech in a very-low-resource setting. This primarily involves a segmentation procedure, which can also be paired with an alignment procedure when a translation is available. Using two realistic Bantu corpora for language documentation, one in Mboshi (Republic of the Congo) and the other in Myene (Gabon), we benchmark various monolingual and bilingual unsupervised word discovery methods. We then show that using expert knowledge in the Adaptor Grammar framework can vastly improve segmentation results, and we indicate ways to use this framework as a decision tool for the linguist. We also propose a tonal variant for a strong nonparametric Bayesian segmentation algorithm, making use of a modified backoff scheme designed to capture tonal structure. To leverage the weak supervision given by a translation, we finally propose and extend an attention-based neural segmentation method, improving significantly the segmentation performance of an existing bilingual method
13

Black, Kevin P. "Interactive Machine Assistance: A Case Study in Linking Corpora and Dictionaries." BYU ScholarsArchive, 2015. https://scholarsarchive.byu.edu/etd/5620.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Machine learning can provide assistance to humans in making decisions, including linguistic decisions such as determining the part of speech of a word. Supervised machine learning methods derive patterns indicative of possible labels (decisions) from annotated example data. For many problems, including most language analysis problems, acquiring annotated data requires human annotators who are trained to understand the problem and to disambiguate among multiple possible labels. Hence, the availability of experts can limit the scope and quantity of annotated data. Machine-learned pre-annotation assistance, which suggests probable labels for unannotated items, can enable expert annotators to work more quickly and thus to produce broader and larger annotated resources more cost-efficiently. Yet, because annotated data is required to build the pre-annotation model, bootstrapping is an obstacle to utilizing pre-annotation assistance, especially for low-resource problems where little or no annotated data exists. Interactive pre-annotation assistance can mitigate bootstrapping costs, even for low-resource problems, by continually refining the pre-annotation model with new annotated examples as the annotators work. In practice, continually refining models has seldom been done except for the simplest of models which can be trained quickly. As a case study in developing sophisticated, interactive, machine-assisted annotation, this work employs the task of corpus-dictionary linkage (CDL), which is to link each word token in a corpus to its correct dictionary entry. CDL resources, such as machine-readable dictionaries and concordances, are essential aids in many tasks including language learning and corpus studies. We employ a pipeline model to provide CDL pre-annotations, with one model per CDL sub-task. We evaluate different models for lemmatization, the most significant CDL sub-task since many dictionary entry headwords are usually lemmas. The best performing lemmatization model is a hybrid which uses a maximum entropy Markov model (MEMM) to handle unknown (novel) word tokens and other component models to handle known word tokens. We extend the hybrid model design to the other CDL sub-tasks in the pipeline. We develop an incremental training algorithm for the MEMM which avoids wasting previous computation as would be done by simply retraining from scratch. The incremental training algorithm facilitates the addition of new dictionary entries over time (i.e., new labels) and also facilitates learning from partially annotated sentences which allows annotators to annotate words in any order. We validate that the hybrid model attains high accuracy and can be trained sufficiently quickly to provide interactive pre-annotation assistance by simulating CDL annotation on Quranic Arabic and classical Syriac data.
14

Meftah, Sara. "Neural Transfer Learning for Domain Adaptation in Natural Language Processing." Thesis, université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG021.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Les méthodes d’apprentissage automatique qui reposent sur les Réseaux de Neurones (RNs) ont démontré des performances de prédiction qui s'approchent de plus en plus de la performance humaine dans plusieurs applications du Traitement Automatique de la Langue (TAL) qui bénéficient de la capacité des différentes architectures des RNs à généraliser à partir des régularités apprises à partir d'exemples d'apprentissage. Toutefois, ces modèles sont limités par leur dépendance aux données annotées. En effet, pour être performants, ces modèles neuronaux ont besoin de corpus annotés de taille importante. Par conséquent, uniquement les langues bien dotées peuvent bénéficier directement de l'avancée apportée par les RNs, comme par exemple les formes formelles des langues. Dans le cadre de cette thèse, nous proposons des méthodes d'apprentissage par transfert neuronal pour la construction d'outils de TAL pour les langues peu dotées en exploitant leurs similarités avec des langues bien dotées. Précisément, nous expérimentons nos approches pour le transfert à partir du domaine source des textes formels vers le domaine cible des textes informels (langue utilisée dans les réseaux sociaux). Tout au long de cette thèse nous proposons différentes contributions. Tout d'abord, nous proposons deux approches pour le transfert des connaissances encodées dans les représentations neuronales d'un modèle source, pré-entraîné sur les données annotées du domaine source, vers un modèle cible, adapté par la suite sur quelques exemples annotés du domaine cible. La première méthode transfère des représentations contextuelles pré-entraînées sur le domaine source. Tandis que la deuxième méthode utilise des poids pré-entraînés pour initialiser les paramètres du modèle cible. Ensuite, nous effectuons une série d'analyses pour repérer les limites des méthodes proposées ci-dessus. Nous constatons que, même si l'approche d'apprentissage par transfert proposée améliore les résultats du domaine cible, un transfert négatif « dissimulé » peut atténuer le gain final apporté par l'apprentissage par transfert. De plus, une analyse interprétative du modèle pré-entraîné, montre que les neurones pré-entraînés peuvent être biaisés par ce qu'ils ont appris du domaine source, et donc peuvent avoir des difficultés à apprendre des « patterns » spécifiques au domaine cible. Issu de notre analyse, nous proposons un nouveau schéma d'adaptation qui augmente le modèle cible avec des neurones normalisés, pondérés et initialisés aléatoirement qui permettent une meilleure adaptation au domaine cible tout en conservant les connaissances apprises du domaine source. Enfin, nous proposons une approche d’apprentissage par transfert qui permet de profiter des similarités entre différentes tâches, en plus des connaissances pré-apprises du domaine source
Recent approaches based on end-to-end deep neural networks have revolutionised Natural Language Processing (NLP), achieving remarkable results in several tasks and languages. Nevertheless, these approaches are limited with their "gluttony" in terms of annotated data, since they rely on a supervised training paradigm, i.e. training from scratch on large amounts of annotated data. Therefore, there is a wide gap between NLP technologies capabilities for high-resource languages compared to the long tail of low-resourced languages. Moreover, NLP researchers have focused much of their effort on training NLP models on the news domain, due to the availability of training data. However, many research works have highlighted that models trained on news fail to work efficiently on out-of-domain data, due to their lack of robustness against domain shifts. This thesis presents a study of transfer learning approaches, through which we propose different methods to take benefit from the pre-learned knowledge on the high-resourced domain to enhance the performance of neural NLP models in low-resourced settings. Precisely, we apply our approaches to transfer from the news domain to the social media domain. Indeed, despite the importance of its valuable content for a variety of applications (e.g. public security, health monitoring, or trends highlight), this domain is still poor in terms of annotated data. We present different contributions. First, we propose two methods to transfer the knowledge encoded in the neural representations of a source model pretrained on large labelled datasets from the source domain to the target model, further adapted by a fine-tuning on few annotated examples from the target domain. The first transfers contextualised supervisedly pretrained representations, while the second method transfers pretrained weights, used to initialise the target model's parameters. Second, we perform a series of analysis to spot the limits of the above-mentioned proposed methods. We find that even if the proposed transfer learning approach enhances the performance on social media domain, a hidden negative transfer may mitigate the final gain brought by transfer learning. In addition, an interpretive analysis of the pretrained model, show that pretrained neurons may be biased by what they have learned from the source domain, thus struggle with learning uncommon target-specific patterns. Third, stemming from our analysis, we propose a new adaptation scheme which augments the target model with normalised, weighted and randomly initialised neurons that beget a better adaptation while maintaining the valuable source knowledge. Finally, we propose a model, that in addition to the pre-learned knowledge from the high-resource source-domain, takes advantage of various supervised NLP tasks
15

Karagol-Ayan, Burcu. "Resource generation from structured documents for low-density languages." College Park, Md.: University of Maryland, 2007. http://hdl.handle.net/1903/7580.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Thesis (Ph. D.) -- University of Maryland, College Park, 2007.
Thesis research directed by: Dept. of Computer Science. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.
16

Karim, Hiva. "Best way for collecting data for low-resourced languages." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-35945.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Low resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language corpus, making it a big obstacle for experts to provide a comprehensive textprocessing system. In this study, I Found out the best practices for producing and collectingdata for such zero/low resource languages by means of crowd-sourcing. For the purpose of thisstudy, a number of research articles (n=260) were extracted from Google Scholar, MicrosoftAcademic, and science direct. From these articles, only 60 of them, which met the inclusioncriteria' demands, were considered to review for eligibility. A full-text version of these researcharticles was downloaded and then were carefully screened to ensure eligibility. On the result ofthe eligibility assessment from potentially eligible 60 full-text articles for inclusion, only 25were selected and qualified to include in the final review. The final pool of the selected articles,concerning data generation practices and collection of low resource languages, can beconcluded that speech-based audio data is one of the most common and accessible data types.It can be contended that the collection of audio data from speech-based resources such as nativespeakers of the intended language and available audio recording by taking the advantages ofnew technologies is the most practical, cost-effective, and common method for collecting datafor low resource languages.
17

Aufrant, Lauriane. "Training parsers for low-resourced languages : improving cross-lingual transfer with monolingual knowledge." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS089/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Le récent essor des algorithmes d'apprentissage automatique a rendu les méthodes de Traitement Automatique des Langues d'autant plus sensibles à leur facteur le plus limitant : la qualité des systèmes repose entièrement sur la disponibilité de grandes quantités de données, ce qui n'est pourtant le cas que d'une minorité parmi les 7.000 langues existant au monde. La stratégie dite du transfert cross-lingue permet de contourner cette limitation : une langue peu dotée en ressources (la cible) peut être traitée en exploitant les ressources disponibles dans une autre langue (la source). Les progrès accomplis sur ce plan se limitent néanmoins à des scénarios idéalisés, avec des ressources cross-lingues prédéfinies et de bonne qualité, de sorte que le transfert reste inapplicable aux cas réels de langues peu dotées, qui n'ont pas ces garanties. Cette thèse vise donc à tirer parti d'une multitude de sources et ressources cross-lingues, en opérant une combinaison sélective : il s'agit d'évaluer, pour chaque aspect du traitement cible, la pertinence de chaque ressource. L'étude est menée en utilisant l'analyse en dépendance par transition comme cadre applicatif. Le cœur de ce travail est l'élaboration d'un nouveau méta-algorithme de transfert, dont l'architecture en cascade permet la combinaison fine des diverses ressources, en ciblant leur exploitation à l'échelle du mot. L'approche cross-lingue pure n'étant en l'état pas compétitive avec la simple annotation de quelques phrases cibles, c'est avant tout la complémentarité de ces méthodes que souligne l'analyse empirique. Une série de nouvelles métriques permet une caractérisation fine des similarités cross-lingues et des spécificités syntaxiques de chaque langue, de même que de la valeur ajoutée de l'information cross-lingue par rapport au cadre monolingue. L'exploitation d'informations typologiques s'avère également particulièrement fructueuse. Ces contributions reposent largement sur des innovations techniques en analyse syntaxique, concrétisées par la publication en open source du logiciel PanParser, qui exploite et généralise la méthode dite des oracles dynamiques. Cette thèse contribue sur le plan monolingue à plusieurs autres égards, comme le concept de cascades monolingues, pouvant traiter par exemple d'abord toutes les dépendances faciles, puis seulement les difficiles
As a result of the recent blossoming of Machine Learning techniques, the Natural Language Processing field faces an increasingly thorny bottleneck: the most efficient algorithms entirely rely on the availability of large training data. These technological advances remain consequently unavailable for the 7,000 languages in the world, out of which most are low-resourced. One way to bypass this limitation is the approach of cross-lingual transfer, whereby resources available in another (source) language are leveraged to help building accurate systems in the desired (target) language. However, despite promising results in research settings, the standard transfer techniques lack the flexibility regarding cross-lingual resources needed to be fully usable in real-world scenarios: exploiting very sparse resources, or assorted arrays of resources. This limitation strongly diminishes the applicability of that approach. This thesis consequently proposes to combine multiple sources and resources for transfer, with an emphasis on selectivity: can we estimate which resource of which language is useful for which input? This strategy is put into practice in the frame of transition-based dependency parsing. To this end, a new transfer framework is designed, with a cascading architecture: it enables the desired combination, while ensuring better targeted exploitation of each resource, down to the level of the word. Empirical evaluation dampens indeed the enthusiasm for the purely cross-lingual approach -- it remains in general preferable to annotate just a few target sentences -- but also highlights its complementarity with other approaches. Several metrics are developed to characterize precisely cross-lingual similarities, syntactic idiosyncrasies, and the added value of cross-lingual information compared to monolingual training. The substantial benefits of typological knowledge are also explored. The whole study relies on a series of technical improvements regarding the parsing framework: this work includes the release of a new open source software, PanParser, which revisits the so-called dynamic oracles to extend their use cases. Several purely monolingual contributions complete this work, including an exploration of monolingual cascading, which offers promising perspectives with easy-then-hard strategies
18

Fapšo, Michal. "Vyhledávání výrazů v řeči pomocí mluvených příkladů." Doctoral thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-261237.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Tato práce se zabývá vyhledáváním výrazů v řeči pomocí mluvených příkladů (QbE STD). Výrazy jsou zadávány v mluvené podobě a jsou vyhledány v množině řečových nahrávek, výstupem vyhledávání je seznam detekcí s jejich skóre a časováním. V práci popisujeme, analyzujeme a srovnáváme tři různé přístupy ke QbE STD v jazykově závislých a jazykově nezávislých podmínkách, s jedním a pěti příklady na dotaz. Pro naše experimenty jsme použili česká, maďarská, anglická a arabská (levantská) data, a pro každý z těchto jazyků jsme natrénovali 3-stavový fonémový rozpoznávač. To nám dalo 16 možných kombinací jazyka pro vyhodnocení a jazyka na kterém byl natrénovaný rozpoznávač. Čtyři kombinace byly tedy závislé na jazyce (language-dependent) a 12 bylo jazykově nezávislých (language-independent). Všechny QbE systémy byly vyhodnoceny na stejných datech a stejných fonémových posteriorních příznacích, pomocí metrik: nesdružené Figure-of-Merit (non pooled FOM) a námi navrhnuté nesdružené Figure-of-Merit se simulací normalizace přes promluvy (utterrance-normalized non-pooled Figure-of-Merit). Ty nám poskytly relevantní údaje pro porovnání těchto QbE přístupů a pro získání lepšího vhledu do jejich chování. QbE přístupy použité v této práci jsou: sekvenční statistické modelování (GMM/HMM), srovnávání vzorů v příznacích (DTW) a srovnávání grafů hypotéz (WFST). Abychom porovnali výsledky QbE přístupů s běžnými STD systémy vyhledávajícími textové výrazy, vyhodnotili jsme jazykově závislé konfigurace také s akustickým detektorem klíčových slov (AKWS) a systémem pro vyhledávání fonémových řetězců v grafech hypotéz (WFSTlat). Jádrem této práce je vývoj, analýza a zlepšení systému WFST QbE STD, který po zlepšení dosahuje podobných výsledků jako DTW systém v jazykově závislých podmínkách.
19

Vu, Ngoc Thang [Verfasser]. "Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information / Ngoc Thang Vu." Aachen : Shaker, 2014. http://d-nb.info/1058315811/34.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
20

Vu, Ngoc Thang [Verfasser], and T. [Akademischer Betreuer] Schultz. "Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information / Ngoc Thang Vu. Betreuer: T. Schultz." Karlsruhe : KIT-Bibliothek, 2014. http://d-nb.info/1051848229/34.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
21

Susman, Derya. "Turkish Large Vocabulary Continuous Speech Recognition By Using Limited Audio Corpus." Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12614207/index.pdf.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Speech recognition in Turkish Language is a challenging problem in several perspectives. Most of the challenges are related to the morphological structure of the language. Since Turkish is an agglutinative language, it is possible to generate many words from a single stem by using suffixes. This characteristic of the language increases the out-of-vocabulary (OOV) words, which degrade the performance of a speech recognizer dramatically. Also, Turkish language allows words to be ordered in a free manner, which makes it difficult to generate robust language models. In this thesis, the existing models and approaches which address the problem of Turkish LVCSR (Large Vocabulary Continuous Speech Recognition) are explored. Different recognition units (words, morphs, stem and endings) are used in generating the n-gram language models. 3-gram and 4-gram language models are generated with respect to the recognition unit. Since the solution domain of speech recognition is involved with machine learning, the performance of the recognizer depends on the sufficiency of the audio data used in acoustic model training. However, it is difficult to obtain rich audio corpora for the Turkish language. In this thesis, existing approaches are used to solve the problem of Turkish LVCSR by using a limited audio corpus. We also proposed several data selection approaches in order to improve the robustness of the acoustic model.
22

Cecovniuc, Ioana. "¿Qué prefiere usted, pagar en metálico o con cardo?: los falsos amigos y su concienciación lingüística en las aulas plurilingües de ELE." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/401588.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Los falsos amigos constituyen una de las dificultades principales que afrontan los discentes que persiguen hablar, escribir e interaccionar en una lengua extranjera. Con un falso amigo se indica una palabra o estructura de un idioma extranjero que se asemeja, en lo escrito o en lo hablado, a una palabra o estructura en la lengua materna del hablante, pero que cuenta con un significado distinto. Abundan los ejemplos de falsas relaciones lingüísticas (totales o parciales) entre lengua materna y lengua meta. No obstante, la experiencia práctica respecto a este tema de investigación es bastante escasa. Los falsos amigos bien se omiten, bien se reducen, en general, a listas o inventarios de los casos más notables y, por ende, no se integran en el aula junto con otras estrategias y tácticas de enseñanza-aprendizaje para una instrucción significativa. Consecuentemente, se pretende con la presente tesis doctoral aportar aspectos relevantes, tanto teóricos como de carácter práctico, tocante a la tipología, los usos y la utilidad e interés didáctico en las aulas de ELE de los falsos amigos como muestras de interferencias interlingüísticas. Vinculado a ello, la perspectiva contrastiva suministra los elementos epistemológicos de análisis: los hablantes están naturalmente expuestos a pasar estructuras y significados y distribuciones de esas estructuras y significados de su lengua materna o su primera lengua de aprendizaje a otra lengua extranjera que están aprendiendo. Esta suposición básica precisa el campo de aplicación de la Lingüística Contrastiva que engloba postulados psicolingüísticos y pedagógicos de dos metodologías complementarias entre sí: Análisis Contrastivo y Análisis de Errores. Asimismo, se consideran los fundamentos que enmarcan el Plurilingüismo con indagar en los últimos desarrollos teórico-empíricos en cuanto a las influencias interlingüísticas a la hora de aprender una lengua extranjera. A la luz de las premisas teóricas, las interferencias son efecto de la relación entre la lengua materna y la lengua que se estudia. Por otra parte, hoy día, aprender dos o más idiomas es una disposición mundial, de ahí que la delimitación en los procesos de aprendizaje no se dé exclusivamente entre la lengua materna y la lengua meta. Así pues, una primera lengua extranjera aprendida puede resultar un factor que influya positiva o negativamente en una segunda o tercera lengua extranjera que se está aprendiendo. En este sentido, con la presente investigación la atención se concentra en torno a los falsos amigos cuando el español (segunda lengua extranjera) entra en contacto con el inglés (primera lengua extranjera) y el rumano o el holandés (lenguas maternas).
The major problems relating to false friends entail difficulties for most students in terms of pronunciation, writing and interaction in a foreign language. With false friends we refer to words or structures of different languages that display morphological affinity but, at the same time, semantic divergence. Examples of (total or partial) false linguistic relations between mother tongue and target language abound. Nevertheless, a practical experience concerning this subject is quite scarce. False friends are either omitted or reduced, in general, to lists or inventories of the most notable cases. Hence, false friends are not (much) integrated in the foreign language classroom along with other teaching and learning strategies and tactics for a meaningful instruction. Consequently, this Ph.D. thesis aims at providing relevant aspects of both theoretical and practical nature in relation to the typology, the uses and the didactic value of false friends as instances of interlinguistic influence. In conjunction therewith, the contrastive perspective makes available the epistemological elements of the present analysis: the speakers are naturally prone to transfer structures and meanings and distributions of these structures and meanings of their mother tongue or of their first foreign language to another foreign language to learn. This basic statement is analyzed by means of Contrastive Linguistics which includes psycholinguistic and educational postulates of two complementary methodologies: Contrastive Analysis and Analysis of Errors. In addition, there are considered the principles that define Plurilingualism by investigating the recent studies regarding interlinguistic influences when learning a foreign language. In light of the theoretical assumptions, interferences result from the relationship between the mother tongue and the foreign language. However, to learn two or more foreign languages is a worldwide tendency nowadays, therefore delimitations in the learning processes could not be given between the mother tongue and the target language exclusively. Thus, a first foreign language can be a factor that positively or negatively influences when learning a second foreign language. In this regard, the present paper is an attempt to cover different aspects on the topic of false friends when Spanish (second foreign language to learn) comes into contact with English (first foreign language) and Romanian or Dutch (mother tongues).
Els falsos amics constitueixen una de les dificultats principals que afronten els discents que persegueixen parlar, escriure i interaccionar en una llengua estrangera. Amb un fals amic s'indica una paraula o estructura d'un idioma estranger que s'assembla, en l'escrit o en la forma oral, a una paraula o estructura en la llengua materna del parlant, però que compta amb un significat diferent. Abunden els exemples de falses relacions lingüístiques (totals o parcials) entre llengua materna i llengua meta. No obstant això, l'experiència pràctica respecte a aquest tema d'exploració és força escassa. Els falsos amics o bé s'ometen, o bé es redueixen, en general, a llistes o inventaris dels casos més notables i, per tant, no s'integren a l'aula juntament amb altres estratègies i tàctiques d'ensenyament-aprenentatge per a una instrucció significativa. Conseqüentment, es pretén amb la present tesi doctoral aportar aspectes rellevants, tant teòrics com de caràcter pràctic, pel que fa a la tipologia, els usos i la utilitat i interés didàctic a les aules d’ELE dels falsos amics com mostres d’interferències interlingüístiques. Vinculat amb això, la perspectiva contrastiva subministra els elements epistemològics d'anàlisi: els parlants són naturalment propensos a passar estructures i significats i distribucions d'aquestes estructures i significats de la seva llengua materna o la seva primera llengua d'aprenentatge a una altra llengua estrangera que estan aprenent. Aquesta suposició bàsica precisa el camp d'aplicació de la Lingüística Contrastiva que engloba postulats psicolingüístics i pedagògics de dues metodologies complementàries entre sí: Anàlisi Contrastiva i Anàlisi d'Errors. També, es valoren els princips del Plurilingüisme amb indagar en els últims desenvolupaments sobre les influències interlingüístiques quan s’aprèn una llengua estrangera. A la llum de les premisses teòriques, les interferències són efecte de la relació entre la llengua materna i la llengua estrangera que s'estudia. D'altra banda, avui en dia, aprendre dos o més idiomes estrangers és una disposició mundial, per aquest motiu la delimitació en els processos d'aprenentatge no es dóna exclusivament entre la llengua materna i la llengua meta. Així doncs, una primera llengua estrangera apresa pot resultar un factor que influeixi positivament o negativament en una segona o tercera llengua estrangera que s'està aprenent. En aquest sentit, amb aquesta recerca l'atenció es polaritza al voltant dels falsos amics quan l'espanyol (segona llengua estrangera) entra en contacte amb l'anglès (primera llengua estrangera) i el romanès o l'holandès (llengües maternes).
23

Pronto, Lindon N. "Exploring German and American Modes of Pedagogical and Institutional Sustainability: Forging a Way into the Future." Scholarship @ Claremont, 2012. http://scholarship.claremont.edu/pitzer_theses/21.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Rooted deep in Germany's past is its modern socio-political grounding for environmental respect and sustainability. This translates into individual and collective action and extends equally to the economic and policy realm as it does to educational institutions. This thesis evaluates research conducted in Germany with a view to what best approaches are transferable to the United States liberal arts setting. Furthermore, exemplary American models of institutional sustainability and environmental education are explored and combined with those from abroad to produce a blueprint and action plan fitting for the American college and university.
24

Farra, Noura. "Cross-Lingual and Low-Resource Sentiment Analysis." Thesis, 2019. https://doi.org/10.7916/d8-x3b7-1r92.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages. This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language. Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis. To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments. The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language. In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment.
25

(8776265), Xiao Zhang. "Flexible Structured Prediction in Natural Language Processing with Partially Annotated Corpora." Thesis, 2020.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Structured prediction makes coherent decisions as structured objects to present the interrelations of these predicted variables. They have been widely used in many areas, such as bioinformatics, computer vision, speech recognition, and natural language processing. Machine Learning with reduced supervision aims to leverage the laborious and error-prone annotation effects and benefit the low-resource languages. In this dissertation we study structured prediction with reduced supervision for two sets of problems, sequence labeling and dependency parsing, both of which are representatives of structured prediction problems in NLP. We investigate three different approaches.
The first approach is learning with modular architecture by task decomposition. By decomposing the labels into location sub-label and type sub-label, we designed neural modules to tackle these sub-labels respectively, with an additional module to infuse the information. The experiments on the benchmark datasets show the modular architecture outperforms existing models and can make use of partially labeled data together with fully labeled data to improve on the performance of using fully labeled data alone.

The second approach builds the neural CRF autoencoder (NCRFAE) model that combines a discriminative component and a generative component for semi-supervised sequence labeling. The model has a unified structure of shared parameters, using different loss functions for labeled and unlabeled data. We developed a variant of the EM algorithm for optimizing the model with tractable inference. The experiments on several languages in the POS tagging task show the model outperforms existing systems in both supervised and semi-supervised setup.

The third approach builds two models for semi-supervised dependency parsing, namely local autoencoding parser (LAP) and global autoencoding parser (GAP). LAP assumes the chain-structured sentence has a latent representation and uses this representation to construct the dependency tree, while GAP treats the dependency tree itself as a latent variable. Both models have unified structures for sentence with and without annotated parse tree. The experiments on several languages show both parsers can use unlabeled sentences to improve on the performance with labeled sentences alone, and LAP is faster while GAP outperforms existing models.
26

Kamran, Amir. "Hybrid Machine Translation Approaches for Low-Resource Languages." Master's thesis, 2011. http://www.nusl.cz/ntk/nusl-313015.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In recent years, corpus based machine translation systems produce significant results for a number of language pairs. However, for low-resource languages like Urdu the purely statistical or purely example based methods are not performing well. On the other hand, the rule-based approaches require a huge amount of time and resources for the development of rules, which makes it difficult in most scenarios. Hybrid machine translation systems might be one of the solutions to overcome these problems, where we can combine the best of different approaches to achieve quality translation. The goal of the thesis is to explore different combinations of approaches and to evaluate their performance over the standard corpus based methods currently in use. This includes: 1. Use of syntax-based and dependency-based reordering rules with Statistical Machine Translation. 2. Automatic extraction of lexical and syntactic rules using statistical methods to facilitate the Transfer-Based Machine Translation. The novel element in the proposed work is to develop an algorithm to learn automatic reordering rules for English-to-Urdu statistical machine translation. Moreover, this approach can be extended to learn lexical and syntactic rules to build a rule-based machine translation system.
27

"Query-by-example spoken term detection for low-resource languages." 2014. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1290682.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In this thesis, we consider the problem of query-by-example (QbyE) spoken term detection (STD) for low-resource languages. The problem is to automatically detect and locate the occurrences of a query term in a large audio database. The query term is given in the form of one or more audio examples. This research is motivated by the demand for information retrieval technologies that can handle speech data of low-resource languages. The major technical difficulty is that manual transcriptions and linguistic knowledge are not available for these languages.
The framework of acoustic segment modeling (ASM) is adopted for unsupervised training of a speech tokenizer. Three novel algorithms are developed for segment labeling in the ASM framework. The proposed algorithms are based on the use of different class-by-segment posterior representations and spectral clustering techniques. The posterior representations are shown to be more robust than conventional spectral representations. Spectral clustering has achieved significant success in many applications. Reformulations of spectral clustering algorithms are made to make them computationally feasible for clustering a large number of speech segments. Experiments on a multilingual speech database demonstrate the advantage of the proposed algorithms over existing approaches.
The speech tokenizer obtained with ASM is applied to QbyE STD. The detection of spoken queries is based on a frame-based template matching framework. The ASM tokenizer serves as the front-end to generate posterior features, which are used for temporal template matching by dynamic time warping (DTW). Experiments show that the ASM tokenizer outperforms a GMM tokenizer and language-mismatched phoneme recognizers. Moreover, a two-step approach is proposed for efficient search.
The frame-based template matching framework for QbyE STD is enhanced in three ways. A novel DTW matrix combination approach is proposed for the fusion of multiple systems with different posterior features. Pseudo-relevance feedback is used for query expansion, and score normalization is applied to calibrate the score distributions of different query terms. Experimental results show that the performances of the QbyE STD system are significantly improved by the three approaches.
關鍵詞檢測是一項在大量語音數據庫中查找某關鍵詞位置的技術。關鍵詞檢測無論在學術研究領域還是實際應用領域都有非常重要的價值。傳統關鍵詞檢測的研究主要針對資源豐富的語言。本文研究針對資源匱乏的語言的關鍵詞檢測。在本文設定條件下,目標語言沒有足夠的資源訓練語音識別系統,並且關鍵詞以聲音樣例的形式給定。
本文採用聲學語音段建模(ASM)框架來無監督訓練語音識別器。我們提出三種新的方法用於ASM框架中的語音片段聚類。我們的方法基於一種新的魯棒的語音片段特徵,並且採用了譜聚類技術。實驗證明我們的方法優於另外三種常用的基線方法,能夠取得更好的建模效果。
我們將ASM識別器用於基於模板匹配的關鍵詞檢測系統中。在該系統中,ASM識別器被視為前端特徵轉換模塊,用於提取後驗概率特徵。為了提高檢測效率,我們還提出一種兩步檢測方法。實驗效果證明我們的方法能夠取得較高的檢測準確率。
為了進一步提高檢測準確率,本文從三個角度優化基於模板匹配的關鍵詞檢測系統。首先我們提出在動態時間規整的距離矩陣上進行系統融合。其次我們提出用偽相關反饋技術來獲取更多的關鍵詞樣例。最後我們對系統打分進行規整從而有利於在設定統一的打分門限。實驗結果證明這三種方法都有效的提高了關鍵詞檢測的系統性能。
Wang, Haipeng.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2014.
Includes bibliographical references (leaves 110-127).
Abstracts also in Chinese.
Title from PDF title page (viewed on 05, December, 2016).
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
28

Cooper, Erica Lindsay. "Text-to-Speech Synthesis Using Found Data for Low-Resource Languages." Thesis, 2019. https://doi.org/10.7916/d8-vdzp-j870.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text-to-speech synthesis is a key component of interactive, speech-based systems. Typically, building a high-quality voice requires collecting dozens of hours of speech from a single professional speaker in an anechoic chamber with a high-quality microphone. There are about 7,000 languages spoken in the world, and most do not enjoy the speech research attention historically paid to such languages as English, Spanish, Mandarin, and Japanese. Speakers of these so-called "low-resource languages" therefore do not equally benefit from these technological advances. While it takes a great deal of time and resources to collect a traditional text-to-speech corpus for a given language, we may instead be able to make use of various sources of "found'' data which may be available. In particular, sources such as radio broadcast news and ASR corpora are available for many languages. While this kind of data does not exactly match what one would collect for a more standard TTS corpus, it may nevertheless contain parts which are usable for producing natural and intelligible parametric TTS voices. In the first part of this thesis, we examine various types of found speech data in comparison with data collected for TTS, in terms of a variety of acoustic and prosodic features. We find that radio broadcast news in particular is a good match. Audiobooks may also be a good match despite their largely more expressive style, and certain speakers in conversational and read ASR corpora also resemble TTS speakers in their manner of speaking and thus their data may be usable for training TTS voices. In the rest of the thesis, we conduct a variety of experiments in training voices on non-traditional sources of data, such as ASR data, radio broadcast news, and audiobooks. We aim to discover which methods produce the most intelligible and natural-sounding voices, focusing on three main approaches: 1) Training data subset selection. In noisy, heterogeneous data sources, we may wish to locate subsets of the data that are well-suited for building voices, based on acoustic and prosodic features that are known to correspond with TTS-style speech, while excluding utterances that introduce noise or other artifacts. We find that choosing subsets of speakers for training data can result in voices that are more intelligible. 2) Augmenting the frontend feature set with new features. In cleaner sources of found data, we may wish to train voices on all of the data, but we may get improvements in naturalness by including acoustic and prosodic features at the frontend and synthesizing in a manner that better matches the TTS style. We find that this approach is promising for creating more natural-sounding voices, regardless of the underlying acoustic model. 3) Adaptation. Another way to make use of high-quality data while also including informative acoustic and prosodic features is to adapt to subsets, rather than to select and train only on subsets. We also experiment with training on mixed high- and low-quality data, and adapting towards the high-quality set, which produces more intelligible voices than training on either type of data by itself. We hope that our findings may serve as guidelines for anyone wishing to build their own TTS voice using non-traditional sources of found data.
29

Lozano, Argüelles Cristina. "Formación y uso de la tecnología de los profesores de escuelas de inmersión en español." Thesis, 2014. http://hdl.handle.net/1805/6037.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Indiana University-Purdue University Indianapolis (IUPUI)
El propósito de esta investigación es ahondar en los usos tecnológicos de los profesores de español y en la formación que han recibido para integrar las TIC en sus clases. En concreto, nos interesa saber su actitud y nivel de seguridad ante la tecnología, de qué recursos disponen y cuáles utilizan en sus clases, cómo aprenden a utilizarlos (formal e informalmente), qué problemas perciben y cómo les gustaría mejorar la integración de la tecnología en sus clases. El estudio se centra en un grupo de escuelas de inmersión de español en los estados de Indiana, Kentucky y Ohio.
30

Michell, Colin Simon. "Investigating the use of forensic stylistic and stylometric techniques in the analyses of authorship on a publicly accessible social networking site (Facebook)." Diss., 2013. http://hdl.handle.net/10500/13324.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This research study examines the forensic application of a selection of stylistic and stylometric techniques in a simulated authorship attribution case involving texts on the social networking site, Facebook. Eight participants each submitted 2,000 words of self-authored text from their personal Facebook messages, and one of them submitted an extra 2,000 words to act as the ‘disputed text’. The texts were analysed in terms of the first 1,000 words received and then at the 2,000-word level to determine what effect text length has on the effectiveness of the chosen style markers (keywords, function words, most frequently occurring words, punctuation, use of digitally mediated communication features and spelling). It was found that despite accurately identifying the author of the disputed text at the 1,000-word level, the results were not entirely conclusive but at the 2,000-word level the results were more promising, with certain style markers being particularly effective.
Linguistics
MA (Linguistics)
31

Mahabeer, Sandhya D. "Barriers in acquiring basic english reading and spelling skills by Zulu-speaking foundation phase learners." Diss., 2003. http://hdl.handle.net/10500/1166.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This study focuses on the barriers that hinder the Zulu-speaking English second language learner in the Foundation Phase in acquiring basic reading and spelling skills. Nine hypotheses were developed from the literature study. Emanating from this, a quantitative empirical investigation, undertaken at various Foundation Phase schools in and around the greater Durban area, examined these barriers. A questionnaire was used as the main instrument in investigating these barriers. The study highlighted the relationships between the various variables. These relationships were, in the main, found significant. The research has indicated that contextual, language, school and intrinsic factors are significantly correlated to the problems L2 learners experience in acquiring English reading and spelling skills. The limitations of this investigation were discussed and recommendations, based on these results, were forwarded.
Educational Studies
M. Ed. (Guidance & Counselling)

До бібліографії