Добірка наукової літератури з теми "Low resource language"

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся зі списками актуальних статей, книг, дисертацій, тез та інших наукових джерел на тему "Low resource language".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Статті в журналах з теми "Low resource language":

1

Lin, Donghui, Yohei Murakami, and Toru Ishida. "Towards Language Service Creation and Customization for Low-Resource Languages." Information 11, no. 2 (January 27, 2020): 67. http://dx.doi.org/10.3390/info11020067.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The most challenging issue with low-resource languages is the difficulty of obtaining enough language resources. In this paper, we propose a language service framework for low-resource languages that enables the automatic creation and customization of new resources from existing ones. To achieve this goal, we first introduce a service-oriented language infrastructure, the Language Grid; it realizes new language services by supporting the sharing and combining of language resources. We then show the applicability of the Language Grid to low-resource languages. Furthermore, we describe how we can now realize the automation and customization of language services. Finally, we illustrate our design concept by detailing a case study of automating and customizing bilingual dictionary induction for low-resource Turkic languages and Indonesian ethnic languages.
2

Zhou, Shuyan, Shruti Rijhwani, John Wieting, Jaime Carbonell, and Graham Neubig. "Improving Candidate Generation for Low-resource Cross-lingual Entity Linking." Transactions of the Association for Computational Linguistics 8 (July 2020): 109–24. http://dx.doi.org/10.1162/tacl_a_00303.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL. 1
3

Mati, Diellza Nagavci, Mentor Hamiti, Arsim Susuri, Besnik Selimi, and Jaumin Ajdari. "Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning." Annals of Emerging Technologies in Computing 5, no. 3 (July 1, 2021): 52–58. http://dx.doi.org/10.33166/aetic.2021.03.005.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.
4

Shikali, Casper S., and Refuoe Mokhosi. "Enhancing African low-resource languages: Swahili data for language modelling." Data in Brief 31 (August 2020): 105951. http://dx.doi.org/10.1016/j.dib.2020.105951.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
5

Chen, Siqi, Yijie Pei, Zunwang Ke, and Wushour Silamu. "Low-Resource Named Entity Recognition via the Pre-Training Model." Symmetry 13, no. 5 (May 2, 2021): 786. http://dx.doi.org/10.3390/sym13050786.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Named entity recognition (NER) is an important task in the processing of natural language, which needs to determine entity boundaries and classify them into pre-defined categories. For low-resource languages, most state-of-the-art systems require tens of thousands of annotated sentences to obtain high performance. However, there is minimal annotated data available about Uyghur and Hungarian (UH languages) NER tasks. There are also specificities in each task—differences in words and word order across languages make it a challenging problem. In this paper, we present an effective solution to providing a meaningful and easy-to-use feature extractor for named entity recognition tasks: fine-tuning the pre-trained language model. Therefore, we propose a fine-tuning method for a low-resource language model, which constructs a fine-tuning dataset through data augmentation; then the dataset of a high-resource language is added; and finally the cross-language pre-trained model is fine-tuned on this dataset. In addition, we propose an attention-based fine-tuning strategy that uses symmetry to better select relevant semantic and syntactic information from pre-trained language models and apply these symmetry features to name entity recognition tasks. We evaluated our approach on Uyghur and Hungarian datasets, which showed wonderful performance compared to some strong baselines. We close with an overview of the available resources for named entity recognition and some of the open research questions.
6

Rijhwani, Shruti, Jiateng Xie, Graham Neubig, and Jaime Carbonell. "Zero-Shot Neural Transfer for Cross-Lingual Entity Linking." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6924–31. http://dx.doi.org/10.1609/aaai.v33i01.33016924.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cross-lingual entity linking maps an entity mention in a source language to its corresponding entry in a structured knowledge base that is in a different (target) language. While previous work relies heavily on bilingual lexical resources to bridge the gap between the source and the target languages, these resources are scarce or unavailable for many low-resource languages. To address this problem, we investigate zero-shot cross-lingual entity linking, in which we assume no bilingual lexical resources are available in the source low-resource language. Specifically, we propose pivot-basedentity linking, which leverages information from a highresource “pivot” language to train character-level neural entity linking models that are transferred to the source lowresource language in a zero-shot manner. With experiments on 9 low-resource languages and transfer through a total of54 languages, we show that our proposed pivot-based framework improves entity linking accuracy 17% (absolute) on average over the baseline systems, for the zero-shot scenario.1 Further, we also investigate the use of language-universal phonological representations which improves average accuracy (absolute) by 36% when transferring between languages that use different scripts.
7

Mi, Chenggang, Shaolin Zhu, and Rui Nie. "Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion." Computational Intelligence and Neuroscience 2021 (April 8, 2021): 1–9. http://dx.doi.org/10.1155/2021/9975078.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.
8

Lee, Chanhee, Kisu Yang, Taesun Whang, Chanjun Park, Andrew Matteson, and Heuiseok Lim. "Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models." Applied Sciences 11, no. 5 (February 24, 2021): 1974. http://dx.doi.org/10.3390/app11051974.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.
9

Et. al., Syed Abdul Basit Andrabi,. "A Review of Machine Translation for South Asian Low Resource Languages." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 5 (April 10, 2021): 1134–47. http://dx.doi.org/10.17762/turcomat.v12i5.1777.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Machine translation is an application of natural language processing. Humans use native languages to communicate with one another, whereas programming languages communicate between humans and computers. NLP is the field that involves a broad set of techniques for analysis, manipulation and automatic generation of human languages or natural languages with the help of computers. It is essential to provide access to information to people for their development in the present information age. It is necessary to put equal emphasis on removing the barrier of language between different divisions of society. The area of NLP strives to fill this gap of the language barrier by applying machine translation. One natural language is transformed into another natural language with the aid of computers. The first few years of this area were dedicated to the development of rule-based systems. Still, later on, due to the increase in computational power, there was a transition towards statistical machine translation. The motive of machine translation is that the meaning of the translated text should be preserved during translation. This research paper aims to analyse the machine translation approaches used for resource-poor languages and determine the needs and challenges the researchers face. This paper also reviews the machine translation systems that are available for poor research languages.
10

Zhang, Mozhi, Yoshinari Fujinuma, and Jordan Boyd-Graber. "Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 9547–54. http://dx.doi.org/10.1609/aaai.v34i05.6500.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. Experiments confirm that character-level knowledge transfer is more data-efficient than word-level transfer between related languages.

Дисертації з теми "Low resource language":

1

Jansson, Herman. "Low-resource Language Question Answering Systemwith BERT." Thesis, Mittuniversitetet, Institutionen för informationssystem och –teknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-42317.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The complexity for being at the forefront regarding information retrieval systems are constantly increasing. Recent technology of natural language processing called BERT has reached superhuman performance in high resource languages for reading comprehension tasks. However, several researchers has stated that multilingual model’s are not enough for low-resource languages, since they are lacking a thorough understanding of those languages. Recently, a Swedish pre-trained BERT model has been introduced which is trained on significantly more Swedish data than the multilingual models currently available. This study compares both multilingual and Swedish monolingual inherited BERT model’s for question answering utilizing both a English and a Swedish machine translated SQuADv2 data set during its fine-tuning process. The models are evaluated with SQuADv2 benchmark and within a implemented question answering system built upon the classical retriever-reader methodology. This study introduces a naive and more robust prediction method for the proposed question answering system as well finding a sweet spot for each individual model approach integrated into the system. The question answering system is evaluated and compared against another question answering library at the leading edge within the area, applying a custom crafted Swedish evaluation data set. The results show that the fine-tuned model based on the Swedish pre-trained model and the Swedish SQuADv2 data set were superior in all evaluation metrics except speed. The comparison between the different systems resulted in a higher evaluation score but a slower prediction time for this study’s system.
2

Zhang, Yuan Ph D. Massachusetts Institute of Technology. "Transfer learning for low-resource natural language analysis." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/108847.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 131-142).
Expressive machine learning models such as deep neural networks are highly effective when they can be trained with large amounts of in-domain labeled training data. While such annotations may not be readily available for the target task, it is often possible to find labeled data for another related task. The goal of this thesis is to develop novel transfer learning techniques that can effectively leverage annotations in source tasks to improve performance of the target low-resource task. In particular, we focus on two transfer learning scenarios: (1) transfer across languages and (2) transfer across tasks or domains in the same language. In multilingual transfer, we tackle challenges from two perspectives. First, we show that linguistic prior knowledge can be utilized to guide syntactic parsing with little human intervention, by using a hierarchical low-rank tensor method. In both unsupervised and semi-supervised transfer scenarios, this method consistently outperforms state-of-the-art multilingual transfer parsers and the traditional tensor model across more than ten languages. Second, we study lexical-level multilingual transfer in low-resource settings. We demonstrate that only a few (e.g., ten) word translation pairs suffice for an accurate transfer for part-of-speech (POS) tagging. Averaged across six languages, our approach achieves a 37.5% improvement over the monolingual top-performing method when using a comparable amount of supervision. In the second monolingual transfer scenario, we propose an aspect-augmented adversarial network that allows aspect transfer over the same domain. We use this method to transfer across different aspects in the same pathology reports, where traditional domain adaptation approaches commonly fail. Experimental results demonstrate that our approach outperforms different baselines and model variants, yielding a 24% gain on this pathology dataset.
by Yuan Zhang.
Ph. D.
3

Zouhair, Taha. "Automatic Speech Recognition for low-resource languages using Wav2Vec2 : Modern Standard Arabic (MSA) as an example of a low-resource language." Thesis, Högskolan Dalarna, Institutionen för information och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:du-37702.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The need for fully automatic translation at DigitalTolk, a Stockholm-based company providing translation services, leads to exploring Automatic Speech Recognition as a first step for Modern Standard Arabic (MSA). Facebook AI recently released a second version of its Wav2Vec models, dubbed Wav2Vec 2.0, which uses deep neural networks and provides several English pretrained models along with a multilingual model trained in 53 different languages, referred to as the Cross-Lingual Speech Representation (XLSR-53). The small English and the XLSR-53 pretrained models are tested, and the results stemming from them discussed, with the Arabic data from Mozilla Common Voice. In this research, the small model did not yield any results and may have needed more unlabelled data to train whereas the large model proved to be successful in predicting the audio recordings in Arabic and a Word Error Rate of 24.40% was achieved, an unprecedented result. The small model turned out to be not suitable for training especially on languages other than English and where the unlabelled data is not enough. On the other hand, the large model gave very promising results despite the low amount of data. The large model should be the model of choice for any future training that needs to be done on low resource languages such as Arabic.
4

Packham, Sean. "Crowdsourcing a text corpus for a low resource language." Master's thesis, University of Cape Town, 2016. http://hdl.handle.net/11427/20436.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Low resourced languages, such as South Africa's isiXhosa, have a limited number of digitised texts, making it challenging to build language corpora and the information retrieval services, such as search and translation that depend on them. Researchers have been unable to assemble isiXhosa corpora of sufficient size and quality to produce working machine translation systems and it has been acknowledged that there is little to know training data and sourcing translations from professionals can be a costly process. A crowdsourcing translation game which paid participants for their contributions was proposed as a solution to source original and relevant parallel corpora for low resource languages such as isiXhosa. The objectives of this dissertation is to report on the four experiments that were conducted to assess user motivation and contribution quantity under various scenarios using the developed crowdsourcing translation game. The first experiment was a pilot study to test a custom built system and to find out if social network users would volunteer to participate in a translation game for free. The second experiment tested multiple payment schemes with users from the University of Cape Town. The schemes rewarded users with consistent, increasing or decreasing amounts for subsequent contributions. Experiment 3 tested whether the same users from Experiment 2 would continue contributing if payments were taken away. The last experiment tested a payment scheme that did not offer a direct and guaranteed reward. Users were paid based on their leaderboard placement and only a limited number of the top leaderboard spots were allocated rewards. From experiment 1 and 3 we found that people do not volunteer without financial incentives, experiment 2 and 4 showed that people want increased rewards when putting in increased effort , experiment 3 also showed that people will not continue contributing if the financial incentives are taken away and experiment 4 also showed that the possibility of incentives is as attractive as offering guaranteed incentives .
5

Lakew, Surafel Melaku. "Multilingual Neural Machine Translation for Low Resource Languages." Doctoral thesis, Università degli studi di Trento, 2020. http://hdl.handle.net/11572/257906.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Machine Translation (MT) is the task of mapping a source language to a target language. The recent introduction of neural MT (NMT) has shown promising results for high-resource language, however, poorly performing for low-resource language (LRL) settings. Furthermore, the vast majority of the 7, 000+ languages around the world do not have parallel data, creating a zero-resource language (ZRL) scenario. In this thesis, we present our approach to improving NMT for LRL and ZRL, leveraging a multilingual NMT modeling (M-NMT), an approach that allows building a single NMT to translate across multiple source and target languages. This thesis begins by i) analyzing the effectiveness of M-NMT for LRL and ZRL translation tasks, spanning two NMT modeling architectures (Recurrent and Transformer), ii) presents a self-learning approach for improving the zero-shot translation directions of ZRLs, iii) proposes a dynamic transfer-learning approach from a pre-trained (parent) model to a LRL (child) model by tailoring to the vocabulary entries of the latter, iv) extends M-NMT to translate from a source language to specific language varieties (e.g. dialects), and finally, v) proposes an approach that can control the verbosity of an NMT model output. Our experimental findings show the effectiveness of the proposed approaches in improving NMT of LRLs and ZRLs.
6

Mairidan, Wushouer. "Pivot-Based Bilingual Dictionary Creation for Low-Resource Languages." 京都大学 (Kyoto University), 2015. http://hdl.handle.net/2433/199441.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
7

Samson, Juan Sarah Flora. "Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia." Thesis, Université Grenoble Alpes (ComUE), 2015. http://www.theses.fr/2015GREAM061/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Les langues en Malaisie meurent à un rythme alarmant. A l'heure actuelle, 15 langues sont en danger alors que deux langues se sont éteintes récemment. Une des méthodes pour sauvegarder les langues est de les documenter, mais c'est une tâche fastidieuse lorsque celle-ci est effectuée manuellement.Un système de reconnaissance automatique de la parole (RAP) serait utile pour accélérer le processus de documentation de ressources orales. Cependant, la construction des systèmes de RAP pour une langue cible nécessite une grande quantité de données d'apprentissage comme le suggèrent les techniques actuelles de l'état de l'art, fondées sur des approches empiriques. Par conséquent, il existe de nombreux défis à relever pour construire des systèmes de transcription pour les langues qui possèdent des quantités de données limitées.L'objectif principal de cette thèse est d'étudier les effets de l'utilisation de données de langues étroitement liées, pour construire un système de RAP pour les langues à faibles ressources en Malaisie. Des études antérieures ont montré que les méthodes inter-lingues et multilingues pourraient améliorer les performances des systèmes de RAP à faibles ressources. Dans cette thèse, nous essayons de répondre à plusieurs questions concernant ces approches: comment savons-nous si une langue est utile ou non dans un processus d'apprentissage trans-lingue ? Comment la relation entre la langue source et la langue cible influence les performances de la reconnaissance de la parole ? La simple mise en commun (pooling) des données d'une langue est-elle une approche optimale ?Notre cas d'étude est l'iban, une langue peu dotée de l'île de Bornéo. Nous étudions les effets de l'utilisation des données du malais, une langue locale dominante qui est proche de l'iban, pour développer un système de RAP pour l'iban, sous différentes contraintes de ressources. Nous proposons plusieurs approches pour adapter les données du malais afin obtenir des modèles de prononciation et des modèles acoustiques pour l'iban.Comme la contruction d'un dictionnaire de prononciation à partir de zéro nécessite des ressources humaines importantes, nous avons développé une approche semi-supervisée pour construire rapidement un dictionnaire de prononciation pour l'iban. Celui-ci est fondé sur des techniques d'amorçage, pour améliorer la correspondance entre les données du malais et de l'iban.Pour augmenter la performance des modèles acoustiques à faibles ressources, nous avons exploré deux techniques de modélisation : les modèles de mélanges gaussiens à sous-espaces (SGMM) et les réseaux de neurones profonds (DNN). Nous avons proposé, dans ce cadre, des méthodes de transfert translingue pour la modélisation acoustique permettant de tirer profit d'une grande quantité de langues “proches” de la langue cible d'intérêt. Les résultats montrent que l'utilisation de données du malais est bénéfique pour augmenter les performances des systèmes de RAP de l'iban. Par ailleurs, nous avons également adapté les modèles SGMM et DNN au cas spécifique de la transcription automatique de la parole non native (très présente en Malaisie). Nous avons proposé une approche fine de fusion pour obtenir un SGMM multi-accent optimal. En outre, nous avons développé un modèle DNN spécifique pour la parole accentuée. Les deux approches permettent des améliorations significatives de la précision du système de RAP. De notre étude, nous observons que les modèles SGMM et, de façon plus surprenante, les modèles DNN sont très performants sur des jeux de données d'apprentissage en quantité limités
Languages in Malaysia are dying in an alarming rate. As of today, 15 languages are in danger while two languages are extinct. One of the methods to save languages is by documenting languages, but it is a tedious task when performed manually.Automatic Speech Recognition (ASR) system could be a tool to help speed up the process of documenting speeches from the native speakers. However, building ASR systems for a target language requires a large amount of training data as current state-of-the-art techniques are based on empirical approach. Hence, there are many challenges in building ASR for languages that have limited data available.The main aim of this thesis is to investigate the effects of using data from closely-related languages to build ASR for low-resource languages in Malaysia. Past studies have shown that cross-lingual and multilingual methods could improve performance of low-resource ASR. In this thesis, we try to answer several questions concerning these approaches: How do we know which language is beneficial for our low-resource language? How does the relationship between source and target languages influence speech recognition performance? Is pooling language data an optimal approach for multilingual strategy?Our case study is Iban, an under-resourced language spoken in Borneo island. We study the effects of using data from Malay, a local dominant language which is close to Iban, for developing Iban ASR under different resource constraints. We have proposed several approaches to adapt Malay data to obtain pronunciation and acoustic models for Iban speech.Building a pronunciation dictionary from scratch is time consuming, as one needs to properly define the sound units of each word in a vocabulary. We developed a semi-supervised approach to quickly build a pronunciation dictionary for Iban. It was based on bootstrapping techniques for improving Malay data to match Iban pronunciations.To increase the performance of low-resource acoustic models we explored two acoustic modelling techniques, the Subspace Gaussian Mixture Models (SGMM) and Deep Neural Networks (DNN). We performed cross-lingual strategies using both frameworks for adapting out-of-language data to Iban speech. Results show that using Malay data is beneficial for increasing the performance of Iban ASR. We also tested SGMM and DNN to improve low-resource non-native ASR. We proposed a fine merging strategy for obtaining an optimal multi-accent SGMM. In addition, we developed an accent-specific DNN using native speech data. After applying both methods, we obtained significant improvements in ASR accuracy. From our study, we observe that using SGMM and DNN for cross-lingual strategy is effective when training data is very limited
8

Tafreshi, Shabnam. "Cross-Genre, Cross-Lingual, and Low-Resource Emotion Classification." Thesis, The George Washington University, 2021. http://pqdtopen.proquest.com/#viewpdf?dispub=28088437.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Emotions can be defined as a natural, instinctive state of mind arising from one’s circumstances, mood, and relationships with others. It has long been a question to be answered by psychology that how and what is it that humans feel. Enabling computers to recognize human emotions has been an of interest to researchers since 1990s (Picard et al., 1995). Ever since, this area of research has grown significantly and emotion detection is becoming an important component in many natural language processing tasks. Several theories exist for defining emotions and are chosen by researchers according to their needs. For instance, according to appraisal theory, a psychology theory, emotions are produced by our evaluations (appraisals or estimates) of events that cause a specific reaction in different people. Some emotions are easy and universal, while others are complex and nuanced. Emotion classification is generally the process of labeling a piece of text with one or more corresponding emotion labels. Psychologists have developed numerous models and taxonomies of emotions. The model or taxonomy depends on the problem, and thorough study is often required to select the best model. Early studies of emotion classification focused on building computational models to classify basic emotion categories. In recent years, increasing volumes of social media and the digitization of data have opened a new horizon in this area of study, where emotion classification is a key component of applications, including mood and behavioral studies, as well as disaster relief, amongst many other applications. Sophisticated models have been built to detect and classify emotion in text, but few analyze how well a model is able to learn emotion cues. The ability to learn emotion cues properly and be able to generalize this learning is very important. This work investigates the robustness of emotion classification approaches across genres and languages, with a focus on quantifying how well state-of-the-art models are able to learn emotion cues. First, we use multi-task learning and hierarchical models to build emotion models that were trained on data combined from multiple genres. Our hypothesis is that a multi-genre, noisy training environment will help the classifier learn emotion cues that are prevalent across genres. Second, we explore splitting text (i.e. sentence) into its clauses and testing whether the model’s performance improves. Emotion analysis needs fine-grained annotation and clause-level annotation can be beneficial to design features to improve emotion detection performance. Intuitively, clause-level annotations may help the model focus on emotion cues, while ignoring irrelevant portions of the text. Third, we adopted a transfer learning approach for cross-lingual/genre emotion classification to focus the classifier’s attention on emotion cues which are consistent across languages. Fourth, we empirically show how to combine different genres to be able to build robust models that can be used as source models for emotion transfer to low-resource target languages. Finally, this study involved curating and re-annotating popular emotional data sets in different genres, and annotating a multi-genre corpus of Persian tweets and news, and generating a collection of emotional sentences for a low-resource language, Azerbaijani, a language spoken in the north west of Iran.
9

Singh, Mittul [Verfasser], and Dietrich [Akademischer Betreuer] Klakow. "Handling long-term dependencies and rare words in low-resource language modelling / Mittul Singh ; Betreuer: Dietrich Klakow." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2017. http://d-nb.info/1141677962/34.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Dyer, Andrew. "Low Supervision, Low Corpus size, Low Similarity! Challenges in cross-lingual alignment of word embeddings : An exploration of the limitations of cross-lingual word embedding alignment in truly low resource scenarios." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-395946.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cross-lingual word embeddings are an increasingly important reseource in cross-lingual methods for NLP, particularly for their role in transfer learning and unsupervised machine translation, purportedly opening up the opportunity for NLP applications for low-resource languages.  However, most research in this area implicitly expects the availablility of vast monolingual corpora for training embeddings, a scenario which is not realistic for many of the world's languages.  Moreover, much of the reporting of the performance of cross-lingual word embeddings is based on a fairly narrow set of mostly European language pairs.  Our study examines the performance of cross-lingual alignment across a more diverse set of language pairs; controls for the effect of the corpus size on which the monolingual embedding spaces are trained; and studies the impact of spectral graph properties of the embedding spsace on alignment.  Through our experiments on a more diverse set of language pairs, we find that performance in bilingual lexicon induction is generally poor in heterogeneous pairs, and that even using a gold or heuristically derived dictionary has little impact on the performance on these pairs of languages.  We also find that the performance for these languages only increases slowly with corpus size.  Finally, we find a moderate correlation between the isospectral difference of the source and target embeddings and the performance of bilingual lexicon induction.  We infer that methods other than cross-lingual alignment may be more appropriate in the case of both low resource languages and heterogeneous language pairs.

Книги з теми "Low resource language":

1

Carver, Tina Kasloff. A writing book: English in everyday life : a teacher's resource book. 2nd ed. Upper Saddle River, NJ: Prentice Hall Regents, 1998.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Shefner, Jon. The illusion of civil society: Democratization and community mobilization in low-income Mexico. University Park, Pa: Pennsylvania State University Press, 2008.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
3

Canadian Legal Information Centre. Plain Language Centre. Plain Language Resource Centre catalogue. [Ottawa: Multiculturalism and Citizenship Canada], 1992.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
4

Hahn, Walther von, and Cristina Vertan. Multilingual processing in eastern and southern EU languages: Low-resourced technologies and translation. Newcastle upon Tyne, UK: Cambridge Scholars Publishing, 2012.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
5

Corson, David. Language policy in schools: A resource for teachers and administrators. Mahwah, NJ: Lawrence Erlbaum Associates, 1999.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
6

Library, Canada Multiculturalism and Citizenship Canada Departmental. Plain Language Resource Centre catalogue =: Catalogue du centre de ressources sur le langage clair et simple. [Ottawa: Multiculturalism and Citizenship Canada], 1992.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
7

Library, Canada Multiculturalism and Citizenship Canada Departmental. Plain Language Resource Centre catalogue =: Catalogue du centre de ressources sur le langage clair et simple. [Ottawa: Multiculturalism and Citizenship Canada], 1992.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
8

Megías, José Manuel Lucía. Literatura románica en Internet: Los textos. Madrid: Editorial Castalia, 2002.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
9

Franklin, Kristine L. El aullido de los monos. New York: Atheneum, 1994.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

United States. Congress. Senate. Committee on Labor and Human Resources. Subcommittee on Education, Arts, and Humanities. Foreign Language Competence for the Future Act of 1989: Hearing before the Subcommittee on Education, Arts, and Humanities of the Committee on Labor and Human Resources, United States Senate, One Hundred First Congress, first session, on S. 1690, to establish programs to improve foreign language instruction, and for other purposes, S. 1540, to establish a critical languages and area studies program, October 31, 1989. Washington: U.S. G.P.O., 1990.

Знайти повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.

Частини книг з теми "Low resource language":

1

Grießhaber, Daniel, Ngoc Thang Vu, and Johannes Maucher. "Low-Resource Text Classification Using Domain-Adversarial Learning." In Statistical Language and Speech Processing, 129–39. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-00810-9_12.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Juan, Sarah Samson, Muhamad Fikri Che Ismail, Hamimah Ujir, and Irwandi Hipiny. "Language Modelling for a Low-Resource Language in Sarawak, Malaysia." In Lecture Notes in Electrical Engineering, 147–58. Singapore: Springer Singapore, 2019. http://dx.doi.org/10.1007/978-981-15-1289-6_14.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
3

Zhu, ShaoLin, Xiao Li, YaTing Yang, Lei Wang, and ChengGang Mi. "Learning Bilingual Lexicon for Low-Resource Language Pairs." In Natural Language Processing and Chinese Computing, 760–70. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-73618-1_66.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
4

Xu, Nuo, Yinqiao Li, Chen Xu, Yanyang Li, Bei Li, Tong Xiao, and Jingbo Zhu. "Analysis of Back-Translation Methods for Low-Resource Neural Machine Translation." In Natural Language Processing and Chinese Computing, 466–75. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-32236-6_42.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
5

Hu, Yong, Heyan Huang, Tian Lan, Xiaochi Wei, Yuxiang Nie, Jiarui Qi, Liner Yang, and Xian-Ling Mao. "Multi-task Learning for Low-Resource Second Language Acquisition Modeling." In Web and Big Data, 603–11. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-60259-8_44.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
6

Srivastava, Brij Mohan Lal, and Manish Shrivastava. "Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings." In Statistical Language and Speech Processing, 80–95. Cham: Springer International Publishing, 2016. http://dx.doi.org/10.1007/978-3-319-45925-7_7.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
7

Juan, Sarah Samson, Laurent Besacier, Benjamin Lecouteux, and Tien-Ping Tan. "Merging of Native and Non-native Speech for Low-resource Accented ASR." In Statistical Language and Speech Processing, 255–66. Cham: Springer International Publishing, 2015. http://dx.doi.org/10.1007/978-3-319-25789-1_24.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
8

James, Cynthia C., and Kean Wah Lee. "Narrative Inquiry into Teacher Identity, Context, and Technology Integration in Low-Resource ESL Classrooms." In Language Learning with Technology, 65–76. Singapore: Springer Singapore, 2021. http://dx.doi.org/10.1007/978-981-16-2697-5_5.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
9

Wu, Jing, Hongxu Hou, Zhipeng Shen, Jian Du, and Jinting Li. "Adapting Attention-Based Neural Network to Low-Resource Mongolian-Chinese Machine Translation." In Natural Language Understanding and Intelligent Applications, 470–80. Cham: Springer International Publishing, 2016. http://dx.doi.org/10.1007/978-3-319-50496-4_39.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Kunchukuttan, Anoop, and Pushpak Bhattacharyya. "A Case Study on Indic Language Translation." In Machine Translation and Transliteration Involving Related and Low-resource Languages, 93–112. Boca Raton: Chapman and Hall/CRC, 2021. http://dx.doi.org/10.1201/9781003096771-7.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.

Тези доповідей конференцій з теми "Low resource language":

1

Gandhe, Ankur, Florian Metze, and Ian Lane. "Neural network language models for low resource languages." In Interspeech 2014. ISCA: ISCA, 2014. http://dx.doi.org/10.21437/interspeech.2014-560.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Feng, Xiaocheng, Xiachong Feng, Bing Qin, Zhangyin Feng, and Ting Liu. "Improving Low Resource Named Entity Recognition using Cross-lingual Knowledge Transfer." In Twenty-Seventh International Joint Conference on Artificial Intelligence {IJCAI-18}. California: International Joint Conferences on Artificial Intelligence Organization, 2018. http://dx.doi.org/10.24963/ijcai.2018/566.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Neural networks have been widely used for high resource language (e.g. English) named entity recognition (NER) and have shown state-of-the-art results.However, for low resource languages, such as Dutch, Spanish, due to the limitation of resources and lack of annotated data, taggers tend to have lower performances.To narrow this gap, we propose three novel strategies to enrich the semantic representations of low resource languages: we first develop neural networks to improve low resource word representations by knowledge transfer from high resource language using bilingual lexicons. Further, a lexicon extension strategy is designed to address out-of lexicon problem by automatically learning semantic projections.Thirdly, we regard word-level entity type distribution features as an external language-independent knowledge and incorporate them into our neural architecture. Experiments on two low resource languages (including Dutch and Spanish) demonstrate the effectiveness of these additional semantic representations (average 4.8\% improvement). Moreover, on Chinese OntoNotes 4.0 dataset, our approach achieved an F-score of 83.07\% with 2.91\% absolute gain compared to the state-of-the-art results.
3

Khemchandani, Yash, Sarvesh Mehtani, Vaidehi Patil, Abhijeet Awasthi, Partha Talukdar, and Sunita Sarawagi. "Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics, 2021. http://dx.doi.org/10.18653/v1/2021.acl-long.105.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
4

Joshi, Ishani, Purvi Koringa, and Suman Mitra. "Word Embeddings in Low Resource Gujarati Language." In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019. http://dx.doi.org/10.1109/icdarw.2019.40090.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
5

Wong, Tak-sum, and John Lee. "Character Profiling in Low-Resource Language Documents." In ADCS '19: Australasian Document Computing Symposium. New York, NY, USA: ACM, 2019. http://dx.doi.org/10.1145/3372124.3372129.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
6

Qi, Zhaodi, Yong Ma, and Mingliang Gu. "A Study on Low-resource Language Identification." In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2019. http://dx.doi.org/10.1109/apsipaasc47483.2019.9023075.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
7

Kumar, Sachin, Antonios Anastasopoulos, Shuly Wintner, and Yulia Tsvetkov. "Machine Translation into Low-resource Language Varieties." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Stroudsburg, PA, USA: Association for Computational Linguistics, 2021. http://dx.doi.org/10.18653/v1/2021.acl-short.16.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
8

Sailor, Hardik, Ankur Patil, and Hemant Patil. "Advances in Low Resource ASR: A Deep Learning Perspective." In The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages. ISCA: ISCA, 2018. http://dx.doi.org/10.21437/sltu.2018-4.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
9

Liu, Chunxi, Matthew Wiesner, Shinji Watanabe, Craig Harman, Jan Trmal, Najim Dehak, and Sanjeev Khudanpur. "Low-Resource Contextual Topic Identification on Speech." In 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018. http://dx.doi.org/10.1109/slt.2018.8639544.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Beloucif, Meriem, Ana Valeria Gonzalez, Marcel Bollmann, and Anders Søgaard. "Naive Regularizers for Low-Resource Neural Machine Translation." In Recent Advances in Natural Language Processing. Incoma Ltd., Shoumen, Bulgaria, 2019. http://dx.doi.org/10.26615/978-954-452-056-4_013.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.

До бібліографії