Статті в журналах з теми "Low resource language"

Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Low resource language.

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-50 статей у журналах для дослідження на тему "Low resource language".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте статті в журналах для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Lin, Donghui, Yohei Murakami, and Toru Ishida. "Towards Language Service Creation and Customization for Low-Resource Languages." Information 11, no. 2 (January 27, 2020): 67. http://dx.doi.org/10.3390/info11020067.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The most challenging issue with low-resource languages is the difficulty of obtaining enough language resources. In this paper, we propose a language service framework for low-resource languages that enables the automatic creation and customization of new resources from existing ones. To achieve this goal, we first introduce a service-oriented language infrastructure, the Language Grid; it realizes new language services by supporting the sharing and combining of language resources. We then show the applicability of the Language Grid to low-resource languages. Furthermore, we describe how we can now realize the automation and customization of language services. Finally, we illustrate our design concept by detailing a case study of automating and customizing bilingual dictionary induction for low-resource Turkic languages and Indonesian ethnic languages.
2

Zhou, Shuyan, Shruti Rijhwani, John Wieting, Jaime Carbonell, and Graham Neubig. "Improving Candidate Generation for Low-resource Cross-lingual Entity Linking." Transactions of the Association for Computational Linguistics 8 (July 2020): 109–24. http://dx.doi.org/10.1162/tacl_a_00303.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL. 1
3

Mati, Diellza Nagavci, Mentor Hamiti, Arsim Susuri, Besnik Selimi, and Jaumin Ajdari. "Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning." Annals of Emerging Technologies in Computing 5, no. 3 (July 1, 2021): 52–58. http://dx.doi.org/10.33166/aetic.2021.03.005.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.
4

Shikali, Casper S., and Refuoe Mokhosi. "Enhancing African low-resource languages: Swahili data for language modelling." Data in Brief 31 (August 2020): 105951. http://dx.doi.org/10.1016/j.dib.2020.105951.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
5

Chen, Siqi, Yijie Pei, Zunwang Ke, and Wushour Silamu. "Low-Resource Named Entity Recognition via the Pre-Training Model." Symmetry 13, no. 5 (May 2, 2021): 786. http://dx.doi.org/10.3390/sym13050786.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Named entity recognition (NER) is an important task in the processing of natural language, which needs to determine entity boundaries and classify them into pre-defined categories. For low-resource languages, most state-of-the-art systems require tens of thousands of annotated sentences to obtain high performance. However, there is minimal annotated data available about Uyghur and Hungarian (UH languages) NER tasks. There are also specificities in each task—differences in words and word order across languages make it a challenging problem. In this paper, we present an effective solution to providing a meaningful and easy-to-use feature extractor for named entity recognition tasks: fine-tuning the pre-trained language model. Therefore, we propose a fine-tuning method for a low-resource language model, which constructs a fine-tuning dataset through data augmentation; then the dataset of a high-resource language is added; and finally the cross-language pre-trained model is fine-tuned on this dataset. In addition, we propose an attention-based fine-tuning strategy that uses symmetry to better select relevant semantic and syntactic information from pre-trained language models and apply these symmetry features to name entity recognition tasks. We evaluated our approach on Uyghur and Hungarian datasets, which showed wonderful performance compared to some strong baselines. We close with an overview of the available resources for named entity recognition and some of the open research questions.
6

Rijhwani, Shruti, Jiateng Xie, Graham Neubig, and Jaime Carbonell. "Zero-Shot Neural Transfer for Cross-Lingual Entity Linking." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6924–31. http://dx.doi.org/10.1609/aaai.v33i01.33016924.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cross-lingual entity linking maps an entity mention in a source language to its corresponding entry in a structured knowledge base that is in a different (target) language. While previous work relies heavily on bilingual lexical resources to bridge the gap between the source and the target languages, these resources are scarce or unavailable for many low-resource languages. To address this problem, we investigate zero-shot cross-lingual entity linking, in which we assume no bilingual lexical resources are available in the source low-resource language. Specifically, we propose pivot-basedentity linking, which leverages information from a highresource “pivot” language to train character-level neural entity linking models that are transferred to the source lowresource language in a zero-shot manner. With experiments on 9 low-resource languages and transfer through a total of54 languages, we show that our proposed pivot-based framework improves entity linking accuracy 17% (absolute) on average over the baseline systems, for the zero-shot scenario.1 Further, we also investigate the use of language-universal phonological representations which improves average accuracy (absolute) by 36% when transferring between languages that use different scripts.
7

Mi, Chenggang, Shaolin Zhu, and Rui Nie. "Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion." Computational Intelligence and Neuroscience 2021 (April 8, 2021): 1–9. http://dx.doi.org/10.1155/2021/9975078.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.
8

Lee, Chanhee, Kisu Yang, Taesun Whang, Chanjun Park, Andrew Matteson, and Heuiseok Lim. "Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models." Applied Sciences 11, no. 5 (February 24, 2021): 1974. http://dx.doi.org/10.3390/app11051974.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.
9

Et. al., Syed Abdul Basit Andrabi,. "A Review of Machine Translation for South Asian Low Resource Languages." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 5 (April 10, 2021): 1134–47. http://dx.doi.org/10.17762/turcomat.v12i5.1777.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Machine translation is an application of natural language processing. Humans use native languages to communicate with one another, whereas programming languages communicate between humans and computers. NLP is the field that involves a broad set of techniques for analysis, manipulation and automatic generation of human languages or natural languages with the help of computers. It is essential to provide access to information to people for their development in the present information age. It is necessary to put equal emphasis on removing the barrier of language between different divisions of society. The area of NLP strives to fill this gap of the language barrier by applying machine translation. One natural language is transformed into another natural language with the aid of computers. The first few years of this area were dedicated to the development of rule-based systems. Still, later on, due to the increase in computational power, there was a transition towards statistical machine translation. The motive of machine translation is that the meaning of the translated text should be preserved during translation. This research paper aims to analyse the machine translation approaches used for resource-poor languages and determine the needs and challenges the researchers face. This paper also reviews the machine translation systems that are available for poor research languages.
10

Zhang, Mozhi, Yoshinari Fujinuma, and Jordan Boyd-Graber. "Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 9547–54. http://dx.doi.org/10.1609/aaai.v34i05.6500.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. Experiments confirm that character-level knowledge transfer is more data-efficient than word-level transfer between related languages.
11

ZENNAKI, O., N. SEMMAR, and L. BESACIER. "A neural approach for inducing multilingual resources and natural language processing tools for low-resource languages." Natural Language Engineering 25, no. 1 (August 6, 2018): 43–67. http://dx.doi.org/10.1017/s1351324918000293.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
AbstractThis work focuses on the rapid development of linguistic annotation tools for low-resource languages (languages that have no labeled training data). We experiment with several cross-lingual annotation projection methods using recurrent neural networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between source and target languages. More precisely, our approach has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger forNlanguages). We investigate both uni and bidirectional RNN models and propose a method to include external information (for instance, low-level information from part-of-speech tags) in the RNN to train higher level taggers (for instance, Super Sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual part-of-speech and Super Sense taggers. We also use our approach in a weakly supervised context, and it shows an excellent potential for very low-resource settings (less than 1k training utterances).
12

Murakami, Yohei. "Indonesia Language Sphere: an ecosystem for dictionary development for low-resource languages." Journal of Physics: Conference Series 1192 (March 2019): 012001. http://dx.doi.org/10.1088/1742-6596/1192/1/012001.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
13

Grönroos, Stig-Arne, Kristiina Jokinen, Katri Hiovain, Mikko Kurimo, and Sami Virpioja. "Low-Resource Active Learning of North Sámi Morphological Segmentation." Septentrio Conference Series, no. 2 (June 17, 2015): 20. http://dx.doi.org/10.7557/5.3465.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Many Uralic languages have a rich morphological structure, but lack tools of morphological analysis needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications.We study how to create a statistical model for morphological segmentation of North Sámi language with a large unannotated corpus and a small amount of human-annotated word forms selected using an active learning approach. For statistical learning, we use the semi-supervised Morfessor Baseline and FlatCat methods. Aer annotating 237 words with our active learning setup, we improve morph boundary recall over 20% with no loss of precision.
14

Adjeisah, Michael, Guohua Liu, Douglas Omwenga Nyabuga, Richard Nuetey Nortey, and Jinling Song. "Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation." Computational Intelligence and Neuroscience 2021 (April 11, 2021): 1–10. http://dx.doi.org/10.1155/2021/6682385.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.
15

Grönroos, Stig-Arne, Katri Hiovain, Peter Smit, Ilona Rauhala, Kristiina Jokinen, Mikko Kurimo, and Sami Virpioja. "Low-Resource Active Learning of Morphological Segmentation." Northern European Journal of Language Technology 4 (March 13, 2016): 47–72. http://dx.doi.org/10.3384/nejlt.2000-1533.1644.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.
16

Park, Chanjun, Yeongwook Yang, Kinam Park, and Heuiseok Lim. "Decoding Strategies for Improving Low-Resource Machine Translation." Electronics 9, no. 10 (September 24, 2020): 1562. http://dx.doi.org/10.3390/electronics9101562.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Pre-processing and post-processing are significant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization to alleviate the problem of unknown words, parallel corpus filtering that only filters data suitable for training, and data augmentation to ensure that the corpus contains sufficient content. Post-processing includes automatic post editing and the application of various strategies during decoding in the translation process. Most recent NLP researches are based on the Pretrain-Finetuning Approach (PFA). However, when small and medium-sized organizations with insufficient hardware attempt to provide NLP services, throughput and memory problems often occur. These difficulties increase when utilizing PFA to process low-resource languages, as PFA requires large amounts of data, and the data for low-resource languages are often insufficient. Utilizing the current research premise that NMT model performance can be enhanced through various pre-processing and post-processing strategies without changing the model, we applied various decoding strategies to Korean–English NMT, which relies on a low-resource language pair. Through comparative experiments, we proved that translation performance could be enhanced without changes to the model. We experimentally examined how performance changed in response to beam size changes and n-gram blocking, and whether performance was enhanced when a length penalty was applied. The results showed that various decoding strategies enhance the performance and compare well with previous Korean–English NMT approaches. Therefore, the proposed methodology can improve the performance of NMT models, without the use of PFA; this presents a new perspective for improving machine translation performance.
17

Zhu, ShaoLin, Xiao Li, YaTing Yang, Lei Wang, and ChengGang Mi. "A Novel Deep Learning Method for Obtaining Bilingual Corpus from Multilingual Website." Mathematical Problems in Engineering 2019 (January 10, 2019): 1–7. http://dx.doi.org/10.1155/2019/7495436.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Machine translation needs a large number of parallel sentence pairs to make sure of having a good translation performance. However, the lack of parallel corpus heavily limits machine translation for low-resources language pairs. We propose a novel method that combines the continuous word embeddings with deep learning to obtain parallel sentences. Since parallel sentences are very invaluable for low-resources language pair, we introduce cross-lingual semantic representation to induce bilingual signals. Our experiments show that we can achieve promising results under lacking external resources for low-resource languages. Finally, we construct a state-of-the-art machine translation system in low-resources language pair.
18

Chen, Xilun, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. "Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification." Transactions of the Association for Computational Linguistics 6 (December 2018): 557–70. http://dx.doi.org/10.1162/tacl_a_00039.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In recent years great success has been achieved in sentiment classification for English, thanks in part to the availability of copious annotated resources. Unfortunately, most languages do not enjoy such an abundance of labeled data. To tackle the sentiment classification problem in low-resource languages without adequate annotated data, we propose an Adversarial Deep Averaging Network (ADAN 1 ) to transfer the knowledge learned from labeled data on a resource-rich source language to low-resource languages where only unlabeled data exist. ADAN has two discriminative branches: a sentiment classifier and an adversarial language discriminator. Both branches take input from a shared feature extractor to learn hidden representations that are simultaneously indicative for the classification task and invariant across languages. Experiments on Chinese and Arabic sentiment classification demonstrate that ADAN significantly outperforms state-of-the-art systems.
19

Haque, Rejwanul, Mohammed Hasanuzzaman, and Andy Way. "Terminology Translation in Low-Resource Scenarios." Information 10, no. 9 (August 30, 2019): 273. http://dx.doi.org/10.3390/info10090273.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Term translation quality in machine translation (MT), which is usually measured by domain experts, is a time-consuming and expensive task. In fact, this is unimaginable in an industrial setting where customised MT systems often need to be updated for many reasons (e.g., availability of new training data, leading MT techniques). To the best of our knowledge, as of yet, there is no publicly-available solution to evaluate terminology translation in MT automatically. Hence, there is a genuine need to have a faster and less-expensive solution to this problem, which could help end-users to identify term translation problems in MT instantly. This study presents a faster and less expensive strategy for evaluating terminology translation in MT. High correlations of our evaluation results with human judgements demonstrate the effectiveness of the proposed solution. The paper also introduces a classification framework, TermCat, that can automatically classify term translation-related errors and expose specific problems in relation to terminology translation in MT. We carried out our experiments with a low resource language pair, English–Hindi, and found that our classifier, whose accuracy varies across the translation directions, error classes, the morphological nature of the languages, and MT models, generally performs competently in the terminology translation classification task.
20

Yi, Jiangyan, Jianhua Tao, Zhengqi Wen, and Ye Bai. "Language-Adversarial Transfer Learning for Low-Resource Speech Recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, no. 3 (March 2019): 621–30. http://dx.doi.org/10.1109/taslp.2018.2889606.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
21

Li, Zhe, Xiuhong Li, Jiabao Sheng, and Wushour Slamu. "AgglutiFiT: Efficient Low-Resource Agglutinative Language Model Fine-Tuning." IEEE Access 8 (2020): 148489–99. http://dx.doi.org/10.1109/access.2020.3015854.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
22

Ping Xu and P. Fung. "Cross-Lingual Language Modeling for Low-Resource Speech Recognition." IEEE Transactions on Audio, Speech, and Language Processing 21, no. 6 (June 2013): 1134–44. http://dx.doi.org/10.1109/tasl.2013.2244088.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
23

Ortega, John E., Richard Castro Mamani, and Kyunghyun Cho. "Neural machine translation with a polysynthetic low resource language." Machine Translation 34, no. 4 (December 2020): 325–46. http://dx.doi.org/10.1007/s10590-020-09255-9.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
24

Carl, Michael, Maite Melero, Toni Badia, Vincent Vandeghinste, Peter Dirix, Ineke Schuurman, Stella Markantonatou, Sokratis Sofianopoulos, Marina Vassiliou, and Olga Yannoutsou. "METIS-II: low resource machine translation." Machine Translation 22, no. 1-2 (March 2008): 67–99. http://dx.doi.org/10.1007/s10590-008-9048-z.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
25

Ranasinghe, Tharindu, and Marcos Zampieri. "An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India." Information 12, no. 8 (July 29, 2021): 306. http://dx.doi.org/10.3390/info12080306.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages.
26

Simpson, Edwin, Jonas Pfeiffer, and Iryna Gurevych. "Low Resource Sequence Tagging with Weak Labels." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8862–69. http://dx.doi.org/10.1609/aaai.v34i05.6415.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Current methods for sequence tagging depend on large quantities of domain-specific training data, limiting their use in new, user-defined tasks with few or no annotations. While crowdsourcing can be a cheap source of labels, it often introduces errors that degrade the performance of models trained on such crowdsourced data. Another solution is to use transfer learning to tackle low resource sequence labelling, but current approaches rely heavily on similar high resource datasets in different languages. In this paper, we propose a domain adaptation method using Bayesian sequence combination to exploit pre-trained models and unreliable crowdsourced data that does not require high resource data in a different language. Our method boosts performance by learning the relationship between each labeller and the target task and trains a sequence labeller on the target domain with little or no gold-standard data. We apply our approach to labelling diagnostic classes in medical and educational case studies, showing that the model achieves strong performance though zero-shot transfer learning and is more effective than alternative ensemble methods. Using NER and information extraction tasks, we show how our approach can train a model directly from crowdsourced labels, outperforming pipeline approaches that first aggregate the crowdsourced data, then train on the aggregated labels.
27

Taghizadeh, Nasrin, and Hesham Faili. "Automatic Wordnet Development for Low-Resource Languages using Cross-Lingual WSD." Journal of Artificial Intelligence Research 56 (May 20, 2016): 61–87. http://dx.doi.org/10.1613/jair.4968.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
‎Wordnets are an effective resource for natural language processing and information retrieval‎, ‎especially for semantic processing and meaning related tasks‎. ‎So far‎, ‎wordnets have been constructed for many languages‎. ‎However‎, ‎the automatic development of wordnets for low-resource languages has not been well studied‎. ‎In this paper‎, ‎an Expectation-Maximization algorithm is used to create high quality and large scale wordnets for poor-resource languages‎. ‎The proposed method benefits from possessing cross-lingual word sense disambiguation and develops a wordnet by only using a bi-lingual dictionary and a mono-lingual corpus‎. ‎The proposed method has been executed with Persian language and the resulting wordnet has been evaluated through several experiments‎. ‎The results show that the induced wordnet has a precision score of 90% and a recall score of 35%‎.
28

Chiarcos, Christian, Ilya Khait, Émilie Pagé-Perron, Niko Schenk, Jayanth, Christian Fäth, Julius Steuer, William Mcgrath, and Jinyan Wang. "Annotating a Low-Resource Language with LLOD Technology: Sumerian Morphology and Syntax." Information 9, no. 11 (November 19, 2018): 290. http://dx.doi.org/10.3390/info9110290.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This paper describes work on the morphological and syntactic annotation of Sumerian cuneiform as a model for low resource languages in general. Cuneiform texts are invaluable sources for the study of history, languages, economy, and cultures of Ancient Mesopotamia and its surrounding regions. Assyriology, the discipline dedicated to their study, has vast research potential, but lacks the modern means for computational processing and analysis. Our project, Machine Translation and Automated Analysis of Cuneiform Languages, aims to fill this gap by bringing together corpus data, lexical data, linguistic annotations and object metadata. The project’s main goal is to build a pipeline for machine translation and annotation of Sumerian Ur III administrative texts. The rich and structured data is then to be made accessible in the form of (Linguistic) Linked Open Data (LLOD), which should open them to a larger research community. Our contribution is two-fold: in terms of language technology, our work represents the first attempt to develop an integrative infrastructure for the annotation of morphology and syntax on the basis of RDF technologies and LLOD resources. With respect to Assyriology, we work towards producing the first syntactically annotated corpus of Sumerian.
29

Paul, Michael, Andrew Finch, and Eiichrio Sumita. "How to Choose the Best Pivot Language for Automatic Translation of Low-Resource Languages." ACM Transactions on Asian Language Information Processing 12, no. 4 (October 2013): 1–17. http://dx.doi.org/10.1145/2505126.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
30

Tahir, Bilal, and Muhammad Amir Mehmood. "Corpulyzer: A Novel Framework for Building Low Resource Language Corpora." IEEE Access 9 (2021): 8546–63. http://dx.doi.org/10.1109/access.2021.3049793.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
31

Kadyan, Virender, Syed Shanawazuddin, and Amitoj Singh. "Developing children’s speech recognition system for low resource Punjabi language." Applied Acoustics 178 (July 2021): 108002. http://dx.doi.org/10.1016/j.apacoust.2021.108002.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
32

Rubino, Raphael, Benjamin Marie, Raj Dabre, Atushi Fujita, Masao Utiyama, and Eiichiro Sumita. "Extremely low-resource neural machine translation for Asian languages." Machine Translation 34, no. 4 (December 2020): 347–82. http://dx.doi.org/10.1007/s10590-020-09258-6.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
AbstractThis paper presents a set of effective approaches to handle extremely low-resource language pairs for self-attention based neural machine translation (NMT) focusing on English and four Asian languages. Starting from an initial set of parallel sentences used to train bilingual baseline models, we introduce additional monolingual corpora and data processing techniques to improve translation quality. We describe a series of best practices and empirically validate the methods through an evaluation conducted on eight translation directions, based on state-of-the-art NMT approaches such as hyper-parameter search, data augmentation with forward and backward translation in combination with tags and noise, as well as joint multilingual training. Experiments show that the commonly used default architecture of self-attention NMT models does not reach the best results, validating previous work on the importance of hyper-parameter tuning. Additionally, empirical results indicate the amount of synthetic data required to efficiently increase the parameters of the models leading to the best translation quality measured by automatic metrics. We show that the best NMT models trained on large amount of tagged back-translations outperform three other synthetic data generation approaches. Finally, comparison with statistical machine translation (SMT) indicates that extremely low-resource NMT requires a large amount of synthetic parallel data obtained with back-translation in order to close the performance gap with the preceding SMT approach.
33

Ranathunga, Surangika, and Isuru Udara Liyanage. "Sentiment Analysis of Sinhala News Comments." ACM Transactions on Asian and Low-Resource Language Information Processing 20, no. 4 (May 26, 2021): 1–23. http://dx.doi.org/10.1145/3445035.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Sinhala is a low-resource language, for which basic language and linguistic tools have not been properly defined. This affects the development of NLP-based end-user applications for Sinhala. Thus, when implementing NLP tools such as sentiment analyzers, we have to rely only on language-independent techniques. This article presents the use of such language-independent techniques in implementing a sentiment analysis system for Sinhala news comments. We demonstrate that for low-resource languages such as Sinhala, the use of recently introduced word embedding models as semantic features can compensate for the lack of well-developed language-specific linguistic or language resources, and text classification with acceptable accuracy is indeed possible using both traditional statistical classifiers and Deep Learning models. The developed classification models, a corpus of 8.9 million tokens extracted from Sinhala news articles and user comments, and Sinhala Word2Vec and fastText word embedding models are now available for public use; 9,048 news comments annotated with POSITIVE/NEGATIVE/NEUTRAL polarities have also been released.
34

Boito, Marcely Zanon, Aline Villavicencio, and Laurent Besacier. "Investigating alignment interpretability for low-resource NMT." Machine Translation 34, no. 4 (December 2020): 305–23. http://dx.doi.org/10.1007/s10590-020-09254-w.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
35

Nasution, Arbi Haza, Yohei Murakami, and Toru Ishida. "Plan Optimization to Bilingual Dictionary Induction for Low-resource Language Families." ACM Transactions on Asian and Low-Resource Language Information Processing 20, no. 2 (March 10, 2021): 1–28. http://dx.doi.org/10.1145/3448215.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Creating bilingual dictionary is the first crucial step in enriching low-resource languages. Especially for the closely related ones, it has been shown that the constraint-based approach is useful for inducing bilingual lexicons from two bilingual dictionaries via the pivot language. However, if there are no available machine-readable dictionaries as input, we need to consider manual creation by bilingual native speakers. To reach a goal of comprehensively create multiple bilingual dictionaries, even if we already have several existing machine-readable bilingual dictionaries, it is still difficult to determine the execution order of the constraint-based approach to reducing the total cost. Plan optimization is crucial in composing the order of bilingual dictionaries creation with the consideration of the methods and their costs. We formalize the plan optimization for creating bilingual dictionaries by utilizing Markov Decision Process (MDP) with the goal to get a more accurate estimation of the most feasible optimal plan with the least total cost before fully implementing the constraint-based bilingual lexicon induction. We model a prior beta distribution of bilingual lexicon induction precision with language similarity and polysemy of the topology as and parameters. It is further used to model cost function and state transition probability. We estimated the cost of all investment plans as a baseline for evaluating the proposed MDP-based approach with total cost as an evaluation metric. After utilizing the posterior beta distribution in the first batch of experiments to construct the prior beta distribution in the second batch of experiments, the result shows 61.5% of cost reduction compared to the estimated all investment plans and 39.4% of cost reduction compared to the estimated MDP optimal plan. The MDP-based proposal outperformed the baseline on the total cost.
36

Badenhorst, Jaco, and Febe de Wet. "The Usefulness of Imperfect Speech Data for ASR Development in Low-Resource Languages." Information 10, no. 9 (August 28, 2019): 268. http://dx.doi.org/10.3390/info10090268.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
When the National Centre for Human Language Technology (NCHLT) Speech corpus was released, it created various opportunities for speech technology development in the 11 official, but critically under-resourced, languages of South Africa. Since then, the substantial improvements in acoustic modeling that deep architectures achieved for well-resourced languages ushered in a new data requirement: their development requires hundreds of hours of speech. A suitable strategy for the enlargement of speech resources for the South African languages is therefore required. The first possibility was to look for data that has already been collected but has not been included in an existing corpus. Additional data was collected during the NCHLT project that was not included in the official corpus: it only contains a curated, but limited subset of the data. In this paper, we first analyze the additional resources that could be harvested from the auxiliary NCHLT data. We also measure the effect of this data on acoustic modeling. The analysis incorporates recent factorized time-delay neural networks (TDNN-F). These models significantly reduce phone error rates for all languages. In addition, data augmentation and cross-corpus validation experiments for a number of the datasets illustrate the utility of the auxiliary NCHLT data.
37

Yu, Chongchong, Yunbing Chen, Yueqiao Li, Meng Kang, Shixuan Xu, and Xueer Liu. "Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language." Symmetry 11, no. 2 (February 2, 2019): 179. http://dx.doi.org/10.3390/sym11020179.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
To rescue and preserve an endangered language, this paper studied an end-to-end speech recognition model based on sample transfer learning for the low-resource Tujia language. From the perspective of the Tujia language international phonetic alphabet (IPA) label layer, using Chinese corpus as an extension of the Tujia language can effectively solve the problem of an insufficient corpus in the Tujia language, constructing a cross-language corpus and an IPA dictionary that is unified between the Chinese and Tujia languages. The convolutional neural network (CNN) and bi-directional long short-term memory (BiLSTM) network were used to extract the cross-language acoustic features and train shared hidden layer weights for the Tujia language and Chinese phonetic corpus. In addition, the automatic speech recognition function of the Tujia language was realized using the end-to-end method that consists of symmetric encoding and decoding. Furthermore, transfer learning was used to establish the model of the cross-language end-to-end Tujia language recognition system. The experimental results showed that the recognition error rate of the proposed model is 46.19%, which is 2.11% lower than the that of the model that only used the Tujia language data for training. Therefore, this approach is feasible and effective.
38

Kuriyozov, Elmurod, and Sanatbek Matlatipov. "Building a New Sentiment Analysis Dataset for Uzbek Language and Creating Baseline Models." Proceedings 21, no. 1 (August 2, 2019): 37. http://dx.doi.org/10.3390/proceedings2019021037.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Making natural language processing technologies available for low-resource languages is an important goal to improve the access to technology in their communities of speakers. In this paper, we provide the first annotated corpora for polarity classification for Uzbek language. Our methodology considers collecting a medium-size manually annotated dataset and a larger-size dataset automatically translated from existing resources. Then, we use these datasets to train sentiment analysis models on the Uzbek language, using both traditional machine learning techniques and recent deep learning models.
39

Maruyama, Takumi, and Kazuhide Yamamoto. "Extremely Low-Resource Text Simplification with Pre-trained Transformer Language Model." International Journal of Asian Language Processing 30, no. 01 (March 2020): 2050001. http://dx.doi.org/10.1142/s2717554520500010.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Inspired by machine translation task, recent text simplification approaches regard a task as a monolingual text-to-text generation, and neural machine translation models have significantly improved the performance of simplification tasks. Although such models require a large-scale parallel corpus, such corpora for text simplification are very few in number and smaller in size compared to machine translation task. Therefore, we have attempted to facilitate the training of simplification rewritings using pre-training from a large-scale monolingual corpus such as Wikipedia articles. In addition, we propose a translation language model to seamlessly conduct a fine-tuning of text simplification from the pre-training of the language model. The experimental results show that the translation language model substantially outperforms a state-of-the-art model under a low-resource setting. In addition, a pre-trained translation language model with only 3000 supervised examples can achieve a performance comparable to that of the state-of-the-art model using 30,000 supervised examples.
40

Nisar, Shibli, and Muhammad Tariq. "Dialect recognition for low resource language using an adaptive filter bank." International Journal of Wavelets, Multiresolution and Information Processing 16, no. 04 (July 2018): 1850031. http://dx.doi.org/10.1142/s0219691318500315.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Dialect recognition of low resource languages is the next stage in the technological advancement in speech recognition. Traditional methods for dialects recognition such as mel frequency cesptral coefficients (MFCC) and discrete wavelet transform (DWT) work better for high resource languages, however, the performance is low when applied in low resource languages. This paper presents a new approach for Pashto dialects recognition using an adaptive filter bank with MFCC and DWT. In this approach, features are extracted using adaptive filter bank in MFCC and DWT followed by classification through hidden Markov model (HMM), support vector machines (SVM) and K-nearest neighbors (KNN). The results obtained from the proposed method are very satisfactory with an overall dialect recognition accuracy of [Formula: see text].
41

Poncelet, Jakob, Vincent Renkens, and Hugo Van hamme. "Low resource end-to-end spoken language understanding with capsule networks." Computer Speech & Language 66 (March 2021): 101142. http://dx.doi.org/10.1016/j.csl.2020.101142.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
42

Bhowmik, Kowshik, and Anca Ralescu. "Leveraging Vector Space Similarity for Learning Cross-Lingual Word Embeddings: A Systematic Review." Digital 1, no. 3 (July 1, 2021): 145–61. http://dx.doi.org/10.3390/digital1030011.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This article presents a systematic literature review on quantifying the proximity between independently trained monolingual word embedding spaces. A search was carried out in the broader context of inducing bilingual lexicons from cross-lingual word embeddings, especially for low-resource languages. The returned articles were then classified. Cross-lingual word embeddings have drawn the attention of researchers in the field of natural language processing (NLP). Although existing methods have yielded satisfactory results for resource-rich languages and languages related to them, some researchers have pointed out that the same is not true for low-resource and distant languages. In this paper, we report the research on methods proposed to provide better representation for low-resource and distant languages in the cross-lingual word embedding space.
43

Purwarianti, Ayu, Athia Saelan, Irfan Afif, Filman Ferdian, and Alfan Farizki Wicaksono. "Natural Language Understanding Tools with Low Language Resource in Building Automatic Indonesian Mind Map Generator." International Journal on Electrical Engineering and Informatics 5, no. 3 (September 30, 2013): 256–69. http://dx.doi.org/10.15676/ijeei.2013.5.3.2.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
44

Basu, Joyanta, Soma Khan, Rajib Roy, Tapan Kumar Basu, and Swanirbhar Majumder. "Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification." Circuits, Systems, and Signal Processing 40, no. 10 (April 20, 2021): 4986–5013. http://dx.doi.org/10.1007/s00034-021-01704-x.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
45

González-Garduño, Ana V. "Reinforcement Learning for Improved Low Resource Dialogue Generation." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9884–85. http://dx.doi.org/10.1609/aaai.v33i01.33019884.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In this thesis, I focus on language independent methods of improving utterance understanding and response generation and attempt to tackle some of the issues surrounding current systems. The aim is to create a unified approach to dialogue generation inspired by developments in both goal oriented and open ended dialogue systems. The main contributions in this thesis are: 1) Introducing hybrid approaches to dialogue generation using retrieval and encoder-decoder architectures to produce fluent but precise utterances in dialogues, 2) Proposing supervised, semi-supervised and Reinforcement Learning methods for domain adaptation in goal oriented dialogue and 3) Introducing models that can adapt cross lingually.
46

Cooper, Jones, and Prys. "Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology." Information 10, no. 8 (July 25, 2019): 247. http://dx.doi.org/10.3390/info10080247.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.
47

Yuan, Yang, Xiao Li, and Ya-Ting Yang. "Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages." Information 11, no. 1 (December 29, 2019): 24. http://dx.doi.org/10.3390/info11010024.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.
48

Salam, Khan Md Anwarus, Setsuo Yamada, and Nishino Tetsuro. "Improve Example-Based Machine Translation Quality for Low-Resource Language Using Ontology." International Journal of Networked and Distributed Computing 5, no. 3 (2017): 176. http://dx.doi.org/10.2991/ijndc.2017.5.3.6.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
49

Hutchinson, Brian, Mari Ostendorf, and Maryam Fazel. "A Sparse Plus Low-Rank Exponential Language Model for Limited Resource Scenarios." IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, no. 3 (March 2015): 494–504. http://dx.doi.org/10.1109/taslp.2014.2379593.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
50

Tran, Van-Khanh, and Le-Minh Nguyen. "Variational model for low-resource natural language generation in spoken dialogue systems." Computer Speech & Language 65 (January 2021): 101120. http://dx.doi.org/10.1016/j.csl.2020.101120.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.

До бібліографії