To see the other types of publications on this topic, follow the link: Low resource language.

Journal articles on the topic 'Low resource language'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Low resource language.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Pakray, Partha, Alexander Gelbukh, and Sivaji Bandyopadhyay. "Natural language processing applications for low-resource languages." Natural Language Processing 31, no. 2 (2025): 183–97. https://doi.org/10.1017/nlp.2024.33.

Full text
Abstract:
AbstractNatural language processing (NLP) has significantly advanced our ability to model and interact with human language through technology. However, these advancements have disproportionately benefited high-resource languages with abundant data for training complex models. Low-resource languages, often spoken by smaller or marginalized communities, need help realizing the full potential of NLP applications. The primary challenges in developing NLP applications for low-resource languages stem from the need for large, well-annotated datasets, standardized tools, and linguistic resources. This scarcity of resources hinders the performance of data-driven approaches that have excelled in high-resource settings. Further, low-resource languages frequently exhibit complex grammatical structures, diverse vocabularies, and unique social contexts, which pose additional challenges for standard NLP techniques. Innovative strategies are emerging to address these challenges. Researchers are actively collecting and curating datasets, even utilizing community engagement platforms to expand data resources. Transfer learning, where models pre-trained on high-resource languages are adapted to low-resource settings, has shown significant promise. Multilingual models like Multilingual Bidirectional Encoder Representations from Transformers (mBERT) and Cross Lingual Models (XLM-R), trained on vast quantities of multilingual data, offer a powerful avenue for cross-lingual knowledge transfer. Additionally, researchers are exploring integrating multimodal approaches, combining textual data with images, audio, or video, to enhance NLP performance in low-resource language scenarios. This survey covers applications like part-of-speech tagging, morphological analysis, sentiment analysis, hate speech detection, dependency parsing, language identification, discourse annotation guidelines, question answering, machine translation, information retrieval, and predictive authoring for augmentative and alternative communication systems. The review also highlights machine learning approaches, deep learning approaches, Transformers, and cross-lingual transfer learning as practical techniques. Developing practical NLP applications for low-resource languages is crucial for preserving linguistic diversity, fostering inclusion within the digital world, and expanding our understanding of human language. While challenges remain, the strategies outlined in this survey demonstrate the ongoing progress and highlight the potential for NLP to empower communities that speak low-resource languages and contribute to a more equitable landscape within language technology.
APA, Harvard, Vancouver, ISO, and other styles
2

Lin, Donghui, Yohei Murakami, and Toru Ishida. "Towards Language Service Creation and Customization for Low-Resource Languages." Information 11, no. 2 (2020): 67. http://dx.doi.org/10.3390/info11020067.

Full text
Abstract:
The most challenging issue with low-resource languages is the difficulty of obtaining enough language resources. In this paper, we propose a language service framework for low-resource languages that enables the automatic creation and customization of new resources from existing ones. To achieve this goal, we first introduce a service-oriented language infrastructure, the Language Grid; it realizes new language services by supporting the sharing and combining of language resources. We then show the applicability of the Language Grid to low-resource languages. Furthermore, we describe how we can now realize the automation and customization of language services. Finally, we illustrate our design concept by detailing a case study of automating and customizing bilingual dictionary induction for low-resource Turkic languages and Indonesian ethnic languages.
APA, Harvard, Vancouver, ISO, and other styles
3

Ranasinghe, Tharindu, and Marcos Zampieri. "Multilingual Offensive Language Identification for Low-resource Languages." ACM Transactions on Asian and Low-Resource Language Information Processing 21, no. 1 (2022): 1–13. http://dx.doi.org/10.1145/3457610.

Full text
Abstract:
Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.
APA, Harvard, Vancouver, ISO, and other styles
4

Cassano, Federico, John Gouwar, Francesca Lucchetti, et al. "Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs." Proceedings of the ACM on Programming Languages 8, OOPSLA2 (2024): 677–708. http://dx.doi.org/10.1145/3689735.

Full text
Abstract:
Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on high-resource programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available (e.g., OCaml, Racket, and several others). This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, called MultiPL-T, generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. MultiPL-T translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize unit tests for commented code from a high-resource source language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate the code from the high-resource source language to a target low-resource language. This gives us a corpus of candidate training data in the target language, but many of these translations are wrong. 3) We use a lightweight compiler to compile the test cases generated in (1) from the source language to the target language, which allows us to filter our obviously wrong translations. The result is a training corpus in the target low-resource language where all items have been validated with test cases. We apply this approach to generate tens of thousands of new, validated training items for five low-resource languages: Julia, Lua, OCaml, R, and Racket, using Python as the source high-resource language. Furthermore, we use an open Code LLM (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. Using datasets generated with MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models on the natural language to code task. We also present Racket fine-tunes for two very recent models, DeepSeek Coder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.
APA, Harvard, Vancouver, ISO, and other styles
5

Abigail Rai. "Part-of-Speech (POS) Tagging of Low-Resource Language (Limbu) with Deep learning." Panamerican Mathematical Journal 35, no. 1s (2024): 149–57. http://dx.doi.org/10.52783/pmj.v35.i1s.2297.

Full text
Abstract:
POS tagging is a basic Natural Language Processing (NLP) task that tags the words in an input text according to its grammatical values. Although POS Tagging is a fundamental application for very resourced languages, such as Limbu, is still unknown due to only few tagged datasets and linguistic resources. This research project uses deep learning techniques, transfer learning, and the BiLSTM-CRF model to develop an accurate POS-tagging system for the Limbu language. Using annotated and unannotated language data, we progress in achieving a small yet informative dataset of Limbu text. Skilled multilingual tutoring was modified to enhance success on low-resource language tests. The model as propose attains 90% accuracy, which is very much better than traditional rule-based and machine learning methods for Limbu POS tagging. The results indicate that deep learning methods can address linguistic issues facing low-resource languages even with limited data. In turn, this study provides a cornerstone for follow up NLP-based applications of Limbu and similar low-resource languages, demonstrating how deep learning can fill the gap where data is scarce.
APA, Harvard, Vancouver, ISO, and other styles
6

Nitu, Melania, and Mihai Dascalu. "Natural Language Processing Tools for Romanian – Going Beyond a Low-Resource Language." Interaction Design and Architecture(s), no. 60 (March 15, 2024): 7–26. http://dx.doi.org/10.55612/s-5002-060-001sp.

Full text
Abstract:
Advances in Natural Language Processing bring innovative instruments to the educational field to improve the quality of the didactic process by addressing challenges like language barriers and creating personalized learning experiences. Most research in the domain is dedicated to high-resource languages, such as English, while languages with limited coverage, like Romanian, are still underrepresented in the field. Operating on low-resource languages is essential to ensure equitable access to educational opportunities and to preserve linguistic diversity. Through continuous investments in developing Romanian educational instruments, we are rapidly going beyond a low-resource language. This paper presents recent educational instruments and frameworks dedicated to Romanian, leveraging state-of-the-art NLP techniques, such as building advanced Romanian language models and benchmarks encompassing tools for language learning, text comprehension, question answering, automatic essay scoring, and information retrieval. The methods and insights gained are transferable to other low-resource languages, emphasizing methodological adaptability, collaborative frameworks, and technology transfer to address similar challenges in diverse linguistic contexts. Two use cases are presented, focusing on assessing student performance in Moodle courses and extracting main ideas from students’ feedback. These practical applications in Romanian academic settings serve as examples for enhancing educational practices in other less-resourced languages.
APA, Harvard, Vancouver, ISO, and other styles
7

Zhou, Shuyan, Shruti Rijhwani, John Wieting, Jaime Carbonell, and Graham Neubig. "Improving Candidate Generation for Low-resource Cross-lingual Entity Linking." Transactions of the Association for Computational Linguistics 8 (July 2020): 109–24. http://dx.doi.org/10.1162/tacl_a_00303.

Full text
Abstract:
Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL. 1
APA, Harvard, Vancouver, ISO, and other styles
8

Vargas, Francielle, Wolfgang Schmeisser-Nieto, Zohar Rabinovich, Thiago A. S. Pardo, and Fabrício Benevenuto. "Discourse annotation guideline for low-resource languages." Natural Language Processing 31, no. 2 (2025): 700–743. https://doi.org/10.1017/nlp.2024.19.

Full text
Abstract:
AbstractMost existing discourse annotation guidelines have focused on the English language. As a result, there is a significant lack of research and resources concerning computational discourse-level language understanding and generation for other languages. To fill this relevant gap, we introduce the first discourse annotation guideline using the rhetorical structure theory (RST) for low-resource languages. Specifically, this guideline provides accurate examples of discourse coherence relations in three romance languages: Italian, Portuguese, and Spanish. We further discuss theoretical definitions of RST and compare different artificial intelligence discourse frameworks, hence offering a reliable and accessible survey to new researchers and annotators.
APA, Harvard, Vancouver, ISO, and other styles
9

Li, Zihao, Yucheng Shi, Zirui Liu, et al. "Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 27 (2025): 28186–94. https://doi.org/10.1609/aaai.v39i27.35038.

Full text
Abstract:
The development of Large Language Models (LLMs) relies on extensive text corpora, which are often unevenly distributed across languages. This imbalance results in LLMs performing significantly better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate. Currently, there is a lack of quantitative methods to evaluate the performance of LLMs in these low-resource languages. To address this gap, we propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations. By comparing the LLM's internal representation of various languages against a baseline derived from English, we can assess the model's multilingual capabilities in a robust and language-agnostic manner. Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores, underscoring the effectiveness of our metric in assessing language-specific capabilities. Besides, the experiments show that there is a strong correlation between the LLM’s performance in different languages and the proportion of those languages in its pre-training corpus. These insights underscore the efficacy of the Language Ranker as a tool for evaluating LLM performance across different languages, particularly those with limited resources.
APA, Harvard, Vancouver, ISO, and other styles
10

Azragul Yusup, Azragul Yusup, Degang Chen Azragul Yusup, Yifei Ge Degang Chen, Hongliang Mao Yifei Ge, and Nujian Wang Hongliang Mao. "Resource Construction and Ensemble Learning based Sentiment Analysis for the Low-resource Language Uyghur." 網際網路技術學刊 24, no. 4 (2023): 1009–16. http://dx.doi.org/10.53106/160792642023072404018.

Full text
Abstract:
<p>To address the problem of scarce low-resource sentiment analysis corpus nowadays, this paper proposes a sentence-level sentiment analysis resource conversion method HTL based on the syntactic-semantic knowledge of the low-resource language Uyghur to convert high-resource corpus to low-resource corpus. In the conversion process, a k-fold cross-filtering method is proposed to reduce the distortion of data samples, which is used to select high-quality samples for conversion; finally, the Uyghur sentiment analysis dataset USD is constructed; the Baseline of this dataset is verified under the LSTM model, and the accuracy and F1 values reach 81.07% and 81.13%, respectively, which can provide a reference for the construction of low-resource language corpus nowadays. The accuracy and F1 values reached 81.07% and 81.13%, respectively, which can provide a reference for the construction of today’s low-resource corpus. Meanwhile, this paper also proposes a sentiment analysis model based on logistic regression ensemble learning, SA-LREL, which combines the advantages of several lightweight network models such as TextCNN, RNN, and RCNN as the base model, and the meta-model is constructed using logistic regression functions for ensemble, and the accuracy and F1 values reach 82.17% and 81.86% respectively in the test set, and the experimental results show that the method can effectively improve the performance of Uyghur sentiment analysis task.</p> <p> </p>
APA, Harvard, Vancouver, ISO, and other styles
11

Mati, Diellza Nagavci, Mentor Hamiti, Arsim Susuri, Besnik Selimi, and Jaumin Ajdari. "Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning." Annals of Emerging Technologies in Computing 5, no. 3 (2021): 52–58. http://dx.doi.org/10.33166/aetic.2021.03.005.

Full text
Abstract:
The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.
APA, Harvard, Vancouver, ISO, and other styles
12

Kashyap, Gaurav. "Multilingual NLP: Techniques for Creating Models that Understand and Generate Multiple Languages with Minimal Resources." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 12 (2024): 1–5. https://doi.org/10.55041/ijsrem7648.

Full text
Abstract:
Models that can process human language in a variety of applications have been developed as a result of the quick development of natural language processing (NLP). Scaling NLP technologies to support multiple languages with minimal resources is still a major challenge, even though many models work well in high-resource languages. By developing models that can comprehend and produce text in multiple languages, especially those with little linguistic information, multilingual natural language processing (NLP) seeks to overcome this difficulty. This study examines the methods used in multilingual natural language processing (NLP), such as data augmentation, transfer learning, and multilingual pre-trained models. It also talks about the innovations and trade-offs involved in developing models that can effectively handle multiple languages with little effort. Many low-resource languages have been underserved by the quick advances in natural language processing, which have mostly benefited high-resource languages. The methods for creating multilingual NLP models that can efficiently handle several languages with little resource usage are examined in this paper. We discuss unsupervised morphology-based approaches to expand vocabularies, the importance of community involvement in low-resource language technology, and the limitations of current multilingual models. With the creation of strong language models capable of handling a variety of tasks, the field of natural language processing has advanced significantly in recent years. But not all languages have benefited equally from the advancements, with high-resource languages like English receiving disproportionate attention. [9] As a result, there are huge differences in the performance and accessibility of natural language processing (NLP) systems for the languages spoken around the world, many of which are regarded as low-resource. Researchers have looked into a number of methods for developing multilingual natural language processing (NLP) models that can comprehend and produce text in multiple languages with little effort in order to rectify this imbalance. Using unsupervised morphology-based techniques to increase the vocabulary of low-resource languages is one promising strategy. Keywords: Multilingual NLP, Low-resource Languages, Morphology, Vocabulary Expansion, Creole Languages
APA, Harvard, Vancouver, ISO, and other styles
13

Rijhwani, Shruti, Jiateng Xie, Graham Neubig, and Jaime Carbonell. "Zero-Shot Neural Transfer for Cross-Lingual Entity Linking." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6924–31. http://dx.doi.org/10.1609/aaai.v33i01.33016924.

Full text
Abstract:
Cross-lingual entity linking maps an entity mention in a source language to its corresponding entry in a structured knowledge base that is in a different (target) language. While previous work relies heavily on bilingual lexical resources to bridge the gap between the source and the target languages, these resources are scarce or unavailable for many low-resource languages. To address this problem, we investigate zero-shot cross-lingual entity linking, in which we assume no bilingual lexical resources are available in the source low-resource language. Specifically, we propose pivot-basedentity linking, which leverages information from a highresource “pivot” language to train character-level neural entity linking models that are transferred to the source lowresource language in a zero-shot manner. With experiments on 9 low-resource languages and transfer through a total of54 languages, we show that our proposed pivot-based framework improves entity linking accuracy 17% (absolute) on average over the baseline systems, for the zero-shot scenario.1 Further, we also investigate the use of language-universal phonological representations which improves average accuracy (absolute) by 36% when transferring between languages that use different scripts.
APA, Harvard, Vancouver, ISO, and other styles
14

Qarah, Faisal, and Tawfeeq Alsanoosy. "Evaluation of Arabic Large Language Models on Moroccan Dialect." Engineering, Technology & Applied Science Research 15, no. 3 (2025): 22478–85. https://doi.org/10.48084/etasr.10331.

Full text
Abstract:
Large Language Models (LLMs) have shown outstanding performance in many Natural Language Processing (NLP) tasks for high-resource languages, especially English, primarily because most of them were trained on widely available text resources. As a result, many low-resource languages, such as Arabic and African languages and their dialects, are not well studied, raising concerns about whether LLMs can perform fairly across them. Therefore, evaluating the performance of LLMs for low-resource languages and diverse dialects is crucial. This study investigated the performance of LLMs in Moroccan Arabic, a low-resource dialect spoken by approximately 30 million people. The performance of 14 Arabic pre-trained models was evaluated on the Moroccan dialect, employing 11 datasets across various NLP tasks such as text classification, sentiment analysis, and offensive language detection. The evaluation results showed that MARBERTv2 achieved the highest overall average F1-score of 83.47, while the second-best model, DarijaBERT-mix, had an average F1-score of 83.38. These findings provide valuable insights into the effectiveness of current LLMs for low-resource languages, particularly the Moroccan dialect.
APA, Harvard, Vancouver, ISO, and other styles
15

Lee, Chanhee, Kisu Yang, Taesun Whang, Chanjun Park, Andrew Matteson, and Heuiseok Lim. "Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models." Applied Sciences 11, no. 5 (2021): 1974. http://dx.doi.org/10.3390/app11051974.

Full text
Abstract:
Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.
APA, Harvard, Vancouver, ISO, and other styles
16

Lee, Jaeseong, Dohyeon Lee, and Seung-won Hwang. "Script, Language, and Labels: Overcoming Three Discrepancies for Low-Resource Language Specialization." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 11 (2023): 13004–13. http://dx.doi.org/10.1609/aaai.v37i11.26528.

Full text
Abstract:
Although multilingual pretrained models (mPLMs) enabled support of various natural language processing in diverse languages, its limited coverage of 100+ languages lets 6500+ languages remain ‘unseen’. One common approach for an unseen language is specializing the model for it as target, by performing additional masked language modeling (MLM) with the target language corpus. However, we argue that, due to the discrepancy from multilingual MLM pretraining, a naive specialization as such can be suboptimal. Specifically, we pose three discrepancies to overcome. Script and linguistic discrepancy of the target language from the related seen languages, hinder a positive transfer, for which we propose to maximize representation similarity, unlike existing approaches maximizing overlaps. In addition, label space for MLM prediction can vary across languages, for which we propose to reinitialize top layers for a more effective adaptation. Experiments over four different language families and three tasks shows that our method improves the task performance of unseen languages with statistical significance, while previous approach fails to.
APA, Harvard, Vancouver, ISO, and other styles
17

Mozafari, Marzieh, Khouloud Mnassri, Reza Farahbakhsh, and Noel Crespi. "Offensive language detection in low resource languages: A use case of Persian language." PLOS ONE 19, no. 6 (2024): e0304166. http://dx.doi.org/10.1371/journal.pone.0304166.

Full text
Abstract:
THIS ARTICLE USES WORDS OR LANGUAGE THAT IS CONSIDERED PROFANE, VULGAR, OR OFFENSIVE BY SOME READERS. Different types of abusive content such as offensive language, hate speech, aggression, etc. have become prevalent in social media and many efforts have been dedicated to automatically detect this phenomenon in different resource-rich languages such as English. This is mainly due to the comparative lack of annotated data related to offensive language in low-resource languages, especially the ones spoken in Asian countries. To reduce the vulnerability among social media users from these regions, it is crucial to address the problem of offensive language in such low-resource languages. Hence, we present a new corpus of Persian offensive language consisting of 6,000 out of 520,000 randomly sampled micro-blog posts from X (Twitter) to deal with offensive language detection in Persian as a low-resource language in this area. We introduce a method for creating the corpus and annotating it according to the annotation practices of recent efforts for some benchmark datasets in other languages which results in categorizing offensive language and the target of offense as well. We perform extensive experiments with three classifiers in different levels of annotation with a number of classical Machine Learning (ML), Deep learning (DL), and transformer-based neural networks including monolingual and multilingual pre-trained language models. Furthermore, we propose an ensemble model integrating the aforementioned models to boost the performance of our offensive language detection task. Initial results on single models indicate that SVM trained on character or word n-grams are the best performing models accompanying monolingual transformer-based pre-trained language model ParsBERT in identifying offensive vs non-offensive content, targeted vs untargeted offense, and offensive towards individual or group. In addition, the stacking ensemble model outperforms the single models by a substantial margin, obtaining 5% respective macro F1-score improvement for three levels of annotation.
APA, Harvard, Vancouver, ISO, and other styles
18

Laskar, Sahinur Rahman, Abdullah Faiz Ur Rahman Khilji, Partha Pakray, and Sivaji Bandyopadhyay. "Improved neural machine translation for low-resource English–Assamese pair." Journal of Intelligent & Fuzzy Systems 42, no. 5 (2022): 4727–38. http://dx.doi.org/10.3233/jifs-219260.

Full text
Abstract:
Language translation is essential to bring the world closer and plays a significant part in building a community among people of different linguistic backgrounds. Machine translation dramatically helps in removing the language barrier and allows easier communication among linguistically diverse communities. Due to the unavailability of resources, major languages of the world are accounted as low-resource languages. This leads to a challenging task of automating translation among various such languages to benefit indigenous speakers. This article investigates neural machine translation for the English–Assamese resource-poor language pair by tackling insufficient data and out-of-vocabulary problems. We have also proposed an approach of data augmentation-based NMT, which exploits synthetic parallel data and shows significantly improved translation accuracy for English-to-Assamese and Assamese-to-English translation and obtained state-of-the-art results.
APA, Harvard, Vancouver, ISO, and other styles
19

A. Baldha, Nirav. "Question Answering for Low Resource Languages Using Natural Language Processing." International Journal of Scientific Research and Engineering Trends 8, no. 2 (2022): 1122–26. http://dx.doi.org/10.61137/ijsret.vol.8.issue2.207.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Shikali, Casper S., and Refuoe Mokhosi. "Enhancing African low-resource languages: Swahili data for language modelling." Data in Brief 31 (August 2020): 105951. http://dx.doi.org/10.1016/j.dib.2020.105951.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Xiao, Yubei, Ke Gong, Pan Zhou, Guolin Zheng, Xiaodan Liang, and Liang Lin. "Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (2021): 14112–20. http://dx.doi.org/10.1609/aaai.v35i16.17661.

Full text
Abstract:
Low-resource automatic speech recognition (ASR) is challenging, as the low-resource target language data cannot well train an ASR model. To solve this issue, meta-learning formulates ASR for each source language into many small ASR tasks and meta-learns a model initialization on all tasks from different source languages to access fast adaptation on unseen target languages. However, for different source languages, the quantity and difficulty vary greatly because of their different data scales and diverse phonological systems, which leads to task-quantity and task-difficulty imbalance issues and thus a failure of multilingual meta-learning ASR (MML-ASR). In this work, we solve this problem by developing a novel adversarial meta sampling (AMS) approach to improve MML-ASR. When sampling tasks in MML-ASR, AMS adaptively determines the task sampling probability for each source language. Specifically, for each source language, if the query loss is large, it means that its tasks are not well sampled to train ASR model in terms of its quantity and difficulty and thus should be sampled more frequently for extra learning. Inspired by this fact, we feed the historical task query loss of all source language domain into a network to learn a task sampling policy for adversarially increasing the current query loss of MML-ASR. Thus, the learnt task sampling policy can master the learning situation of each language and thus predicts good task sampling probability for each language for more effective learning. Finally, experiment results on two multilingual datasets show significant performance improvement when applying our AMS on MML-ASR, and also demonstrate the applicability of AMS to other low-resource speech tasks and transfer learning ASR approaches.
APA, Harvard, Vancouver, ISO, and other styles
22

Chen, Siqi, Yijie Pei, Zunwang Ke, and Wushour Silamu. "Low-Resource Named Entity Recognition via the Pre-Training Model." Symmetry 13, no. 5 (2021): 786. http://dx.doi.org/10.3390/sym13050786.

Full text
Abstract:
Named entity recognition (NER) is an important task in the processing of natural language, which needs to determine entity boundaries and classify them into pre-defined categories. For low-resource languages, most state-of-the-art systems require tens of thousands of annotated sentences to obtain high performance. However, there is minimal annotated data available about Uyghur and Hungarian (UH languages) NER tasks. There are also specificities in each task—differences in words and word order across languages make it a challenging problem. In this paper, we present an effective solution to providing a meaningful and easy-to-use feature extractor for named entity recognition tasks: fine-tuning the pre-trained language model. Therefore, we propose a fine-tuning method for a low-resource language model, which constructs a fine-tuning dataset through data augmentation; then the dataset of a high-resource language is added; and finally the cross-language pre-trained model is fine-tuned on this dataset. In addition, we propose an attention-based fine-tuning strategy that uses symmetry to better select relevant semantic and syntactic information from pre-trained language models and apply these symmetry features to name entity recognition tasks. We evaluated our approach on Uyghur and Hungarian datasets, which showed wonderful performance compared to some strong baselines. We close with an overview of the available resources for named entity recognition and some of the open research questions.
APA, Harvard, Vancouver, ISO, and other styles
23

Vinodh Gunnam. "Tackling Low-Resource Languages: Efficient Transfer Learning Techniques for Multilingual NLP." International Journal for Research Publication and Seminar 13, no. 4 (2022): 354–59. http://dx.doi.org/10.36676/jrps.v13.i4.1601.

Full text
Abstract:
Therefore, it aims to review the most efficient techniques for transfer learning about low-resource language in multilingual NLP. Some languages need reliable data; the problem is that they lack the resources to achieve high model accuracy. One of the solutions presented is transfer learning, a technique that enables knowledge from other high-resource languages to be utilized for LRLs. This study also uses simulation reports, real-time case studies, and experiences to support how these techniques work. The major issues are also outlined, such as lack of training data, model complexity and language problems of variation and solutions, data augmentation, few shots learning, and pre-trained multilingual models. These approaches make way for more diverse NLP systems, and they help pave the way for language inclusion.
APA, Harvard, Vancouver, ISO, and other styles
24

Thakkar, Gaurish, Nives Mikelić Preradović, and Marko Tadić. "Transferring Sentiment Cross-Lingually within and across Same-Family Languages." Applied Sciences 14, no. 13 (2024): 5652. http://dx.doi.org/10.3390/app14135652.

Full text
Abstract:
Natural language processing for languages with limited resources is hampered by a lack of data. Using English as a hub language for such languages, cross-lingual sentiment analysis has been developed. The sheer quantity of English language resources raises questions about its status as the primary resource. This research aims to examine the impact on sentiment analysis of adding data from same-family versus distant-family languages. We analyze the performance using low-resource and high-resource data from the same language family (Slavic), investigate the effect of using a distant-family language (English) and report the results for both settings. Quantitative experiments using multi-task learning demonstrate that adding a large quantity of data from related and distant-family languages is advantageous for cross-lingual sentiment transfer.
APA, Harvard, Vancouver, ISO, and other styles
25

Bajpai, Ashutosh, and Tanmoy Chakraborty. "Multilingual LLMs Inherently Reward In-Language Time-Sensitive Semantic Alignment for Low-Resource Languages." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 22 (2025): 23469–77. https://doi.org/10.1609/aaai.v39i22.34515.

Full text
Abstract:
The unwavering disparity in labeled resources between resource-rich languages and those considered low-resource remains a significant impediment for Large Language Models (LLMs). Recent strides in cross-lingual in-context learning (X-ICL), mainly through semantically aligned examples retrieved from multilingual pre-trained transformers, have shown promise in mitigating this issue. However, our investigation reveals that LLMs intrinsically reward in-language semantically aligned cross-lingual instances over direct cross-lingual semantic alignments, with a pronounced disparity in handling time–sensitive queries in the X-ICL setup. Such queries demand sound temporal reasoning ability from LLMs, yet the advancements have predominantly focused on English. This study aims to bridge this gap by improving temporal reasoning capabilities in low-resource languages. To this end, we introduce mTEMPREASON, a temporal reasoning dataset aimed at the varied degrees of low-resource languages and propose Cross-Lingual Time-Sensitive Semantic Alignment (CLiTSSA), a novel method to improve temporal reasoning in these contexts. To facilitate this, we construct an extension of mTEMPREASON comprising pairs of parallel cross–language temporal queries along with their anticipated in-language semantic similarity scores. Our empirical evidence underscores the superior performance of CLiTSSA compared to established baselines across three languages -- Romanian, German, and French, encompassing three temporal tasks and including a diverse set of four contemporaneous LLMs. This marks a significant step forward in addressing resource disparity in the context of temporal reasoning across languages.
APA, Harvard, Vancouver, ISO, and other styles
26

Et. al., Syed Abdul Basit Andrabi,. "A Review of Machine Translation for South Asian Low Resource Languages." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 5 (2021): 1134–47. http://dx.doi.org/10.17762/turcomat.v12i5.1777.

Full text
Abstract:
Machine translation is an application of natural language processing. Humans use native languages to communicate with one another, whereas programming languages communicate between humans and computers. NLP is the field that involves a broad set of techniques for analysis, manipulation and automatic generation of human languages or natural languages with the help of computers. It is essential to provide access to information to people for their development in the present information age. It is necessary to put equal emphasis on removing the barrier of language between different divisions of society. The area of NLP strives to fill this gap of the language barrier by applying machine translation. One natural language is transformed into another natural language with the aid of computers. The first few years of this area were dedicated to the development of rule-based systems. Still, later on, due to the increase in computational power, there was a transition towards statistical machine translation. The motive of machine translation is that the meaning of the translated text should be preserved during translation. This research paper aims to analyse the machine translation approaches used for resource-poor languages and determine the needs and challenges the researchers face. This paper also reviews the machine translation systems that are available for poor research languages.
APA, Harvard, Vancouver, ISO, and other styles
27

Kalluri, Kartheek. "ADAPTING LLMs FOR LOW RESOURCE LANGUAGES-TECHNIQUES AND ETHICAL CONSIDERATIONS." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 12 (2024): 1–6. https://doi.org/10.55041/isjem00140.

Full text
Abstract:
Adaptive large language models (LLMs) to resource-scarce languages and also analyze the ethical considerations involved. Already incorporated the elements of mixed methods. It consists of a literature review, corpus collection, expert interviews, and shareholders meeting. Some adaptation techniques examined in this study are data augmentation, multilingual pre-training, change of architecture, and parameter-efficient fine-tuning. The quantitative analysis indicated model performance improvements for under-resourced languages, particularly through cross-lingual knowledge transfer and data augmentation. However, results were varied in terms of languages and tasks. There were ethical issues in qualitative analysis: This articulated an ethical framework around the aspects of inclusive and transparency with the involvement of constituencies along all the lines of bias, cultural sensitivity, privacy of data, and impacts on linguistic diversity. Finally, although transfer learning and data augmentation speak nicely to adapting LLMs toward low-resource languages, very careful consideration must still be given to implications to ensure their fair and contextually appropriate use. Keywords- Adaptive large language models (LLMs), Resource-scarce languages, Data augmentation, Multilingual pre training, Cross lingual knowledge transfer, Ethical consideration, Cultural sensitivity
APA, Harvard, Vancouver, ISO, and other styles
28

Rakhimova, Diana, Aidana Karibayeva, and Assem Turarbek. "The Task of Post-Editing Machine Translation for the Low-Resource Language." Applied Sciences 14, no. 2 (2024): 486. http://dx.doi.org/10.3390/app14020486.

Full text
Abstract:
In recent years, machine translation has made significant advancements; however, its effectiveness can vary widely depending on the language pair. Languages with limited resources, such as Kazakh, Uzbek, Kalmyk, Tatar, and others, often encounter challenges in achieving high-quality machine translations. Kazakh is an agglutinative language with complex morphology, making it a low-resource language. This article addresses the task of post-editing machine translation for the Kazakh language. The research begins by discussing the history and evolution of machine translation and how it has developed to meet the unique needs of languages with limited resources. The research resulted in the development of a machine translation post-editing system. The system utilizes modern machine learning methods, starting with neural machine translation using the BRNN model in the initial post-editing stage. Subsequently, the transformer model is applied to further edit the text. Complex structural and grammatical forms are processed, and abbreviations are replaced. Practical experiments were conducted on various texts: news publications, legislative documents, IT sphere, etc. This article serves as a valuable resource for researchers and practitioners in the field of machine translation, shedding light on effective post-editing strategies to enhance translation quality, particularly in scenarios involving languages with limited resources such as Kazakh and Uzbek. The obtained results were tested and evaluated using specialized metrics—BLEU, TER, and WER.
APA, Harvard, Vancouver, ISO, and other styles
29

Kim, Bosung, Juae Kim, Youngjoong Ko, and Jungyun Seo. "Commonsense Knowledge Augmentation for Low-Resource Languages via Adversarial Learning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 7 (2021): 6393–401. http://dx.doi.org/10.1609/aaai.v35i7.16793.

Full text
Abstract:
Commonsense reasoning is one of the ultimate goals of artificial intelligence research because it simulates the human thinking process. However, most commonsense reasoning studies have focused on English because available commonsense knowledge for low-resource languages is scarce due to high construction costs. Translation is one of the typical methods for augmenting data for low-resource languages; however, translation entails ambiguity problems, where one word can be translated into multiple words due to polysemes and homonyms. Previous studies have suggested methods to measure the validity of translated multiple triples by using additional metadata and manually labeled data. However, such handcrafted datasets are not available for many low-resource languages. In this paper, we propose a knowledge augmentation method using adversarial networks that does not require any labeled data. Our adversarial networks can transfer knowledge learned from a resource-rich language to low-resource languages and thus measure the validity score of translated triples even without labeled data. We designed experiments to demonstrate that high-scoring triples obtained by the proposed model can be considered augmented knowledge. The experimental results show that our proposed method for a low-resource language, Korean, achieved 93.7% precision@1 on a manually labeled benchmark. Furthermore, to verify our model for other low-resource languages, we introduced new test sets for knowledge validation in 16 different languages. Our adversarial model obtains strong results for all language test sets. We will release the augmented Korean knowledge and test sets for 16 languages.
APA, Harvard, Vancouver, ISO, and other styles
30

Zhang, Mozhi, Yoshinari Fujinuma, and Jordan Boyd-Graber. "Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 9547–54. http://dx.doi.org/10.1609/aaai.v34i05.6500.

Full text
Abstract:
Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (caco) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder derives vector representations for input words from their written forms, and the classifier makes predictions based on the word vectors. We use a joint character representation for both the source language and the target language, which allows the embedder to generalize knowledge about source language words to target language words with similar forms. We propose a multi-task objective that can further improve the model if additional cross-lingual or monolingual resources are available. Experiments confirm that character-level knowledge transfer is more data-efficient than word-level transfer between related languages.
APA, Harvard, Vancouver, ISO, and other styles
31

ZENNAKI, O., N. SEMMAR, and L. BESACIER. "A neural approach for inducing multilingual resources and natural language processing tools for low-resource languages." Natural Language Engineering 25, no. 1 (2018): 43–67. http://dx.doi.org/10.1017/s1351324918000293.

Full text
Abstract:
AbstractThis work focuses on the rapid development of linguistic annotation tools for low-resource languages (languages that have no labeled training data). We experiment with several cross-lingual annotation projection methods using recurrent neural networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between source and target languages. More precisely, our approach has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger forNlanguages). We investigate both uni and bidirectional RNN models and propose a method to include external information (for instance, low-level information from part-of-speech tags) in the RNN to train higher level taggers (for instance, Super Sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual part-of-speech and Super Sense taggers. We also use our approach in a weakly supervised context, and it shows an excellent potential for very low-resource settings (less than 1k training utterances).
APA, Harvard, Vancouver, ISO, and other styles
32

Meeus, Quentin, Marie-Francine Moens, and Hugo Van hamme. "Bidirectional Representations for Low-Resource Spoken Language Understanding." Applied Sciences 13, no. 20 (2023): 11291. http://dx.doi.org/10.3390/app132011291.

Full text
Abstract:
Speech representation models lack the ability to efficiently store semantic information and require fine tuning to deliver decent performance. In this research, we introduce a transformer encoder–decoder framework with a multiobjective training strategy, incorporating connectionist temporal classification (CTC) and masked language modeling (MLM) objectives. This approach enables the model to learn contextual bidirectional representations. We evaluate the representations in a challenging low-resource scenario, where training data is limited, necessitating expressive speech embeddings to compensate for the scarcity of examples. Notably, we demonstrate that our model’s initial embeddings outperform comparable models on multiple datasets before fine tuning. Fine tuning the top layers of the representation model further enhances performance, particularly on the Fluent Speech Command dataset, even under low-resource conditions. Additionally, we introduce the concept of class attention as an efficient module for spoken language understanding, characterized by its speed and minimal parameter requirements. Class attention not only aids in explaining model predictions but also enhances our understanding of the underlying decision-making processes. Our experiments cover both English and Dutch languages, offering a comprehensive evaluation of our proposed approach.
APA, Harvard, Vancouver, ISO, and other styles
33

Berthelier, Benoit. "Division and the Digital Language Divide: A Critical Perspective on Natural Language Processing Resources for the South and North Korean Languages." Korean Studies 47, no. 1 (2023): 243–73. http://dx.doi.org/10.1353/ks.2023.a908624.

Full text
Abstract:
Abstract: The digital world is marked by large asymmetries in the volume of content available between different languages. As a direct corollary, this inequality also exists, amplified, in the number of resources (labeled and unlabeled datasets, pretrained models, academic research) available for the computational analysis of these languages or what is generally called natural language processing (NLP). NLP literature divides languages between high- and low-resource languages. Thanks to early private and public investment in the field, the Korean language is generally considered to be a high-resource language. Yet, the good fortunes of Korean in the age of machine learning obscure the divided state of the language, as recensions of available resources and research solely focus on the standard language of South Korea, thus making it the sole representant of an otherwise diverse linguistic family that includes the Northern standard language as well as regional and diasporic dialects. This paper shows that the resources developed for the South Korean language do not necessarily transfer to the North Korean language. However, it also argues that this does not make North Korean a low-resource language. On one hand, South Korean resources can be augmented with North Korean data to achieve better performance. On the other, North Korean has more resources than commonly assumed. Retracing the long history of NLP research in North Korea, the paper shows that a large number of datasets and research exists for the North Korean language even if they are not easily available. The paper concludes by exploring the possibility of "unified" language models and underscoring the need for active NLP research collaboration across the Korean peninsula.
APA, Harvard, Vancouver, ISO, and other styles
34

Mi, Chenggang, Shaolin Zhu, and Rui Nie. "Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion." Computational Intelligence and Neuroscience 2021 (April 8, 2021): 1–9. http://dx.doi.org/10.1155/2021/9975078.

Full text
Abstract:
Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.
APA, Harvard, Vancouver, ISO, and other styles
35

Shi, Xiayang, Xinyi Liu, Zhenqiang Yu, Pei Cheng, and Chun Xu. "Extracting Parallel Sentences from Low-Resource Language Pairs with Minimal Supervision." Journal of Physics: Conference Series 2171, no. 1 (2022): 012044. http://dx.doi.org/10.1088/1742-6596/2171/1/012044.

Full text
Abstract:
Abstract At present, machine translation in the market depends on parallel sentence corpus, and the number of parallel sentences will affect the performance of machine translation, especially in low resource corpus. In recent years, the use of non parallel corpora to learn cross language word representation as low resources and less supervision to obtain bilingual sentence pairs provides a new idea. In this paper, we propose a new method. First, we create cross domain mappings in a small number of single languages. Then a classifier is constructed to extract bilingual parallel sentence pairs. Finally, we prove the effectiveness of our method in Uygur Chinese low resource language by using machine translation, and achieve good results.
APA, Harvard, Vancouver, ISO, and other styles
36

Sabouri, Sadra, Elnaz Rahmati Rahmati, Soroush Gooran, and Hossein Sameti. "naab: A ready-to-use plug-and-play corpus for Farsi." Journal of Artificial Intelligence, Applications, and Innovations 1, no. 2 (2024): 1–8. https://doi.org/10.61838/jaiai.1.2.1.

Full text
Abstract:
The rise of large language models (LLMs) has transformed numerous natural language processing (NLP) tasks, yet their performance in low and mid-resource languages, such as Farsi, still lags behind resource-rich languages like English. To address this gap, we introduce Naab, the largest publicly available, cleaned, and ready-to-use Farsi textual corpus. Naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words. Named after the Farsi word ناب (meaning "pure" or "high-grade"), this corpus is openly accessible via Hugging Face, offering researchers a valuable resource for Farsi NLP tasks. In addition to naab, we provide naab-raw, an unprocessed version of the dataset, along with a pre-processing toolkit that allows users to clean their custom corpora. These resources empower NLP researchers and practitioners, particularly those focusing on low-resource languages, to improve the performance of LLMs in their respective domains and bridge the gap between resource-rich and resource-poor languages.
APA, Harvard, Vancouver, ISO, and other styles
37

Adjeisah, Michael, Guohua Liu, Douglas Omwenga Nyabuga, Richard Nuetey Nortey, and Jinling Song. "Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation." Computational Intelligence and Neuroscience 2021 (April 11, 2021): 1–10. http://dx.doi.org/10.1155/2021/6682385.

Full text
Abstract:
Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.
APA, Harvard, Vancouver, ISO, and other styles
38

Visser, Ruan, Trieko Grobler, and Marcel Dunaiski. "Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages." JUCS - Journal of Universal Computer Science 30, no. 13 (2024): 1849–71. https://doi.org/10.3897/jucs.118889.

Full text
Abstract:
To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of language model development under resource-constrained conditions. We investigate the interplay between model size, pretraining objectives, and multilingual dataset composition in the context of low-resource languages such as Zulu and Xhosa. In our approach, we initially pretrain language models from scratch on specific low-resource languages using a variety of model configurations, and incrementally add related languages to explore the effect of additional languages on the performance of these models. We demonstrate that smaller data volumes can be effectively leveraged, and that the choice of pretraining objective and multilingual dataset composition significantly influences model performance. Our monolingual and multilingual models, exhibit competitive, and in some cases superior, performance compared to established multilingual models such as XLM-R-base and AfroXLM-R-base.
APA, Harvard, Vancouver, ISO, and other styles
39

Visser, Ruan, Trieko Grobler, and Marcel Dunaiski. "Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages." JUCS - Journal of Universal Computer Science 30, no. (13) (2024): 1849–71. https://doi.org/10.3897/jucs.118889.

Full text
Abstract:
To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of language model development under resource-constrained conditions. We investigate the interplay between model size, pretraining objectives, and multilingual dataset composition in the context of low-resource languages such as Zulu and Xhosa. In our approach, we initially pretrain language models from scratch on specific low-resource languages using a variety of model configurations, and incrementally add related languages to explore the effect of additional languages on the performance of these models. We demonstrate that smaller data volumes can be effectively leveraged, and that the choice of pretraining objective and multilingual dataset composition significantly influences model performance. Our monolingual and multilingual models, exhibit competitive, and in some cases superior, performance compared to established multilingual models such as XLM-R-base and AfroXLM-R-base.
APA, Harvard, Vancouver, ISO, and other styles
40

Xiao, Jingxuan, and Jiawei Wu. "Transfer Learning for Cross-Language Natural Language Processing Models." Journal of Computer Technology and Applied Mathematics 1, no. 3 (2024): 30–38. https://doi.org/10.5281/zenodo.13366733.

Full text
Abstract:
Cross-language natural language processing (NLP) presents numerous challenges due to the wide array of linguistic structures and vocabulary found within each language. Transfer learning has proven itself successful at meeting these challenges by drawing upon knowledge gained in highly resourced languages to enhance performance in lower resource ones. This paper investigates the application of transfer learning in cross-language NLP, exploring various methodologies, models and their efficacy. More specifically, we investigate mechanisms related to model adaptation, fine-tuning techniques and integration of multilingual data sources. Through experiments and analyses on tasks such as sentiment analysis, named entity recognition and machine translation across multiple languages, we demonstrate how transfer learning can enhance model performance. Our experiments reveal significant increases in both prediction accuracy and generalization across low-resource languages - providing valuable insight into future research directions as well as global NLP deployment applications.
APA, Harvard, Vancouver, ISO, and other styles
41

Supriya, Musica, U. Dinesh Acharya, and Ashalatha Nayak. "Enhancing Neural Machine Translation Quality for Kannada–Tulu Language Pairs through Transformer Architecture: A Linguistic Feature Integration." Designs 8, no. 5 (2024): 100. http://dx.doi.org/10.3390/designs8050100.

Full text
Abstract:
The rise of intelligent systems demands good machine translation models that are less data hungry and more efficient, especially for low- and extremely-low-resource languages with few or no data available. By integrating a linguistic feature to enhance the quality of translation, we have developed a generic Neural Machine Translation (NMT) model for Kannada–Tulu language pairs. The NMT model uses Transformer architecture and a state-of-the-art model for translating text from Kannada to Tulu and learns based on the parallel data. Kannada and Tulu are both low-resource Dravidian languages, with Tulu recognised as an extremely-low-resource language. Dravidian languages are morphologically rich and are highly agglutinative in nature and there exist only a few NMT models for Kannada–Tulu language pairs. They exhibit poor translation scores as they fail to capture the linguistic features of the language. The proposed generic approach can benefit other low-resource Indic languages that have smaller parallel corpora for NMT tasks. Evaluation metrics like Bilingual Evaluation Understudy (BLEU), character-level F-score (chrF) and Word Error Rate (WER) are considered to obtain the improved translation scores for the linguistic-feature-embedded NMT model. These results hold promise for further experimentation with other low- and extremely-low-resource language pairs.
APA, Harvard, Vancouver, ISO, and other styles
42

V Kadam, Ashlesha. "Natural Language Understanding of Low-Resource Languages in Voice Assistants: Advancements, Challenges and Mitigation Strategies." International Journal of Language, Literature and Culture 3, no. 5 (2023): 20–23. http://dx.doi.org/10.22161/ijllc.3.5.3.

Full text
Abstract:
This paper presents an exploration of low resource languages and the specific challenges that arise in natural language understanding of these by a voice assistant. While voice assistants have made significant strides when it comes to their understanding of mainstream languages, this paper focuses on extending this understanding to low resource languages in order to maintain diversity of linguistics and also delight the customer. In this paper, the specific nuances of natural language understanding when it comes to these low resource languages has been discussed. The paper also proposes techniques to overcome some of the challenges in voice assistants understanding low resource language models. The proposed methods and future direction presented in this doc are poised to drive advancements in voice technology and promote inclusivity by ensuring that voice assistants are accessible to speakers of underrepresented languages.
APA, Harvard, Vancouver, ISO, and other styles
43

Zhu, ShaoLin, Xiao Li, YaTing Yang, Lei Wang, and ChengGang Mi. "A Novel Deep Learning Method for Obtaining Bilingual Corpus from Multilingual Website." Mathematical Problems in Engineering 2019 (January 10, 2019): 1–7. http://dx.doi.org/10.1155/2019/7495436.

Full text
Abstract:
Machine translation needs a large number of parallel sentence pairs to make sure of having a good translation performance. However, the lack of parallel corpus heavily limits machine translation for low-resources language pairs. We propose a novel method that combines the continuous word embeddings with deep learning to obtain parallel sentences. Since parallel sentences are very invaluable for low-resources language pair, we introduce cross-lingual semantic representation to induce bilingual signals. Our experiments show that we can achieve promising results under lacking external resources for low-resource languages. Finally, we construct a state-of-the-art machine translation system in low-resources language pair.
APA, Harvard, Vancouver, ISO, and other styles
44

Tela, Abrhalei, Abraham Woubie, and Ville Hautamäki. "Transferring monolingual model to low-resource language: the case of Tigrinya." Applied Computing and Intelligence 4, no. 2 (2024): 184–94. http://dx.doi.org/10.3934/aci.2024011.

Full text
Abstract:
<p>In recent years, transformer models have achieved great success in natural language processing (NLP) tasks. Most of the current results are achieved by using monolingual transformer models, where the model is pre-trained using a single-language unlabelled text corpus. Then, the model is fine-tuned to the specific downstream task. However, the cost of pre-training a new transformer model is high for most languages. In this work, we propose a cost-effective transfer learning method to adopt a strong source language model, trained from a large monolingual corpus to a low-resource language. Thus, using the XLNet language model, we demonstrate competitive performance with mBERT and a pre-trained target language model on the cross-lingual sentiment (CLS) dataset and on a new sentiment analysis dataset for the low-resource language Tigrinya. With only 10k examples of the given Tigrinya sentiment analysis dataset, English XLNet achieved 78.88% F1-Score, outperforming BERT and mBERT by 10% and 7%, respectively. More interestingly, fine-tuning (English) XLNet model on the CLS dataset showed promising results compared to mBERT and even outperformed mBERT for one dataset of the Japanese language.</p>
APA, Harvard, Vancouver, ISO, and other styles
45

Wu, Yike, Shiwan Zhao, Ying Zhang, Xiaojie Yuan, and Zhong Su. "When Pairs Meet Triplets: Improving Low-Resource Captioning via Multi-Objective Optimization." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 3 (2022): 1–20. http://dx.doi.org/10.1145/3492325.

Full text
Abstract:
Image captioning for low-resource languages has attracted much attention recently. Researchers propose to augment the low-resource caption dataset into (image, rich-resource language, and low-resource language) triplets and develop the dual attention mechanism to exploit the existence of triplets in training to improve the performance. However, datasets in triplet form are usually small due to their high collecting cost. On the other hand, there are already many large-scale datasets, which contain one pair from the triplet, such as caption datasets in the rich-resource language and translation datasets from the rich-resource language to the low-resource language. In this article, we revisit the caption-translation pipeline of the translation-based approach to utilize not only the triplet dataset but also large-scale paired datasets in training. The caption-translation pipeline is composed of two models, one caption model of the rich-resource language and one translation model from the rich-resource language to the low-resource language. Unfortunately, it is not trivial to fully benefit from incorporating both the triplet dataset and paired datasets into the pipeline, due to the gap between the training and testing phases and the instability in the training process. We propose to jointly optimize the two models of the pipeline in an end-to-end manner to bridge the training and testing gap, and introduce two auxiliary training objectives to stabilize the training process. Experimental results show that the proposed method improves significantly over the state-of-the-art methods.
APA, Harvard, Vancouver, ISO, and other styles
46

Grönroos, Stig-Arne, Kristiina Jokinen, Katri Hiovain, Mikko Kurimo, and Sami Virpioja. "Low-Resource Active Learning of North Sámi Morphological Segmentation." Septentrio Conference Series, no. 2 (June 17, 2015): 20. http://dx.doi.org/10.7557/5.3465.

Full text
Abstract:
Many Uralic languages have a rich morphological structure, but lack tools of morphological analysis needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications.We study how to create a statistical model for morphological segmentation of North Sámi language with a large unannotated corpus and a small amount of human-annotated word forms selected using an active learning approach. For statistical learning, we use the semi-supervised Morfessor Baseline and FlatCat methods. Aer annotating 237 words with our active learning setup, we improve morph boundary recall over 20% with no loss of precision.
APA, Harvard, Vancouver, ISO, and other styles
47

Chaka, Chaka. "Currently Available GenAI-Powered Large Language Models and Low-Resource Languages: Any Offerings? Wait Until You See." International Journal of Learning, Teaching and Educational Research 23, no. 12 (2024): 148–73. https://doi.org/10.26803/ijlter.23.12.9.

Full text
Abstract:
A lot of hype has accompanied the increasing number of generative artificial intelligence-powered large language models (LLMs). Similarly, much has been written about what currently available LLMs can and cannot do, including their benefits and risks, especially in higher education. However, few use cases have investigated the performance and generative capabilities of LLMs in low-resource languages. With this in mind, one of the purposes of the current study was to explore the extent to which seven, currently available, free-to-use versions of LLMs (ChatGPT, Claude, Copilot, Gemini, GroqChat, Perplexity, and YouChat) perform in five low-resource languages (isiZulu, Sesotho, Yoruba, M?ori, and Mi’kmaq) in their generative multilingual capabilities. Employing a common input prompt, in which the only change was to insert the name of a given low-resource language and English in each case, this study collected its datasets by inputting this common prompt into the seven LLMs. Three of the findings of this study are noteworthy. First, the seven LLMs displayed a significant lack of generative multilingual capabilities in the five low-resource languages. Second, they hallucinated and produced nonsensical, meaningless, and irrelevant responses in their low-resource language outputs. Third, their English responses were far better in quality, relevance, depth, detail, and nuance than their low-resource language only and English responses for the five low-resource languages. The paper ends by offering the implications and making the conclusions of the study in terms of LLMs’ generative capabilities in low-resource languages.
APA, Harvard, Vancouver, ISO, and other styles
48

Murakami, Yohei. "Indonesia Language Sphere: an ecosystem for dictionary development for low-resource languages." Journal of Physics: Conference Series 1192 (March 2019): 012001. http://dx.doi.org/10.1088/1742-6596/1192/1/012001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Pakray, Partha, Alexander Gelbukh, and Sivaji Bandyopadhyay. "Preface: Special issue on Natural Language Processing applications for low-resource languages." Natural Language Processing 31, no. 2 (2025): 181–82. https://doi.org/10.1017/nlp.2024.34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Chen, Xilun, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. "Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification." Transactions of the Association for Computational Linguistics 6 (December 2018): 557–70. http://dx.doi.org/10.1162/tacl_a_00039.

Full text
Abstract:
In recent years great success has been achieved in sentiment classification for English, thanks in part to the availability of copious annotated resources. Unfortunately, most languages do not enjoy such an abundance of labeled data. To tackle the sentiment classification problem in low-resource languages without adequate annotated data, we propose an Adversarial Deep Averaging Network (ADAN 1 ) to transfer the knowledge learned from labeled data on a resource-rich source language to low-resource languages where only unlabeled data exist. ADAN has two discriminative branches: a sentiment classifier and an adversarial language discriminator. Both branches take input from a shared feature extractor to learn hidden representations that are simultaneously indicative for the classification task and invariant across languages. Experiments on Chinese and Arabic sentiment classification demonstrate that ADAN significantly outperforms state-of-the-art systems.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography