Journal articles: 'Language models'

1

Li, Hang. "Language models." Communications of the ACM 65, no. 7 (2022): 56–63. http://dx.doi.org/10.1145/3490443.

APA, Harvard, Vancouver, ISO, and other styles

2

Begnarovich, Uralov Azamat. "The Inconsistency Of Language Models." American Journal of Social Science and Education Innovations 03, no. 09 (2021): 39–44. http://dx.doi.org/10.37547/tajssei/volume03issue09-09.

Full text

Abstract:

The article deals with the problem of disproportion in morpheme units of linguistics and patterns. Based on the disproportion, information is given on the combined affixes formed in the morphemes, the expanded forms, and the analytic and synthetic forms. The data is based on the opinions of the world's leading linguists. The ideas are proven using examples. The formation of a particular linguistic model is a disproportion in the language system (meaning-function-methodological features): confusion of meanings, multifunctionality, semantics, competition in the use of forms (one form has more and more privileges, archaic nation of another form).

APA, Harvard, Vancouver, ISO, and other styles

3

Shimi, G., C. Jerin Mahibha, and Durairaj Thenmozhi. "An Empirical Analysis of Language Detection in Dravidian Languages." Indian Journal Of Science And Technology 17, no. 15 (2024): 1515–26. http://dx.doi.org/10.17485/ijst/v17i15.765.

Full text

Abstract:

Objectives: Language detection is the process of identifying a language associated with a text. The proposed system aims to detect the Dravidian language that is associated with the given text using different machine learning and deep learning algorithms. The paper presents an empirical analysis of the results obtained using the different models. It also aims to evaluate the performance of a language agnostic model for the purpose of language detection. Method: An empirical analysis of Dravidian language identification in social media text using machine learning and deep learning approaches with k-fold cross validation has been implemented. The identification of Dravidian languages, including Tamil, Malayalam, Tamil Code Mix, and Malayalam Code Mix, is performed using both machine learning (ML) and deep learning algorithms. The machine learning algorithms used for language detection are Naive Bayes (NB), Multinomial Logistic Regression (MLR), Support Vector Machine (SVM), and Random Forest (RF). The supervised Deep Learning (DL) models used include BERT, mBERT and language agnostic models. Findings: The language agnostic model outperform all other models considering the task of language detection in Dravidian languages. The results of both the ML and DL models are analyzed empirically with performance measures like accuracy, precision, recall, and f1-score. The accuracy associated with different machine learning algorithms varies from 85% to 89%. It is evident from the experimental result that the deep learning model outperformed with an accuracy of 98%. Novelty: The proposed system emphasizes on the use of the language agnostic model to implement the process of detecting Dravidian languages associated with the given text which provides a promising result of 98% accuracy which is higher than the existing methodologies. Keywords: Language, Machine learning, Deep learning, Transformer model, Encoder, Decoder

APA, Harvard, Vancouver, ISO, and other styles

4

Mezzoudj, Freha, and Abdelkader Benyettou. "An empirical study of statistical language models: n-gram language models vs. neural network language models." International Journal of Innovative Computing and Applications 9, no. 4 (2018): 189. http://dx.doi.org/10.1504/ijica.2018.095762.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Mezzoudj, Freha, and Abdelkader Benyettou. "An empirical study of statistical language models: n-gram language models vs. neural network language models." International Journal of Innovative Computing and Applications 9, no. 4 (2018): 189. http://dx.doi.org/10.1504/ijica.2018.10016827.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Babb, Robert G. "Language and models." ACM SIGSOFT Software Engineering Notes 13, no. 1 (1988): 43–45. http://dx.doi.org/10.1145/43857.43872.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Liu, X., M. J. F. Gales, and P. C. Woodland. "Paraphrastic language models." Computer Speech & Language 28, no. 6 (2014): 1298–316. http://dx.doi.org/10.1016/j.csl.2014.04.004.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Cerf, Vinton G. "Large Language Models." Communications of the ACM 66, no. 8 (2023): 7. http://dx.doi.org/10.1145/3606337.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Nederhof, Mark-Jan. "A General Technique to Train Language Models on Language Models." Computational Linguistics 31, no. 2 (2005): 173–85. http://dx.doi.org/10.1162/0891201054223986.

Full text

Abstract:

We show that under certain conditions, a language model can be trained on the basis of a second language model. The main instance of the technique trains a finite automaton on the basis of a probabilistic context-free grammar, such that the Kullback-Leibler distance between grammar and trained automaton is provably minimal. This is a substantial generalization of an existing algorithm to train an n-gram model on the basis of a probabilistic context-free grammar.

APA, Harvard, Vancouver, ISO, and other styles

10

Veres, Csaba. "Large Language Models are Not Models of Natural Language: They are Corpus Models." IEEE Access 10 (2022): 61970–79. http://dx.doi.org/10.1109/access.2022.3182505.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Xiao, Jingxuan, and Jiawei Wu. "Transfer Learning for Cross-Language Natural Language Processing Models." Journal of Computer Technology and Applied Mathematics 1, no. 3 (2024): 30–38. https://doi.org/10.5281/zenodo.13366733.

Full text

Abstract:

Cross-language natural language processing (NLP) presents numerous challenges due to the wide array of linguistic structures and vocabulary found within each language. Transfer learning has proven itself successful at meeting these challenges by drawing upon knowledge gained in highly resourced languages to enhance performance in lower resource ones. This paper investigates the application of transfer learning in cross-language NLP, exploring various methodologies, models and their efficacy. More specifically, we investigate mechanisms related to model adaptation, fine-tuning techniques and integration of multilingual data sources. Through experiments and analyses on tasks such as sentiment analysis, named entity recognition and machine translation across multiple languages, we demonstrate how transfer learning can enhance model performance. Our experiments reveal significant increases in both prediction accuracy and generalization across low-resource languages - providing valuable insight into future research directions as well as global NLP deployment applications.

APA, Harvard, Vancouver, ISO, and other styles

12

Alam, Salman. "Comparison of Various Models in the Context of Language Identification (Indo Aryan Languages)." International Journal of Science and Research (IJSR) 10, no. 3 (2021): 185–88. https://doi.org/10.21275/sr21303115028.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Amusa, Kamoli Akinwale, Tolulope Christiana Erinosho, Olufunke Olubusola Nuga, and Abdulmatin Olalekan Omotoso. "YorubaAI: Bridging Language Barrier with Advanced Language Models." Journal of Applied Artificial Intelligence 6, no. 1 (2025): 39–52. https://doi.org/10.48185/jaai.v6i1.1474.

Full text

Abstract:

YorubaAI addresses the digital divide caused by language barriers, particularly for Yoruba language speakers who struggle to interact with advanced large language models (LLMs) like GPT-4, which primarily support high-resource languages. This study develops a system, named YorubaAI, for seamless communication in Yoruba language with LLMs. The YorubaAI enables users to input and receive responses in Yoruba language, both in text and audio formats. To achieve this, a speech-to-text (STT) model is fine-tuned for automatic Yoruba language speech recognition while a text-to-speech (TTS) model is employed for conversion of Yoruba language text to speech equivalent. Direct communication with LLM in low-resource languages like Yoruba language typically yields poor results. To prevent this, a generation technique known as retrieval-augmented generation (RAG) is utilized to augment the LLM's existing knowledge with additional information. The RAG is formed through creation of a database of questions and answers in Yoruba language. This database serves as the primary knowledge base that the YorubaAI uses to retrieve relevant information with respect to the question asked. The content of the created questions and answers database is converted into vector embeddings using Google’s Language-Agnostic BERT Sentence Embedding (LaBSE) model to yield numerical representations that capture the semantic meaning of the texts. The embeddings generated from the Yoruba questions database are stored in a vector store database. These embeddings were essential for efficient search and retrieval.The the two models (STT and TTS models) were integrated with a LLM using a user-friendly interface that was built using the Gradio framework. The STT model achieved a word error rate of 13.06% while the TTS model generated natural-sounding Yoruba language speech. YorubaAI correctly responded to various queries in pure Yoruba language syntax and thus successfully bridges the AI accessibility gap for Yoruba language speakers.

APA, Harvard, Vancouver, ISO, and other styles

14

Mitra, Arijita, Nasim Ahmed, Payel Pramanik, and Sayantan Nandi. "Language Studies and Communication Models." International Journal of English Learning & Teaching Skills 3, no. 1 (2020): 1776–94. http://dx.doi.org/10.15864/ijelts.3110.

Full text

Abstract:

Language studies and communication is very important and precisely used in our daily lives. It’s not just about the grammar but learning language means learning expressions, learning about people and their culture. Language represents words when communication is verbal or written. We can conclude that Language is a method of Communication. The aim to put up this topic was to highlight the momentousness of communication on our life which can be achieved through the knowledge acquired by the study of languages. Shaping one’s ideas into reality requires proper transmission of idea which is where communication comes in handy. Adding onto this, nowadays it has been a very important aspect for every single child to be familiar to proper learning of language and communicate effectively in order to get success in future life and achieve high prestige positions.

APA, Harvard, Vancouver, ISO, and other styles

15

Liu, Xunying, James L. Hieronymus, Mark J. F. Gales, and Philip C. Woodland. "Syllable language models for Mandarin speech recognition: Exploiting character language models." Journal of the Acoustical Society of America 133, no. 1 (2013): 519–28. http://dx.doi.org/10.1121/1.4768800.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

De Coster, Mathieu, and Joni Dambre. "Leveraging Frozen Pretrained Written Language Models for Neural Sign Language Translation." Information 13, no. 5 (2022): 220. http://dx.doi.org/10.3390/info13050220.

Full text

Abstract:

We consider neural sign language translation: machine translation from signed to written languages using encoder–decoder neural networks. Translating sign language videos to written language text is especially complex because of the difference in modality between source and target language and, consequently, the required video processing. At the same time, sign languages are low-resource languages, their datasets dwarfed by those available for written languages. Recent advances in written language processing and success stories of transfer learning raise the question of how pretrained written language models can be leveraged to improve sign language translation. We apply the Frozen Pretrained Transformer (FPT) technique to initialize the encoder, decoder, or both, of a sign language translation model with parts of a pretrained written language model. We observe that the attention patterns transfer in zero-shot to the different modality and, in some experiments, we obtain higher scores (from 18.85 to 21.39 BLEU-4). Especially when gloss annotations are unavailable, FPTs can increase performance on unseen data. However, current models appear to be limited primarily by data quality and only then by data quantity, limiting potential gains with FPTs. Therefore, in further research, we will focus on improving the representations used as inputs to translation models.

APA, Harvard, Vancouver, ISO, and other styles

17

O'Rourke, Bernadette. "Language Revitalisation Models in Minority Language Contexts." Anthropological Journal of European Cultures 24, no. 1 (2015): 63–82. http://dx.doi.org/10.3167/ajec.2015.240105.

Full text

Abstract:

This article looks at the historicisation of the native speaker and ideologies of authenticity and anonymity in Europe's language revitalisation movements. It focuses specifically on the case of Irish in the Republic of Ireland and examines how the native speaker ideology and the opposing ideological constructs of authenticity and anonymity filter down to the belief systems and are discursively produced by social actors on the ground. For this I draw on data from ongoing fieldwork in the Republic of Ireland, drawing on interviews with a group of Irish language enthusiasts located outside the officially designated Irish-speaking Gaeltacht.

APA, Harvard, Vancouver, ISO, and other styles

18

G, Sajini. "Computational Evaluation of Language Models by Considering Various Scaling Properties for Processing Natural Languages." Journal of Advanced Research in Dynamical and Control Systems 12, SP7 (2020): 691–700. http://dx.doi.org/10.5373/jardcs/v12sp7/20202159.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Wu, Zhaofeng, William Merrill, Hao Peng, Iz Beltagy, and Noah A. Smith. "Transparency Helps Reveal When Language Models Learn Meaning." Transactions of the Association for Computational Linguistics 11 (2023): 617–34. http://dx.doi.org/10.1162/tacl_a_00565.

Full text

Abstract:

Abstract Many current NLP systems are built from language models trained to optimize unsupervised objectives on large amounts of raw text. Under what conditions might such a procedure acquire meaning? Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations (i.e., languages with strong transparency), both autoregressive and masked language models successfully learn to emulate semantic relations between expressions. However, when denotations are changed to be context-dependent with the language otherwise unmodified, this ability degrades. Turning to natural language, our experiments with a specific phenomenon—referential opacity—add to the growing body of evidence that current language models do not represent natural language semantics well. We show this failure relates to the context-dependent nature of natural language form-meaning mappings.

APA, Harvard, Vancouver, ISO, and other styles

20

Hayashi, Hiroaki, Zecong Hu, Chenyan Xiong, and Graham Neubig. "Latent Relation Language Models." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 7911–18. http://dx.doi.org/10.1609/aaai.v34i05.6298.

Full text

Abstract:

In this paper, we propose Latent Relation Language Models (LRLMs), a class of language models that parameterizes the joint distribution over the words in a document and the entities that occur therein via knowledge graph relations. This model has a number of attractive properties: it not only improves language modeling performance, but is also able to annotate the posterior probability of entity spans for a given text through relations. Experiments demonstrate empirical improvements over both word-based language models and a previous approach that incorporates knowledge graph information. Qualitative analysis further demonstrates the proposed model's ability to learn to predict appropriate relations in context. 1

APA, Harvard, Vancouver, ISO, and other styles

21

Tsvetkov, Yulia, and Chris Dyer. "Cross-Lingual Bridges with Models of Lexical Borrowing." Journal of Artificial Intelligence Research 55 (January 13, 2016): 63–93. http://dx.doi.org/10.1613/jair.4786.

Full text

Abstract:

Linguistic borrowing is the phenomenon of transferring linguistic constructions (lexical, phonological, morphological, and syntactic) from a donor language to a recipient language as a result of contacts between communities speaking different languages. Borrowed words are found in all languages, andin contrast to cognate relationshipsborrowing relationships may exist across unrelated languages (for example, about 40% of Swahilis vocabulary is borrowed from the unrelated language Arabic). In this work, we develop a model of morpho-phonological transformations across languages. Its features are based on universal constraints from Optimality Theory (OT), and we show that compared to several standardbut linguistically more naïvebaselines, our OT-inspired model obtains good performance at predicting donor forms from borrowed forms with only a few dozen training examples, making this a cost-effective strategy for sharing lexical information across languages. We demonstrate applications of the lexical borrowing model in machine translation, using resource-rich donor language to obtain translations of out-of-vocabulary loanwords in a lower resource language. Our framework obtains substantial improvements (up to 1.6 BLEU) over standard baselines.

APA, Harvard, Vancouver, ISO, and other styles

22

Yogatama, Dani, Cyprien de Masson d’Autume, and Lingpeng Kong. "Adaptive Semiparametric Language Models." Transactions of the Association for Computational Linguistics 9 (2021): 362–73. http://dx.doi.org/10.1162/tacl_a_00371.

Full text

Abstract:

Abstract We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component in an integrated architecture. Our model uses extended short-term context by caching local hidden states—similar to transformer-XL—and global long-term memory by retrieving a set of nearest neighbor tokens at each timestep. We design a gating function to adaptively combine multiple information sources to make a prediction. This mechanism allows the model to use either local context, short-term memory, or long-term memory (or any combination of them) on an ad hoc basis depending on the context. Experiments on word-based and character-based language modeling datasets demonstrate the efficacy of our proposed method compared to strong baselines.

APA, Harvard, Vancouver, ISO, and other styles

23

Cockburn, Alexander, and Noam Chomsky. "Models, Nature, and Language." Grand Street, no. 50 (1994): 170. http://dx.doi.org/10.2307/25007794.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Lavrenko, Victor, and W. Bruce Croft. "Relevance-Based Language Models." ACM SIGIR Forum 51, no. 2 (2017): 260–67. http://dx.doi.org/10.1145/3130348.3130376.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Buckman, Jacob, and Graham Neubig. "Neural Lattice Language Models." Transactions of the Association for Computational Linguistics 6 (December 2018): 529–41. http://dx.doi.org/10.1162/tacl_a_00036.

Full text

Abstract:

In this work, we propose a new language modeling paradigm that has the ability to perform both prediction and moderation of information flow at multiple granularities: neural lattice language models. These models construct a lattice of possible paths through a sentence and marginalize across this lattice to calculate sequence probabilities or optimize parameters. This approach allows us to seamlessly incorporate linguistic intuitions — including polysemy and the existence of multiword lexical items — into our language model. Experiments on multiple language modeling tasks show that English neural lattice language models that utilize polysemous embeddings are able to improve perplexity by 9.95% relative to a word-level baseline, and that a Chinese model that handles multi-character tokens is able to improve perplexity by 20.94% relative to a character-level baseline.

APA, Harvard, Vancouver, ISO, and other styles

26

Varona, A., and I. Torres. "Scaling Smoothed Language Models." International Journal of Speech Technology 8, no. 4 (2005): 341–61. http://dx.doi.org/10.1007/s10772-006-9047-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Schwenk, Holger. "Continuous space language models." Computer Speech & Language 21, no. 3 (2007): 492–518. http://dx.doi.org/10.1016/j.csl.2006.09.003.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Bengio, Yoshua. "Neural net language models." Scholarpedia 3, no. 1 (2008): 3881. http://dx.doi.org/10.4249/scholarpedia.3881.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Schütze, Hinrich, and Michael Walsh. "Half-Context Language Models." Computational Linguistics 37, no. 4 (2011): 843–65. http://dx.doi.org/10.1162/coli_a_00078.

Full text

Abstract:

This article investigates the effects of different degrees of contextual granularity on language model performance. It presents a new language model that combines clustering and half-contextualization, a novel representation of contexts. Half-contextualization is based on the half-context hypothesis that states that the distributional characteristics of a word or bigram are best represented by treating its context distribution to the left and right separately and that only directionally relevant distributional information should be used. Clustering is achieved using a new clustering algorithm for class-based language models that compares favorably to the exchange algorithm. When interpolated with a Kneser-Ney model, half-context models are shown to have better perplexity than commonly used interpolated n-gram models and traditional class-based approaches. A novel, fine-grained, context-specific analysis highlights those contexts in which the model performs well and those which are better treated by existing non-class-based models.

APA, Harvard, Vancouver, ISO, and other styles

30

Fan, Ju, Zihui Gu, Songyue Zhang, et al. "Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL." Proceedings of the VLDB Endowment 17, no. 11 (2024): 2750–63. http://dx.doi.org/10.14778/3681954.3681960.

Full text

Abstract:

Zero-shot natural language to SQL (NL2SQL) aims to generalize pretrained NL2SQL models to new environments ( e.g. , new databases and new linguistic phenomena) without any annotated NL2SQL samples from these environments. Existing approaches either use small language models (SLMs) like BART and T5, or prompt large language models (LLMs). However, SLMs may struggle with complex natural language reasoning, and LLMs may not precisely align schemas to identify the correct columns or tables. In this paper, we propose a ZeroNL2SQL framework, which divides NL2SQL into smaller sub-tasks and utilizes both SLMs and LLMs. ZeroNL2SQL first fine-tunes SLMs for better generalizability in SQL structure identification and schema alignment, producing an SQL sketch. It then uses LLMs's language reasoning capability to fill in the missing information in the SQL sketch. To support ZeroNL2SQL, we propose novel database serialization and question-aware alignment methods for effective sketch generation using SLMs. Additionally, we devise a multi-level matching strategy to recommend the most relevant values to LLMs, and select the optimal SQL query via an execution-based strategy. Comprehensive experiments show that ZeroNL2SQL achieves the best zero-shot NL2SQL performance on benchmarks, i.e. , outperforming the state-of-the-art SLM-based methods by 5.5% to 16.4% and exceeding LLM-based methods by 10% to 20% on execution accuracy.

APA, Harvard, Vancouver, ISO, and other styles

31

Oh, Jiun, and Yong-Suk Choi. "Reusing Monolingual Pre-Trained Models by Cross-Connecting Seq2seq Models for Machine Translation." Applied Sciences 11, no. 18 (2021): 8737. http://dx.doi.org/10.3390/app11188737.

Full text

Abstract:

This work uses sequence-to-sequence (seq2seq) models pre-trained on monolingual corpora for machine translation. We pre-train two seq2seq models with monolingual corpora for the source and target languages, then combine the encoder of the source language model and the decoder of the target language model, i.e., the cross-connection. We add an intermediate layer between the pre-trained encoder and the decoder to help the mapping of each other since the modules are pre-trained completely independently. These monolingual pre-trained models can work as a multilingual pre-trained model because one model can be cross-connected with another model pre-trained on any other language, while their capacity is not affected by the number of languages. We will demonstrate that our method improves the translation performance significantly over the random baseline. Moreover, we will analyze the appropriate choice of the intermediate layer, the importance of each part of a pre-trained model, and the performance change along with the size of the bitext.

APA, Harvard, Vancouver, ISO, and other styles

32

Dong, Li. "Learning natural language interfaces with neural models." AI Matters 7, no. 2 (2021): 14–17. http://dx.doi.org/10.1145/3478369.3478375.

Full text

Abstract:

Language is the primary and most natural means of communication for humans. The learning curve of interacting with various services (e.g., digital assistants, and smart appliances) would be greatly reduced if we could talk to machines using human language. However, in most cases computers can only interpret and execute formal languages.

APA, Harvard, Vancouver, ISO, and other styles

33

Sharma Shria Verma, Dhananjai. "Automated Penetration Testing using Large Language Models." International Journal of Science and Research (IJSR) 13, no. 4 (2024): 1826–31. http://dx.doi.org/10.21275/sr24427043741.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

ALAHYANE, Latifa Mohamed. "APPLIED LINGUISTIC APPROACH TO TEACHING FOREIGN LANGUAGES THEORETICAL AND METHODOLOGICAL MODELS." International Journal of Humanities and Educational Research 03, no. 05 (2021): 371–92. http://dx.doi.org/10.47832/2757-5403.5-3.32.

Full text

Abstract:

The theoretical achievement in the field of foreign language learning in the 1950s and early ‎‎1960s remained related to the practical side of language teaching. Moreover, The idea of the ‎need for foreign language teaching methodologies for a theory of learning has remained constant ‎since the occurrence of educational reform movements of the late nineteenth century.‎ To come to terms with the current developments in the field of foreign language learning, it is ‎necessary to trace the recent history of the research carried out in this regard. Therefore, we will ‎focus in this article on tracking the most important theoretical assets of foreign language teaching ‎methods, and monitoring the evolution of language teaching and learning methods. This is done ‎to distinguish between two approaches to language teaching; first, Direct teaching that negates ‎the overlap of the learned and acquired language during foreign language instruction. And ‎second Mediated teaching in which the second language is taught through the first language. ‎Through this, we will monitor the cognitive cross-fertilization between acquiring the first ‎language and learning the second one by tracing the relationship between them. We will list the most important assumptions underpinned by approaches to foreign language ‎teaching. And we will monitor the foundations on which each approach is based separately to ‎discover the commonalities between them and the contrast between them. We will then ‎contribute to building a new conception of foreign language learning by making use of the ‎translation action inherent in the procedures adopted in most of these approaches. This is mainly ‎evident in the difference between the necessity of adopting the first language or not during the ‎teaching and learning of the foreign language‎. . Keywords: Applied Linguistics, First Language acquisition, Teaching Foreign Languages approaches, ‎Direct teaching, Mediated teaching‎

APA, Harvard, Vancouver, ISO, and other styles

35

Mozafari, Marzieh, Khouloud Mnassri, Reza Farahbakhsh, and Noel Crespi. "Offensive language detection in low resource languages: A use case of Persian language." PLOS ONE 19, no. 6 (2024): e0304166. http://dx.doi.org/10.1371/journal.pone.0304166.

Full text

Abstract:

THIS ARTICLE USES WORDS OR LANGUAGE THAT IS CONSIDERED PROFANE, VULGAR, OR OFFENSIVE BY SOME READERS. Different types of abusive content such as offensive language, hate speech, aggression, etc. have become prevalent in social media and many efforts have been dedicated to automatically detect this phenomenon in different resource-rich languages such as English. This is mainly due to the comparative lack of annotated data related to offensive language in low-resource languages, especially the ones spoken in Asian countries. To reduce the vulnerability among social media users from these regions, it is crucial to address the problem of offensive language in such low-resource languages. Hence, we present a new corpus of Persian offensive language consisting of 6,000 out of 520,000 randomly sampled micro-blog posts from X (Twitter) to deal with offensive language detection in Persian as a low-resource language in this area. We introduce a method for creating the corpus and annotating it according to the annotation practices of recent efforts for some benchmark datasets in other languages which results in categorizing offensive language and the target of offense as well. We perform extensive experiments with three classifiers in different levels of annotation with a number of classical Machine Learning (ML), Deep learning (DL), and transformer-based neural networks including monolingual and multilingual pre-trained language models. Furthermore, we propose an ensemble model integrating the aforementioned models to boost the performance of our offensive language detection task. Initial results on single models indicate that SVM trained on character or word n-grams are the best performing models accompanying monolingual transformer-based pre-trained language model ParsBERT in identifying offensive vs non-offensive content, targeted vs untargeted offense, and offensive towards individual or group. In addition, the stacking ensemble model outperforms the single models by a substantial margin, obtaining 5% respective macro F1-score improvement for three levels of annotation.

APA, Harvard, Vancouver, ISO, and other styles

36

Kalikova, Anna Mikhailovna, Maria Vladimirovna Volkova, Zulfia Kapizovna Tastemirova, Julia Evgenievna Bespalova, and Olga Borisovna Bagrintseva. "Structural differences of syntactic models in Russian and Chinese." SHS Web of Conferences 164 (2023): 00083. http://dx.doi.org/10.1051/shsconf/202316400083.

Full text

Abstract:

For many centuries of scientific existence linguistic researchers and philosophic leaders have been trying to establish a rigid determination of linguistic concepts. Theories of formal generative grammar provide the opportunity for logically proved abstract models. Such models allowing to reflect the typological features of Chinese and Russian syntactical structures in the most accessible way. According to the language morphological classification, the world's languages are divided into four morphological groups: a. inflectional languages, b. agglutinative languages, c. isolating languages, d. incorporating languages. The Chinese language is related to the morphological group of isolating languages as it produces poor methods of morphological inflection and strong significance of the word order in a sentence. The morphological structure of the Russian language which belongs to the inflectional morphological group is opposed to it. The current paper aims to present typological differences between the two comparable languages in terms of generative linguistics. The generated research produces two typological schemes of structural differences between the analyzed languages. The analysis is supplied with certain examples of language usage in both linguistic cultures.

APA, Harvard, Vancouver, ISO, and other styles

37

Mukhamadiyev, Abdinabi, Mukhriddin Mukhiddinov, Ilyos Khujayarov, Mannon Ochilov, and Jinsoo Cho. "Development of Language Models for Continuous Uzbek Speech Recognition System." Sensors 23, no. 3 (2023): 1145. http://dx.doi.org/10.3390/s23031145.

Full text

Abstract:

Automatic speech recognition systems with a large vocabulary and other natural language processing applications cannot operate without a language model. Most studies on pre-trained language models have focused on more popular languages such as English, Chinese, and various European languages, but there is no publicly available Uzbek speech dataset. Therefore, language models of low-resource languages need to be studied and created. The objective of this study is to address this limitation by developing a low-resource language model for the Uzbek language and understanding linguistic occurrences. We proposed the Uzbek language model named UzLM by examining the performance of statistical and neural-network-based language models that account for the unique features of the Uzbek language. Our Uzbek-specific linguistic representation allows us to construct more robust UzLM, utilizing 80 million words from various sources while using the same or fewer training words, as applied in previous studies. Roughly sixty-eight thousand different words and 15 million sentences were collected for the creation of this corpus. The experimental results of our tests on the continuous recognition of Uzbek speech show that, compared with manual encoding, the use of neural-network-based language models reduced the character error rate to 5.26%.

APA, Harvard, Vancouver, ISO, and other styles

38

Dorado, Rubén. "Statistical models for languaje representation." Revista Ontare 1, no. 1 (2015): 29. http://dx.doi.org/10.21158/23823399.v1.n1.2013.1208.

Full text

Abstract:

ONTARE. REVISTA DE INVESTIGACIÓN DE LA FACULTAD DE INGENIERÍAThis paper discuses several models for the computational representation of language. First, some n-gram models that are based on Markov models are introduced. Second, a family of models known as the exponential models is taken into account. This family in particular allows the incorporation of several features to model. Third, a recent current of research, the probabilistic Bayesian approach, is discussed. In this kind of models, language is modeled as a probabilistic distribution. Several distributions and probabilistic processes, such as the Dirichlet distribution and the Pitman- Yor process, are used to approximate the linguistic phenomena. Finally, the problem of sparseness of the language and its common solution known as smoothing is discussed. RESUMENEste documento discute varios modelos para la representación computacional del lenguaje. En primer lugar, se introducen los modelos de n-gramas que son basados en los modelos Markov. Luego, se toma en cuenta una familia de modelos conocido como el modelo exponencial. Esta familia en particular permite la incorporación de varias funciones para modelar. Como tercer punto, se discute una corriente reciente de la investigación, el enfoque probabilístico Bayesiano. En este tipo de modelos, el lenguaje es modelado como una distribución probabilística. Se utilizan varias distribuciones y procesos probabilísticos para aproximar los fenómenos lingüísticos, tales como la distribución de Dirichlet y el proceso de Pitman-Yor. Finalmente, se discute el problema de la escasez del lenguaje y su solución más común conocida como smoothing o redistribución.

APA, Harvard, Vancouver, ISO, and other styles

39

Murray, Robert W. "On Models of Syllable Division." Revue québécoise de linguistique 18, no. 2 (2009): 151–69. http://dx.doi.org/10.7202/602657ar.

Full text

Abstract:

AbstractPicard (1983, 1987b) claims that his model of syllable division predicts the placement of a syllable boundary in any given sequence of segments for a particular language. In this article, I show that this model is inadequate in three ways; a) it does not take into consideration language specific differences in syllable structure, particularly of sequences of the typeVPLVandVPGV(whereP= plosive,L= liquid, andG= glide) which can be syllabifiedV$PLV/V$PGVorVP$LV/VP$GVdepending on language specific factors, b) it fails to predict the correct placement of syllable boundaries for certain languages (e.g. Huichol), and c) it fails to take into consideration the existence of ambisyllabic segments.

APA, Harvard, Vancouver, ISO, and other styles

40

Oralbekova, Dina, Orken Mamyrbayev, Mohamed Othman, Dinara Kassymova, and Kuralai Mukhsina. "Contemporary Approaches in Evolving Language Models." Applied Sciences 13, no. 23 (2023): 12901. http://dx.doi.org/10.3390/app132312901.

Full text

Abstract:

This article provides a comprehensive survey of contemporary language modeling approaches within the realm of natural language processing (NLP) tasks. This paper conducts an analytical exploration of diverse methodologies employed in the creation of language models. This exploration encompasses the architecture, training processes, and optimization strategies inherent in these models. The detailed discussion covers various models ranging from traditional n-gram and hidden Markov models to state-of-the-art neural network approaches such as BERT, GPT, LLAMA, and Bard. This article delves into different modifications and enhancements applied to both standard and neural network architectures for constructing language models. Special attention is given to addressing challenges specific to agglutinative languages within the context of developing language models for various NLP tasks, particularly for Arabic and Turkish. The research highlights that contemporary transformer-based methods demonstrate results comparable to those achieved by traditional methods employing Hidden Markov Models. These transformer-based approaches boast simpler configurations and exhibit faster performance during both training and analysis. An integral component of the article is the examination of popular and actively evolving libraries and tools essential for constructing language models. Notable tools such as NLTK, TensorFlow, PyTorch, and Gensim are reviewed, with a comparative analysis considering their simplicity and accessibility for implementing diverse language models. The aim is to provide readers with insights into the landscape of contemporary language modeling methodologies and the tools available for their implementation.

APA, Harvard, Vancouver, ISO, and other styles

41

Kashyap, Gaurav. "Multilingual NLP: Techniques for Creating Models that Understand and Generate Multiple Languages with Minimal Resources." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 12 (2024): 1–5. https://doi.org/10.55041/ijsrem7648.

Full text

Abstract:

Models that can process human language in a variety of applications have been developed as a result of the quick development of natural language processing (NLP). Scaling NLP technologies to support multiple languages with minimal resources is still a major challenge, even though many models work well in high-resource languages. By developing models that can comprehend and produce text in multiple languages, especially those with little linguistic information, multilingual natural language processing (NLP) seeks to overcome this difficulty. This study examines the methods used in multilingual natural language processing (NLP), such as data augmentation, transfer learning, and multilingual pre-trained models. It also talks about the innovations and trade-offs involved in developing models that can effectively handle multiple languages with little effort. Many low-resource languages have been underserved by the quick advances in natural language processing, which have mostly benefited high-resource languages. The methods for creating multilingual NLP models that can efficiently handle several languages with little resource usage are examined in this paper. We discuss unsupervised morphology-based approaches to expand vocabularies, the importance of community involvement in low-resource language technology, and the limitations of current multilingual models. With the creation of strong language models capable of handling a variety of tasks, the field of natural language processing has advanced significantly in recent years. But not all languages have benefited equally from the advancements, with high-resource languages like English receiving disproportionate attention. [9] As a result, there are huge differences in the performance and accessibility of natural language processing (NLP) systems for the languages spoken around the world, many of which are regarded as low-resource. Researchers have looked into a number of methods for developing multilingual natural language processing (NLP) models that can comprehend and produce text in multiple languages with little effort in order to rectify this imbalance. Using unsupervised morphology-based techniques to increase the vocabulary of low-resource languages is one promising strategy. Keywords: Multilingual NLP, Low-resource Languages, Morphology, Vocabulary Expansion, Creole Languages

APA, Harvard, Vancouver, ISO, and other styles

42

Ilyas, Mohammed. "Language quotient (LQ): new models of language learning." International Journal of ADVANCED AND APPLIED SCIENCES 3, no. 9 (2016): 44–50. http://dx.doi.org/10.21833/ijaas.2016.09.008.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Irtza, Saad, Vidhyasaharan Sethu, Eliathamby Ambikairajah, and Haizhou Li. "Using language cluster models in hierarchical language identification." Speech Communication 100 (June 2018): 30–40. http://dx.doi.org/10.1016/j.specom.2018.04.004.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Shi, Zhouxing, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, and Cho-Jui Hsieh. "Red Teaming Language Model Detectors with Language Models." Transactions of the Association for Computational Linguistics 12 (2024): 174–89. http://dx.doi.org/10.1162/tacl_a_00639.

Full text

Abstract:

Abstract The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent work has proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM’s output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems. Code is available at https://github.com/shizhouxing/LLM-Detector-Robustness.

APA, Harvard, Vancouver, ISO, and other styles

45

Lee, Chanhee, Kisu Yang, Taesun Whang, Chanjun Park, Andrew Matteson, and Heuiseok Lim. "Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models." Applied Sciences 11, no. 5 (2021): 1974. http://dx.doi.org/10.3390/app11051974.

Full text

Abstract:

Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.

APA, Harvard, Vancouver, ISO, and other styles

46

Alostad, Hana. "Large Language Models as Kuwaiti Annotators." Big Data and Cognitive Computing 9, no. 2 (2025): 33. https://doi.org/10.3390/bdcc9020033.

Full text

Abstract:

Stance detection for low-resource languages, such as the Kuwaiti dialect, poses a significant challenge in natural language processing (NLP) due to the scarcity of annotated datasets and specialized tools. This study addresses these limitations by evaluating the effectiveness of open large language models (LLMs) in automating stance detection through zero-shot and few-shot prompt engineering, with a focus on the potential of open-source models to achieve performance levels comparable to those of closed-source alternatives. We also highlight the critical distinctions between zero- and few-shot learning, emphasizing their significance for addressing the challenges posed by low-resource languages. Our evaluation involved testing 11 LLMs on a manually labeled dataset of social media posts, including GPT-4o, Gemini Pro 1.5, Mistral-Large, Jais-30B, and AYA-23. As expected, closed-source models such as GPT-4o, Gemini Pro 1.5, and Mistral-Large demonstrated superior performance, achieving maximum F1 scores of 95.4%, 95.0%, and 93.2%, respectively, in few-shot scenarios with English as the prompt template language. However, open-source models such as Jais-30B and AYA-23 achieved competitive results, with maximum F1 scores of 93.0% and 93.1%, respectively, under the same conditions. Furthermore, statistical analysis using ANOVA and Tukey’s HSD post hoc tests revealed no significant differences in overall performance among GPT-4o, Gemini Pro 1.5, Mistral-Large, Jais-30B, and AYA-23. This finding underscores the potential of open-source LLMs as cost-effective and privacy-preserving alternatives for low-resource language annotation. This is the first study comparing LLMs for stance detection in the Kuwaiti dialect. Our findings highlight the importance of prompt design and model consistency in improving the quality of annotations and pave the way for NLP solutions for under-represented Arabic dialects.

APA, Harvard, Vancouver, ISO, and other styles

47

Tantug, Ahmet Cüneyd. "Document Categorization with Modified Statistical Language Models for Agglutinative Languages." International Journal of Computational Intelligence Systems 3, no. 5 (2010): 632. http://dx.doi.org/10.2991/ijcis.2010.3.5.12.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Ayres‐Bennett, Wendy. "Researching Language Standards and Standard Languages: Theories, Models and Methods." Transactions of the Philological Society 122, no. 3 (2024): 496–503. https://doi.org/10.1111/1467-968x.12298.

Full text

Abstract:

AbstractThe title of Jim Adams's rich and interesting paper clearly states the key question at the heart of his analysis: ‘Was classical (late republican) Latin a “standard language”?’. In this article, I contextualise some of his answers to this and other related questions he raises by situating them in the context of theoretical discussions of standardisation and recent explorations of its processes and outcomes. In recent years there has been extensive research on linguistic standardisation. This research has broadened the scope of consideration from the now stock examples of modern Western European languages to minoritized languages, multilingual situations and stateless languages, raising the question of whether traditional models and conceptions of standardisation often associated with the creation of a nation‐state can equally be applied to other contexts, including Latin during the late republican and early imperial periods.

APA, Harvard, Vancouver, ISO, and other styles

49

Feng, Hui. "Different languages, different cultures, different language ideologies, different linguistic models." Journal of Multicultural Discourses 4, no. 2 (2009): 151–64. http://dx.doi.org/10.1080/17447140802283191.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Tantug, Ahmet Cüneyd. "Document Categorization with Modified Statistical Language Models for Agglutinative Languages." International Journal of Computational Intelligence Systems 3, no. 5 (2010): 632–45. http://dx.doi.org/10.1080/18756891.2010.9727729.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Journal articles on the topic 'Language models'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles