Log in

Relevant bibliographies by topics / N-gram language models / Journal articles

To see the other types of publications on this topic, follow the link: N-gram language models.

Journal articles on the topic 'N-gram language models'

Author: Grafiati

Published: 4 June 2021

Last updated: 18 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'N-gram language models.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

LLORENS, DAVID, JUAN MIGUEL VILAR, and FRANCISCO CASACUBERTA. "FINITE STATE LANGUAGE MODELS SMOOTHED USING n-GRAMS." International Journal of Pattern Recognition and Artificial Intelligence 16, no. 03 (May 2002): 275–89. http://dx.doi.org/10.1142/s0218001402001666.

Full text

Abstract:

We address the problem of smoothing the probability distribution defined by a finite state automaton. Our approach extends the ideas employed for smoothing n-gram models. This extension is obtained by interpreting n-gram models as finite state models. The experiments show that our smoothing improves perplexity over smoothed n-grams and Error Correcting Parsing techniques.

APA, Harvard, Vancouver, ISO, and other styles

2

MEMUSHAJ, ALKET, and TAREK M. SOBH. "USING GRAPHEME n-GRAMS IN SPELLING CORRECTION AND AUGMENTATIVE TYPING SYSTEMS." New Mathematics and Natural Computation 04, no. 01 (March 2008): 87–106. http://dx.doi.org/10.1142/s1793005708000970.

Full text

Abstract:

Probabilistic language models have gained popularity in Natural Language Processing due to their ability to successfully capture language structures and constraints with computational efficiency. Probabilistic language models are flexible and easily adapted to language changes over time as well as to some new languages. Probabilistic language models can be trained and their accuracy strongly related to the availability of large text corpora. In this paper, we investigate the usability of grapheme probabilistic models, specifically grapheme n-grams models in spellchecking as well as augmentative typing systems. Grapheme n-gram models require substantially smaller training corpora and that is one of the main drivers for this thesis in which we build grapheme n-gram language models for the Albanian language. There are presently no available Albanian language corpora to be used for probabilistic language modeling. Our technique attempts to augment spellchecking and typing systems by utilizing grapheme n-gram language models in improving suggestion accuracy in spellchecking and augmentative typing systems. Our technique can be implemented in a standalone tool or incorporated in another tool to offer additional selection/scoring criteria.

APA, Harvard, Vancouver, ISO, and other styles

3

Mezzoudj, Freha, and Abdelkader Benyettou. "An empirical study of statistical language models: n-gram language models vs. neural network language models." International Journal of Innovative Computing and Applications 9, no. 4 (2018): 189. http://dx.doi.org/10.1504/ijica.2018.095762.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Mezzoudj, Freha, and Abdelkader Benyettou. "An empirical study of statistical language models: n-gram language models vs. neural network language models." International Journal of Innovative Computing and Applications 9, no. 4 (2018): 189. http://dx.doi.org/10.1504/ijica.2018.10016827.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Takase, Sho, Jun Suzuki, and Masaaki Nagata. "Character n-Gram Embeddings to Improve RNN Language Models." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 5074–82. http://dx.doi.org/10.1609/aaai.v33i01.33015074.

Full text

Abstract:

This paper proposes a novel Recurrent Neural Network (RNN) language model that takes advantage of character information. We focus on character n-grams based on research in the field of word embedding construction (Wieting et al. 2016). Our proposed method constructs word embeddings from character ngram embeddings and combines them with ordinary word embeddings. We demonstrate that the proposed method achieves the best perplexities on the language modeling datasets: Penn Treebank, WikiText-2, and WikiText-103. Moreover, we conduct experiments on application tasks: machine translation and headline generation. The experimental results indicate that our proposed method also positively affects these tasks

APA, Harvard, Vancouver, ISO, and other styles

6

Santos, André L., Gonçalo Prendi, Hugo Sousa, and Ricardo Ribeiro. "Stepwise API usage assistance using n -gram language models." Journal of Systems and Software 131 (September 2017): 461–74. http://dx.doi.org/10.1016/j.jss.2016.06.063.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Nederhof, Mark-Jan. "A General Technique to Train Language Models on Language Models." Computational Linguistics 31, no. 2 (June 2005): 173–85. http://dx.doi.org/10.1162/0891201054223986.

Full text

Abstract:

We show that under certain conditions, a language model can be trained on the basis of a second language model. The main instance of the technique trains a finite automaton on the basis of a probabilistic context-free grammar, such that the Kullback-Leibler distance between grammar and trained automaton is provably minimal. This is a substantial generalization of an existing algorithm to train an n-gram model on the basis of a probabilistic context-free grammar.

APA, Harvard, Vancouver, ISO, and other styles

8

Crego, Josep M., and François Yvon. "Factored bilingual n-gram language models for statistical machine translation." Machine Translation 24, no. 2 (June 2010): 159–75. http://dx.doi.org/10.1007/s10590-010-9082-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Lin, Jimmy, and W. John Wilbur. "Modeling actions of PubMed users with n-gram language models." Information Retrieval 12, no. 4 (September 12, 2008): 487–503. http://dx.doi.org/10.1007/s10791-008-9067-7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

GUO, YUQING, HAIFENG WANG, and JOSEF VAN GENABITH. "Dependency-based n-gram models for general purpose sentence realisation." Natural Language Engineering 17, no. 4 (November 29, 2010): 455–83. http://dx.doi.org/10.1017/s1351324910000288.

Full text

Abstract:

AbstractThis paper presents a general-purpose, wide-coverage, probabilistic sentence generator based on dependency n-gram models. This is particularly interesting as many semantic or abstract syntactic input specifications for sentence realisation can be represented as labelled bi-lexical dependencies or typed predicate-argument structures. Our generation method captures the mapping between semantic representations and surface forms by linearising a set of dependencies directly, rather than via the application of grammar rules as in more traditional chart-style or unification-based generators. In contrast to conventional n-gram language models over surface word forms, we exploit structural information and various linguistic features inherent in the dependency representations to constrain the generation space and improve the generation quality. A series of experiments shows that dependency-based n-gram models generalise well to different languages (English and Chinese) and representations (LFG and CoNLL). Compared with state-of-the-art generation systems, our general-purpose sentence realiser is highly competitive with the added advantages of being simple, fast, robust and accurate.

APA, Harvard, Vancouver, ISO, and other styles

11

Sennrich, Rico. "Modelling and Optimizing on Syntactic N-Grams for Statistical Machine Translation." Transactions of the Association for Computational Linguistics 3 (December 2015): 169–82. http://dx.doi.org/10.1162/tacl_a_00131.

Full text

Abstract:

The role of language models in SMT is to promote fluent translation output, but traditional n-gram language models are unable to capture fluency phenomena between distant words, such as some morphological agreement phenomena, subcategorisation, and syntactic collocations with string-level gaps. Syntactic language models have the potential to fill this modelling gap. We propose a language model for dependency structures that is relational rather than configurational and thus particularly suited for languages with a (relatively) free word order. It is trainable with Neural Networks, and not only improves over standard n-gram language models, but also outperforms related syntactic language models. We empirically demonstrate its effectiveness in terms of perplexity and as a feature function in string-to-tree SMT from English to German and Russian. We also show that using a syntactic evaluation metric to tune the log-linear parameters of an SMT system further increases translation quality when coupled with a syntactic language model.

APA, Harvard, Vancouver, ISO, and other styles

12

Doval, Yerai, and Carlos Gómez-Rodríguez. "Comparing neural- and N-gram-based language models for word segmentation." Journal of the Association for Information Science and Technology 70, no. 2 (December 2, 2018): 187–97. http://dx.doi.org/10.1002/asi.24082.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Taranukha, V. "Ways to Improve N-Gram Language Models for OCR and Speech Recognition of Slavic Languages." Advanced Science Journal 2014, no. 4 (March 31, 2014): 65–69. http://dx.doi.org/10.15550/asj.2014.04.065.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Long, Qiang, Wei Wang, Jinfu Deng, Song Liu, Wenhao Huang, Fangying Chen, and Sifan Liu. "A distributed system for large-scale n-gram language models at Tencent." Proceedings of the VLDB Endowment 12, no. 12 (August 2019): 2206–17. http://dx.doi.org/10.14778/3352063.3352136.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Wang, Rui, Masao Utiyama, Isao Goto, Eiichiro Sumita, Hai Zhao, and Bao-Liang Lu. "Converting Continuous-Space Language Models into N -gram Language Models with Efficient Bilingual Pruning for Statistical Machine Translation." ACM Transactions on Asian and Low-Resource Language Information Processing 15, no. 3 (March 8, 2016): 1–26. http://dx.doi.org/10.1145/2843942.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Huang, Fei, Arun Ahuja, Doug Downey, Yi Yang, Yuhong Guo, and Alexander Yates. "Learning Representations for Weakly Supervised Natural Language Processing Tasks." Computational Linguistics 40, no. 1 (March 2014): 85–120. http://dx.doi.org/10.1162/coli_a_00167.

Full text

Abstract:

Finding the right representations for words is critical for building accurate NLP systems when domain-specific labeled data for the task is scarce. This article investigates novel techniques for extracting features from n-gram models, Hidden Markov Models, and other statistical language models, including a novel Partial Lattice Markov Random Field model. Experiments on part-of-speech tagging and information extraction, among other tasks, indicate that features taken from statistical language models, in combination with more traditional features, outperform traditional representations alone, and that graphical model representations outperform n-gram models, especially on sparse and polysemous words.

APA, Harvard, Vancouver, ISO, and other styles

17

XIONG, DEYI, and MIN ZHANG. "Backward and trigger-based language models for statistical machine translation." Natural Language Engineering 21, no. 2 (July 24, 2013): 201–26. http://dx.doi.org/10.1017/s1351324913000168.

Full text

Abstract:

AbstractThe language model is one of the most important knowledge sources for statistical machine translation. In this article, we present two extensions to standard n-gram language models in statistical machine translation: a backward language model that augments the conventional forward language model, and a mutual information trigger model which captures long-distance dependencies that go beyond the scope of standard n-gram language models. We introduce algorithms to integrate the two proposed models into two kinds of state-of-the-art phrase-based decoders. Our experimental results on Chinese/Spanish/Vietnamese-to-English show that both models are able to significantly improve translation quality in terms of BLEU and METEOR over a competitive baseline.

APA, Harvard, Vancouver, ISO, and other styles

18

Schütze, Hinrich, and Michael Walsh. "Half-Context Language Models." Computational Linguistics 37, no. 4 (December 2011): 843–65. http://dx.doi.org/10.1162/coli_a_00078.

Full text

Abstract:

This article investigates the effects of different degrees of contextual granularity on language model performance. It presents a new language model that combines clustering and half-contextualization, a novel representation of contexts. Half-contextualization is based on the half-context hypothesis that states that the distributional characteristics of a word or bigram are best represented by treating its context distribution to the left and right separately and that only directionally relevant distributional information should be used. Clustering is achieved using a new clustering algorithm for class-based language models that compares favorably to the exchange algorithm. When interpolated with a Kneser-Ney model, half-context models are shown to have better perplexity than commonly used interpolated n-gram models and traditional class-based approaches. A novel, fine-grained, context-specific analysis highlights those contexts in which the model performs well and those which are better treated by existing non-class-based models.

APA, Harvard, Vancouver, ISO, and other styles

19

Rahman, M. D. Riazur, M. D. Tarek Habib, M. D. Sadekur Rahman, Gazi Zahirul Islam, and M. D. Abbas Ali Khan. "An exploratory research on grammar checking of Bangla sentences using statistical language models." International Journal of Electrical and Computer Engineering (IJECE) 10, no. 3 (June 1, 2020): 3244. http://dx.doi.org/10.11591/ijece.v10i3.pp3244-3252.

Full text

Abstract:

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.

APA, Harvard, Vancouver, ISO, and other styles

20

Nowakowski, Karol, Michal Ptaszynski, and Fumito Masui. "MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language." Information 10, no. 10 (October 16, 2019): 317. http://dx.doi.org/10.3390/info10100317.

Full text

Abstract:

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.

APA, Harvard, Vancouver, ISO, and other styles

21

BERTOLAMI, ROMAN, and HORST BUNKE. "INTEGRATION OF n-GRAM LANGUAGE MODELS IN MULTIPLE CLASSIFIER SYSTEMS FOR OFFLINE HANDWRITTEN TEXT LINE RECOGNITION." International Journal of Pattern Recognition and Artificial Intelligence 22, no. 07 (November 2008): 1301–21. http://dx.doi.org/10.1142/s0218001408006855.

Full text

Abstract:

Current multiple classifier systems for unconstrained handwritten text recognition do not provide a straightforward way to utilize language model information. In this paper, we describe a generic method to integrate a statistical n-gram language model into the combination of multiple offline handwritten text line recognizers. The proposed method first builds a word transition network and then rescores this network with an n-gram language model. Experimental evaluation conducted on a large dataset of offline handwritten text lines shows that the proposed approach improves the recognition accuracy over a reference system as well as over the original combination method that does not include a language model.

APA, Harvard, Vancouver, ISO, and other styles

22

MASUMURA, Ryo, Taichi ASAMI, Takanobu OBA, Hirokazu MASATAKI, Sumitaka SAKAUCHI, and Satoshi TAKAHASHI. "N-gram Approximation of Latent Words Language Models for Domain Robust Automatic Speech Recognition." IEICE Transactions on Information and Systems E99.D, no. 10 (2016): 2462–70. http://dx.doi.org/10.1587/transinf.2016slp0014.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Shahrivari, Saeed, Saeed Rahmani, and Hooman Keshavarz. "AUTOMATIC TAGGING OF PERSIAN WEB PAGES BASED ON N-GRAM LANGUAGE MODELS USING MAPREDUCE." ICTACT Journal on Soft Computing 05, no. 04 (July 1, 2015): 1003–8. http://dx.doi.org/10.21917/ijsc.2015.0140.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Dorado, Rubén. "Statistical models for languaje representation." Revista Ontare 1, no. 1 (September 16, 2015): 29. http://dx.doi.org/10.21158/23823399.v1.n1.2013.1208.

Full text

Abstract:

ONTARE. REVISTA DE INVESTIGACIÓN DE LA FACULTAD DE INGENIERÍAThis paper discuses several models for the computational representation of language. First, some n-gram models that are based on Markov models are introduced. Second, a family of models known as the exponential models is taken into account. This family in particular allows the incorporation of several features to model. Third, a recent current of research, the probabilistic Bayesian approach, is discussed. In this kind of models, language is modeled as a probabilistic distribution. Several distributions and probabilistic processes, such as the Dirichlet distribution and the Pitman- Yor process, are used to approximate the linguistic phenomena. Finally, the problem of sparseness of the language and its common solution known as smoothing is discussed. RESUMENEste documento discute varios modelos para la representación computacional del lenguaje. En primer lugar, se introducen los modelos de n-gramas que son basados en los modelos Markov. Luego, se toma en cuenta una familia de modelos conocido como el modelo exponencial. Esta familia en particular permite la incorporación de varias funciones para modelar. Como tercer punto, se discute una corriente reciente de la investigación, el enfoque probabilístico Bayesiano. En este tipo de modelos, el lenguaje es modelado como una distribución probabilística. Se utilizan varias distribuciones y procesos probabilísticos para aproximar los fenómenos lingüísticos, tales como la distribución de Dirichlet y el proceso de Pitman-Yor. Finalmente, se discute el problema de la escasez del lenguaje y su solución más común conocida como smoothing o redistribución.

APA, Harvard, Vancouver, ISO, and other styles

25

Paul, Baltescu, Blunsom Phil, and Hoang Hieu. "OxLM: A Neural Language Modelling Framework for Machine Translation." Prague Bulletin of Mathematical Linguistics 102, no. 1 (September 11, 2014): 81–92. http://dx.doi.org/10.2478/pralin-2014-0016.

Full text

Abstract:

Abstract This paper presents an open source implementation1 of a neural language model for machine translation. Neural language models deal with the problem of data sparsity by learning distributed representations for words in a continuous vector space. The language modelling probabilities are estimated by projecting a word's context in the same space as the word representations and by assigning probabilities proportional to the distance between the words and the context's projection. Neural language models are notoriously slow to train and test. Our framework is designed with scalability in mind and provides two optional techniques for reducing the computational cost: the so-called class decomposition trick and a training algorithm based on noise contrastive estimation. Our models may be extended to incorporate direct n-gram features to learn weights for every n-gram in the training data. Our framework comes with wrappers for the cdec and Moses translation toolkits, allowing our language models to be incorporated as normalized features in their decoders (inside the beam search).

APA, Harvard, Vancouver, ISO, and other styles

26

Pelemans, Joris, Noam Shazeer, and Ciprian Chelba. "Sparse Non-negative Matrix Language Modeling." Transactions of the Association for Computational Linguistics 4 (December 2016): 329–42. http://dx.doi.org/10.1162/tacl_a_00102.

Full text

Abstract:

We present Sparse Non-negative Matrix (SNM) estimation, a novel probability estimation technique for language modeling that can efficiently incorporate arbitrary features. We evaluate SNM language models on two corpora: the One Billion Word Benchmark and a subset of the LDC English Gigaword corpus. Results show that SNM language models trained with n-gram features are a close match for the well-established Kneser-Ney models. The addition of skip-gram features yields a model that is in the same league as the state-of-the-art recurrent neural network language models, as well as complementary: combining the two modeling techniques yields the best known result on the One Billion Word Benchmark. On the Gigaword corpus further improvements are observed using features that cross sentence boundaries. The computational advantages of SNM estimation over both maximum entropy and neural network estimation are probably its main strength, promising an approach that has large flexibility in combining arbitrary features and yet scales gracefully to large amounts of data.

APA, Harvard, Vancouver, ISO, and other styles

27

Zitouni, Imed. "Backoff hierarchical class n-gram language models: effectiveness to model unseen events in speech recognition." Computer Speech & Language 21, no. 1 (January 2007): 88–104. http://dx.doi.org/10.1016/j.csl.2006.01.001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Bessou, Sadik, and Racha Sari. "Efficient Discrimination between Arabic Dialects." Recent Advances in Computer Science and Communications 13, no. 4 (October 19, 2020): 725–30. http://dx.doi.org/10.2174/2213275912666190716115604.

Full text

Abstract:

Background: With the explosion of communication technologies and the accompanying pervasive use of social media, we notice an outstanding proliferation of posts, reviews, comments, and other forms of expressions in different languages. This content attracted researchers from different fields; economics, political sciences, social sciences, psychology and particularly language processing. One of the prominent subjects is the discrimination between similar languages and dialects using natural language processing and machine learning techniques. The problem is usually addressed by formulating the identification as a classification task. Methods: The approach is based on machine learning classification methods to discriminate between Modern Standard Arabic (MSA) and four regional Arabic dialects: Egyptian, Levantine, Gulf and North-African. Several models were trained to discriminate between the studied dialects in large corpora mined from online Arabic newspapers and manually annotated. Results: Experimental results showed that n-gram features could substantially improve performance. Logistic regression based on character and word n-gram model using Count Vectors identified the handled dialects with an overall accuracy of 95%. Best results were achieved with Linear Support vector classifier using TF-IDF Vectors trained by character-based uni-gram, bi-gram, trigram, and word-based uni-gram, bi-gram with an overall accuracy of 95.1%. Conclusion: The results showed that n-gram features could substantially improve performance. Additionally, we noticed that the kind of data representation could provide a significant performance boost compared to simple representation.

APA, Harvard, Vancouver, ISO, and other styles

29

Takahashi, Shuntaro, and Kumiko Tanaka-Ishii. "Evaluating Computational Language Models with Scaling Properties of Natural Language." Computational Linguistics 45, no. 3 (September 2019): 481–513. http://dx.doi.org/10.1162/coli_a_00355.

Full text

Abstract:

In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given by Zipf’s law, Heaps’ law, Ebeling’s method, Taylor’s law, and long-range correlation analysis) can serve for evaluation of computational models. Specifically, we test n-gram language models, a probabilistic context-free grammar, language models based on Simon/Pitman-Yor processes, neural language models, and generative adversarial networks for text generation. Our analysis reveals that language models based on recurrent neural networks with a gating mechanism (i.e., long short-term memory; a gated recurrent unit; and quasi-recurrent neural networks) are the only computational models that can reproduce the long memory behavior of natural language. Furthermore, through comparison with recently proposed model-based evaluation methods, we find that the exponent of Taylor’s law is a good indicator of model quality.

APA, Harvard, Vancouver, ISO, and other styles

30

Arthur O. Santos, Flávio, Thiago Dias Bispo, Hendrik Teixeira Macedo, and Cleber Zanchettin. "Morphological Skip-Gram: Replacing FastText characters n-gram with morphological knowledge." Inteligencia Artificial 24, no. 67 (February 20, 2021): 1–17. http://dx.doi.org/10.4114/intartif.vol24iss67pp1-17.

Full text

Abstract:

Natural language processing systems have attracted much interest of the industry. This branch of study is composed of some applications such as machine translation, sentiment analysis, named entity recognition, question and answer, and others. Word embeddings (i.e., continuous word representations) are an essential module for those applications generally used as word representation to machine learning models. Some popular methods to train word embeddings are GloVe and Word2Vec. They achieve good word representations, despite limitations: both ignore morphological information of the words and consider only one representation vector for each word. This approach implies the word embeddings does not consider different word contexts properly and are unaware of its inner structure. To mitigate this problem, the other word embeddings method FastText represents each word as a bag of characters n-grams. Hence, a continuous vector describes each n-gram, and the final word representation is the sum of its characters n-grams vectors. Nevertheless, the use of all n-grams character of a word is a poor approach since some n-grams have no semantic relation with their words and increase the amount of potentially useless information. This approach also increase the training phase time. In this work, we propose a new method for training word embeddings, and its goal is to replace the FastText bag of character n-grams for a bag of word morphemes through the morphological analysis of the word. Thus, words with similar context and morphemes are represented by vectors close to each other. To evaluate our new approach, we performed intrinsic evaluations considering 15 different tasks, and the results show a competitive performance compared to FastText. Moreover, the proposed model is $40\%$ faster than FastText in the training phase. We also outperform the baseline approaches in extrinsic evaluations through Hate speech detection and NER tasks using different scenarios.

APA, Harvard, Vancouver, ISO, and other styles

31

WANG, XIAOLONG, DANIEL S. YEUNG, JAMES N. K. LIU, ROBERT LUK, and XUAN WANG. "A HYBRID LANGUAGE MODEL BASED ON STATISTICS AND LINGUISTIC RULES." International Journal of Pattern Recognition and Artificial Intelligence 19, no. 01 (February 2005): 109–28. http://dx.doi.org/10.1142/s0218001405003934.

Full text

Abstract:

Language modeling is a current research topic in many domains including speech recognition, optical character recognition, handwriting recognition, machine translation and spelling correction. There are two main types of language models, the mathematical and the linguistic. The most widely used mathematical language model is the n-gram model inferred from statistics. This model has three problems: long distance restriction, recursive nature and partial language understanding. Language models based on linguistics present many difficulties when applied to large scale real texts. We present here a new hybrid language model that combines the advantages of the n-gram statistical language model with those of a linguistic language model which makes use of grammatical or semantic rules. Using suitable rules, this hybrid model can solve problems such as long distance restriction, recursive nature and partial language understanding. The new language model has been effective in experiments and has been incorporated in Chinese sentence input products for Windows and Macintosh OS.

APA, Harvard, Vancouver, ISO, and other styles

32

GuoDong, Z., and L. KimTeng. "Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition." Computer Speech & Language 13, no. 2 (April 1999): 125–41. http://dx.doi.org/10.1006/csla.1998.0118.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. "Enriching Word Vectors with Subword Information." Transactions of the Association for Computational Linguistics 5 (December 2017): 135–46. http://dx.doi.org/10.1162/tacl_a_00051.

Full text

Abstract:

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

APA, Harvard, Vancouver, ISO, and other styles

34

TACHBELIE, MARTHA YIFIRU, SOLOMON TEFERRA ABATE, and WOLFGANG MENZEL. "Using morphemes in language modeling and automatic speech recognition of Amharic." Natural Language Engineering 20, no. 2 (December 12, 2012): 235–59. http://dx.doi.org/10.1017/s1351324912000356.

Full text

Abstract:

AbstractThis paper presents morpheme-based language models developed for Amharic (a morphologically rich Semitic language) and their application to a speech recognition task. A substantial reduction in the out of vocabulary rate has been observed as a result of using subwords or morphemes. Thus a severe problem of morphologically rich languages has been addressed. Moreover, lower perplexity values have been obtained with morpheme-based language models than with word-based models. However, when comparing the quality based on the probability assigned to the test sets, word-based models seem to fare better. We have studied the utility of morpheme-based language models in speech recognition systems and found that the performance of a relatively small vocabulary (5k) speech recognition system improved significantly as a result of using morphemes as language modeling and dictionary units. However, as the size of the vocabulary increases (20k or more) the morpheme-based systems suffer from acoustic confusability and did not achieve a significant improvement over a word-based system with an equivalent vocabulary size even with the use of higher order (quadrogram) n-gram language models.

APA, Harvard, Vancouver, ISO, and other styles

35

FLOR, MICHAEL. "A fast and flexible architecture for very large word n-gram datasets." Natural Language Engineering 19, no. 1 (January 10, 2012): 61–93. http://dx.doi.org/10.1017/s1351324911000349.

Full text

Abstract:

AbstractThis paper presents TrendStream, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel trie-based architecture, features lossless compression, and provides optimization for both speed and memory use. In addition to literal queries, it also supports fast pattern matching searches (with wildcards or regular expressions), on the same data structure, without any additional indexing. Language models are updateable directly in the compiled binary format, allowing rapid encoding of existing tabulated collections, incremental generation of n-gram models from streaming text, and merging of encoded compiled files. This architecture offers flexible choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model, or on-demand partial data loading with very modest memory requirements. The implemented system runs successfully on several different platforms, under different operating systems, even when the n-gram model file is much larger than available memory. Experimental evaluation results are presented with the Google Web1T collection and the Gigaword corpus.

APA, Harvard, Vancouver, ISO, and other styles

36

Dorado, Ruben. "Smoothing methods for the treatment of digital texts." Revista Ontare 2, no. 1 (September 17, 2015): 42. http://dx.doi.org/10.21158/23823399.v2.n1.2014.1234.

Full text

Abstract:

ONTARE. REVISTA DE INVESTIGACIÓN DE LA FACULTAD DE INGENIERÍAThis article describes the exploration task known as smoothing for statistical language representation. It also reviews some of the state- of-the-art methods that improve the representation of language in a statistical way. Specifically, these reported methods improve statistical models known as N-gram models. This paper also shows a method to measure models in order to compare them.

APA, Harvard, Vancouver, ISO, and other styles

37

Chang, Harry M. "Constructing n-gram rules for natural language models through exploring the limitation of the Zipf–Mandelbrot law." Computing 91, no. 3 (October 2, 2010): 241–64. http://dx.doi.org/10.1007/s00607-010-0116-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Singh, Umrinderpal. "A Comparison of Phrase Based and Word based Language Model for Punjabi." International Journal of Advanced Research in Computer Science and Software Engineering 7, no. 7 (July 30, 2017): 444. http://dx.doi.org/10.23956/ijarcsse/v7i7/0232.

Full text

Abstract:

A language model provides connection to the decoding process to determine a precise word from several available options in the information base or phrase table. The language model can be generated using n-gram approach. Various language models and smoothing procedures are there to determine this model, like unigram, bigram, trigram, interpolation, backoff language model etc. We have done some experiments with different language models where we have used phrases in place of words as the smallest unit. Experiments have shown that phrase based language model yield more accurate results as compared to simple word based mode. We have also done some experiments with machine translation system where we have used phrase based language model rather than word based model and system yield great improvement.

APA, Harvard, Vancouver, ISO, and other styles

39

Smywinski-Pohl, Aleksander, and Bartosz Ziółko. "Application of Morphosyntactic and Class-Based Language Models in Automatic Speech Recognition of Polish." International Journal on Artificial Intelligence Tools 25, no. 02 (April 2016): 1650006. http://dx.doi.org/10.1142/s0218213016500068.

Full text

Abstract:

In this paper we investigate the usefulness of morphosyntactic information as well as clustering in modeling Polish for automatic speech recognition. Polish is an inflectional language, thus we investigate the usefulness of an N-gram model based on morphosyntactic features. We present how individual types of features influence the model and which types of features are best suited for building a language model for automatic speech recognition. We compared the results of applying them with a class-based model that is automatically derived from the training corpus. We show that our approach towards clustering performs significantly better than frequently used SRI LM clustering method. However, this difference is apparent only for smaller corpora.

APA, Harvard, Vancouver, ISO, and other styles

40

MAUČEC, MIRJAM SEPESY, TOMAŽ ROTOVNIK, ZDRAVKO KAČIČ, and JANEZ BREST. "USING DATA-DRIVEN SUBWORD UNITS IN LANGUAGE MODEL OF HIGHLY INFLECTIVE SLOVENIAN LANGUAGE." International Journal of Pattern Recognition and Artificial Intelligence 23, no. 02 (March 2009): 287–312. http://dx.doi.org/10.1142/s0218001409007119.

Full text

Abstract:

This paper presents the results of a study on modeling the highly inflective Slovenian language. We focus on creating a language model for a large vocabulary speech recognition system. A new data-driven method is proposed for the induction of inflectional morphology into language modeling. The research focus is on data sparsity, which results from the complex morphology of the language. The idea of using subword units is examined. An attempt is made to figure out the segmentation of words into two subword units: stems and endings. No prior knowledge of the language is used. The subword units should fit into the frameworks of the probabilistic language models. A morphologically correct decomposition of words is not being sought, but searching for a decomposition which yields the minimum entropy of the training corpus. This entropy is approximated by using N-gram models. Despite some seemingly over-simplified assumption, the subword models improve the applicability of the language models for a sparse training corpus. The experiments were performed using the VEČER newswire text corpus as a training corpus. The test set was taken from the SNABI speech database, because the final models were evaluated in speech recognition experiments on SNABI speech database. Two different subword-based models are proposed and examined experimentally. The experiments demonstrate that subword-based models, which considerably reduce OOV rate, improve speech recognition WER when compared with standard word-based models, even though they increase test set perplexity. Subword-based models with improved perplexity, but which reduce the OOV rate much less than the previous ones, do not improve speech recognition results.

APA, Harvard, Vancouver, ISO, and other styles

41

Tremblay, Antoine, Elissa Asp, Anne Johnson, Malgorzata Zarzycka Migdal, Tim Bardouille, and Aaron J. Newman. "What the Networks Tell us about Serial and Parallel Processing." Mental Lexicon 11, no. 1 (June 7, 2016): 115–60. http://dx.doi.org/10.1075/ml.11.1.06tre.

Full text

Abstract:

A large literature documenting facilitative effects for high frequency complex words and phrases has led to proposals that high frequency phrases may be stored in memory rather than constructed on-line from their component parts (similarly to high frequency complex words). To investigate this, we explored language processing during a novel picture description task. Using the magneto-encephalographam (MEG) technique and generalised additive mixed-effects modelling, we characterised the effects of the frequency of use of single words as well as two-, three-, and four-word sequences (N-grams) on brain activity during the pre-production stage of unconstrained overt picture description. We expected amplitude responses to be modulated by N-gram frequency such that if N-grams were stored we would see a corresponding reduction or flattening in amplitudes as frequency increased. We found that while amplitude responses to increasing N-gram frequencies corresponded with our expectations about facilitation, the effect appeared at low frequency ranges and for single words only in the phonological network. We additionally found that high frequency N-grams elicited activity increases in some networks, which may be signs of competition or combination depending on the network. Moreover, this effect was not reliable for single word frequencies. These amplitude responses do not clearly support storage for high frequency multi-word sequences. To probe these unexpected results, we turned our attention to network topographies and the timing. We found that, with the exception of an initial ‘sentence’ network, all the networks aggregated peaks from more than one domain (e.g. semantics and phonology). Moreover, although activity moved serially from anterior ventral networks to dorsal posterior networks during processing, as expected in combinatorial accounts, sentence processing and semantic networks ran largely in parallel. Thus, network topographies and timing may account for (some) facilitative effects associated with frequency. We review literature relevant to the network topographies and timing and briefly discuss our results in relation to current processing and theoretical models.

APA, Harvard, Vancouver, ISO, and other styles

42

Castro, Dayvid W., Ellen Souza, Douglas Vitório, Diego Santos, and Adriano L. I. Oliveira. "Smoothed n-gram based models for tweet language identification: A case study of the Brazilian and European Portuguese national varieties." Applied Soft Computing 61 (December 2017): 1160–72. http://dx.doi.org/10.1016/j.asoc.2017.05.065.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Xia, Yu Guo, and Ming Liang Gu. "Ensemble Learning Approach with Application to Chinese Dialect Identification." Applied Mechanics and Materials 333-335 (July 2013): 769–74. http://dx.doi.org/10.4028/www.scientific.net/amm.333-335.769.

Full text

Abstract:

In this paper we propose ensemble learning based approach to identify Chinese dialects. This new method firstly uses Gaussian Mixture Models and N-gram language models to produce a set of base learners. Then the two typical ensemble learning approach, Bagging and AdaBoost are conducted to combine the base learner to determine the dialect category. The ANN is selected as weak learner. The experimental results show that the ensemble approach not only enhances the performance of the system greatly, but also reduces the contradiction between the training data and the number of parameters in models.

APA, Harvard, Vancouver, ISO, and other styles

44

Eyamin, Md Iftakher Alam, Md Tarek Habib, Muhammad Ifte Khairul Islam, Md Sadekur Rahman, and Md Abbas Ali Khan. "An investigative design of optimum stochastic language model for bangla autocomplete." Indonesian Journal of Electrical Engineering and Computer Science 13, no. 2 (February 1, 2019): 671. http://dx.doi.org/10.11591/ijeecs.v13.i2.pp671-676.

Full text

Abstract:

<p class="Abstract">Word completion and word prediction are two important phenomena in typing that have extreme effect on aiding disable people and students while using keyboard or other similar devices. Such autocomplete technique also helps students significantly during learning process through constructing proper keywords during web searching. A lot of works are conducted for English language, but for Bangla, it is still very inadequate as well as the metrics used for performance computation is not rigorous yet. Bangla is one of the mostly spoken languages (3.05% of world population) and ranked as seventh among all the languages in the world. In this paper, word prediction on Bangla sentence by using stochastic, i.e. <em>N</em>-gram based language models are proposed for autocomplete a sentence by predicting a set of words rather than a single word, which was done in previous work. A novel approach is proposed in order to find the optimum language model based on performance metric. In addition, for finding out better performance, a large Bangla corpus of different word types is used.</p>

APA, Harvard, Vancouver, ISO, and other styles

45

Zhang, Lipeng, Peng Zhang, Xindian Ma, Shuqin Gu, Zhan Su, and Dawei Song. "A Generalized Language Model in Tensor Space." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 7450–58. http://dx.doi.org/10.1609/aaai.v33i01.33017450.

Full text

Abstract:

In the literature, tensors have been effectively used for capturing the context information in language models. However, the existing methods usually adopt relatively-low order tensors, which have limited expressive power in modeling language. Developing a higher-order tensor representation is challenging, in terms of deriving an effective solution and showing its generality. In this paper, we propose a language model named Tensor Space Language Model (TSLM), by utilizing tensor networks and tensor decomposition. In TSLM, we build a high-dimensional semantic space constructed by the tensor product of word vectors. Theoretically, we prove that such tensor representation is a generalization of the n-gram language model. We further show that this high-order tensor representation can be decomposed to a recursive calculation of conditional probability for language modeling. The experimental results on Penn Tree Bank (PTB) dataset and WikiText benchmark demonstrate the effectiveness of TSLM.

APA, Harvard, Vancouver, ISO, and other styles

46

Futrell, Richard, Adam Albright, Peter Graff, and Timothy J. O’Donnell. "A Generative Model of Phonotactics." Transactions of the Association for Computational Linguistics 5 (December 2017): 73–86. http://dx.doi.org/10.1162/tacl_a_00047.

Full text

Abstract:

We present a probabilistic model of phonotactics, the set of well-formed phoneme sequences in a language. Unlike most computational models of phonotactics (Hayes and Wilson, 2008; Goldsmith and Riggle, 2012), we take a fully generative approach, modeling a process where forms are built up out of subparts by phonologically-informed structure building operations. We learn an inventory of subparts by applying stochastic memoization (Johnson et al., 2007; Goodman et al., 2008) to a generative process for phonemes structured as an and-or graph, based on concepts of feature hierarchy from generative phonology (Clements, 1985; Dresher, 2009). Subparts are combined in a way that allows tier-based feature interactions. We evaluate our models’ ability to capture phonotactic distributions in the lexicons of 14 languages drawn from the WOLEX corpus (Graff, 2012). Our full model robustly assigns higher probabilities to held-out forms than a sophisticated N-gram model for all languages. We also present novel analyses that probe model behavior in more detail.

APA, Harvard, Vancouver, ISO, and other styles

47

Pakoci, Edvin, Branislav Popović, and Darko Pekar. "Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition." Computational Intelligence and Neuroscience 2019 (March 3, 2019): 1–8. http://dx.doi.org/10.1155/2019/5072918.

Full text

Abstract:

Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech recognition system, often a wrong word ending occurs, which is nevertheless counted as an error. This effect is larger for contexts not present in the language model training corpus. In this manuscript, an approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. These categories include word type, word case, grammatical number, and gender, and they were all assigned to words in the system vocabulary, where applicable. These additional word features helped to produce significant improvements in relation to the baseline system, both for n-gram-based and neural network-based language models. The proposed system can help overcome a lot of tedious errors in a large vocabulary system, for example, for dictation, both for Serbian and for other languages with similar characteristics.

APA, Harvard, Vancouver, ISO, and other styles

48

Pino, Juan, Aurelien Waite, and William Byrne. "Simple and Efficient Model Filtering in Statistical Machine Translation." Prague Bulletin of Mathematical Linguistics 98, no. 1 (October 1, 2012): 5–24. http://dx.doi.org/10.2478/v10108-012-0005-x.

Full text

Abstract:

Simple and Efficient Model Filtering in Statistical Machine Translation Data availability and distributed computing techniques have allowed statistical machine translation (SMT) researchers to build larger models. However, decoders need to be able to retrieve information efficiently from these models to be able to translate an input sentence or a set of input sentences. We introduce an easy to implement and general purpose solution to tackle this problem: we store SMT models as a set of key-value pairs in an HFile. We apply this strategy to two specific tasks: test set hierarchical phrase-based rule filtering and n-gram count filtering for language model lattice rescoring. We compare our approach to alternative strategies and show that its trade offs in terms of speed, memory and simplicity are competitive.

APA, Harvard, Vancouver, ISO, and other styles

49

Stolcke, Andreas, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. "Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech." Computational Linguistics 26, no. 3 (September 2000): 339–73. http://dx.doi.org/10.1162/089120100561737.

Full text

Abstract:

We describe a statistical approach for modeling dialogue acts in conversational speech, i.e., speech-act-like units such as STATEMENT, Question, BACKCHANNEL, Agreement, Disagreement, and Apology. Our model detects and predicts dialogue acts based on lexical, collocational, and prosodic cues, as well as on the discourse coherence of the dialogue act sequence. The dialogue model is based on treating the discourse structure of a conversation as a hidden Markov model and the individual dialogue acts as observations emanating from the model states. Constraints on the likely sequence of dialogue acts are modeled via a dialogue act n-gram. The statistical dialogue grammar is combined with word n-grams, decision trees, and neural networks modeling the idiosyncratic lexical and prosodic manifestations of each dialogue act. We develop a probabilistic integration of speech recognition with dialogue modeling, to improve both speech recognition and dialogue act classification accuracy. Models are trained and evaluated using a large hand-labeled database of 1,155 conversations from the Switchboard corpus of spontaneous human-to-human telephone speech. We achieved good dialogue act labeling accuracy (65% based on errorful, automatically recognized words and prosody, and 71% based on word transcripts, compared to a chance baseline accuracy of 35% and human accuracy of 84%) and a small reduction in word recognition error.

APA, Harvard, Vancouver, ISO, and other styles

50

Boudia, Mohamed Amine, Reda Mohamed Hamou, and Abdelmalek Amine. "A New Meta-Heuristic based on Human Renal Function for Detection and Filtering of SPAM." International Journal of Information Security and Privacy 9, no. 4 (October 2015): 26–58. http://dx.doi.org/10.4018/ijisp.2015100102.

Full text

Abstract:

The e-mail is therefore one of the most used methods for its efficiency and profitability. In the last few years, the undesirables emails (SPAM) are widely spread as they play an important part in the inbox. Consequently, several recent studies have provided evidence of the importance of detection and filtering of SPAM as a major interest for the Internet community. In the present paper, the authors propose and experiment a new and original meta-heuristic based on the renal system for detection and filtering spam. The natural model of the renal system is taken as an inspiration for its purification of blood, the filtering of toxins as well as the regularization of the blood pressure. The messages are represented by both a bag words and N-Gram method which is independent of languages because an email can be received in any language. After that, the authors propose to use two models to apply a Bayesien classification on textual data: Bernoulli or Multinomial model.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!