Tesis: "Statistical linguistics"

1

Onnis, Luca. "Statistical language learning". Thesis, University of Warwick, 2003. http://wrap.warwick.ac.uk/54811/.

Resumen

Theoretical arguments based on the "poverty of the stimulus" have denied a priori the possibility that abstract linguistic representations can be learned inductively from exposure to the environment, given that the linguistic input available to the child is both underdetermined and degenerate. I reassess such learnability arguments by exploring a) the type and amount of statistical information implicitly available in the input in the form of distributional and phonological cues; b) psychologically plausible inductive mechanisms for constraining the search space; c) the nature of linguistic representations, algebraic or statistical. To do so I use three methodologies: experimental procedures, linguistic analyses based on large corpora of naturally occurring speech and text, and computational models implemented in computer simulations. In Chapters 1,2, and 5, I argue that long-distance structural dependencies - traditionally hard to explain with simple distributional analyses based on ngram statistics - can indeed be learned associatively provided the amount of intervening material is highly variable or invariant (the Variability effect). In Chapter 3, I show that simple associative mechanisms instantiated in Simple Recurrent Networks can replicate the experimental findings under the same conditions of variability. Chapter 4 presents successes and limits of such results across perceptual modalities (visual vs. auditory) and perceptual presentation (temporal vs. sequential), as well as the impact of long and short training procedures. In Chapter 5, I show that generalisation to abstract categories from stimuli framed in non-adjacent dependencies is also modulated by the Variability effect. In Chapter 6, I show that the putative separation of algebraic and statistical styles of computation based on successful speech segmentation versus unsuccessful generalisation experiments (as published in a recent Science paper) is premature and is the effect of a preference for phonological properties of the input. In chapter 7 computer simulations of learning irregular constructions suggest that it is possible to learn from positive evidence alone, despite Gold's celebrated arguments on the unlearnability of natural languages. Evolutionary simulations in Chapter 8 show that irregularities in natural languages can emerge from full regularity and remain stable across generations of simulated agents. In Chapter 9 I conclude that the brain may endowed with a powerful statistical device for detecting structure, generalising, segmenting speech, and recovering from overgeneralisations. The experimental and computational evidence gathered here suggests that statistical language learning is more powerful than heretofore acknowledged by the current literature.

Los estilos APA, Harvard, Vancouver, ISO, etc.

2

Zhang, Lidan y 张丽丹. "Exploiting linguistic knowledge for statistical natural language processing". Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2011. http://hub.hku.hk/bib/B46506299.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

3

White, Christopher Wm. "Some Statistical Properties of Tonality, 1650-1900". Thesis, Yale University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3578472.

Texto completo

Resumen

This dissertation investigates the statistical properties present within corpora of common practice music, involving a data set of more than 8,000 works spanning from 1650 to 1900, and focusing specifically on the properties of the chord progressions contained therein.

In the first chapter, methodologies concerning corpus analysis are presented and contrasted with text-based methodologies. It is argued that corpus analyses not only can show large-scale trends within data, but can empirically test and formalize traditional or inherited music theories, while also modeling corpora as a collection of discursive and communicative materials. Concerning the idea of corpus analysis as an analysis of discourse, literature concerning musical communication and learning is reviewed, and connections between corpus analysis and statistical learning are explored. After making this connection, we explore several problems with models of musical communication (e.g., music's composers and listeners likely use different cognitive models for their respective production and interpretation) and several implications of connecting corpora to cognitive models (e.g., a model's dependency on a particular historical situation).

Chapter 2 provides an overview of literature concerning computational musical analysis. The divide between top-down systems and bottom-up systems is discussed, and examples of each are reviewed. The chapter ends with an examination of more recent applications of information theory in music analysis.

Chapter 3 considers various ways corpora can be grouped as well as the implications those grouping techniques have on notions of musical style. It is hypothesized that the evolution of musical style can be modeled through the interaction of corpus statistics, chronological eras, and geographic contexts. This idea is tested by quantifying the probabilities of various composers' chord progressions, and cluster analyses are performed on these data. Various ways to divide and group corpora are considered, modeled, and tested.

In the fourth chapter, this dissertation investigates notions of harmonic vocabulary and syntax, hypothesizing that music involves syntactic regularity in much the same way as occurs in spoken languages. This investigation first probes this hypothesis through a corpus analysis of the Bach chorales, identifying potential syntactic/functional categories using a Hidden Markov Model. The analysis produces a three-function model as well as models with higher numbers of functions. In the end, the data suggest that music does indeed involve regularities, while also arguing for a definition of chord function that adds subtlety to models used by traditional music theory. A number of implications are considered, including the interaction of chord frequency and chord function, and the preeminence of triads in the resulting syntactic models.

Chapter 5 considers a particularly difficult problem of corpus analysis as it relates to musical vocabulary and syntax: the variegated and complex musical surface. One potential algorithm for vocabulary reduction is presented. This algorithm attempts to change each chord within an n-grams to its subset or superset that maximizes the probability of that trigram occurring. When a corpus of common-practice music is processed using this algorithm, a standard tertian chord vocabulary results, along with a bigram chord syntax that adheres to our intuitions concerning standard chord function.

In the sixth chapter, this study probes the notion of musical key as it concerns communication, suggesting that if musical practice is constrained by its point in history and progressions of chords exhibit syntactic regularities, then one should be able to build a key-finding model that learns to identify key by observing some historically situated corpus. Such a model is presented, and is trained on the music of a variety of different historical periods. The model then analyzes two famous moments of musical ambiguity: the openings of Beethoven's Eroica and Wagner's prelude to Tristan und Isolde. The results confirm that different corpus-trained models produce subtly different behavior.

The dissertation ends by considering several general and summarizing issues, for instance the notion that there are many historically-situated tonal models within Western music history, and that the difference between listening and compositional models likely accounts for the gap between the complex statistics of the tonal tradition and traditional concepts in music theory.

Los estilos APA, Harvard, Vancouver, ISO, etc.

4

Arad, Iris. "A quasi-statistical approach to automatic generation of linguistic knowledge". Thesis, University of Manchester, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.358872.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

5

McMahon, John George Gavin. "Statistical language processing based on self-organising word classification". Thesis, Queen's University Belfast, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.241417.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

6

Clark, Stephen. "Class-based statistical models for lexical knowledge acquisition". Thesis, University of Sussex, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.341541.

Texto completo

Resumen

This thesis is about the automatic acquisition of a particular kind of lexical knowledge, namely the knowledge of which noun senses can fill the argument slots of predicates. The knowledge is represented using probabilities, which agrees with the intuition that there are no absolute constraints on the arguments of predicates, but that the constraints are satisfied to a certain degree; thus the problem of knowledge acquisition becomes the problem of probability estimation from corpus data. The problem with defining a probability model in terms of senses is that this involves a huge number of parameters, which results in a sparse data problem. The proposal here is to define a probability model over senses in a semantic hierarchy, and exploit the fact that senses can be grouped into classes consisting of semantically similar senses. A novel class-based estimation technique is developed, together with a procedure that determines a suitable class for a sense (given a predicate and argument position). The problem of determining a suitable class can be thought of as finding a suitable level of generalisation in the hierarchy. The generalisation procedure uses a statistical test to locate areas consisting of semantically similar senses, and, as well as being used for probability estimation, is also employed as part of a re-estimation algorithm for estimating sense frequencies from incomplete data. The rest of the thesis considers how the lexical knowledge can be used to resolve structural ambiguities, and provides empirical evaluations. The estimation techniques are first integrated into a parse selection system, using a probabilistic dependency model to rank the alternative parses for a sentence. Then, a PP-attachment task is used to provide an evaluation which is more focussed on the class-based estimation technique, and, finally, a pseudo disambiguation task is used to compare the estimation technique with alternative approaches.

Los estilos APA, Harvard, Vancouver, ISO, etc.

7

Lakeland, Corrin y n/a. "Lexical approaches to backoff in statistical parsing". University of Otago. Department of Computer Science, 2006. http://adt.otago.ac.nz./public/adt-NZDU20060913.134736.

Texto completo

Resumen

This thesis develops a new method for predicting probabilities in a statistical parser so that more sophisticated probabilistic grammars can be used. A statistical parser uses a probabilistic grammar derived from a training corpus of hand-parsed sentences. The grammar is represented as a set of constructions - in a simple case these might be context-free rules. The probability of each construction in the grammar is then estimated by counting its relative frequency in the corpus. A crucial problem when building a probabilistic grammar is to select an appropriate level of granularity for describing the constructions being learned. The more constructions we include in our grammar, the more sophisticated a model of the language we produce. However, if too many different constructions are included, then our corpus is unlikely to contain reliable information about the relative frequency of many constructions. In existing statistical parsers two main approaches have been taken to choosing an appropriate granularity. In a non-lexicalised parser constructions are specified as structures involving particular parts-of-speech, thereby abstracting over individual words. Thus, in the training corpus two syntactic structures involving the same parts-of-speech but different words would be treated as two instances of the same event. In a lexicalised grammar the assumption is that the individual words in a sentence carry information about its syntactic analysis over and above what is carried by its part-of-speech tags. Lexicalised grammars have the potential to provide extremely detailed syntactic analyses; however, Zipf�s law makes it hard for such grammars to be learned. In this thesis, we propose a method for optimising the trade-off between informative and learnable constructions in statistical parsing. We implement a grammar which works at a level of granularity in between single words and parts-of-speech, by grouping words together using unsupervised clustering based on bigram statistics. We begin by implementing a statistical parser to serve as the basis for our experiments. The parser, based on that of Michael Collins (1999), contains a number of new features of general interest. We then implement a model of word clustering, which we believe is the first to deliver vector-based word representations for an arbitrarily large lexicon. Finally, we describe a series of experiments in which the statistical parser is trained using categories based on these word representations.

Los estilos APA, Harvard, Vancouver, ISO, etc.

8

Stymne, Sara. "Compound Processing for Phrase-Based Statistical Machine Translation". Licentiate thesis, Linköping : Department of Computer and Information Science, Linköpings universitet, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-51416.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

9

Yamangil, Elif. "Rich Linguistic Structure from Large-Scale Web Data". Thesis, Harvard University, 2013. http://dissertations.umi.com/gsas.harvard:11162.

Texto completo

Resumen

The past two decades have shown an unexpected effectiveness of Web-scale data in natural language processing. Even the simplest models, when paired with unprecedented amounts of unstructured and unlabeled Web data, have been shown to outperform sophisticated ones. It has been argued that the effectiveness of Web-scale data has undermined the necessity of sophisticated modeling or laborious data set curation. In this thesis, we argue for and illustrate an alternative view, that Web-scale data not only serves to improve the performance of simple models, but also can allow the use of qualitatively more sophisticated models that would not be deployable otherwise, leading to even further performance gains.
Engineering and Applied Sciences

Los estilos APA, Harvard, Vancouver, ISO, etc.

10

Phillips, Aaron B. "Modeling Relevance in Statistical Machine Translation: Scoring Alignment, Context, and Annotations of Translation Instances". Research Showcase @ CMU, 2012. http://repository.cmu.edu/dissertations/134.

Texto completo

Resumen

Machine translation has advanced considerably in recent years, primarily due to the availability of larger datasets. However, one cannot rely on the availability of copious, high-quality bilingual training data. In this work, we improve upon the state-of-the-art in machine translation with an instance-based model that scores each instance of translation in the corpus. A translation instance reflects a source and target correspondence at one specific location in the corpus. The significance of this approach is that our model is able to capture that some instances of translation are more relevant than others. We have implemented this approach in Cunei, a new platform for machine translation that permits the scoring of instance-specific features. Leveraging per-instance alignment features, we demonstrate that Cunei can outperform Moses, a widely-used machine translation system. We then expand on this baseline system in three principal directions, each of which shows further gains. First, we score the source context of a translation instance in order to favor those that are most similar to the input sentence. Second, we apply similar techniques to score the target context of a translation instance and favor those that are most similar to the target hypothesis. Third, we provide a mechanism to mark-up the corpus with annotations (e.g. statistical word clustering, part-of-speech labels, and parse trees) and then exploit this information to create additional perinstance similarity features. Each of these techniques explicitly takes advantage of the fact that our approach scores each instance of translation on demand after the input sentence is provided and while the target hypothesis is being generated; similar extensions would be impossible or quite difficult in existing machine translation systems. Ultimately, this approach provides a more exible framework for integration of novel features that adapts better to new data. In our experiments with German-English and Czech-English translation, the addition of instance-specific features consistently shows improvement.

Los estilos APA, Harvard, Vancouver, ISO, etc.

11

Madden, Joshua. "A statistical analysis of high-traffic websites". Thesis, Kansas State University, 2014. http://hdl.handle.net/2097/17650.

Texto completo

Resumen

Master of Science
Department of Journalism and Mass Communications
Steven Smethers
Although scholars have increasingly recognized the important role of the Internet within the field of mass communications, little research has been done analyzing the behavior of individuals online. The success or failure of a site is often dependent on the number of visitors it receives (often called “traffic”) and this includes newspapers that are attempting to direct larger audiences to their websites. Theoretical arguments have been made for certain factors (region, social media presence, backlinks, etc.) having a positive correlation with traffic, but few, if any, statistical analyses have been done on traffic patterns. This study looks at a sample of approximately 300 high-traffic websites and forms several regression models in order to analyze which factors are most highly correlated with Internet traffic and what the nature of that correlation is.

Los estilos APA, Harvard, Vancouver, ISO, etc.

12

Tweedie, Fiona Jane. "A statistical investigation into the provenance of De Doctrina Christiana, attributed to John Milton". Thesis, University of the West of England, Bristol, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.364078.

Texto completo

Resumen

The aim of this study is to conduct an objective investigation into the provenance of De Doctrina Christiana, a theological treatise attributed to Milton since its discovery in 1823. This attribution was questioned in 1991 provoking a series of papers, one of which makes a plea for an objective analysis, which I aim to supply. I begin by reviewing critically some techniques that have recently been applied to stylometry. They include methods from artificial intelligence, linguistics and statistics. The chapter concludes with an investigation into the QSUM technique, finding it to be invalid. As De Doctrina Christiana is written in neo-Latin I examine previous work carried out in Latin, then turn to historical issues and examine issues including censorship and the physical characteristics of the manuscript. The text is the only theological work in the extant Milton canon. As genre as well as authorship affects style, I consider theories of genre which influence the choice of suitable control texts. Chapter seven deals with the methodology used in the study. The analysis follows in a hierarchical structure. I establish which techniques distinguish between Milton and the control texts while maintaining the internal consistency of the authors. It is found that the most-frequently-occurring words are good discriminators. I then use this technique to examine De Doctrina Christiana and the Milton and control texts. A clear difference is found between texts from polemic and exegetical genres, and samples from De Doctrina Christiana form into two groups. This heterogeneity forms the third part of the analysis. No apparent difference is found between sections of the text with different amanuenses, but the Epistle appears to be markedly more Miltonic than the rest. In addition, postulated insertions into chapter X of Book I appear to have a Miltonic influence. I conclude by examining the hypothesis of a Ramist ordering to the text.

Los estilos APA, Harvard, Vancouver, ISO, etc.

13

Joelsson, Jakob. "Translationese and Swedish-English Statistical Machine Translation". Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-305199.

Texto completo

Resumen

This thesis investigates how well machine learned classifiers can identify translated text, and the effect translationese may have in Statistical Machine Translation -- all in a Swedish-to-English, and reverse, context. Translationese is a term used to describe the dialect of a target language that is produced when a source text is translated. The systems trained for this thesis are SVM-based classifiers for identifying translationese, as well as translation and language models for Statistical Machine Translation. The classifiers successfully identified translationese in relation to non-translated text, and to some extent, also what source language the texts were translated from. In the SMT experiments, variation of the translation model was whataffected the results the most in the BLEU evaluation. Systems configured with non-translated source text and translationese target text performed better than their reversed counter parts. The language model experiments showed that those trained on known translationese and classified translationese performed better than known non-translated text, though classified translationese did not perform as well as the known translationese. Ultimately, the thesis shows that translationese can be identified by machine learned classifiers and may affect the results of SMT systems.

Los estilos APA, Harvard, Vancouver, ISO, etc.

14

Dehdari, Jonathan. "A Neurophysiologically-Inspired Statistical Language Model". The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1399071363.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

15

Filali, Karim. "Multi-dynamic Bayesian networks for machine translation and NLP /". Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/6857.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

16

Ntantiso, Mzamo. "Exploring the statistical equivalence of the English and Xhosa versions of the Woodcock-Munõz Language Survey". Thesis, Nelson Mandela Metropolitan University, 2009. http://hdl.handle.net/10948/d1018620.

Texto completo

Resumen

This study explored statistical equivalence of the adapted Xhosa and English version of the Woodcock-Muñoz Language Survey (WMLS) by investigating group differences on each subscale, in terms of mean scores, index reliability, and item characteristics for two language groups. A Convenience quota sampling technique was used to select 188 Xhosa (n = 188) and 198 English (n = 198) learners from Grades 6 and 7 living in rural and urban Eastern Cape. The WMLS Xhosa and English versions were administered to learners in their first languages. Significant mean group differences were found, but differences were not found on the reliability indices, or mean item characteristics. This pointed in the direction of statistical equivalence. However, scrutiny of the item characteristics of the individual items per subscale indicated possible problems at an item level that need to be investigated further with differential functioning analyses. Thus, stringent DIF analyses were suggested for future research on DIF items before the versions of the WMLS can be considered as equivalent.

Los estilos APA, Harvard, Vancouver, ISO, etc.

17

Su, Kim Nam. "Statistical modeling of multiword expressions". Connect to thesis, 2008. http://repository.unimelb.edu.au/10187/3147.

Texto completo

Resumen

In natural languages, words can occur in single units called simplex words or in a group of simplex words that function as a single unit, called multiword expressions (MWEs). Although MWEs are similar to simplex words in their syntax and semantics, they pose their own sets of challenges (Sag et al. 2002). MWEs are arguably one of the biggest roadblocks in computational linguistics due to the bewildering range of syntactic, semantic, pragmatic and statistical idiomaticity they are associated with, and their high productivity. In addition, the large numbers in which they occur demand specialized handling. Moreover, dealing with MWEs has a broad range of applications, from syntactic disambiguation to semantic analysis in natural language processing (NLP) (Wacholder and Song 2003; Piao et al. 2003; Baldwin et al. 2004; Venkatapathy and Joshi 2006).
Our goals in this research are: to use computational techniques to shed light on the underlying linguistic processes giving rise to MWEs across constructions and languages; to generalize existing techniques by abstracting away from individual MWE types; and finally to exemplify the utility of MWE interpretation within general NLP tasks.
In this thesis, we target English MWEs due to resource availability. In particular, we focus on noun compounds (NCs) and verb-particle constructions (VPCs) due to their high productivity and frequency.
Challenges in processing noun compounds are: (1) interpreting the semantic relation (SR) that represents the underlying connection between the head noun and modifier(s); (2) resolving syntactic ambiguity in NCs comprising three or more terms; and (3) analyzing the impact of word sense on noun compound interpretation. Our basic approach to interpreting NCs relies on the semantic similarity of the NC components using firstly a nearest-neighbor method (Chapter 5), then verb semantics based on the observation that it is often an underlying verb that relates the nouns in NCs (Chapter 6), and finally semantic variation within NC sense collocations, in combination with bootstrapping (Chapter 7).
Challenges in dealing with verb-particle constructions are: (1) identifying VPCs in raw text data (Chapter 8); and (2) modeling the semantic compositionality of VPCs (Chapter 5). We place particular focus on identifying VPCs in context, and measuring the compositionality of unseen VPCs in order to predict their meaning. Our primary approach to the identification task is to adapt localized context information derived from linguistic features of VPCs to distinguish between VPCs and simple verb-PP combinations. To measure the compositionality of VPCs, we use semantic similarity among VPCs by testing the semantic contribution of each component.
Finally, we conclude the thesis with a chapter-by-chapter summary and outline of the findings of our work, suggestions of potential NLP applications, and a presentation of further research directions (Chapter 9).

Los estilos APA, Harvard, Vancouver, ISO, etc.

18

Bijleveld, Henny. "Linguistiche analysis van neurogeen stotteren". Doctoral thesis, Universite Libre de Bruxelles, 1999. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/211864.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

19

Robinson, Cory S. "A Statistical Approach to Syllabic Alliteration in the Odyssean Aeneid". BYU ScholarsArchive, 2014. https://scholarsarchive.byu.edu/etd/4199.

Texto completo

Resumen

William Clarke (1976) and Nathan Greenberg (1980) offer an objective framework for the study of alliteration in Latin poetry. However, their definition of alliteration as word initial sound repetition in a verse is inconsistent with the syllabic nature both of the device itself and also of the metrical structure. The present study reconciles this disparity in the first half of the Aeneid by applying a similar method to syllable initial sound repetition. A chi-square test for goodness-of-fit reveals that the distributions of the voiceless obstruents [p], [t], [k], [k^w], [f], and [s] and the sonorants [m], [n], [l], and [r] differ significantly from a Poisson model. These sounds generally occur twice per verse more often than expected, and three or more times per verse less often than expected. This finding is largely consistent with existing observations about Vergil's style (e.g. Clarke, 1976; Greenberg, 1980; Wilkinson, 1963). The regular association of phonetic features with differences in distribution suggests phonetic motivation for the practice.

Los estilos APA, Harvard, Vancouver, ISO, etc.

20

Urieli, Assaf. "Robust French syntax analysis : reconciling statistical methods and linguistic knowledge in the Talismane toolkit". Phd thesis, Université Toulouse le Mirail - Toulouse II, 2013. http://tel.archives-ouvertes.fr/tel-01058143.

Texto completo

Resumen

In this thesis we explore robust statistical syntax analysis for French. Our main concern is to explore methods whereby the linguist can inject linguistic knowledge and/or resources into the robust statistical engine in order to improve results for specific phenomena. We first explore the dependency annotation schema for French, concentrating on certain phenomena. Next, we look into the various algorithms capable of producing this annotation, and in particular on the transition-based parsing algorithm used in the rest of this thesis. After exploring supervised machine learning algorithms for NLP classification problems, we present the Talismane toolkit for syntax analysis, built within the framework of this thesis, including four statistical modules - sentence boundary detection, tokenisation, pos-tagging and parsing - as well as the various linguistic resources used for the baseline model, including corpora, lexicons and feature sets. Our first experiments attempt various machine learning configurations in order to identify the best baseline. We then look into improvements made possible by beam search and beam propagation. Finally, we present a series of experiments aimed at correcting errors related to specific linguistic phenomena, using targeted features. One our innovation is the introduction of rules that can impose or prohibit certain decisions locally, thus bypassing the statistical model. We explore the usage of rules for errors that the features are unable to correct. Finally, we look into the enhancement of targeted features by large scale linguistic resources, and in particular a semi-supervised approach using a distributional semantic resource.

Los estilos APA, Harvard, Vancouver, ISO, etc.

21

Wright, Christopher M. "Using Statistical Methods to Determine Geolocation Via Twitter". TopSCHOLAR®, 2014. http://digitalcommons.wku.edu/theses/1372.

Texto completo

Resumen

With the ever expanding usage of social media websites such as Twitter, it is possible to use statistical inquires to form a geographic location of a person using solely the content of their tweets. According to a study done in 2010, Zhiyuan Cheng, was able to detect a location of a Twitter user within 100 miles of their actual location 51% of the time. While this may seem like an already significant find, this study was done while Twitter was still finding its ground to stand on. In 2010, Twitter had 75 million unique users registered, as of March 2013, Twitter has around 500 million unique users. In this thesis, my own dataset was collected and using Excel macros, a comparison of my results to that of Cheng’s will see if the results have changed over the three years since his study. If found to be that Cheng’s 51% can be shown more efficiently using a simpler methodology, this could have a significant impact on Homeland Security and cyber security measures.

Los estilos APA, Harvard, Vancouver, ISO, etc.

22

Chan, Oscar. "Prosodic features for a maximum entropy language model". University of Western Australia. School of Electrical, Electronic and Computer Engineering, 2008. http://theses.library.uwa.edu.au/adt-WU2008.0244.

Texto completo

Resumen

A statistical language model attempts to characterise the patterns present in a natural language as a probability distribution defined over word sequences. Typically, they are trained using word co-occurrence statistics from a large sample of text. In some language modelling applications, such as automatic speech recognition (ASR), the availability of acoustic data provides an additional source of knowledge. This contains, amongst other things, the melodic and rhythmic aspects of speech referred to as prosody. Although prosody has been found to be an important factor in human speech recognition, its use in ASR has been limited. The goal of this research is to investigate how prosodic information can be employed to improve the language modelling component of a continuous speech recognition system. Because prosodic features are largely suprasegmental, operating over units larger than the phonetic segment, the language model is an appropriate place to incorporate such information. The prosodic features and standard language model features are combined under the maximum entropy framework, which provides an elegant solution to modelling information obtained from multiple, differing knowledge sources. We derive features for the model based on perceptually transcribed Tones and Break Indices (ToBI) labels, and analyse their contribution to the word recognition task. While ToBI has a solid foundation in linguistic theory, the need for human transcribers conflicts with the statistical model's requirement for a large quantity of training data. We therefore also examine the applicability of features which can be automatically extracted from the speech signal. We develop representations of an utterance's prosodic context using fundamental frequency, energy and duration features, which can be directly incorporated into the model without the need for manual labelling. Dimensionality reduction techniques are also explored with the aim of reducing the computational costs associated with training a maximum entropy model. Experiments on a prosodically transcribed corpus show that small but statistically significant reductions to perplexity and word error rates can be obtained by using both manually transcribed and automatically extracted features.

Los estilos APA, Harvard, Vancouver, ISO, etc.

23

Jarman, Jay. "Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized Versus Common Languages". Scholar Commons, 2011. http://scholarcommons.usf.edu/etd/3166.

Texto completo

Resumen

This dissertation focuses on developing and evaluating hybrid approaches for analyzing free-form text in the medical domain. This research draws on natural language processing (NLP) techniques that are used to parse and extract concepts based on a controlled vocabulary. Once important concepts are extracted, additional machine learning algorithms, such as association rule mining and decision tree induction, are used to discover classification rules for specific targets. This multi-stage pipeline approach is contrasted with traditional statistical text mining (STM) methods based on term counts and term-by-document frequencies. The aim is to create effective text analytic processes by adapting and combining individual methods. The methods are evaluated on an extensive set of real clinical notes annotated by experts to provide benchmark results. There are two main research question for this dissertation. First, can information (specialized language) be extracted from clinical progress notes that will represent the notes without loss of predictive information? Secondly, can classifiers be built for clinical progress notes that are represented by specialized language? Three experiments were conducted to answer these questions by investigating some specific challenges with regard to extracting information from the unstructured clinical notes and classifying documents that are so important in the medical domain. The first experiment addresses the first research question by focusing on whether relevant patterns within clinical notes reside more in the highly technical medically-relevant terminology or in the passages expressed by common language. The results from this experiment informed the subsequent experiments. It also shows that predictive patterns are preserved by preprocessing text documents with a grammatical NLP system that separates specialized language from common language and it is an acceptable method of data reduction for the purpose of STM. Experiments two and three address the second research question. Experiment two focuses on applying rule-mining techniques to the output of the information extraction effort from experiment one, with the ultimate goal of creating rule-based classifiers. There are several contributions of this experiment. First, it uses a novel approach to create classification rules from specialized language and to build a classifier. The data is split by classification and then rules are generated. Secondly, several toolkits were assembled to create the automated process by which the rules were created. Third, this automated process created interpretable rules and finally, the resulting model provided good accuracy. The resulting performance was slightly lower than from the classifier from experiment one but had the benefit of having interpretable rules. Experiment three focuses on using decision tree induction (DTI) for a rule discovery approach to classification, which also addresses research question three. DTI is another rule centric method for creating a classifier. The contributions of this experiment are that DTI can be used to create an accurate and interpretable classifier using specialized language. Additionally, the resulting rule sets are simple and easily interpretable, as well as created using a highly automated process.

Los estilos APA, Harvard, Vancouver, ISO, etc.

24

Corradini, Ryan Arthur. "A Hybrid System for Glossary Generation of Feature Film Content for Language Learning". BYU ScholarsArchive, 2010. https://scholarsarchive.byu.edu/etd/2238.

Texto completo

Resumen

This report introduces a suite of command-line tools created to assist content developers with the creation of rich supplementary material to use in conjunction with feature films and other video assets in language teaching. The tools are intended to leverage open-source corpora and software (the OPUS OpenSubs corpus and the Moses statistical machine translation system, respectively), but are written in a modular fashion so that other resources could be leveraged in their place. The completed tool suite facilitates three main tasks, which together constitute this project. First, several scripts created for use in preparing linguistic data for the system are discussed. Next, a set of scripts are described that together leverage the strengths of both terminology management and statistical machine translation to provide candidate translation entries for terms of interest. Finally, a tool chain and methodology are given for enriching the terminological data store based on the output of the machine translation process, thereby enabling greater accuracy and efficiency with each subsequent application.

Los estilos APA, Harvard, Vancouver, ISO, etc.

25

Packer, Thomas L. "Surface Realization Using a Featurized Syntactic Statistical Language Model". Diss., CLICK HERE for online access, 2006. http://contentdm.lib.byu.edu/ETD/image/etd1195.pdf.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

26

Murakami, Akira. "Individual variation and the role of L1 in the L2 development of English grammatical morphemes : insights from learner corpora". Thesis, University of Cambridge, 2014. https://www.repository.cam.ac.uk/handle/1810/254430.

Texto completo

Resumen

The overarching goal of the dissertation is to illustrate the relevance of learner corpus research to the field of second language acquisition (SLA). The possibility that learner corpora can be useful in mainstream SLA research has a significant implication given that they have not been systematically explored in relation to SLA theories. The thesis contributes to building a methodological framework to utilize learner corpora beneficially to SLA and argues that learner corpus research contributes to other disciplines. This is achieved by a series of case studies that quantitatively analyze individual variation and the role of native language (L1) in second language (L2) development of English grammatical morphemes and explain the findings with existing SLA theories. The dissertation investigates the L2 development of morphemes based on two largescale learner corpora. It first reviews the literature and points out that the L2 acquisition order of English grammatical morphemes that has been believed universal in SLA research may, in fact, vary across the learners with different L1 backgrounds and that individual differences in morpheme studies have been relatively neglected in previous literature. The present research, thus, provides empirical evidence testing the universality of the order and the extent of individual differences. In the first study, the thesis investigates L1 influence on the L2 acquisition order of six English grammatical morphemes across seven L1 groups and five proficiency levels. Data drawn from approximately 12,000 essays from the Cambridge Learner Corpus establish clear L1 influence on this issue. The study also reveals that learners without the equivalent morpheme in L1 tend to achieve an accuracy level of below 90% with respect to the morpheme even at the highest proficiency level, and that morphemes requiring learners to learn to pay attention to the relevant distinctions in their acquisition show a stronger effect of L1 than those which only require new form-meaning mappings. The findings are interpreted under the framework of thinking-for-speaking proposed by Dan Slobin. Following the first study, the dissertation exploits EF-Cambridge Open Language Database (EFCamDat) and analyzes the developmental patterns of morphemes, L1 influence on the patterns, and the extent to which individual variation is observed in the development. Based on approximately 140,000 essays written by 46,700 learners of 10 L1 groups across a wide range of proficiency levels, the study found that (i) certain developmental patterns of accuracy are observed irrespective of target morphemes, (ii) inverted U-shaped development is rare irrespective of morphemes, (iii) proficiency influences the within-learner developmental patterns of morphemes, (iv) the developmental patterns at least slightly vary depending on morphemes, and (v) significant individual variation is observed in absolute accuracy, the accuracy difference between morphemes, and the rate of development. The findings are interpreted with dynamic systems theory (DST), a theory of development that has recently been applied to SLA research. The thesis further examines whether any systematic relationship is observed between the developmental patterns of morphemes. Although DST expects that their development is interlinked, the study did not find any strong relationships between the developmental patterns. However, it revealed a weak supportive relationship in the developmental pattern between articles and plural -s. That is, within individual learners, when the accuracy of articles increases, the accuracy of plural -s tends to increase as well, and vice versa.

Los estilos APA, Harvard, Vancouver, ISO, etc.

27

Bernhardsson, Sebastian. "Structures in complex systems : Playing dice with networks and books". Doctoral thesis, Umeå universitet, Institutionen för fysik, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-27694.

Texto completo

Resumen

Complex systems are neither perfectly regular nor completely random. They consist of a multitude of players who, in many cases, playtogether in a way that makes their combined strength greater than the sum of their individual achievements. It is often very effective to represent these systems as networks where the actual connections between the players take on a crucial role.Networks exist all around us and are an important part of our world, from the protein machinery inside our cells to social interactions and man-madecommunication systems. Many of these systems have developed over a long period of time and are constantly undergoing changes driven by complicated microscopic events. These events are often too complicated for us to accurately resolve, making the world seem random and unpredictable. There are however ways of using this unpredictability in our favor by replacing the true events by much simpler stochastic rules giving effectively the same outcome. This allows us to capture the macroscopic behavior of the system, to extract important information about the dynamics of the system and learn about the reason for what we observe. Statistical mechanics gives the tools to deal with such large systems driven by underlying random processes under various external constraints, much like how intracellular networks are driven by random mutations under the constraint of natural selection.This similarity makes it interesting to combine the two and to apply some of the tools provided by statistical mechanics on biological systems.In this thesis, several null models are presented, with this view point in mind, to capture and explain different types of structural properties of real biological networks. The most recent major transition in evolution is the development of language, both spoken and written. This thesis also brings up the subject of quantitative linguistics from the eyes of a physicist, here called linguaphysics. Also in this case the data is analyzed with an assumption of an underlying randomness. It is shown that some statistical properties of books, previously thought to be universal, turn out to exhibit author specific size dependencies. A meta book theory is put forward which explains this dependency by describing the writing of a text as pulling a section out of a huge, individual, abstract mother book.
Komplexa system är varken perfekt ordnade eller helt slumpmässiga. De består av en mängd aktörer, som i många fall agerar tillsammans på ett sådant sätt att deras kombinerade styrka är större än deras individuella prestationer. Det är ofta effektivt att representera dessa system som nätverk där de faktiska kopplingarna mellan aktörerna spelar en avgörande roll. Nätverk finns överallt omkring oss och är en viktig del av vår värld , från proteinmaskineriet inne i våra celler till sociala samspel och människotillverkade kommunikationssystem.Många av dessa system har utvecklats under lång tid och genomgår hela tiden förändringar som drivs på av komplicerade småskaliga händelser.Dessa händelser är ofta för komplicerade för oss att noggrant kunna analysera, vilket får vår värld att verka slumpmässig och oförutsägbar. Det finns dock sätt att använda denna oförutsägbarhet till vår fördel genom att byta ut de verkliga händelserna mot mycket enklare regler baserade på sannolikheter, som ger effektivt sett samma utfall. Detta tillåter oss att fånga systemets övergripande uppförande, att utvinna viktig information om systemets dynamik och att få kunskap om anledningen till vad vi observerar. Statistisk mekanik hanterar stora system pådrivna av sådana underliggande slumpmässiga processer under olika restriktioner, på liknande sätt som nätverk inne i celler drivs av slumpmässiga mutationer under restriktionerna från naturligt urval. Denna likhet gör det intressant att kombinera de två och att applicera de verktyg som ges av statistisk mekanik på biologiska system. I denna avhandling presenteras flera nollmodeller som, baserat på detta synsätt, fångar och förklarar olika typer av strukturella egenskaper hos verkliga biologiska nätverk. Den senaste stora evolutionära övergången är utvecklandet av språk, både talat och skrivet. Denna avhandling tar också upp ämnet om kvantitativ linguistik genom en fysikers ögon, här kallat linguafysik. även i detta fall så analyseras data med ett antagande om en underliggande slumpmässighet. Det demonstreras att vissa statistiska egenskaper av böcker, som man tidigare trott vara universella, egentligen beror på bokens längd och på författaren. En metaboksteori ställs fram vilken förklarar detta beroende genom att beskriva författandet av en text som att rycka ut en sektion ur en stor, individuell, abstrakt moderbok.

Los estilos APA, Harvard, Vancouver, ISO, etc.

28

Botha, Gerrti Reinier. "Text-based language identification for the South African languages". Pretoria : [s.n.], 2007. http://upetd.up.ac.za/thesis/available/etd-090942008-133715/.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

29

Eklund, Robert. "A Probabilistic Tagging Module Based on Surface Pattern Matching". Thesis, Stockholm University, Department of Computational Linguistics, Institute of Linguistics, 1993. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-135294.

Texto completo

Resumen

A problem with automatic tagging and lexical analysis is that it is never 100 % accurate. In order to arrive at better figures, one needs to study the character of what is left untagged by automatic taggers. In this paper untagged residue outputted by the automatic analyser SWETWOL (Karlsson 1992) at Helsinki is studied. SWETWOL assigns tags to words in Swedish texts mainly through dictionary lookup. The contents of the untagged residue files are described and discussed, and possible ways of solving different problems are proposed. One method of tagging residual output is proposed and implemented: the left-stripping method, through which untagged words are bereaved their left-most letters, searched in a dictionary, and if found, tagged according to the information found in the said dictionary. If the stripped word is not found in the dictionary, a match is searched in ending lexica containing statistical information about word classes associated with that particular word form (i.e., final letter cluster, be this a grammatical suffix or not), and the relative frequency of each word class. If a match is found, the word is given graduated tagging according to the statistical information in the ending lexicon. If a match is not found, the word is stripped of what is now its left-most letter and is recursively searched in a dictionary and ending lexica (in that order). The ending lexica employed in this paper are retrieved from a reversed version of Nusvensk Frekvensordbok (Allén 1970), and contain endings of between one and seven letters. The contents of the ending lexica are to a certain degree described and discussed. The programs working according to the principles described are run on files of untagged residual output. Appendices include, among other things, LISP source code, untagged and tagged files, the ending lexica containing one and two letter endings and excerpts from ending lexica containing three to seven letters.

Los estilos APA, Harvard, Vancouver, ISO, etc.

30

Saers, Markus. "Translation as Linear Transduction : Models and Algorithms for Efficient Learning in Statistical Machine Translation". Doctoral thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-135704.

Texto completo

Resumen

Automatic translation has seen tremendous progress in recent years, mainly thanks to statistical methods applied to large parallel corpora. Transductions represent a principled approach to modeling translation, but existing transduction classes are either not expressive enough to capture structural regularities between natural languages or too complex to support efficient statistical induction on a large scale. A common approach is to severely prune search over a relatively unrestricted space of transduction grammars. These restrictions are often applied at different stages in a pipeline, with the obvious drawback of committing to irrevocable decisions that should not have been made. In this thesis we will instead restrict the space of transduction grammars to a space that is less expressive, but can be efficiently searched. First, the class of linear transductions is defined and characterized. They are generated by linear transduction grammars, which represent the natural bilingual case of linear grammars, as well as the natural linear case of inversion transduction grammars (and higher order syntax-directed transduction grammars). They are recognized by zipper finite-state transducers, which are equivalent to finite-state automata with four tapes. By allowing this extra dimensionality, linear transductions can represent alignments that finite-state transductions cannot, and by keeping the mechanism free of auxiliary storage, they become much more efficient than inversion transductions. Secondly, we present an algorithm for parsing with linear transduction grammars that allows pruning. The pruning scheme imposes no restrictions a priori, but guides the search to potentially interesting parts of the search space in an informed and dynamic way. Being able to parse efficiently allows learning of stochastic linear transduction grammars through expectation maximization. All the above work would be for naught if linear transductions were too poor a reflection of the actual transduction between natural languages. We test this empirically by building systems based on the alignments imposed by the learned grammars. The conclusion is that stochastic linear inversion transduction grammars learned from observed data stand up well to the state of the art.

Los estilos APA, Harvard, Vancouver, ISO, etc.

31

Pettersson, Eva. "Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction". Doctoral thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-269753.

Texto completo

Resumen

Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic workflow for automatic linguistic analysis and information extraction from historical text, with spelling normalisation as a core component in the pipeline. In the spelling normalisation step, the historical input text is automatically normalised to a more modern spelling, enabling the use of existing taggers and parsers trained on modern language data in the succeeding linguistic analysis step. In the final information extraction step, certain linguistic structures are identified based on the annotation labels given by the NLP tools, and ranked in accordance with the specific information need expressed by the user. An important consideration in my implementation is that the pipeline should be applicable to different languages, time periods, genres, and information needs by simply substituting the language resources used in each module. Furthermore, the reuse of existing NLP tools developed for the modern language is crucial, considering the lack of linguistically annotated historical data combined with the high variability in historical text, making it hard to train NLP tools specifically aimed at analysing historical text. In my evaluation, I show that spelling normalisation can be a very useful technique for easy access to historical information content, even in cases where there is little (or no) annotated historical training data available. For the specific information extraction task of automatically identifying verb phrases describing work in Early Modern Swedish text, 91 out of the 100 top-ranked instances are true positives in the best setting.

Los estilos APA, Harvard, Vancouver, ISO, etc.

32

Barnhart, Zachary. "A Comparative Analysis of Web-based Machine Translation Quality: English to French and French to English". Thesis, University of North Texas, 2012. https://digital.library.unt.edu/ark:/67531/metadc177176/.

Texto completo

Resumen

This study offers a partial reduplication of a 2006 study by Williams, which focused primarily on the analysis of the quality of translation produced by online software, namely Yahoo!® Babelfish, Freetranslation.com, and Google Translate. Since the data for the study by Williams were collected in 2004 and the data for present study in 2012, this gives a lapse of eight years for a diachronic analysis of the differences in quality of the translations provided by these online services. At the time of the 2006 study by Williams, all three services used a rule-based translation system, but, in October 2007, however, Google Translate switched to a system that is entirely statistical in nature. Thus, the present study is also able to examine the differences in quality between contemporary statistical and rule-based approaches to machine translation.

Los estilos APA, Harvard, Vancouver, ISO, etc.

33

Williams, Jake Ryland. "Lexical mechanics: Partitions, mixtures, and context". ScholarWorks @ UVM, 2015. http://scholarworks.uvm.edu/graddis/346.

Texto completo

Resumen

Highly structured for efficient communication, natural languages are complex systems. Unlike in their computational cousins, functions and meanings in natural languages are relative, frequently prescribed to symbols through unexpected social processes. Despite grammar and definition, the presence of metaphor can leave unwitting language users "in the dark," so to speak. This is not problematic, but rather an important operational feature of languages, since the lifting of meaning onto higher-order structures allows individuals to compress descriptions of regularly-conveyed information. This compressed terminology, often only appropriate when taken locally (in context), is beneficial in an enormous world of novel experience. However, what is natural for a human to process can be tremendously difficult for a computer. When a sequence of words (a phrase) is to be taken as a unit, suppose the choice of words in the phrase is subordinate to the choice of the phrase, i.e., there exists an inter-word dependence owed to membership within a common phrase. This word selection process is not one of independent selection, and so is capable of generating word-frequency distributions that are not accessible via independent selection processes. We have shown in Ch. 2 through analysis of thousands of English texts that empirical word-frequency distributions possess these word-dependence anomalies, while phrase-frequency distributions do not. In doing so, this study has also led to the development of a novel, general, and mathematical framework for the generation of frequency data for phrases, opening up the field of mass-preserving mesoscopic lexical analyses. A common oversight in many studies of the generation and interpretation of language is the assumption that separate discourses are independent. However, even when separate texts are each produced by means of independent word selection, it is possible for their composite distribution of words to exhibit dependence. Succinctly, different texts may use a common word or phrase for different meanings, and so exhibit disproportionate usages when juxtaposed. To support this theory, we have shown in Ch. 3 that the act of combining distinct texts to form large 'corpora' results in word-dependence irregularities. This not only settles a 15-year discussion, challenging the current major theory, but also highlights an important practice necessary for successful computational analysis---the retention of meaningful separations in language. We must also consider how language speakers and listeners navigate such a combinatorially vast space for meaning. Dictionaries (or, the collective editorial communities behind them) are smart. They know all about the lexical objects they define, but we ask about the latent information they hold, or should hold, about related, undefined objects. Based solely on the text as data, in Ch. 4 we build on our result in Ch. 2 and develop a model of context defined by the structural similarities of phrases. We then apply this model to define measures of meaning in a corpus-guided experiment, computationally detecting entries missing from a massive, collaborative online dictionary known as the Wiktionary.

Los estilos APA, Harvard, Vancouver, ISO, etc.

34

Gispert, Ramis Adrià. "Introducing linguistic knowledge into statistical machine translation". Doctoral thesis, Universitat Politècnica de Catalunya, 2007. http://hdl.handle.net/10803/6902.

Texto completo

Resumen

Aquesta tesi està dedicada a l'estudi de la utilització de informació morfosintàctica en el marc dels sistemes de traducció estocàstica, amb l'objectiu de millorar-ne la qualitat a través de la incorporació de informació lingüística més enllà del nivell simbòlic superficial de les paraules.

El sistema de traducció estocàstica utilitzat en aquest treball segueix un enfocament basat en tuples, unitats bilingües que permeten estimar un model de traducció de probabilitat conjunta per mitjà de la combinació, dins un entorn log-linial, de cadenes d'n-grames i funcions característiques addicionals. Es presenta un estudi detallat d'aquesta aproximació, que inclou la seva transformació des d'una implementació d'X-grames en autòmats d'estats finits, més orientada a la traducció de veu, cap a l'actual solució d'n-grames orientada a la traducció de text de gran vocabulari. La tesi estudia també les fases d'entrenament i decodificació, així com el rendiment per a diferents tasques (variant el tamany dels corpora o el parell d'idiomes) i els principals problemes reflectits en les anàlisis d'error.

La tesis també investiga la incorporació de informació lingüística específicament en aliniament per paraules. Es proposa l'extensió mitjançant classificació de formes verbals d'un algorisme d'aliniament paraula a paraula basat en co-ocurrències, amb resultats positius. Així mateix, s'avalua de forma empírica l'impacte en qualitat d'aliniament i de traducció que s'obté mitjançant l'etiquetatge morfològic, la lematització, la classificació de formes verbals i el truncament o stemming del text paral·lel.

Pel que fa al model de traducció, es proposa un model de tractament de les formes verbals per mitjà d'un model de instanciació addicional, i es realitzen experiments en la direcció d'anglès a castellà. La tesi també introdueix un model de llenguatge d'etiquetes morfològiques del destí per tal d'abordar problemes de concordança. Finalment, s'estudia l'impacte de la derivació morfològica en la formulació de la traducció estocàstica mitjançant n-grames, avaluant empíricament el possible guany derivat d'estratègies de reducció morfològica.
This Ph.D. thesis dissertation addresses the use of morphosyntactic information in order to improve the performance of Statistical Machine Translation (SMT) systems, providing them with additional linguistic information beyond the surface level of words from parallel corpora.
The statistical machine translation system in this work here follows a tuple-based approach, modelling joint-probability translation models via log-linear combination of bilingual n-grams with additional feature functions. A detailed study of the approach is conducted. This includes its initial development from a speech-oriented Finite-State Transducer architecture implementing X-grams towards a large-vocabulary text-oriented n-grams implementation, training and decoding particularities, portability across language pairs and tasks, and main difficulties as revealed in error analyses.

The use of linguistic knowledge to improve word alignment quality is also studied. A cooccurrence-based one-to-one word alignment algorithm is extended with verb form classification with successful results. Additionally, we evaluate the impact in word alignment and translation quality of Part-Of-Speech, base form, verb form classification and stemming on state-of-art word alignment tools.

Furthermore, the thesis proposes a translation model tackling verb form generation through an additional verb instance model, reporting experiments in English-to-Spanish tasks. Disagreement is addressed via incorporating a target Part-Of-Speech language model. Finally, we study the impact of morphology derivation on Ngram-based SMT formulation, empirically evaluating the quality gain that is to be gained via morphology reduction.

Los estilos APA, Harvard, Vancouver, ISO, etc.

35

Hoang, Hieu. "Improving statistical machine translation with linguistic information". Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5781.

Texto completo

Resumen

Statistical machine translation (SMT) should benefit from linguistic information to improve performance but current state-of-the-art models rely purely on data-driven models. There are several reasons why prior efforts to build linguistically annotated models have failed or not even been attempted. Firstly, the practical implementation often requires too much work to be cost effective. Where ad-hoc implementations have been created, they impose too strict constraints to be of general use. Lastly, many linguistically-motivated approaches are language dependent, tackling peculiarities in certain languages that do not apply to other languages. This thesis successfully integrates linguistic information about part-of-speech tags, lemmas and phrase structure to improve MT quality. The major contributions of this thesis are: 1. We enhance the phrase-based model to incorporate linguistic information as additional factors in the word representation. The factored phrase-based model allows us to make use of different types of linguistic information in a systematic way within the predefined framework. We show how this model improves translation by as much as 0.9 BLEU for small German-English training corpora, and 0.2 BLEU for larger corpora. 2. We extend the factored model to the factored template model to focus on improving reordering. We show that by generalising translation with part-of-speech tags, we can improve performance by as much as 1.1 BLEU on a small French- English system. 3. Finally, we switch from the phrase-based model to a syntax-based model with the mixed syntax model. This allows us to transition from the word-level approaches using factors to multiword linguistic information such as syntactic labels and shallow tags. The mixed syntax model uses source language syntactic information to inform translation. We show that the model is able to explain translation better, leading to a 0.8 BLEU improvement over the baseline hierarchical phrase-based model for a small German-English task. Also, the model requires only labels on continuous source spans, it is not dependent on a tree structure, therefore, other types of syntactic information can be integrated into the model. We experimented with a shallow parser and see a gain of 0.5 BLEU for the same dataset. Training with more training data, we improve translation by 0.6 BLEU (1.3 BLEU out-of-domain) over the hierarchical baseline. During the development of these three models, we discover that attempting to rigidly model translation as linguistic transfer process results in degraded performance. However, by combining the advantages of standard SMT models with linguistically-motivated models, we are able to achieve better translation performance. Our work shows the importance of balancing the specificity of linguistic information with the robustness of simpler models.

Los estilos APA, Harvard, Vancouver, ISO, etc.

36

Zbib, Rabih M. (Rabih Mohamed) 1974. "Using linguistic knowledge in statistical machine translation". Thesis, Massachusetts Institute of Technology, 2010. http://hdl.handle.net/1721.1/62391.

Texto completo

Resumen

Thesis (Ph. D. in Information Technology)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2010.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 153-162).
In this thesis, we present methods for using linguistically motivated information to enhance the performance of statistical machine translation (SMT). One of the advantages of the statistical approach to machine translation is that it is largely language-agnostic. Machine learning models are used to automatically learn translation patterns from data. SMT can, however, be improved by using linguistic knowledge to address specific areas of the translation process, where translations would be hard to learn fully automatically. We present methods that use linguistic knowledge at various levels to improve statistical machine translation, focusing on Arabic-English translation as a case study. In the first part, morphological information is used to preprocess the Arabic text for Arabic-to-English and English-to-Arabic translation, which reduces the gap in the complexity of the morphology between Arabic and English. The second method addresses the issue of long-distance reordering in translation to account for the difference in the syntax of the two languages. In the third part, we show how additional local context information on the source side is incorporated, which helps reduce lexical ambiguity. Two methods are proposed for using binary decision trees to control the amount of context information introduced. These methods are successfully applied to the use of diacritized Arabic source in Arabic-to-English translation. The final method combines the outputs of an SMT system and a Rule-based MT (RBMT) system, taking advantage of the flexibility of the statistical approach and the rich linguistic knowledge embedded in the rule-based MT system.
by Rabih M. Zbib.
Ph.D.in Information Technology

Los estilos APA, Harvard, Vancouver, ISO, etc.

37

Lindgren, Anna. "Semi-Automatic Translation of Medical Terms from English to Swedish : SNOMED CT in Translation". Thesis, Linköpings universitet, Medicinsk informatik, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-69736.

Texto completo

Resumen

The Swedish National Board of Health and Welfare has been overseeing translations of the international clinical terminology SNOMED CT from English to Swedish. This study was performed to find whether semi-automatic methods of translation could produce a satisfactory translation while requiring fewer resources than manual translation. Using the medical English-Swedish dictionary TermColl translations of select subsets of SNOMED CT were produced by ways of translation memory and statistical translation. The resulting translations were evaluated via BLEU score using translations provided by the Swedish National Board of Health and Welfare as reference before being compared with each other. The results showed a strong advantage for statistical translation over use of a translation memory; however, overall translation results were far from satisfactory.
Den internationella kliniska terminologin SNOMED CT har översatts från engelska till svenska under ansvar av Socialstyrelsen. Den här studien utfördes för att påvisa om semiautomatiska översättningsmetoder skulle kunna utföra tillräckligt bra översättning med färre resurser än manuell översättning. Den engelsk-svenska medicinska ordlistan TermColl användes som bas för översättning av delmängder av SNOMED CT via översättningsminne och genom statistisk översättning. Med Socialstyrelsens översättningar som referens poängsattes the semiautomatiska översättningarna via BLEU. Resultaten visade att statistisk översättning gav ett betydligt bättre resultat än översättning med översättningsminne, men över lag var resultaten alltför dåliga för att semiautomatisk översättning skulle kunna rekommenderas i detta fall.

Los estilos APA, Harvard, Vancouver, ISO, etc.

38

Linardaki, Evita. "Linguistic and statistical extensions of data oriented parsing". Thesis, University of Essex, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.434401.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

39

Panesar, Kulvinder. "Conversational artificial intelligence - demystifying statistical vs linguistic NLP solutions". Universitat Politécnica de Valéncia, 2020. http://hdl.handle.net/10454/18121.

Texto completo

Resumen

yes
This paper aims to demystify the hype and attention on chatbots and its association with conversational artificial intelligence. Both are slowly emerging as a real presence in our lives from the impressive technological developments in machine learning, deep learning and natural language understanding solutions. However, what is under the hood, and how far and to what extent can chatbots/conversational artificial intelligence solutions work – is our question. Natural language is the most easily understood knowledge representation for people, but certainly not the best for computers because of its inherent ambiguous, complex and dynamic nature. We will critique the knowledge representation of heavy statistical chatbot solutions against linguistics alternatives. In order to react intelligently to the user, natural language solutions must critically consider other factors such as context, memory, intelligent understanding, previous experience, and personalized knowledge of the user. We will delve into the spectrum of conversational interfaces and focus on a strong artificial intelligence concept. This is explored via a text based conversational software agents with a deep strategic role to hold a conversation and enable the mechanisms need to plan, and to decide what to do next, and manage the dialogue to achieve a goal. To demonstrate this, a deep linguistically aware and knowledge aware text based conversational agent (LING-CSA) presents a proof-of-concept of a non-statistical conversational AI solution.

Los estilos APA, Harvard, Vancouver, ISO, etc.

40

Osika, Anton. "Statistical analysis of online linguistic sentiment measures with financial applications". Thesis, KTH, Matematisk statistik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177106.

Texto completo

Resumen

Gavagai is a company that uses diﬀerent methods to aggregate senti-ment towards speciﬁc topics from a large stream of real time published documents. Gavagai wants to ﬁnd a procedure to decide which way of measuring sentiment (sentiment measure) towards a topic is most useful in a given context. This work discusses what criterion are desirable for aggregating sentiment and derives and evaluates procedures to select "optimal" sentiment measures. Three novel models for selecting a set of sentiment measures that describe independent attributes of the aggregated data are evaluated. The models can be summarized as: maximizing variance of the last principal compo-nent of the data, maximizing the diﬀerential entropy of the data and, in the special case of selecting an additional sentiment measure, maximizing the unexplained variance conditional on the previous sentiment measures. When exogenous time varying data considering a topic is available, the data can be used to select the sentiment measure that best explain the data. With this goal in mind, the hypothesis that sentiment data can be used to predict ﬁnancial volatility and political poll data is tested. The null hypothesis can not be rejected. A framework for aggregating sentiment measures in a mathematically co-herent way is summarized in a road map.
Företaget Gavagai använder olika mått för att i realtid uppskatta sen-timent ifrån diverse strömmar av publika dokument. Gavagai vill hitta ett en procedur som bestämmer vilka mått som passar passar bäst i en given kontext. Det här arbetet diskuterar vilka kriterium som är önskvärda för att mäta sentiment samt härleder och utvärderar procedurer för att välja öptimalasentimentmått. Tre metoder för att välja ut en grupp av mått som beskriver oberoende polariseringar i text föreslås. Dessa bygger på att: välja mått där principal-komponentsanalys uppvisar hög dimensionalitet hos måtten, välja mått som maximerar total uppskattad diﬀerentialentropi, välja ett mått som har hög villkorlig varians givet andra polariseringar. Då exogen tidsvarierande data om ett ämne ﬁnns tillgängligt kan denna data användas för att beräkna vilka sentimentmått som bäst beskriver datan. För att undersöka potentialen i att välja sentimentmått på detta sätt testas hypoteserna att publika sentimentmått kan förutspå ﬁnansiell volatilitet samt politiska opinionsundersökningar. Nollhypotesen kan ej förkastas. En sammanfattning för att på ett genomgående matematiskt koherent sätt aggregera sentiment läggs fram tillsammans med rekommendationer för framtida efterforskningar.

Los estilos APA, Harvard, Vancouver, ISO, etc.

41

Herrmann, Teresa [Verfasser] y A. [Akademischer Betreuer] Waibel. "Linguistic Structure in Statistical Machine Translation / Teresa Herrmann. Betreuer: A. Waibel". Karlsruhe : KIT-Bibliothek, 2015. http://d-nb.info/1102250155/34.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

42

Kliegl, Reinhold. "Publication Statistics Show Collaboration, Not Competition". Universität Potsdam, 2008. http://opus.kobv.de/ubp/volltexte/2011/5719/.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

43

Rayson, Paul Edward. "Matrix : a statistical method and software tool for linguistic analysis through corpus comparison". Thesis, Lancaster University, 2003. http://eprints.lancs.ac.uk/12287/.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

44

Kearsley, Logan R. "A Hybrid Approach to Cross-Linguistic Tokenization: Morphology with Statistics". BYU ScholarsArchive, 2016. https://scholarsarchive.byu.edu/etd/5984.

Texto completo

Resumen

Tokenization, or word boundary detection, is a critical first step for most NLP applications. This is often given little attention in English and other languages which use explicit spaces between written words, but standard orthographies for many languages lack explicit markers. Tokenization systems for such languages are usually engineered on an individual basis, with little re-use. The human ability to decode any written language, however, suggests that a general algorithm exists.This thesis presents simple morphologically-based and statistical methods for identifying word boundaries in multiple languages. Statistical methods tend to over-predict, while lexical and morphological methods fail when encountering unknown words. I demonstrate that a generic hybrid approach to tokenization using both morphological and statistical information generalizes well across multiple languages and improves performance over morphological or statistical methods alone, and show that it can be used for efficient tokenization of English, Korean, and Arabic.

Los estilos APA, Harvard, Vancouver, ISO, etc.

45

Tolle, Kristin M. "Domain-independent semantic concept extraction using corpus linguistics, statistics and artificial intelligence techniques". Diss., The University of Arizona, 2003. http://hdl.handle.net/10150/280502.

Texto completo

Resumen

For this dissertation two software applications were developed and three experiments were conducted to evaluate the viability of a unique approach to medical information extraction. The first system, the AZ Noun Phraser, was designed as a concept extraction tool. The second application, ANNEE, is a neural net-based entity extraction (EE) system. These two systems were combined to perform concept extraction and semantic classification specifically for use in medical document retrieval systems. The goal of this research was to create a system that automatically (without human interaction) enabled semantic type assignment, such as gene name and disease, to concepts extracted from unstructured medical text documents. Improving conceptual analysis of search phrases has been shown to improve the precision of information retrieval systems. Enabling this capability in the field of medicine can aid medical researchers, doctors and librarians in locating information, potentially improving healthcare decision-making. Due to the flexibility and non-domain specificity of the implementation, these applications have also been successfully deployed in other text retrieval experimentation for law enforcement (Atabakhsh et al., 2001; Hauck, Atabakhsh, Ongvasith, Gupta, & Chen, 2002), medicine (Tolle & Chen, 2000), query expansion (Leroy, Tolle, & Chen, 2000), web document categorization (Chen, Fan, Chau, & Zeng, 2001), Internet spiders (Chau, Zeng, & Chen, 2001), collaborative agents (Chau, Zeng, Chen, Huang, & Hendriawan, 2002), competitive intelligence (Chen, Chau, & Zeng, 2002), and Internet chat-room data visualization (Zhu & Chen, 2001).

Los estilos APA, Harvard, Vancouver, ISO, etc.

46

Xu, Yushi Ph D. Massachusetts Institute of Technology. "Combining linguistics and statistics for high-quality limited domain English-Chinese machine translation". Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/44726.

Texto completo

Resumen

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.
Includes bibliographical references (p. 86-87).
Second language learning is a compelling activity in today's global markets. This thesis focuses on critical technology necessary to produce a computer spoken translation game for learning Mandarin Chinese in a relatively broad travel domain. Three main aspects are addressed: efficient Chinese parsing, high-quality English-Chinese machine translation, and how these technologies can be integrated into a translation game system. In the language understanding component, the TINA parser is enhanced with bottom-up and long distance constraint features. The results showed that with these features, the Chinese grammar ran ten times faster and covered 15% more of the test set. In the machine translation component, a combined method of linguistic and statistical system is introduced. The English-Chinese translation is done via an intermediate language "Zhonglish", where the English-Zhonglish translation is accomplished by a parse-and-paraphrase paradigm using hand-coded rules, mainly for structural reconstruction. Zhonglish-Chinese translation is accomplished by a standard phrase based statistical machine translation system, mostly accomplishing word sense disambiguation and lexicon mapping. We evaluated in an independent test set in IWSLT travel domain spoken language corpus. Substantial improvements were achieved for GIZA alignment crossover: we obtained a 45% decrease in crossovers compared to a traditional phrase-based statistical MT system. Furthermore, the BLEU score improved by 2 points. Finally, a framework of the translation game system is described, and the feasibility of integrating the components to produce reference translation and to automatically assess student's translation is verified.
by Yushi Xu.
S.M.

Los estilos APA, Harvard, Vancouver, ISO, etc.

47

Xu, Jia [Verfasser]. "Sequence segmentation for statistical machine translation / Jia Xu". Aachen : Hochschulbibliothek der Rheinisch-Westfälischen Technischen Hochschule Aachen, 2010. http://d-nb.info/1015180108/34.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

48

Baker, David Ian. "Shepherd of Hermas : a socio-rhetorical and statistical-linguistic study of authorship and community concerns". Thesis, Cardiff University, 2006. http://orca.cf.ac.uk/56076/.

Texto completo

Resumen

The Shepherd of Hermas, hereafter simply referred to as The Shepherd, is a long document that was highly prized in the early church. It gives an account of the visions and dreams that were experienced by the main character, Hermas. This gives the impression to the general reader that the text is of the genre of an apocalypse1. While Hermas 'sees' angelic figures and the visions are explained by a spiritual guide, it lacks the visions of heaven that is central to other apocalypse literature, and also, end-of-the-world catastrophic occurrences. Consequently, The Shepherd cannot be considered as apocalyptic, or even pseudo-apocalyptic. The genre of The Shepherd will be considered in a later chapter of this thesis. A description of the narrative structure is given later in this introduction.

Los estilos APA, Harvard, Vancouver, ISO, etc.

49

FRAGOSO, LUANE DA COSTA PINTO LINS. "INTEGRATION OF LINGUISTIC AND GRAPHIC INFORMATION IN MULTIMODAL COMPREHENSION OF STATISTICAL GRAPHS: A PSYCHOLINGUISTIC ASSESSMENT". PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2015. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=25595@1.

Texto completo

Resumen

PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
PROGRAMA DE SUPORTE À PÓS-GRADUAÇÃO DE INSTS. DE ENSINO
Esta tese possui como objetivo investigar o mapeamento entre o conteúdo de sentenças e aquele apresentado em gráficos no processo de compreensão multimodal. Assume-se uma abordagem experimental, baseada nos aportes teórico-metodológicos da Psicologia Cognitiva e da Psicolinguística, aliada a discussões pertinentes à área de Educação Matemática e aos estudos sobre multimodalidade e letramento. Consideram-se duas propostas acerca da integração entre informação linguística e visual: uma vinculada à hipótese de modularidade representacional de Jackendoff (1996), em que se defende a ideia de módulos de interface, de natureza híbrida, e uma proposta alternativa, assumida no presente trabalho, segundo a qual tanto o processamento linguístico como o visual gerariam representações de natureza abstrata/proposicional, que seriam integradas em uma interface conceitual. Buscou-se verificar (i) se fatores top-down como conhecimento prévio do assunto afetam essa integração e (ii) em que medida informação linguística instaura expectativas acerca da informação expressa no gráfico. Foram conduzidos dois experimentos de comparação sentença-figura com gráficos de coluna e de linha, utilizando o programa psyscope, e um envolvendo gráficos de linha com a técnica de rastreamento ocular. Não foram encontradas evidências de efeitos top-down no experimento com gráfico de colunas. Foram obtidos, contudo, efeitos significativos para tempo de resposta associados a outros fatores, quais sejam correção do gráfico, expressão lexical usada para comparar itens do gráfico (maior vs menor, p. ex.) e número de itens referidos na sentença a serem localizados no gráfico. Nos dois experimentos com gráficos de linha, as variáveis independentes foram (i) congruência (linha congruente/incongruente em relação ao verbo – exemplo: linha inclinada para cima ou para baixo vs. verbo subir) e (ii) correção do gráfico em expressar o conteúdo da frase, manipulada com alterações na linha e na ordenação (ascendente/descendente) de informação temporal no eixo x. No experimento com psyscope, os resultados indicaram não haver dificuldade de julgar a compatibilidade frase/gráfico quando congruência e correção não divergiam. Para tempo de resposta, houve efeito principal de congruência e correção, com menores tempos associados, respectivamente, às condições em que a linha era congruente com o verbo e o gráfico correto. Também houve efeito de interação entre as variáveis. No experimento com rastreador ocular, foram analisados índice de acertos, número e tempo total de duração das fixações e trajetória do olhar nas áreas de interesse demarcadas. Em relação a índice de acerto, assim como no experimento com psycope, maior dificuldade de processamento estava associada à condição incongruente correta, em que há quebra de expectativa em relação à posição da linha (vs. verbo) e ao modo usual de organização dos gráficos no eixo x. Quanto aos movimentos oculares, na área do gráfico, observou-se maior número e tempo total de duração das fixações nas condições corretas; na área da frase, tais condições apresentaram resultados opostos. Quanto à trajetória do olhar, os dados sugerem ser a informação linguística acessada em primeiro lugar, orientando a leitura do gráfico. Considerando os resultados em conjunto, pode-se afirmar que o custo de integração é determinado pela compatibilidade (ou não) entre as proposições geradas pelos módulos linguístico e visual.
This thesis aims at investigating the mapping between the sentential content and the content presented in graphs in a multimodal comprehension process. We assume an experimental approach, based on Cognitive Pyschology and Psycholinguistics methodological and theoretical contributions as well as literacy and multimodality studies. Two proposals concerning the integration between linguistic and visual information are considered: one linked to Jackendoff s (1996) representational modularity hypothesis, in which, the idea of interface modules, of hybrid nature, is defended; and an alternative one according to which linguistic and visual processing could generate propositional/abstract representations which could be integrated into a conceptual interface. We tried to check (i) if top-down aspects such as prior knowledge can affect this integration and (ii) in what extent linguistic information may bring expectations about the information expressed in the graph. Sentence-picture comparison experiments were conducted with line and columns graphs using the pyscope software, and another one concerning line graphs with eye tracking technique. Top-down effects were not found in columns graphs experiment. However, significant effects related to response time associated with other aspects such as graph accuracy, lexical expression used in order to compare graph elements (larger x smaller, for example) and the number of elements in the sentence that must be found in the graph. In both experiments with line graphs, the independent variables were (i) congruency (congruent/incongruent line in relation to the verb - line up or down vs verb increase) and (ii) accuracy of the graph in order to express the content of the sentence, manipulated with changes in the line and time information order (ascendant/descendent) in x axis. In psyscope experiment, there was no difficulty in judging the sentence-picture compatibility when congruency and correction were not different. Concerning the response time, there was effect of congruency and correction, with shorter times associated, respectively, to the conditions in which line was congruent to the verb and correct graph. There was also effect of interaction. In eye tracking experiment, accuracy rates, number of fixations, total fixation duration and the scanpath in areas of interest were analysed. In relation to accuracy rates, similar to psyscope experiment, more difficulty in processing was associated to incongruent/incorrect condition, in which there is a break in the expectation related to the line position (vs.verb) and the common organization of the elements displayed in x axis. Concerning eye movements, in the graph area, number of fixations and total fixation duration were higher in correct conditions; in the sentence area, these results were opposite. Analyzing the scanpath, data suggest that linguistic information is accessed first, guiding the graph reading. To conclude, it s possible to state that the cost of integration is determined by compatibility (or not) between the propositions from both linguistic and visual modules.

Los estilos APA, Harvard, Vancouver, ISO, etc.

50

Hasan, Saša [Verfasser]. "Triplet lexicon models for statistical machine translation / Sasa Hasan". Aachen : Hochschulbibliothek der Rheinisch-Westfälischen Technischen Hochschule Aachen, 2012. http://d-nb.info/1028004060/34.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

Tesis sobre el tema "Statistical linguistics"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros