Dissertations / Theses: 'Natural language processing (Computer science) Compuational linguistics'

1

Vaillette, Nathan. "Logical specification of finite-state transductions for natural language processing." Columbus, Ohio : Ohio State University, 2004. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1072058657.

Full text

Abstract:

Thesis (Ph. D.)--Ohio State University, 2004.
Title from first page of PDF file. Document formatted into pages; contains xv, 253 p.; also includes graphics. Includes abstract and vita. Advisor: Chris Brew, Dept. of Linguistics. Includes bibliographical references (p. 245-253).

APA, Harvard, Vancouver, ISO, and other styles

2

Jarmasz, Mario. ""Roget's Thesaurus" as a lexical resource for natural language processing." Thesis, University of Ottawa (Canada), 2003. http://hdl.handle.net/10393/26493.

Full text

Abstract:

This dissertation presents an implementation of an electronic lexical knowledge base that uses the 1987 Penguin edition of Roget's Thesaurus as the source for its lexical material---the first implementation of a computerized Roget's to use an entire current edition. It explains the steps necessary for taking a machine-readable file and transforming it into a tractable system. Roget's organization is studied in detail and contrasted with WordNet's. We show two applications of the computerized Thesaurus: computing semantic similarity between words and phrases, and building lexical chains in a text. The experiments are performed using well-known benchmarks and the results are compared to those of other systems that use Roget's, WordNet and statistical techniques. Roget's has turned out to be an excellent resource for measuring semantic similarity; lexical chains are easily built but more difficult to evaluate. We also explain ways in which Roget's Thesaurus and WordNet can be combined.

APA, Harvard, Vancouver, ISO, and other styles

3

Berman, Lucy. "Lewisian Properties and Natural Language Processing: Computational Linguistics from a Philosophical Perspective." Scholarship @ Claremont, 2019. https://scholarship.claremont.edu/cmc_theses/2200.

Full text

Abstract:

Nothing seems more obvious than that our words have meaning. When people speak to each other, they exchange information through the use of a particular set of words. The words they say to each other, moreover, are about something. Yet this relation of “aboutness,” known as “reference,” is not quite as simple as it appears. In this thesis I will present two opposing arguments about the nature of our words and how they relate to the things around us. First, I will present Hilary Putnam’s argument, in which he examines the indeterminacy of reference, forcing us to conclude that we must abandon metaphysical realism. While Putnam considers his argument to be a refutation of non-epistemicism, David Lewis takes it to be a reductio, claiming Putnam’s conclusion is incredible. I will present Lewis’s response to Putnam, in which he accepts the challenge of demonstrating how Putnam’s argument fails and rescuing us from the abandonment of realism. In order to explain the determinacy of reference, Lewis introduces the concept of “natural properties.” In the final chapter of this thesis, I will propose another use for Lewisian properties. Namely, that of helping to minimize the gap between natural language processing and human communication.

APA, Harvard, Vancouver, ISO, and other styles

4

Keller, Thomas Anderson. "Comparison and Fine-Grained Analysis of Sequence Encoders for Natural Language Processing." Thesis, University of California, San Diego, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10599339.

Full text

Abstract:

Most machine learning algorithms require a fixed length input to be able to perform commonly desired tasks such as classification, clustering, and regression. For natural language processing, the inherently unbounded and recursive nature of the input poses a unique challenge when deriving such fixed length representations. Although today there is a general consensus on how to generate fixed length representations of individual words which preserve their meaning, the same cannot be said for sequences of words in sentences, paragraphs, or documents. In this work, we study the encoders commonly used to generate fixed length representations of natural language sequences, and analyze their effectiveness across a variety of high and low level tasks including sentence classification and question answering. Additionally, we propose novel improvements to the existing Skip-Thought and End-to-End Memory Network architectures and study their performance on both the original and auxiliary tasks. Ultimately, we show that the setting in which the encoders are trained, and the corpus used for training, have a greater influence of the final learned representation than the underlying sequence encoders themselves.

APA, Harvard, Vancouver, ISO, and other styles

5

Pham, Son Bao Computer Science &amp Engineering Faculty of Engineering UNSW. "Incremental knowledge acquisition for natural language processing." Awarded by:University of New South Wales. School of Computer Science and Engineering, 2006. http://handle.unsw.edu.au/1959.4/26299.

Full text

Abstract:

Linguistic patterns have been used widely in shallow methods to develop numerous NLP applications. Approaches for acquiring linguistic patterns can be broadly categorised into three groups: supervised learning, unsupervised learning and manual methods. In supervised learning approaches, a large annotated training corpus is required for the learning algorithms to achieve decent results. However, annotated corpora are expensive to obtain and usually available only for established tasks. Unsupervised learning approaches usually start with a few seed examples and gather some statistics based on a large unannotated corpus to detect new examples that are similar to the seed ones. Most of these approaches either populate lexicons for predefined patterns or learn new patterns for extracting general factual information; hence they are applicable to only a limited number of tasks. Manually creating linguistic patterns has the advantage of utilising an expert's knowledge to overcome the scarcity of annotated data. In tasks with no annotated data available, the manual way seems to be the only choice. One typical problem that occurs with manual approaches is that the combination of multiple patterns, possibly being used at different stages of processing, often causes unintended side effects. Existing approaches, however, do not focus on the practical problem of acquiring those patterns but rather on how to use linguistic patterns for processing text. A systematic way to support the process of manually acquiring linguistic patterns in an efficient manner is long overdue. This thesis presents KAFTIE, an incremental knowledge acquisition framework that strongly supports experts in creating linguistic patterns manually for various NLP tasks. KAFTIE addresses difficulties in manually constructing knowledge bases of linguistic patterns, or rules in general, often faced in existing approaches by: (1) offering a systematic way to create new patterns while ensuring they are consistent; (2) alleviating the difficulty in choosing the right level of generality when creating a new pattern; (3) suggesting how existing patterns can be modified to improve the knowledge base's performance; (4) making the effort in creating a new pattern, or modifying an existing pattern, independent of the knowledge base's size. KAFTIE, therefore, makes it possible for experts to efficiently build large knowledge bases for complex tasks. This thesis also presents the KAFDIS framework for discourse processing using new representation formalisms: the level-of-detail tree and the discourse structure graph.

APA, Harvard, Vancouver, ISO, and other styles

6

Schäfer, Ulrich. "Integrating deep and shallow natural language processing components : representations and hybrid architectures /." Saarbrücken : German Reseach Center for Artificial Intelligence : Saarland University, Dept. of Computational Linguistics and Phonetics, 2007. http://www.loc.gov/catdir/toc/fy1001/2008384333.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Mahamood, Saad Ali. "Generating affective natural language for parents of neonatal infants." Thesis, University of Aberdeen, 2010. http://digitool.abdn.ac.uk:80/webclient/DeliveryManager?pid=158569.

Full text

Abstract:

The thesis presented here describes original research in the field of Natural Language Generation (NLG). NLG is the subfield of artificial intelligence that is concerned with the automatic production of documents from underlying data. This thesis in particular focuses on developing new and novel methods for generating text that takes into consideration the recipient’s level of stress as a factor to adapt the resultant textural output. This consideration of taking the recipient level of stress was particularly salient due to the domain that this research was conducted under; providing information for parents of pre-term infants during neonatal intensive care (NICU). A highly technical and stressful environment for parents where emotional sensitivity must be shown for the nature of information presented. We have investigated the emotional and informational needs of these parents through an extensive past literature review and two separate research studies with former and current NICU parents. The NLG system built for this research was called BabyTalk Family (BT-Family). A system that can produce a textual summary of medical events that has occurred for a baby in NICU in last twenty-four hours for parents. The novelty of this system is that is capable of estimating the level of stress of the recipient and by using several affective NLG strategies it is able to tailor it’s output for a stressed audience. Unlike traditional NLG systems where the output would remain unchanged regardless of emotional state of the recipient. The key innovation in this system was the integration of several affective strategies in the Document Planner for tailoring textual output for stress recipients. BT-Family’s output was evaluated with thirteen parents that previously had baby in neonatal care. We developed a methodology for an evaluation that involved a direct comparison between stressed and unstressed text for the same given medical scenario for variables such as preference, understandability, helpfulness, and emotional appropriateness. The results, obtained showed the parents overwhelming preferred the stressed text for all of the variables measured.

APA, Harvard, Vancouver, ISO, and other styles

8

Kozlowski, Raymond. "Uniform multilingual sentence generation using flexible lexico-grammatical resources." Access to citation, abstract and download form provided by ProQuest Information and Learning Company; downloadable PDF file 0.93 Mb., 213 p, 2006. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&res_dat=xri:pqdiss&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&rft_dat=xri:pqdiss:3200536.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Carpuat, Marine Jacinthe. "Word sense alignment using bilingual corpora /." View Abstract or Full-Text, 2002. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202002%20CARPUA.

Full text

Abstract:

Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2002.
Includes bibliographical references (leaves 43-44). Also available in electronic version. Access restricted to campus users.

APA, Harvard, Vancouver, ISO, and other styles

10

Petersen, Sarah E. "Natural language processing tools for reading level assessment and text simplication for bilingual education /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/6906.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Crocker, Matthew Walter. "A principle-based system for natural language analysis and translation." Thesis, University of British Columbia, 1988. http://hdl.handle.net/2429/27863.

Full text

Abstract:

Traditional views of grammatical theory hold that languages are characterised by sets of constructions. This approach entails the enumeration of all possible constructions for each language being described. Current theories of transformational generative grammar have established an alternative position. Specifically, Chomsky's Government-Binding theory proposes a system of principles which are common to human language. Such a theory is referred to as a "Universal Grammar"(UG). Associated with the principles of grammar are parameters of variation which account for the diversity of human languages. The grammar for a particular language is known as a "Core Grammar", and is characterised by an appropriately parametrised instance of UG. Despite these advances in linguistic theory, construction-based approaches have remained the status quo within the field of natural language processing. This thesis investigates the possibility of developing a principle-based system which reflects the modular nature of the linguistic theory. That is, rather than stipulating the possible constructions of a language, a system is developed which uses the principles of grammar and language specific parameters to parse language. Specifically, a system-is presented which performs syntactic analysis and translation for a subset of English and German. The cross-linguistic nature of the theory is reflected by the system which can be considered a procedural model of UG.
Science, Faculty of
Computer Science, Department of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

12

Lin, Jing. "Using a rewriting system to model individual writing styles." Thesis, University of Aberdeen, 2012. http://digitool.abdn.ac.uk:80/webclient/DeliveryManager?pid=186641.

Full text

Abstract:

Each individual has a distinguished writing style. But natural language generation systems pro- duce text with much less variety. Is it possible to produce more human-like text from natural language generation systems by mimicking the style of particular authors? We start by analysing the text of real authors. We collect a corpus of texts from a single genre (food recipes) with each text identified with its author, and summarise a variety of writing features in these texts. Each author's writing style is the combination of a set of features. Analysis of the writing features shows that not only does each individual author write differently but the differences are consistent over the whole of their corpus. Hence we conclude that authors do keep consistent style consisting of a variety of different features. When we discuss notions such as the style and meaning of texts, we are referring to the reac- tion that readers have to them. It is important, therefore, in the field of computational linguistics to experiment by showing texts to people and assessing their interpretation of the texts. In our research we move the thesis from simple discussion and statistical analysis of the properties of text and NLG systems, to perform experiments to verify the actual impact that lexical preference has on real readers. Through experiments that require participants to follow a recipe and prepare food, we conclude that it is possible to alter the lexicon of a recipe without altering the actions performed by the cook, hence that word choice is an aspect of style rather than semantics; and also that word choice is one of the writing features employed by readers in identifying the author of a text. Among all writing features, individual lexical preference is very important both for analysing and generating texts. So we choose individual lexical choice as our principal topic of research. Using a modified version of distributional similarity CDS) helps us to choose words used by in- dividual authors without the limitation of many other solutions such as a pre-built thesauri. We present an algorithm for analysis and rewriting, and assess the results. Based on the results we propose some further improvements.

APA, Harvard, Vancouver, ISO, and other styles

13

Buys, Jan Moolman. "Incremental generative models for syntactic and semantic natural language processing." Thesis, University of Oxford, 2017. https://ora.ox.ac.uk/objects/uuid:a9a7b5cf-3bb1-4e08-b109-de06bf387d1d.

Full text

Abstract:

This thesis investigates the role of linguistically-motivated generative models of syntax and semantic structure in natural language processing (NLP). Syntactic well-formedness is crucial in language generation, but most statistical models do not account for the hierarchical structure of sentences. Many applications exhibiting natural language understanding rely on structured semantic representations to enable querying, inference and reasoning. Yet most semantic parsers produce domain-specific or inadequately expressive representations. We propose a series of generative transition-based models for dependency syntax which can be applied as both parsers and language models while being amenable to supervised or unsupervised learning. Two models are based on Markov assumptions commonly made in NLP: The first is a Bayesian model with hierarchical smoothing, the second is parameterised by feed-forward neural networks. The Bayesian model enables careful analysis of the structure of the conditioning contexts required for generative parsers, but the neural network is more accurate. As a language model the syntactic neural model outperforms both the Bayesian model and n-gram neural networks, pointing to the complementary nature of distributed and structured representations for syntactic prediction. We propose approximate inference methods based on particle filtering. The third model is parameterised by recurrent neural networks (RNNs), dropping the Markov assumptions. Exact inference with dynamic programming is made tractable here by simplifying the structure of the conditioning contexts. We then shift the focus to semantics and propose models for parsing sentences to labelled semantic graphs. We introduce a transition-based parser which incrementally predicts graph nodes (predicates) and edges (arguments). This approach is contrasted against predicting top-down graph traversals. RNNs and pointer networks are key components in approaching graph parsing as an incremental prediction problem. The RNN architecture is augmented to condition the model explicitly on the transition system configuration. We develop a robust parser for Minimal Recursion Semantics, a linguistically-expressive framework for compositional semantics which has previously been parsed only with grammar-based approaches. Our parser is much faster than the grammar-based model, while the same approach improves the accuracy of neural Abstract Meaning Representation parsing.

APA, Harvard, Vancouver, ISO, and other styles

14

Turner, Elise Hill. "Integrating intention and convention to organize problem solving dialogues." Diss., Georgia Institute of Technology, 1989. http://hdl.handle.net/1853/9248.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Rodriguez, Paul Fabian. "Mathematical foundations of simple recurrent networks /." Diss., Connect to a 24 p. preview or request complete full text in PDF format. Access restricted to UC campuses, 1999. http://wwwlib.umi.com/cr/ucsd/fullcit?p9935464.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Wong, Jimmy Pui Fung. "The use of prosodic features in Chinese speech recognition and spoken language processing /." View Abstract or Full-Text, 2003. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202003%20WONG.

Full text

Abstract:

Thesis (M.Phil.)--Hong Kong University of Science and Technology, 2003.
Includes bibliographical references (leaves 97-101). Also available in electronic version. Access restricted to campus users.

APA, Harvard, Vancouver, ISO, and other styles

17

Lakeland, Corrin, and n/a. "Lexical approaches to backoff in statistical parsing." University of Otago. Department of Computer Science, 2006. http://adt.otago.ac.nz./public/adt-NZDU20060913.134736.

Full text

Abstract:

This thesis develops a new method for predicting probabilities in a statistical parser so that more sophisticated probabilistic grammars can be used. A statistical parser uses a probabilistic grammar derived from a training corpus of hand-parsed sentences. The grammar is represented as a set of constructions - in a simple case these might be context-free rules. The probability of each construction in the grammar is then estimated by counting its relative frequency in the corpus. A crucial problem when building a probabilistic grammar is to select an appropriate level of granularity for describing the constructions being learned. The more constructions we include in our grammar, the more sophisticated a model of the language we produce. However, if too many different constructions are included, then our corpus is unlikely to contain reliable information about the relative frequency of many constructions. In existing statistical parsers two main approaches have been taken to choosing an appropriate granularity. In a non-lexicalised parser constructions are specified as structures involving particular parts-of-speech, thereby abstracting over individual words. Thus, in the training corpus two syntactic structures involving the same parts-of-speech but different words would be treated as two instances of the same event. In a lexicalised grammar the assumption is that the individual words in a sentence carry information about its syntactic analysis over and above what is carried by its part-of-speech tags. Lexicalised grammars have the potential to provide extremely detailed syntactic analyses; however, Zipf�s law makes it hard for such grammars to be learned. In this thesis, we propose a method for optimising the trade-off between informative and learnable constructions in statistical parsing. We implement a grammar which works at a level of granularity in between single words and parts-of-speech, by grouping words together using unsupervised clustering based on bigram statistics. We begin by implementing a statistical parser to serve as the basis for our experiments. The parser, based on that of Michael Collins (1999), contains a number of new features of general interest. We then implement a model of word clustering, which we believe is the first to deliver vector-based word representations for an arbitrarily large lexicon. Finally, we describe a series of experiments in which the statistical parser is trained using categories based on these word representations.

APA, Harvard, Vancouver, ISO, and other styles

18

Kočiský, Tomáš. "Deep learning for reading and understanding language." Thesis, University of Oxford, 2017. http://ora.ox.ac.uk/objects/uuid:cc45e366-cdd8-495b-af42-dfd726700ff0.

Full text

Abstract:

This thesis presents novel tasks and deep learning methods for machine reading comprehension and question answering with the goal of achieving natural language understanding. First, we consider a semantic parsing task where the model understands sentences and translates them into a logical form or instructions. We present a novel semi-supervised sequential autoencoder that considers language as a discrete sequential latent variable and semantic parses as the observations. This model allows us to leverage synthetically generated unpaired logical forms, and thereby alleviate the lack of supervised training data. We show the semi-supervised model outperforms a supervised model when trained with the additional generated data. Second, reading comprehension requires integrating information and reasoning about events, entities, and their relations across a full document. Question answering is conventionally used to assess reading comprehension ability, in both artificial agents and children learning to read. We propose a new, challenging, supervised reading comprehension task. We gather a large-scale dataset of news stories from the CNN and Daily Mail websites with Cloze-style questions created from the highlights. This dataset allows for the first time training deep learning models for reading comprehension. We also introduce novel attention-based models for this task and present qualitative analysis of the attention mechanism. Finally, following the recent advances in reading comprehension in both models and task design, we further propose a new task for understanding complex narratives, NarrativeQA, consisting of full texts of books and movie scripts. We collect human written questions and answers based on high-level plot summaries. This task is designed to encourage development of models for language understanding; it is designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience. We show that although humans solve the tasks easily, standard reading comprehension models struggle on the tasks presented here.

APA, Harvard, Vancouver, ISO, and other styles

19

Hermann, Karl Moritz. "Distributed representations for compositional semantics." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:1c995f84-7e10-43b0-a801-1c8bbfb53e76.

Full text

Abstract:

The mathematical representation of semantics is a key issue for Natural Language Processing (NLP). A lot of research has been devoted to finding ways of representing the semantics of individual words in vector spaces. Distributional approaches—meaning distributed representations that exploit co-occurrence statistics of large corpora—have proved popular and successful across a number of tasks. However, natural language usually comes in structures beyond the word level, with meaning arising not only from the individual words but also the structure they are contained in at the phrasal or sentential level. Modelling the compositional process by which the meaning of an utterance arises from the meaning of its parts is an equally fundamental task of NLP. This dissertation explores methods for learning distributed semantic representations and models for composing these into representations for larger linguistic units. Our underlying hypothesis is that neural models are a suitable vehicle for learning semantically rich representations and that such representations in turn are suitable vehicles for solving important tasks in natural language processing. The contribution of this thesis is a thorough evaluation of our hypothesis, as part of which we introduce several new approaches to representation learning and compositional semantics, as well as multiple state-of-the-art models which apply distributed semantic representations to various tasks in NLP. Part I focuses on distributed representations and their application. In particular, in Chapter 3 we explore the semantic usefulness of distributed representations by evaluating their use in the task of semantic frame identification. Part II describes the transition from semantic representations for words to compositional semantics. Chapter 4 covers the relevant literature in this field. Following this, Chapter 5 investigates the role of syntax in semantic composition. For this, we discuss a series of neural network-based models and learning mechanisms, and demonstrate how syntactic information can be incorporated into semantic composition. This study allows us to establish the effectiveness of syntactic information as a guiding parameter for semantic composition, and answer questions about the link between syntax and semantics. Following these discoveries regarding the role of syntax, Chapter 6 investigates whether it is possible to further reduce the impact of monolingual surface forms and syntax when attempting to capture semantics. Asking how machines can best approximate human signals of semantics, we propose multilingual information as one method for grounding semantics, and develop an extension to the distributional hypothesis for multilingual representations. Finally, Part III summarizes our findings and discusses future work.

APA, Harvard, Vancouver, ISO, and other styles

20

Gomes, de Oliveira Rodrigo. "Geographic referring expressions : doing geometry with words." Thesis, University of Aberdeen, 2017. http://digitool.abdn.ac.uk:80/webclient/DeliveryManager?pid=232615.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Pérez-Rosas, Verónica. "Exploration of Visual, Acoustic, and Physiological Modalities to Complement Linguistic Representations for Sentiment Analysis." Thesis, University of North Texas, 2014. https://digital.library.unt.edu/ark:/67531/metadc699996/.

Full text

Abstract:

This research is concerned with the identification of sentiment in multimodal content. This is of particular interest given the increasing presence of subjective multimodal content on the web and other sources, which contains a rich and vast source of people's opinions, feelings, and experiences. Despite the need for tools that can identify opinions in the presence of diverse modalities, most of current methods for sentiment analysis are designed for textual data only, and few attempts have been made to address this problem. The dissertation investigates techniques for augmenting linguistic representations with acoustic, visual, and physiological features. The potential benefits of using these modalities include linguistic disambiguation, visual grounding, and the integration of information about people's internal states. The main goal of this work is to build computational resources and tools that allow sentiment analysis to be applied to multimodal data. This thesis makes three important contributions. First, it shows that modalities such as audio, video, and physiological data can be successfully used to improve existing linguistic representations for sentiment analysis. We present a method that integrates linguistic features with features extracted from these modalities. Features are derived from verbal statements, audiovisual recordings, thermal recordings, and physiological sensors signals. The resulting multimodal sentiment analysis system is shown to significantly outperform the use of language alone. Using this system, we were able to predict the sentiment expressed in video reviews and also the sentiment experienced by viewers while exposed to emotionally loaded content. Second, the thesis provides evidence of the portability of the developed strategies to other affect recognition problems. We provided support for this by studying the deception detection problem. Third, this thesis contributes several multimodal datasets that will enable further research in sentiment and deception detection.

APA, Harvard, Vancouver, ISO, and other styles

22

Botha, Gerrti Reinier. "Text-based language identification for the South African languages." Pretoria : [s.n.], 2007. http://upetd.up.ac.za/thesis/available/etd-090942008-133715/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Grefenstette, Edward Thomas. "Category-theoretic quantitative compositional distributional models of natural language semantics." Thesis, University of Oxford, 2013. http://ora.ox.ac.uk/objects/uuid:d7f9433b-24c0-4fb5-925b-d8b3744b7012.

Full text

Abstract:

This thesis is about the problem of compositionality in distributional semantics. Distributional semantics presupposes that the meanings of words are a function of their occurrences in textual contexts. It models words as distributions over these contexts and represents them as vectors in high dimensional spaces. The problem of compositionality for such models concerns itself with how to produce distributional representations for larger units of text (such as a verb and its arguments) by composing the distributional representations of smaller units of text (such as individual words). This thesis focuses on a particular approach to this compositionality problem, namely using the categorical framework developed by Coecke, Sadrzadeh, and Clark, which combines syntactic analysis formalisms with distributional semantic representations of meaning to produce syntactically motivated composition operations. This thesis shows how this approach can be theoretically extended and practically implemented to produce concrete compositional distributional models of natural language semantics. It furthermore demonstrates that such models can perform on par with, or better than, other competing approaches in the field of natural language processing. There are three principal contributions to computational linguistics in this thesis. The first is to extend the DisCoCat framework on the syntactic front and semantic front, incorporating a number of syntactic analysis formalisms and providing learning procedures allowing for the generation of concrete compositional distributional models. The second contribution is to evaluate the models developed from the procedures presented here, showing that they outperform other compositional distributional models present in the literature. The third contribution is to show how using category theory to solve linguistic problems forms a sound basis for research, illustrated by examples of work on this topic, that also suggest directions for future research.

APA, Harvard, Vancouver, ISO, and other styles

24

Liebscher, Robert Aubrey. "Temporal, categorical, and bibliographical context of scientific texts : interactions and applications /." Diss., Connect to a 24 p. preview or request complete full text in PDF formate. Access restricted to UC campuses, 2005. http://wwwlib.umi.com/cr/ucsd/fullcit?p3207704.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Enss, Matthew. "An Investigation of Word Sense Disambiguation for Improving Lexical Chaining." Thesis, University of Waterloo, 2006. http://hdl.handle.net/10012/2938.

Full text

Abstract:

This thesis investigates how word sense disambiguation affects lexical chains, as well as proposing an improved model for lexical chaining in which word sense disambiguation is performed prior to lexical chaining. A lexical chain is a set of words from a document that are related in meaning. Lexical chains can be used to identify the dominant topics in a document, as well as where changes in topic occur. This makes them useful for applications such as topic segmentation and document summarization.

However, polysemous words are an inherent problem for algorithms that find lexical chains as the intended meaning of a polysemous word must be determined before its semantic relations to other words can be determined. For example, the word "bank" should only be placed in a chain with "money" if in the context of the document "bank" refers to a place that deals with money, rather than a river bank. The process by which the intended senses of polysemous words are determined is word sense disambiguation. To date, lexical chaining algorithms have performed word sense disambiguation as part of the overall process building lexical chains. Because the intended senses of polysemous words must be determined before words can be properly chained, we propose that word sense disambiguation should be performed before lexical chaining occurs. Furthermore, if word sense disambiguation is performed prior to lexical chaining, then it can be done with any available disambiguation method, without regard to how lexical chains will be built afterwards. Therefore, the most accurate available method for word sense disambiguation should be applied prior to the creation of lexical chains.

We perform an experiment to demonstrate the validity of the proposed model. We compare the lexical chains produced in two cases:

Lexical chaining is performed as normal on a corpus of documents that has not been disambiguated.
Lexical chaining is performed on the same corpus, but all the words have been correctly disambiguated beforehand.

We show that the lexical chains created in the second case are more correct than the chains created in the first. This result demonstrates that accurate word sense disambiguation performed prior to the creation of lexical chains does lead to better lexical chains being produced, confirming that our model for lexical chaining is an improvement upon previous approaches.

APA, Harvard, Vancouver, ISO, and other styles

26

Boyd, Adriane Amelia. "Detecting and Diagnosing Grammatical Errors for Beginning Learners of German: From Learner Corpus Annotation to Constraint Satisfaction Problems." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1325170396.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Ofoghi, Bahadorreza. "Enhancing factoid question answering using frame semantic-based approaches." University of Ballarat, 2009. http://innopac.ballarat.edu.au/record=b1503070.

Full text

Abstract:

FrameNet is used to enhance the performance of semantic QA systems. FrameNet is a linguistic resource that encapsulates Frame Semantics and provides scenario-based generalizations over lexical items that share similar semantic backgrounds.
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

28

Shockley, Darla Magdalene. "Email Thread Summarization with Conditional Random Fields." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1268159269.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Pang, Bo. "Handwriting Chinese character recognition based on quantum particle swarm optimization support vector machine." Thesis, University of Macau, 2018. http://umaclib3.umac.mo/record=b3950620.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Bihi, Ahmed. "Analysis of similarity and differences between articles using semantics." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-34843.

Full text

Abstract:

Adding semantic analysis in the process of comparing news articles enables a deeper level of analysis than traditional keyword matching. In this bachelor’s thesis, we have compared, implemented, and evaluated three commonly used approaches for document-level similarity. The three similarity measurement selected were, keyword matching, TF-IDF vector distance, and Latent Semantic Indexing. Each method was evaluated on a coherent set of news articles where the majority of the articles were written about Donald Trump and the American election the 9th of November 2016, there were several control articles, about random topics, in the set of articles. TF-IDF vector distance combined with Cosine similarity and Latent Semantic Indexing gave the best results on the set of articles by separating the control articles from the Trump articles. Keyword matching and TF-IDF distance using Euclidean distance did not separate the Trump articles from the control articles. We implemented and performed sentiment analysis on the set of news articles in the classes positive, negative and neutral and then validated them against human readers classifying the articles. With the sentiment analysis (positive, negative, and neutral) implementation, we got a high correlation with human readers (100%).

APA, Harvard, Vancouver, ISO, and other styles

31

Zechner, Niklas. "A novel approach to text classification." Doctoral thesis, Umeå universitet, Institutionen för datavetenskap, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-138917.

Full text

Abstract:

This thesis explores the foundations of text classification, using both empirical and deductive methods, with a focus on author identification and syntactic methods. We strive for a thorough theoretical understanding of what affects the effectiveness of classification in general. To begin with, we systematically investigate the effects of some parameters on the accuracy of author identification. How is the accuracy affected by the number of candidate authors, and the amount of data per candidate? Are there differences in how methods react to the changes in parameters? Using the same techniques, we see indications that methods previously thought to be topic-independent might not be so, but that syntactic methods may be the best option for avoiding topic dependence. This means that previous studies may have overestimated the power of lexical methods. We also briefly look for ways of spotting which particular features might be the most effective for classification. Apart from author identification, we apply similar methods to identifying properties of the author, including age and gender, and attempt to estimate the number of distinct authors in a text sample. In all cases, the techniques are proven viable if not overwhelmingly accurate, and we see that lexical and syntactic methods give very similar results. In the final parts, we see some results of automata theory that can be of use for syntactic analysis and classification. First, we generalise a known algorithm for finding a list of the best-ranked strings according to a weighted automaton, to doing the same with trees and a tree automaton. This result can be of use for speeding up parsing, which often runs in several steps, where each step needs several trees from the previous as input. Second, we use a compressed version of deterministic finite automata, known as failure automata, and prove that finding the optimal compression is NP-complete, but that there are efficient algorithms for finding good approximations. Third, we find and prove the derivatives of regular expressions with cuts. Derivatives are an operation on expressions to calculate the remaining expression after reading a given symbol, and cuts are an extension to regular expressions found in many programming languages. Together, these findings may be able to improve on the syntactic analysis which we have seen is a valuable tool for text classification.

APA, Harvard, Vancouver, ISO, and other styles

32

Tabassum, Binte Jafar Jeniya. "Information Extraction From User Generated Noisy Texts." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1606315356821532.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Shivade, Chaitanya P. "How sick are you?Methods for extracting textual evidence to expedite clinical trial screening." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1462810822.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Sil, Avirup. "Entity Information Extraction using Structured and Semi-structured resources." Diss., Temple University Libraries, 2014. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/272966.

Full text

Abstract:

Computer and Information Science
Ph.D.
Among all the tasks that exist in Information Extraction, Entity Linking, also referred to as entity disambiguation or entity resolution, is a new and important problem which has recently caught the attention of a lot of researchers in the Natural Language Processing (NLP) community. The task involves linking/matching a textual mention of a named-entity (like a person or a movie-name) to an appropriate entry in a database (e.g. Wikipedia or IMDB). If the database does not contain the entity it should return NIL (out-of-database) value. Existing techniques for linking named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. In this dissertation, we introduce a new framework, called Open-Database Entity Linking (Open-DB EL), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. In experiments on two domains, our Open-DB EL strategies outperform a state-of-the-art Wikipedia EL system by over 25% in accuracy. Existing approaches typically perform EL using a pipeline architecture: they use a Named-Entity Recognition (NER) system to find the boundaries of mentions in text, and an EL system to connect the mentions to entries in structured or semi-structured repositories like Wikipedia. However, the two tasks are tightly coupled, and each type of system can benefit significantly from the kind of information provided by the other. We propose and develop a joint model for NER and EL, called NEREL, that takes a large set of candidate mentions from typical NER systems and a large set of candidate entity links from EL systems, and ranks the candidate mention-entity pairs together to make joint predictions. In NER and EL experiments across three datasets, NEREL significantly outperforms or comes close to the performance of two state-of-the-art NER systems, and it outperforms 6 competing EL systems. On the benchmark MSNBC dataset, NEREL, provides a 60% reduction in error over the next best NER system and a 68% reduction in error over the next-best EL system. We also extend the idea of using semi-structured resources to a relatively less explored area of entity information extraction. Most previous work on information extraction from text has focused on named-entity recognition, entity linking, and relation extraction. Much less attention has been paid to extracting the temporal scope for relations between named-entities; for example, the relation president-Of (John F. Kennedy, USA) is true only in the time-frame (January 20, 1961 - November 22, 1963). In this dissertation we present a system for temporal scoping of relational facts, called TSRF which is trained on distant supervision based on the largest semi-structured resource available: Wikipedia. TSRF employs language models consisting of patterns automatically bootstrapped from sentences collected from Wikipedia pages that contain the main entity of a page and slot-fillers extracted from the infobox tuples. This proposed system achieves state-of-the-art results on 6 out of 7 relations on the benchmark Text Analysis Conference (TAC) 2013 dataset for the task of temporal slot filling (TSF). Overall, the system outperforms the next best system that participated in the TAC evaluation by 10 points on the TAC-TSF evaluation metric.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

35

Cimiano, Philipp. "Ontology learning and population from text : algorithms, evaluation and applications /." New York, NY : Springer, 2006. http://www.loc.gov/catdir/enhancements/fy0824/2006931701-d.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Moncecchi, Guillermo. "Détection du langage spéculatif dans la littérature scientifique." Phd thesis, Université de Nanterre - Paris X, 2013. http://tel.archives-ouvertes.fr/tel-00800552.

Full text

Abstract:

Ce travail de thèse propose une méthodologie visant la résolution de certains problèmes de classification, notamment ceux concernant la classification séquentielle en tâches de Traitement Automatique des Langues. Afin d'améliorer les résultats de la tâche de classification, nous proposons l'utilisation d'une approche itérative basée sur l'erreur, qui intègre, dans le processus d'apprentissage, des connaissances d'un expert représentées sous la forme de "règles de connaissance". Nous avons appliqué la méthodologie à deux tâches liées à la détection de la spéculation ("hedging") dans la littérature scientifique: la détection de segments textuels spéculatifs ("hedge cue identification") et la détection de la couverture de ces segments ("hedge cue scope detection"). Les résultats son prometteurs: pour la première tâche, nous avons amélioré le F-score de la baseline de 2,5 points en intégrant des données sur la co-occurrence de segments spéculatifs. Concernant la deuxième tâche, l'intégration d'information syntaxique et des règles pour l'élagage syntaxique ont permis d'améliorer les résultats de la classification de 0,712 à 0,835 (F-score). Par rapport aux méthodes de l'état de l'art, les résultats sont très bons et ils suggèrent que l'approche consistant à améliorer les classifieurs basées seulement sur des erreurs commises dans un corpus, peut être également appliquée à d'autres tâches similaires. Qui plus est, ce travail de thèse propose un schéma de classes permettant de représenter l'analyse d'une phrase dans une structure unique qui intègre les résultats de différentes analyses linguistiques. Cela permet de mieux gérer le processus itératif d'amélioration du classifieur, dans lequel différents ensembles d'attributs d'apprentissage sont utilisés à chaque itération. Nous proposons également de stocker les attributs dans un modèle relationnel au lieu des structures textuelles classiques, afin de faciliter l'analyse et la manipulation des données apprises.

APA, Harvard, Vancouver, ISO, and other styles

37

Sadid-Al-Hasan, Sheikh, and University of Lethbridge Faculty of Arts and Science. "Answering complex questions : supervised approaches." Thesis, Lethbridge, Alta. : University of Lethbridge, Dept. of Mathematics and Computer Science, c2009, 2009. http://hdl.handle.net/10133/2478.

Full text

Abstract:

The term “Google” has become a verb for most of us. Search engines, however, have certain limitations. For example ask it for the impact of the current global financial crisis in different parts of the world, and you can expect to sift through thousands of results for the answer. This motivates the research in complex question answering where the purpose is to create summaries of large volumes of information as answers to complex questions, rather than simply offering a listing of sources. Unlike simple questions, complex questions cannot be answered easily as they often require inferencing and synthesizing information from multiple documents. Hence, this task is accomplished by the query-focused multidocument summarization systems. In this thesis we apply different supervised learning techniques to confront the complex question answering problem. To run our experiments, we consider the DUC-2007 main task. A huge amount of labeled data is a prerequisite for supervised training. It is expensive and time consuming when humans perform the labeling task manually. Automatic labeling can be a good remedy to this problem. We employ five different automatic annotation techniques to build extracts from human abstracts using ROUGE, Basic Element (BE) overlap, syntactic similarity measure, semantic similarity measure and Extended String Subsequence Kernel (ESSK). The representative supervised methods we use are Support Vector Machines (SVM), Conditional Random Fields (CRF), Hidden Markov Models (HMM) and Maximum Entropy (MaxEnt). We annotate DUC-2006 data and use them to train our systems, whereas 25 topics of DUC-2007 data set are used as test data. The evaluation results reveal the impact of automatic labeling methods on the performance of the supervised approaches to complex question answering. We also experiment with two ensemble-based approaches that show promising results for this problem domain.
x, 108 leaves : ill. ; 29 cm

APA, Harvard, Vancouver, ISO, and other styles

38

Hale, Scott A. "Global connectivity, information diffusion, and the role of multilingual users in user-generated content platforms." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:3040a250-c526-4f10-aa9b-25117fd4dea2.

Full text

Abstract:

Internet content and Internet users are becoming more linguistically diverse as more people speaking different languages come online and produce content on user-generated content platforms. Several platforms have emerged as truly global platforms with users speaking many different languages and coming from around the world. It is now possible to study human behavior on these platforms using the digital trace data the platforms make available about the content people are authoring. Network literature suggests that people cluster together by language, but also that there is a small average path length between any two people on most Internet platforms (including two speakers of different languages). If so, multilingual users may play critical roles as bridges or brokers on these platforms by connecting clusters of monolingual users together across languages. The large differences in the content available in different languages online underscores the importance of such roles. This thesis studies the roles of multilingual users and platform design on two large, user-generated content platforms: Wikipedia and Twitter. It finds that language has a strong role structuring each platform, that multilingual users do act as linguistic bridges subject to certain limitations, that the size of a language correlates with the roles its speakers play in cross-language connections, and that there is a correlation between activity and multilingualism. In contrast to the general understanding in linguistics of high levels of multilingualism offline, this thesis finds relatively low levels of multilingualism on Twitter (11%) and Wikipedia (15%). The findings have implications for both platform design and social network theory. The findings suggest design strategies to increase multilingualism online through the identification and promotion of multilingual starter tasks, the discovery of related other-language information, and the promotion of user choice in linguistic filtering. While weak-ties have received much attention in the social networks literature, cross-language ties are often not distinguished from same-language weak ties. This thesis finds that cross-language ties are similar to same-language weak ties in that both connect distant parts of the network, have limited bandwidth, and yet transfer a non-trivial amount of information when considered in aggregate. At the same time, cross-language ties are distinct from same-language weak ties for the purposes of information diffusion. In general cross-language ties are smaller in number than same-language ties, but each cross-language tie may convey more diverse information given the large differences in the content available in different languages and the relative ease with which a multilingual speaker may access content in multiple languages compared to a monolingual speaker.

APA, Harvard, Vancouver, ISO, and other styles

39

Buys, Jan Moolman. "Probabilistic tree transducers for grammatical error correction." Thesis, Stellenbosch : Stellenbosch University, 2013. http://hdl.handle.net/10019.1/85592.

Full text

Abstract:

Thesis (MSc)--Stellenbosch University, 2013.
ENGLISH ABSTRACT: We investigate the application of weighted tree transducers to correcting grammatical errors in natural language. Weighted finite-state transducers (FST) have been used successfully in a wide range of natural language processing (NLP) tasks, even though the expressiveness of the linguistic transformations they perform is limited. Recently, there has been an increase in the use of weighted tree transducers and related formalisms that can express syntax-based natural language transformations in a probabilistic setting. The NLP task that we investigate is the automatic correction of grammar errors made by English language learners. In contrast to spelling correction, which can be performed with a very high accuracy, the performance of grammar correction systems is still low for most error types. Commercial grammar correction systems mostly use rule-based methods. The most common approach in recent grammatical error correction research is to use statistical classifiers that make local decisions about the occurrence of specific error types. The approach that we investigate is related to a number of other approaches inspired by statistical machine translation (SMT) or based on language modelling. Corpora of language learner writing annotated with error corrections are used as training data. Our baseline model is a noisy-channel FST model consisting of an n-gram language model and a FST error model, which performs word insertion, deletion and replacement operations. The tree transducer model we use to perform error correction is a weighted top-down tree-to-string transducer, formulated to perform transformations between parse trees of correct sentences and incorrect sentences. Using an algorithm developed for syntax-based SMT, transducer rules are extracted from training data of which the correct version of sentences have been parsed. Rule weights are also estimated from the training data. Hypothesis sentences generated by the tree transducer are reranked using an n-gram language model. We perform experiments to evaluate the performance of different configurations of the proposed models. In our implementation an existing tree transducer toolkit is used. To make decoding time feasible sentences are split into clauses and heuristic pruning is performed during decoding. We consider different modelling choices in the construction of transducer rules. The evaluation of our models is based on precision and recall. Experiments are performed to correct various error types on two learner corpora. The results show that our system is competitive with existing approaches on several error types.
AFRIKAANSE OPSOMMING: Ons ondersoek die toepassing van geweegde boomoutomate om grammatikafoute in natuurlike taal outomaties reg te stel. Geweegde eindigetoestand outomate word suksesvol gebruik in ’n wye omvang van take in natuurlike taalverwerking, alhoewel die uitdrukkingskrag van die taalkundige transformasies wat hulle uitvoer beperk is. Daar is die afgelope tyd ’n toename in die gebruik van geweegde boomoutomate en verwante formalismes wat sintaktiese transformasies in natuurlike taal in ’n probabilistiese raamwerk voorstel. Die natuurlike taalverwerkingstoepassing wat ons ondersoek is die outomatiese regstelling van taalfoute wat gemaak word deur Engelse taalleerders. Terwyl speltoetsing in Engels met ’n baie hoë akkuraatheid gedoen kan word, is die prestasie van taalregstellingstelsels nog relatief swak vir meeste fouttipes. Kommersiële taalregstellingstelsels maak oorwegend gebruik van reël-gebaseerde metodes. Die algemeenste benadering in onlangse navorsing oor grammatikale foutkorreksie is om statistiese klassifiseerders wat plaaslike besluite oor die voorkoms van spesifieke fouttipes maak te gebruik. Die benadering wat ons ondersoek is verwant aan ’n aantal ander benaderings wat geïnspireer is deur statistiese masjienvertaling of op taalmodellering gebaseer is. Korpora van taalleerderskryfwerk wat met foutregstellings geannoteer is, word as afrigdata gebruik. Ons kontrolestelsel is ’n geraaskanaal eindigetoestand outomaatmodel wat bestaan uit ’n n-gram taalmodel en ’n foutmodel wat invoegings-, verwyderings- en vervangingsoperasies op woordvlak uitvoer. Die boomoutomaatmodel wat ons gebruik vir grammatikale foutkorreksie is ’n geweegde bo-na-onder boom-na-string omsetteroutomaat geformuleer om transformasies tussen sintaksbome van korrekte sinne en foutiewe sinne te maak. ’n Algoritme wat ontwikkel is vir sintaksgebaseerde statistiese masjienvertaling word gebruik om reëls te onttrek uit die afrigdata, waarvan sintaksontleding op die korrekte weergawe van die sinne gedoen is. Reëlgewigte word ook vanaf die afrigdata beraam. Hipotese-sinne gegenereer deur die boomoutomaat word herrangskik met behulp van ’n n-gram taalmodel. Ons voer eksperimente uit om die doeltreffendheid van verskillende opstellings van die voorgestelde modelle te evalueer. In ons implementering word ’n bestaande boomoutomaat sagtewarepakket gebruik. Om die dekoderingstyd te verminder word sinne in frases verdeel en die soekruimte heuristies besnoei. Ons oorweeg verskeie modelleringskeuses in die samestelling van outomaatreëls. Die evaluering van ons modelle word gebaseer op presisie en herroepvermoë. Eksperimente word uitgevoer om verskeie fouttipes reg te maak op twee leerderkorpora. Die resultate wys dat ons model kompeterend is met bestaande benaderings op verskeie fouttipes.

APA, Harvard, Vancouver, ISO, and other styles

40

Packer, Thomas L. "Surface Realization Using a Featurized Syntactic Statistical Language Model." Diss., CLICK HERE for online access, 2006. http://contentdm.lib.byu.edu/ETD/image/etd1195.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Schwartz, Hansen A. "The acquisition of lexical knowledge from the web for aspects of semantic interpretation." Doctoral diss., University of Central Florida, 2011. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5028.

Full text

Abstract:

Applications to word sense disambiguation, an aspect of semantic interpretation, are used to evaluate the contributions. Disambiguation systems which utilize semantically annotated training data are considered supervised. The algorithms of this dissertation are considered minimally-supervised; they do not require training data created by humans, though they may use human-created data sources. In the case of evaluating a database of common sense knowledge, integrating the knowledge into an existing minimally-supervised disambiguation system significantly improved results -- a 20.5\% error reduction. Similarly, the Web selectors disambiguation system, which acquires knowledge directly as part of the algorithm, achieved results comparable with top minimally-supervised systems, an F-score of 80.2\% on a standard noun disambiguation task. This work enables the study of many subsequent related tasks for improving semantic interpretation and its application to real-world technologies. Other aspects of semantic interpretation, such as semantic role labeling could utilize the same methods presented here for word sense disambiguation. As the Web continues to grow, the capabilities of the systems in this dissertation are expected to increase. Although the Web selectors system achieves great results, a study in this dissertation shows likely improvements from acquiring more data. Furthermore, the methods for acquiring a database of common sense knowledge could be applied in a more exhaustive fashion for other types of common sense knowledge. Finally, perhaps the greatest benefits from this work will come from the enabling of real world technologies that utilize semantic interpretation.; This work investigates the effective acquisition of lexical knowledge from the Web to perform semantic interpretation. The Web provides an unprecedented amount of natural language from which to gain knowledge useful for semantic interpretation. The knowledge acquired is described as common sense knowledge, information one uses in his or her daily life to understand language and perception. Novel approaches are presented for both the acquisition of this knowledge and use of the knowledge in semantic interpretation algorithms. The goal is to increase accuracy over other automatic semantic interpretation systems, and in turn enable stronger real world applications such as machine translation, advanced Web search, sentiment analysis, and question answering. The major contributions of this dissertation consist of two methods of acquiring lexical knowledge from the Web, namely a database of common sense knowledge and Web selectors. The first method is a framework for acquiring a database of concept relationships. To acquire this knowledge, relationships between nouns are found on the Web and analyzed over WordNet using information-theory, producing information about concepts rather than ambiguous words. For the second contribution, words called Web selectors are retrieved which take the place of an instance of a target word in its local context. The selectors serve for the system to learn the types of concepts that the sense of a target word should be similar. Web selectors are acquired dynamically as part of a semantic interpretation algorithm, while the relationships in the database are useful to stand-alone programs. A final contribution of this dissertation concerns a novel semantic similarity measure and an evaluation of similarity and relatedness measures on tasks of concept similarity. Such tasks are useful when applying acquired knowledge to semantic interpretation.
ID: 029808979; System requirements: World Wide Web browser and PDF reader.; Mode of access: World Wide Web.; Thesis (Ph.D.)--University of Central Florida, 2011.; Includes bibliographical references (p. 141-160).
Ph.D.
Doctorate
Electrical Engineering and Computer Science
Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

42

Engelbrecht, Herman Arnold. "Automatic phoneme recognition of South African English." Thesis, Stellenbosch : Stellenbosch University, 2004. http://hdl.handle.net/10019.1/49867.

Full text

Abstract:

Thesis (MEng)--University of Stellenbosch, 2004.
ENGLISH ABSTRACT: Automatic speech recognition applications have been developed for many languages in other countries but not much research has been conducted on developing Human Language Technology (HLT) for S.A. languages. Research has been performed on informally gathered speech data but until now a speech corpus that could be used to develop HLT for S.A. languages did not exist. With the development of the African Speech Technology Speech Corpora, it has now become possible to develop commercial applications of HLT. The two main objectives of this work are the accurate modelling of phonemes, suitable for the purposes of LVCSR, and the evaluation of the untried S.A. English speech corpus. Three different aspects of phoneme modelling was investigated by performing isolated phoneme recognition on the NTIMIT speech corpus. The three aspects were signal processing, statistical modelling of HMM state distributions and context-dependent phoneme modelling. Research has shown that the use of phonetic context when modelling phonemes forms an integral part of most modern LVCSR systems. To facilitate the context-dependent phoneme modelling, a method of constructing robust and accurate models using decision tree-based state clustering techniques is described. The strength of this method is the ability to construct accurate models of contexts that did not occur in the training data. The method incorporates linguistic knowledge about the phonetic context, in conjunction with the training data, to decide which phoneme contexts are similar and should share model parameters. As LVCSR typically consists of continuous recognition of spoken words, the contextdependent and context-independent phoneme models that were created for the isolated recognition experiments are evaluated by performing continuous phoneme recognition. The phoneme recognition experiments are performed, without the aid of a grammar or language model, on the S.A. English corpus. As the S.A. English corpus is newly created, no previous research exist to which the continuous recognition results can be compared to. Therefore, it was necessary to create comparable baseline results, by performing continuous phoneme recognition on the NTIMIT corpus. It was found that acceptable recognition accuracy was obtained on both the NTIMIT and S.A. English corpora. Furthermore, the results on S.A. English was 2 - 6% better than the results on NTIMIT, indicating that the S.A. English corpus is of a high enough quality that it can be used for the development of HLT.
AFRIKAANSE OPSOMMING: Automatiese spraak-herkenning is al ontwikkel vir ander tale in ander lande maar, daar nog nie baie navorsing gedoen om menslike taal-tegnologie (HLT) te ontwikkel vir Suid- Afrikaanse tale. Daar is al navorsing gedoen op spraak wat informeel versamel is, maar tot nou toe was daar nie 'n spraak databasis wat vir die ontwikkeling van HLT vir S.A. tale. Met die ontwikkeling van die African Speech Technology Speech Corpora, het dit moontlik geword om HLT te ontwikkel vir wat geskik is vir kornmersiele doeleindes. Die twee hoofdoele van hierdie tesis is die akkurate modellering van foneme, geskik vir groot-woordeskat kontinue spraak-herkenning (LVCSR), asook die evaluasie van die S.A. Engels spraak-databasis. Drie aspekte van foneem-modellering word ondersoek deur isoleerde foneem-herkenning te doen op die NTIMIT spraak-databasis. Die drie aspekte wat ondersoek word is sein prosessering, statistiese modellering van die HMM toestands distribusies, en konteksafhanklike foneem-modellering. Navorsing het getoon dat die gebruik van fonetiese konteks 'n integrale deel vorm van meeste moderne LVCSR stelsels. Dit is dus nodig om robuuste en akkurate konteks-afhanklike modelle te kan bou. Hiervoor word 'n besluitnemingsboom- gebaseerde trosvormings tegniek beskryf. Die tegniek is ook in staat is om akkurate modelle te bou van kontekste van nie voorgekom het in die afrigdata nie. Om te besluit watter fonetiese kontekste is soortgelyk en dus model parameters moet deel, maak die tegniek gebruik van die afrigdata en inkorporeer taalkundige kennis oor die fonetiese kontekste. Omdat LVCSR tipies is oor die kontinue herkenning van woorde, word die konteksafhanklike en konteks-onafhanklike modelle, wat gebou is vir die isoleerde foneem-herkenningseksperimente, evalueer d.m.v. kontinue foneem-herkening. Die kontinue foneemherkenningseksperimente word gedoen op die S.A. Engels databasis, sonder die hulp van 'n taalmodel of grammatika. Omdat die S.A. Engels databasis nuut is, is daar nog geen ander navorsing waarteen die result ate vergelyk kan word nie. Dit is dus nodig om kontinue foneem-herkennings result ate op die NTIMIT databasis te genereer, waarteen die S.A. Engels resulte vergelyk kan word. Die resulate dui op aanvaarbare foneem her kenning op beide die NTIMIT en S.A. Engels databassise. Die resultate op S.A. Engels is selfs 2 - 6% beter as die resultate op NTIMIT, wat daarop dui dat die S.A. Engels spraak-databasis geskik is vir die ontwikkeling van HLT.

APA, Harvard, Vancouver, ISO, and other styles

43

Newman-Griffis, Denis R. "Capturing Domain Semantics with Representation Learning: Applications to Health and Function." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587658607378958.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Botha, Jan Abraham. "Probabilistic modelling of morphologically rich languages." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c7.

Full text

Abstract:

This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.

APA, Harvard, Vancouver, ISO, and other styles

45

Stoia, Laura Cristina. "Noun phrase generation for situated dialogs." Columbus, Ohio : Ohio State University, 2007. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1196196971.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Bartl, Eduard. "Mathematical foundations of graded knowledge spaces." Diss., Online access via UMI:, 2009.

Find full text

Abstract:

Thesis (Ph. D.)--State University of New York at Binghamton, Thomas J. Watson School of Engineering and Applied Science, Department of Systems Science and Industrial Engineering, 2009.
Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

47

Dhyani, Dushyanta Dhyani. "Boosting Supervised Neural Relation Extraction with Distant Supervision." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1524095334803486.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Hughes, Cameron A. "Epistemic Structures of Interrogative Domains." Youngstown State University / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=ysu1227285777.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Lin, Chi-San Althon. "Syntax-driven argument identification and multi-argument classification for semantic role labeling." The University of Waikato, 2007. http://hdl.handle.net/10289/2602.

Full text

Abstract:

Semantic role labeling is an important stage in systems for Natural Language Understanding. The basic problem is one of identifying who did what to whom for each predicate in a sentence. Thus labeling is a two-step process: identify constituent phrases that are arguments to a predicate, then label those arguments with appropriate thematic roles. Existing systems for semantic role labeling use machine learning methods to assign roles one-at-a-time to candidate arguments. There are several drawbacks to this general approach. First, more than one candidate can be assigned the same role, which is undesirable. Second, the search for each candidate argument is exponential with respect to the number of words in the sentence. Third, single-role assignment cannot take advantage of dependencies known to exist between semantic roles of predicate arguments, such as their relative juxtaposition. And fourth, execution times for existing algorithm are excessive, making them unsuitable for real-time use. This thesis seeks to obviate these problems by approaching semantic role labeling as a multi-argument classification process. It observes that the only valid arguments to a predicate are unembedded constituent phrases that do not overlap that predicate. Given that semantic role labeling occurs after parsing, this thesis proposes an algorithm that systematically traverses the parse tree when looking for arguments, thereby eliminating the vast majority of impossible candidates. Moreover, instead of assigning semantic roles one at a time, an algorithm is proposed to assign all labels simultaneously; leveraging dependencies between roles and eliminating the problem of duplicate assignment. Experimental results are provided as evidence to show that a combination of the proposed argument identification and multi-argument classification algorithms outperforms all existing systems that use the same syntactic information.

APA, Harvard, Vancouver, ISO, and other styles

50

Wijeratne, Sanjaya. "A Framework to Understand Emoji Meaning: Similarity and Sense Disambiguation of Emoji using EmojiNet." Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright1547506375922938.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Natural language processing (Computer science) Compuational linguistics'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles