Dissertations / Theses: 'Natural Language Processing (NLP)'

1

Hellmann, Sebastian. "Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data." Doctoral thesis, Universitätsbibliothek Leipzig, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-157932.

Full text

Abstract:

This thesis is a compendium of scientific works and engineering specifications that have been contributed to a large community of stakeholders to be copied, adapted, mixed, built upon and exploited in any way possible to achieve a common goal: Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data The explosion of information technology in the last two decades has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. These resources become even more useful when linked with each other and the last few years have seen the emergence of numerous approaches in various disciplines concerned with linguistic resources and NLP tools. It is the challenge of our time to store, interlink and exploit this wealth of data accumulated in more than half a century of computational linguistics, of empirical, corpus-based study of language, and of computational lexicography in all its heterogeneity. The vision of the Giant Global Graph (GGG) was conceived by Tim Berners-Lee aiming at connecting all data on the Web and allowing to discover new relations between this openly-accessible data. This vision has been pursued by the Linked Open Data (LOD) community, where the cloud of published datasets comprises 295 data repositories and more than 30 billion RDF triples (as of September 2011). RDF is based on globally unique and accessible URIs and it was specifically designed to establish links between such URIs (or resources). This is captured in the Linked Data paradigm that postulates four rules: (1) Referred entities should be designated by URIs, (2) these URIs should be resolvable over HTTP, (3) data should be represented by means of standards such as RDF, (4) and a resource should include links to other resources. Although it is difficult to precisely identify the reasons for the success of the LOD effort, advocates generally argue that open licenses as well as open access are key enablers for the growth of such a network as they provide a strong incentive for collaboration and contribution by third parties. In his keynote at BNCOD 2011, Chris Bizer argued that with RDF the overall data integration effort can be “split between data publishers, third parties, and the data consumer”, a claim that can be substantiated by observing the evolution of many large data sets constituting the LOD cloud. As written in the acknowledgement section, parts of this thesis has received numerous feedback from other scientists, practitioners and industry in many different ways. The main contributions of this thesis are summarized here: Part I – Introduction and Background. During his keynote at the Language Resource and Evaluation Conference in 2012, Sören Auer stressed the decentralized, collaborative, interlinked and interoperable nature of the Web of Data. The keynote provides strong evidence that Semantic Web technologies such as Linked Data are on its way to become main stream for the representation of language resources. The jointly written companion publication for the keynote was later extended as a book chapter in The People’s Web Meets NLP and serves as the basis for “Introduction” and “Background”, outlining some stages of the Linked Data publication and refinement chain. Both chapters stress the importance of open licenses and open access as an enabler for collaboration, the ability to interlink data on the Web as a key feature of RDF as well as provide a discussion about scalability issues and decentralization. Furthermore, we elaborate on how conceptual interoperability can be achieved by (1) re-using vocabularies, (2) agile ontology development, (3) meetings to refine and adapt ontologies and (4) tool support to enrich ontologies and match schemata. Part II - Language Resources as Linked Data. “Linked Data in Linguistics” and “NLP & DBpedia, an Upward Knowledge Acquisition Spiral” summarize the results of the Linked Data in Linguistics (LDL) Workshop in 2012 and the NLP & DBpedia Workshop in 2013 and give a preview of the MLOD special issue. In total, five proceedings – three published at CEUR (OKCon 2011, WoLE 2012, NLP & DBpedia 2013), one Springer book (Linked Data in Linguistics, LDL 2012) and one journal special issue (Multilingual Linked Open Data, MLOD to appear) – have been (co-)edited to create incentives for scientists to convert and publish Linked Data and thus to contribute open and/or linguistic data to the LOD cloud. Based on the disseminated call for papers, 152 authors contributed one or more accepted submissions to our venues and 120 reviewers were involved in peer-reviewing. “DBpedia as a Multilingual Language Resource” and “Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Linked Data Cloud” contain this thesis’ contribution to the DBpedia Project in order to further increase the size and inter-linkage of the LOD Cloud with lexical-semantic resources. Our contribution comprises extracted data from Wiktionary (an online, collaborative dictionary similar to Wikipedia) in more than four languages (now six) as well as language-specific versions of DBpedia, including a quality assessment of inter-language links between Wikipedia editions and internationalized content negotiation rules for Linked Data. In particular the work described in created the foundation for a DBpedia Internationalisation Committee with members from over 15 different languages with the common goal to push DBpedia as a free and open multilingual language resource. Part III - The NLP Interchange Format (NIF). “NIF 2.0 Core Specification”, “NIF 2.0 Resources and Architecture” and “Evaluation and Related Work” constitute one of the main contribution of this thesis. The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. The core specification is included in and describes which URI schemes and RDF vocabularies must be used for (parts of) natural language texts and annotations in order to create an RDF/OWL-based interoperability layer with NIF built upon Unicode Code Points in Normal Form C. In , classes and properties of the NIF Core Ontology are described to formally define the relations between text, substrings and their URI schemes. contains the evaluation of NIF. In a questionnaire, we asked questions to 13 developers using NIF. UIMA, GATE and Stanbol are extensible NLP frameworks and NIF was not yet able to provide off-the-shelf NLP domain ontologies for all possible domains, but only for the plugins used in this study. After inspecting the software, the developers agreed however that NIF is adequate enough to provide a generic RDF output based on NIF using literal objects for annotations. All developers were able to map the internal data structure to NIF URIs to serialize RDF output (Adequacy). The development effort in hours (ranging between 3 and 40 hours) as well as the number of code lines (ranging between 110 and 445) suggest, that the implementation of NIF wrappers is easy and fast for an average developer. Furthermore the evaluation contains a comparison to other formats and an evaluation of the available URI schemes for web annotation. In order to collect input from the wide group of stakeholders, a total of 16 presentations were given with extensive discussions and feedback, which has lead to a constant improvement of NIF from 2010 until 2013. After the release of NIF (Version 1.0) in November 2011, a total of 32 vocabulary employments and implementations for different NLP tools and converters were reported (8 by the (co-)authors, including Wiki-link corpus, 13 by people participating in our survey and 11 more, of which we have heard). Several roll-out meetings and tutorials were held (e.g. in Leipzig and Prague in 2013) and are planned (e.g. at LREC 2014). Part IV - The NLP Interchange Format in Use. “Use Cases and Applications for NIF” and “Publication of Corpora using NIF” describe 8 concrete instances where NIF has been successfully used. One major contribution in is the usage of NIF as the recommended RDF mapping in the Internationalization Tag Set (ITS) 2.0 W3C standard and the conversion algorithms from ITS to NIF and back. One outcome of the discussions in the standardization meetings and telephone conferences for ITS 2.0 resulted in the conclusion there was no alternative RDF format or vocabulary other than NIF with the required features to fulfill the working group charter. Five further uses of NIF are described for the Ontology of Linguistic Annotations (OLiA), the RDFaCE tool, the Tiger Corpus Navigator, the OntosFeeder and visualisations of NIF using the RelFinder tool. These 8 instances provide an implemented proof-of-concept of the features of NIF. starts with describing the conversion and hosting of the huge Google Wikilinks corpus with 40 million annotations for 3 million web sites. The resulting RDF dump contains 477 million triples in a 5.6 GB compressed dump file in turtle syntax. describes how NIF can be used to publish extracted facts from news feeds in the RDFLiveNews tool as Linked Data. Part V - Conclusions. provides lessons learned for NIF, conclusions and an outlook on future work. Most of the contributions are already summarized above. One particular aspect worth mentioning is the increasing number of NIF-formated corpora for Named Entity Recognition (NER) that have come into existence after the publication of the main NIF paper Integrating NLP using Linked Data at ISWC 2013. These include the corpora converted by Steinmetz, Knuth and Sack for the NLP & DBpedia workshop and an OpenNLP-based CoNLL converter by Brümmer. Furthermore, we are aware of three LREC 2014 submissions that leverage NIF: NIF4OGGD - NLP Interchange Format for Open German Governmental Data, N^3 – A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format and Global Intelligent Content: Active Curation of Language Resources using Linked Data as well as an early implementation of a GATE-based NER/NEL evaluation framework by Dojchinovski and Kliegr. Further funding for the maintenance, interlinking and publication of Linguistic Linked Data as well as support and improvements of NIF is available via the expiring LOD2 EU project, as well as the CSA EU project called LIDER, which started in November 2013. Based on the evidence of successful adoption presented in this thesis, we can expect a decent to high chance of reaching critical mass of Linked Data technology as well as the NIF standard in the field of Natural Language Processing and Language Resources.

APA, Harvard, Vancouver, ISO, and other styles

2

NOZZA, DEBORA. "Deep Learning for Feature Representation in Natural Language Processing." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2018. http://hdl.handle.net/10281/241185.

Full text

Abstract:

La mole di dati generata dagli utenti sul Web è esponenzialmente cresciuta negli ultimi dieci anni, creando nuove e rilevanti opportunità per ogni tipo di dominio applicativo. Per risolvere i problemi derivanti dall’eccessiva quantità di dati, la ricerca nell’ambito dell’elaborazione del linguaggio naturale si è mossa verso lo sviluppo di modelli computazionali capaci di capirlo ed interpretarlo senza (o quasi) alcun intervento umano. Recentemente, questo campo di studi è stato testimone di un incremento sia in termini di efficienza computazionale che di risultati, per merito dell’avvento di una nuova linea di ricerca nell’apprendimento automatico chiamata Deep Learning. Questa tesi si focalizza in modo particolare su una specifica classe di modelli di Deep Learning atta ad apprendere rappresentazioni di alto livello, e conseguentemente più significative, dei dati di input in ambiente non supervisionato. Nelle tecniche di Deep Learning, queste rappresentazioni sono ottenute tramite multiple trasformazioni non lineari di complessità e astrazione crescente a partire dai dati di input. Questa fase, in cui vengono elaborate le sopracitate rappresentazioni, è un processo cruciale per l’elaborazione del linguaggio naturale in quanto include la procedura di trasformazione da simboli discreti (es. lettere) a una rappresentazione vettoriale che può essere facilmente trattata da un elaboratore. Inoltre, questa rappresentazione deve anche essere in grado di codificare la sintattica e la semantica espressa nel linguaggio utilizzato nei dati. La prima direzione di ricerca di questa tesi mira ad evidenziare come i modelli di elaborazione del linguaggio naturale possano essere potenziati dalle rappresentazioni ottenute con metodi non supervisionati di Deep Learning al fine di conferire un senso agli ingenti contenuti generati dagli utenti. Nello specifico, questa tesi si focalizza su diversi ambiti che sono considerati cruciali per capire di cosa il testo tratti (Named Entity Recognition and Linking) e qual è l’opinione che l’utente sta cercando di esprimere considerando la possibile presenza di ironia (Sentiment Analysis e Irony Detection). Per ognuno di questi ambiti, questa tesi propone modelli innovativi di elaborazione del linguaggio naturale potenziati dalla rappresentazione ottenuta tramite metodi di Deep Learning. Come seconda direzione di ricerca, questa tesi ha approfondito lo sviluppo di un nuovo modello di Deep Learning per l’apprendimento di rappresentazioni significative del testo ulteriormente valorizzato considerando anche la struttura relazionale che sta alla base dei contenuti generati sul Web. Il processo di inferenza terrà quindi in considerazione sia il testo dei dati di input che la componente relazionale sottostante. La rappresentazione, dopo essere stata ottenuta, potrà quindi essere utilizzata da modelli di apprendimento automatico standard per poter eseguire svariate tipologie di analisi nell'ambito di elaborazione del linguaggio naturale. Concludendo, gli studi sperimentali condotti in questa tesi hanno rilevato che l’utilizzo di rappresentazioni più significative ottenute con modelli di Deep Learning, associate agli innovativi modelli di elaborazione del linguaggio naturale proposti in questa tesi, porta ad un miglioramento dei risultati ottenuti e a migliori le abilità di generalizzazione. Ulteriori progressi sono stati anche evidenziati considerando modelli capaci di sfruttare, oltre che al testo, la componente relazionale.
The huge amount of textual user-generated content on the Web has incredibly grown in the last decade, creating new relevant opportunities for different real-world applications and domains. To overcome the difficulties of dealing with this large volume of unstructured data, the research field of Natural Language Processing has provided efficient solutions developing computational models able to understand and interpret human natural language without any (or almost any) human intervention. This field has gained in further computational efficiency and performance from the advent of the recent machine learning research lines concerned with Deep Learning. In particular, this thesis focuses on a specific class of Deep Learning models devoted to learning high-level and meaningful representations of input data in unsupervised settings, by computing multiple non-linear transformations of increasing complexity and abstraction. Indeed, learning expressive representations from the data is a crucial step in Natural Language Processing, because it involves the transformation from discrete symbols (e.g. characters) to a machine-readable representation as real-valued vectors, which should encode semantic and syntactic meanings of the language units. The first research direction of this thesis is aimed at giving evidence that enhancing Natural Language Processing models with representations obtained by unsupervised Deep Learning models can significantly improve the computational abilities of making sense of large volume of user-generated text. In particular, this thesis addresses tasks that were considered crucial for understanding what the text is talking about, by extracting and disambiguating the named entities (Named Entity Recognition and Linking), and which opinion the user is expressing, dealing also with irony (Sentiment Analysis and Irony Detection). For each task, this thesis proposes a novel Natural Language Processing model enhanced by the data representation obtained by Deep Learning. As second research direction, this thesis investigates the development of a novel Deep Learning model for learning a meaningful textual representation taking into account the relational structure underlying user-generated content. The inferred representation comprises both textual and relational information. Once the data representation is obtained, it could be exploited by off-the-shelf machine learning algorithms in order to perform different Natural Language Processing tasks. As conclusion, the experimental investigations reveal that models able to incorporate high-level features, obtained by Deep Learning, show significant performance and improved generalization abilities. Further improvements can be also achieved by models able to take into account the relational information in addition to the textual content.

APA, Harvard, Vancouver, ISO, and other styles

3

Panesar, Kulvinder. "Natural language processing (NLP) in Artificial Intelligence (AI): a functional linguistic perspective." Vernon Press, 2020. http://hdl.handle.net/10454/18140.

Full text

Abstract:

Yes
This chapter encapsulates the multi-disciplinary nature that facilitates NLP in AI and reports on a linguistically orientated conversational software agent (CSA) (Panesar 2017) framework sensitive to natural language processing (NLP), language in the agent environment. We present a novel computational approach of using the functional linguistic theory of Role and Reference Grammar (RRG) as the linguistic engine. Viewing language as action, utterances change the state of the world, and hence speakers and hearer’s mental state change as a result of these utterances. The plan-based method of discourse management (DM) using the BDI model architecture is deployed, to support a greater complexity of conversation. This CSA investigates the integration, intersection and interface of the language, knowledge, speech act constructions (SAC) as a grammatical object, and the sub-model of BDI and DM for NLP. We present an investigation into the intersection and interface between our linguistic and knowledge (belief base) models for both dialogue management and planning. The architecture has three-phase models: (1) a linguistic model based on RRG; (2) Agent Cognitive Model (ACM) with (a) knowledge representation model employing conceptual graphs (CGs) serialised to Resource Description Framework (RDF); (b) a planning model underpinned by BDI concepts and intentionality and rational interaction; and (3) a dialogue model employing common ground. Use of RRG as a linguistic engine for the CSA was successful. We identify the complexity of the semantic gap of internal representations with details of a conceptual bridging solution.

APA, Harvard, Vancouver, ISO, and other styles

4

Välme, Emma, and Lea Renmarker. "Accelerating Sustainability Report Assessment with Natural Language Processing." Thesis, Uppsala universitet, Avdelningen för visuell information och interaktion, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-445912.

Full text

Abstract:

Corporations are expected to be transparent on their sustainability impact and keep their stakeholders informed about how large the impact on the environment is, as well as their work on reducing the impact in question. The transparency is accounted for in a, usually voluntary, sustainability report additional to the already required financial report. With new regulations for mandatory sustainability reporting in Sweden, comprehensive and complete guidelines for corporations to follow are insufficient and the reports tend to be extensive. The reports are therefore hard to assess in terms of how well the reporting is actually done. The Sustainability Reporting Maturity Grid (SRMG) is an assessment tool introduced by Cöster et al. (2020) used for assessing the quality of sustainability reporting. Today, the assessment is performed manually which has proven to be both time-consuming and resulting in varying assessments, affected by individual interpretation of the content. This thesis is exploring how assessment time and grading with the SRMG can be improved by applying Natural Language Processing (NLP) on sustainability documents, resulting in a compressed assessment method - The Prototype. The Prototype intends to facilitate and speed up the process of assessment. The first step towards developing the Prototype was to decide which one of the three Machine Learning models; Naïve Bayes (NB), Support Vector Machines (SVM), or Bidirectional Encoder Representations of Transformers (BERT), is most suitable. This decision was supported by analyzing the accuracy for each model and for respective criteria in the SRMG, where BERT proved a strong classification ability with an average accuracy of 96,8%. Results from the user evaluation of the Prototypeindicated that the assessment time can be halved using the Prototype, with an initial average of 40 minutes decreased to 20 minutes. However, the results further showed a decreased average grading and an increased variation in assessment. The results indicate that applying NLP could be successful, but to get a more competitive Prototype, a more nuanced dataset must be developed, giving more space for the model to detect patterns in the data.

APA, Harvard, Vancouver, ISO, and other styles

5

Djoweini, Camran, and Henrietta Hellberg. "Approaches to natural language processing in app development." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-230167.

Full text

Abstract:

Natural language processing is an on-going field that is not yet fully established. A high demand for natural language processing in applications creates a need for good development-tools and different implementation approaches developed to suit the engineers behind the applications. This project approaches the field from an engineering point of view to research approaches, tools, and techniques that are readily available today for development of natural language processing support. The sub-area of information retrieval of natural language processing was examined through a case study, where prototypes were developed to get a deeper understanding of the tools and techniques used for such tasks from an engineering point of view. We found that there are two major approaches to developing natural language processing support for applications, high-level and low-level approaches. A categorization of tools and frameworks belonging to the two approaches as well as the source code, documentation and, evaluations, of two prototypes developed as part of the research are presented. The choice of approach, tools and techniques should be based on the specifications and requirements of the final product and both levels have their own pros and cons. The results of the report are, to a large extent, generalizable as many different natural language processing tasks can be solved using similar solutions even if their goals vary.
Datalingvistik (engelska natural language processing) är ett område inom datavetenskap som ännu inte är fullt etablerat. En hög efterfrågan av stöd för naturligt språk i applikationer skapar ett behov av tillvägagångssätt och verktyg anpassade för ingenjörer. Detta projekt närmar sig området från en ingenjörs synvinkel för att undersöka de tillvägagångssätt, verktyg och tekniker som finns tillgängliga att arbeta med för utveckling av stöd för naturligt språk i applikationer i dagsläget. Delområdet ‘information retrieval’ undersöktes genom en fallstudie, där prototyper utvecklades för att skapa en djupare förståelse av verktygen och teknikerna som används inom området. Vi kom fram till att det går att kategorisera verktyg och tekniker i två olika grupper, beroende på hur distanserad utvecklaren är från den underliggande bearbetningen av språket. Kategorisering av verktyg och tekniker samt källkod, dokumentering och utvärdering av prototyperna presenteras som resultat. Valet av tillvägagångssätt, tekniker och verktyg bör baseras på krav och specifikationer för den färdiga produkten. Resultaten av studien är till stor del generaliserbara eftersom lösningar till många problem inom området är likartade även om de slutgiltiga målen skiljer sig åt.

APA, Harvard, Vancouver, ISO, and other styles

6

Sætre, Rune. "GeneTUC: Natural Language Understanding in Medical Text." Doctoral thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2006. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-545.

Full text

Abstract:

Natural Language Understanding (NLU) is a 50 years old research field, but its application to molecular biology literature (BioNLU) is a less than 10 years old field. After the complete human genome sequence was published by Human Genome Project and Celera in 2001, there has been an explosion of research, shifting the NLU focus from domains like news articles to the domain of molecular biology and medical literature. BioNLU is needed, since there are almost 2000 new articles published and indexed every day, and the biologists need to know about existing knowledge regarding their own research. So far, BioNLU results are not as good as in other NLU domains, so more research is needed to solve the challenges of creating useful NLU applications for the biologists.

The work in this PhD thesis is a “proof of concept”. It is the first to show that an existing Question Answering (QA) system can be successfully applied in the hard BioNLU domain, after the essential challenge of unknown entities is solved. The core contribution is a system that discovers and classifies unknown entities and relations between them automatically. The World Wide Web (through Google) is used as the main resource, and the performance is almost as good as other named entity extraction systems, but the advantage of this approach is that it is much simpler and requires less manual labor than any of the other comparable systems.

The first paper in this collection gives an overview of the field of NLU and shows how the Information Extraction (IE) problem can be formulated with Local Grammars. The second paper uses Machine Learning to automatically recognize protein name based on features from the GSearch Engine. In the third paper, GSearch is substituted with Google, and the task in this paper is to extract all unknown names belonging to one of 273 biomedical entity classes, like genes, proteins, processes etc. After getting promising results with Google, the fourth paper shows that this approach can also be used to retrieve interactions or relationships between the named entities. The fifth paper describes an online implementation of the system, and shows that the method scales well to a larger set of entities.

The final paper concludes the “proof of concept” research, and shows that the performance of the original GeneTUC NLU system has increased from handling 10% of the sentences in a large collection of abstracts in 2001, to 50% in 2006. This is still not good enough to create a commercial system, but it is believed that another 40% performance gain can be achieved by importing more verb templates into GeneTUC, just like nouns were imported during this work. Work has already begun on this, in the form of a local Masters Thesis.

APA, Harvard, Vancouver, ISO, and other styles

7

Andrén, Samuel, and William Bolin. "NLIs over APIs : Evaluating Pattern Matching as a way of processing natural language for a simple API." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186429.

Full text

Abstract:

This report explores of the feasibility of using pattern matching for implementing a robust Natural Language Interface (NLI) over a limited Application Programming Interface (API). Because APIs are used to such a great extent today and often in mobile applications, it becomes more important to find simple ways of making them accessible to end users. A very intuitive way to access information via an API is using natural language. Therefore, this study first explores the possibility of building a corpus of the most common phrases used for a particular API. It is then explored how those phrases adhere to patterns, and how these patterns can be used to extract meaning from a phrase. Finally it evaluates an implementation of an NLI using pattern matching system based on the patterns. The result of the building of the corpus shows that although the amount of unique phrases used with our API seems to increase quite steadily, the amount of patterns those phrases follow converges to a constant quickly. This implies that it is possible to use these patterns to create an NLI that is robust enough to query an API effectively. The evaluation of the pattern matching system indicates that this technique can be used to successfully extract information from a phrase if its pattern is known by the system.
Den här rapporten utforskar hur genomförbart det är att använda mönstermatchning för att implementera ett robust användargränssnitt för styrning med naturligt språk (Natural Language Interface, NLI) över en begränsad Application Programming Interface (API). Eftersom APIer används i stor utsträckning idag, ofta i mobila applikationer, har det blivit allt mer viktigt att hitta sätt att göra dem ännu mer tillgängliga för slutanvändare. Ett mycket intuitivt sätt att komma åt information är med hjälp av naturligt språk via en API. I den här rapporten redogörs först för möjligheten att bygga ett korpus för en viss API and att skapa mönster för mönstermatchning på det korpuset. Därefter utvärderas en implementation av ett NLI som bygger på mönstermatchning med hjälp av korpuset. Resultatet av korpusuppbyggnaden visar att trots att antalet unika fraser som används för vårt API ökar ganska stadigt, så konvergerar antalat mönster på de fraserna relativt snabbt mot en konstant. Detta antyder att det är mycket möjligt att använda desssa mönster för att skapa en NLI som är robust nog för en API. Utvärderingen av implementationen av mönstermatchingssystemet antyder att tekniken kan användas för att framgångsrikt extrahera information från fraser om mönstret frasen följer finns i systemet.

APA, Harvard, Vancouver, ISO, and other styles

8

Wallner, Vanja. "Mapping medical expressions to MedDRA using Natural Language Processing." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-426916.

Full text

Abstract:

Pharmacovigilance, also referred to as drug safety, is an important science for identifying risks related to medicine intake. Side effects of medicine can be caused by for example interactions, high dosage and misuse. In order to find patterns in what causes the unwanted effects, information needs to be gathered and mapped to predefined terms. This mapping is today done manually by experts which can be a very difficult and time consuming task. In this thesis the aim is to automate the process of mapping side effects by using machine learning techniques. The model was developed using information from preexisting mappings of verbatim expressions of side effects. The final model that was constructed made use of the pre-trained language model BERT, which has received state-of-the-art results within the NLP field. When evaluating on the test set the final model performed an accuracy of 80.21%. It was found that some verbatims were very difficult for our model to classify mainly because of ambiguity or lack of information contained in the verbatim. As it is very important for the mappings to be done correctly, a threshold was introduced which left for manual mapping the verbatims that were most difficult to classify. This process could however still be improved as suggested terms were generated from the model, which could be used as support for the specialist responsible for the manual mapping.

APA, Harvard, Vancouver, ISO, and other styles

9

Woldemariam, Yonas Demeke. "Natural language processing in cross-media analysis." Licentiate thesis, Umeå universitet, Institutionen för datavetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-147640.

Full text

Abstract:

A cross-media analysis framework is an integrated multi-modal platform where a media resource containing different types of data such as text, images, audio and video is analyzed with metadata extractors, working jointly to contextualize the media resource. It generally provides cross-media analysis and automatic annotation, metadata publication and storage, searches and recommendation services. For on-line content providers, such services allow them to semantically enhance a media resource with the extracted metadata representing the hidden meanings and make it more efficiently searchable. Within the architecture of such frameworks, Natural Language Processing (NLP) infrastructures cover a substantial part. The NLP infrastructures include text analysis components such as a parser, named entity extraction and linking, sentiment analysis and automatic speech recognition. Since NLP tools and techniques are originally designed to operate in isolation, integrating them in cross-media frameworks and analyzing textual data extracted from multimedia sources is very challenging. Especially, the text extracted from audio-visual content lack linguistic features that potentially provide important clues for text analysis components. Thus, there is a need to develop various techniques to meet the requirements and design principles of the frameworks. In our thesis, we explore developing various methods and models satisfying text and speech analysis requirements posed by cross-media analysis frameworks. The developed methods allow the frameworks to extract linguistic knowledge of various types and predict various information such as sentiment and competence. We also attempt to enhance the multilingualism of the frameworks by designing an analysis pipeline that includes speech recognition, transliteration and named entity recognition for Amharic, that also enables the accessibility of Amharic contents on the web more efficiently. The method can potentially be extended to support other under-resourced languages.

APA, Harvard, Vancouver, ISO, and other styles

10

Huang, Fei. "Improving NLP Systems Using Unconventional, Freely-Available Data." Diss., Temple University Libraries, 2013. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/221031.

Full text

Abstract:

Computer and Information Science
Ph.D.
Sentence labeling is a type of pattern recognition task that involves the assignment of a categorical label to each member of a sentence of observed words. Standard supervised sentence-labeling systems often have poor generalization: it is difficult to estimate parameters for words which appear in the test set, but seldom (or never) appear in the training set, because they only use words as features in their prediction tasks. Representation learning is a promising technique for discovering features that allow a supervised classifier to generalize from a source domain dataset to arbitrary new domains. We demonstrate that features which are learned from distributional representations of unlabeled data can be used to improve performance on out-of-vocabulary words and help the model to generalize. We also argue that it is important for a representation learner to be able to incorporate expert knowledge during its search for helpful features. We investigate techniques for building open-domain sentence labeling systems that approach the ideal of a system whose accuracy is high and consistent across domains. In particular, we investigate unsupervised techniques for language model representation learning that provide new features which are stable across domains, in that they are predictive in both the training and out-of-domain test data. In experiments, our best system with the proposed techniques reduce error by as much as 11.4% relative to the previous system using traditional representations on the Part-of-Speech tagging task. Moreover, we leverage the Posterior Regularization framework, and develop an architecture for incorporating biases from prior knowledge into representation learning. We investigate three types of biases: entropy bias, distance bias and predictive bias. Experiments on two domain adaptation tasks show that our biased learners identify significantly better sets of features than unbiased learners. This results in a relative reduction in error of more than 16% for both tasks with respect to existing state-of-the-art representation learning techniques. We also extend the idea of using additional unlabeled data to improve the system's performance on a different NLP task, word alignment. Traditional word alignment only takes a sentence-level aligned parallel corpus as input and generates the word-level alignments. However, as the integration of different cultures, more and more people are competent in multiple languages, and they often use elements of multiple languages in conversations. Linguist Code Switching (LCS) is such a situation where two or more languages show up in the context of a single conversation. Traditional machine translation (MT) systems treat LCS data as noise, or just as regular sentences. However, if LCS data is processed intelligently, it can provide a useful signal for training word alignment and MT models. In this work, we first extract constraints from this code switching data and then incorporate them into a word alignment model training procedure. We also show that by using the code switching data, we can jointly train a word alignment model and a language model using co-training. Our techniques for incorporating LCS data improve by 2.64 in BLEU score over a baseline MT system trained using only standard sentence-aligned corpora.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

11

Karlin, Ievgen. "An Evaluation of NLP Toolkits for Information Quality Assessment." Thesis, Linnéuniversitetet, Institutionen för datavetenskap, fysik och matematik, DFM, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-22606.

Full text

Abstract:

Documentation is often the first source, which can help user to solve problems or provide conditions of use of some product. That is why it should be clear and understandable. But what does “understandable” mean? And how to detect whether some text is unclear? And this thesis can answer on those questions.The main idea of current work is to measure clarity of the text information using natural language processing capabilities. There are three global steps to achieve this goal: to define criteria of bad clarity of text information, to evaluate different natural language toolkits and find suitable for us, and to implement a prototype system that, given a text, measures text clarity.Current thesis project is planned to be included to VizzAnalyzer (quality analysis tool, which processes information on structure level) and its main task is to perform a clarity analysis of text information extracted by VizzAnalyzer from different XML-files.

APA, Harvard, Vancouver, ISO, and other styles

12

Boulanger, Hugo. "Data augmentation and generation for natural language processing." Electronic Thesis or Diss., université Paris-Saclay, 2023. http://www.theses.fr/2023UPASG019.

Full text

Abstract:

De plus en plus de domaines cherchent à automatiser une partie de leur processus.Le traitement automatique des langues contient des méthodes permettant d'extraire des informations dans des textes.Ces méthodes peuvent utiliser de l'apprentissage automatique.L'apprentissage automatique nécessite des données annotées pour faire de l'extraction d'information de manière optimale.L'application de ces méthodes à de nouveaux domaines nécessite d'obtenir des données annotées liée à la tâche.Le problème que nous souhaitons résoudre est de proposer et d'étudier des méthodes de génération pour améliorer les performances de modèles appris à basse quantité de données.Nous explorons différentes méthodes avec et sans apprentissage pour générer les données nécessaires à l'apprentissage de modèles d'étiquetage.La première méthode que nous explorons est le remplissage de patrons.Cette méthode de génération de données permet de générer des données annotées en combinant des phrases à trous, les patrons, et des mentions.Nous avons montré que cette méthode permet d'améliorer les performances des modèles d'étiquetage à très petite quantité de données.Nous avons aussi étudié la quantité de données nécessaire pour l'utilisation optimale de cette méthode.La deuxième approche de génération que nous avons testé est l'utilisation de modèles de langue pour la génération couplée à l'utilisation de méthode d'apprentissage semi-supervisé.La méthode d'apprentissage semi-supervisé utilisé est le tri-training et sert à ajouter les étiquettes aux données générées.Le tri-training est testé sur plusieurs méthodes de génération utilisant différents modèles de langue pré-entraînés.Nous avons proposé une version du tri-training appelé tri-training génératif, où la génération n'est pas faite en amont, mais durant le processus de tri-training et profite de celui-ci.Nous avons testé les performances des modèles entraînés durant le processus de semi-supervision et des modèles entraîné sur les données produites par celui-ci.Dans la majeure partie des cas, les données produites permettent d'égaler les performances des modèles entraînés avec la semi-supervision.Cette méthode permet l'amélioration des performances à tous les niveaux de données testés vis-à-vis des modèles sans augmentation.La troisième piste d'étude vise à combiner certains aspects des approches précédentes.Pour cela, nous avons testé différentes approches.L'utilisation de modèles de langues pour faire du remplacement de bouts de phrase à la manière de la méthode de remplissage de patrons fut infructueuse.Nous avons testé l'addition de données générées par différentes méthodes qui ne permet pas de surpasser la meilleure des méthodes.Enfin, nous avons testé l'application de la méthode de remplissage de patrons sur les données générées avec le tri-training qui n'a pas amélioré les résultats obtenu avec le tri-training.S'il reste encore beaucoup à étudier, nous avons cependant mis en évidence des méthodes simples, comme le remplissage de patrons, et plus complexe, comme l'utilisation d'apprentissage supervisé avec des phrases générées par un modèle de langue, permettant d'améliorer les performances de modèles d'étiquetage grâce à la génération de données annotées
More and more fields are looking to automate part of their process.Automatic language processing contains methods for extracting information from texts.These methods can use machine learning.Machine learning requires annotated data to perform information extraction.Applying these methods to new domains requires obtaining annotated data related to the task.In this thesis, our goal is to study generation methods to improve the performance of learned models with low amounts of data.Different methods of generation are explored that either contain machine learning or do not, which are used to generate the data needed to learn sequence labeling models.The first method explored is pattern filling.This data generation method generates annotated data by combining sentences with slots, or patterns, with mentions.We have shown that this method improves the performance of labeling models with tiny amounts of data.The amount of data needed to use this method is also studied.The second approach tested is the use of language models for text generation alongside a semi-supervised learning method for tagging.The semi-supervised learning method used is tri-training and is used to add labels to the generated data.The tri-training is tested on several generation methods using different pre-trained language models.We proposed a version of tri-training called generative tri-training, where the generation is not done in advance but during the tri-training process and takes advantage of it.The performance of the models trained during the semi-supervision process and of the models trained on the data generated by it are tested.In most cases, the data produced match the performance of the models trained with the semi-supervision.This method allows the improvement of the performances at all the tested data levels with respect to the models without augmentation.The third avenue of study combines some aspects of the previous approaches.For this purpose, different approaches are tested.The use of language models to do sentence replacement in the manner of the pattern-filling generation method is unsuccessful.Using a set of data coming from the different generation methods is tested, which does not outperform the best method.Finally, applying the pattern-filling method to the data generated with the tri-training is tested and does not improve the results obtained with the tri-training.While much remains to be studied, we have highlighted simple methods, such as pattern filling, and more complex ones, such as the use of supervised learning with sentences generated by a language model, to improve the performance of labeling models through the generation of annotated data

APA, Harvard, Vancouver, ISO, and other styles

13

Lager, Adam. "Improving Solr search with Natural Language Processing : An NLP implementation for information retrieval in Solr." Thesis, Linköpings universitet, Programvara och system, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-177790.

Full text

Abstract:

The field of AI is emerging fast and institutions and companies are pushing the limits of impossibility. Natural Language Processing is a branch of AI where the goal is to understand human speech and/or text. This technology is used to improve an inverted index,the full text search engine Solr. Solr is open source and has integrated OpenNLP makingit a suitable choice for these kinds of operations. NLP-enabled Solr showed great results compared to the Solr that’s currently running on the systems, where NLP-Solr was slightly worse in terms of precision, it excelled at recall and returning the correct documents.

APA, Harvard, Vancouver, ISO, and other styles

14

Panesar, Kulvinder. "Conversational artificial intelligence - demystifying statistical vs linguistic NLP solutions." Universitat Politécnica de Valéncia, 2020. http://hdl.handle.net/10454/18121.

Full text

Abstract:

yes
This paper aims to demystify the hype and attention on chatbots and its association with conversational artificial intelligence. Both are slowly emerging as a real presence in our lives from the impressive technological developments in machine learning, deep learning and natural language understanding solutions. However, what is under the hood, and how far and to what extent can chatbots/conversational artificial intelligence solutions work – is our question. Natural language is the most easily understood knowledge representation for people, but certainly not the best for computers because of its inherent ambiguous, complex and dynamic nature. We will critique the knowledge representation of heavy statistical chatbot solutions against linguistics alternatives. In order to react intelligently to the user, natural language solutions must critically consider other factors such as context, memory, intelligent understanding, previous experience, and personalized knowledge of the user. We will delve into the spectrum of conversational interfaces and focus on a strong artificial intelligence concept. This is explored via a text based conversational software agents with a deep strategic role to hold a conversation and enable the mechanisms need to plan, and to decide what to do next, and manage the dialogue to achieve a goal. To demonstrate this, a deep linguistically aware and knowledge aware text based conversational agent (LING-CSA) presents a proof-of-concept of a non-statistical conversational AI solution.

APA, Harvard, Vancouver, ISO, and other styles

15

Coppola, Gregory Francis. "Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing." Thesis, University of Edinburgh, 2015. http://hdl.handle.net/1842/10451.

Full text

Abstract:

The development of distributed training strategies for statistical prediction functions is important for applications of machine learning, generally, and the development of distributed structured prediction training strategies is important for natural language processing (NLP), in particular. With ever-growing data sets this is, first, because, it is easier to increase computational capacity by adding more processor nodes than it is to increase the power of individual processor nodes, and, second, because data sets are often collected and stored in different locations. Iterative parameter mixing (IPM) is a distributed training strategy in which each node in a network of processors optimizes a regularized average loss objective on its own subset of the total available training data, making stochastic (per-example) updates to its own estimate of the optimal weight vector, and communicating with the other nodes by periodically averaging estimates of the optimal vector across the network. This algorithm has been contrasted with a close relative, called here the single-mixture optimization algorithm, in which each node stochastically optimizes an average loss objective on its own subset of the training data, operating in isolation until convergence, at which point the average of the independently created estimates is returned. Recent empirical results have suggested that this IPM strategy produces better models than the single-mixture algorithm, and the results of this thesis add to this picture. The contributions of this thesis are as follows. The first contribution is to produce and analyze an algorithm for decentralized stochastic optimization of regularized average loss objective functions. This algorithm, which we call the distributed regularized dual averaging algorithm, improves over prior work on distributed dual averaging by providing a simpler algorithm (used in the rest of the thesis), better convergence bounds for the case of regularized average loss functions, and certain technical results that are used in the sequel. The central contribution of this thesis is to give an optimization-theoretic justification for the IPM algorithm. While past work has focused primarily on its empirical test-time performance, we give a novel perspective on this algorithm by showing that, in the context of the distributed dual averaging algorithm, IPM constitutes a convergent optimization algorithm for arbitrary convex functions, while the single-mixture distribution algorithm is not. Experiments indeed confirm that the superior test-time performance of models trained using IPM, compared to single-mixture, correlates with better optimization of the objective value on the training set, a fact not previously reported. Furthermore, our analysis of general non-smooth functions justifies the use of distributed large-margin (support vector machine [SVM]) training of structured predictors, which we show yields better test performance than the IPM perceptron algorithm, the only version of the IPM to have previously been given a theoretical justification. Our results confirm that IPM training can reach the same level of test performance as a sequentially trained model and can reach better accuracies when one has a fixed budget of training time. Finally, we use the reduction in training time that distributed training allows to experiment with adding higher-order dependency features to a state-of-the-art phrase-structure parsing model. We demonstrate that adding these features improves out-of-domain parsing results of even the strongest phrase-structure parsing models, yielding a new state-of-the-art for the popular train-test pairs considered. In addition, we show that a feature-bagging strategy, in which component models are trained separately and later combined, is sometimes necessary to avoid feature under-training and get the best performance out of large feature sets.

APA, Harvard, Vancouver, ISO, and other styles

16

Riedel, Sebastian. "Efficient prediction of relational structure and its application to natural language processing." Thesis, University of Edinburgh, 2009. http://hdl.handle.net/1842/4167.

Full text

Abstract:

Many tasks in Natural Language Processing (NLP) require us to predict a relational structure over entities. For example, in Semantic Role Labelling we try to predict the ’semantic role’ relation between a predicate verb and its argument constituents. Often NLP tasks not only involve related entities but also relations that are stochastically correlated. For instance, in Semantic Role Labelling the roles of different constituents are correlated: we cannot assign the agent role to one constituent if we have already assigned this role to another. Statistical Relational Learning (also known as First Order Probabilistic Logic) allows us to capture the aforementioned nature of NLP tasks because it is based on the notions of entities, relations and stochastic correlations between relationships. It is therefore often straightforward to formulate an NLP task using a First Order probabilistic language such as Markov Logic. However, the generality of this approach comes at a price: the process of finding the relational structure with highest probability, also known as maximum a posteriori (MAP) inference, is often inefficient, if not intractable. In this work we seek to improve the efficiency of MAP inference for Statistical Relational Learning. We propose a meta-algorithm, namely Cutting Plane Inference (CPI), that iteratively solves small subproblems of the original problem using any existing MAP technique and inspects parts of the problem that are not yet included in the current subproblem but could potentially lead to an improved solution. Our hypothesis is that this algorithm can dramatically improve the efficiency of existing methods while remaining at least as accurate. We frame the algorithm in Markov Logic, a language that combines First Order Logic and Markov Networks. Our hypothesis is evaluated using two tasks: Semantic Role Labelling and Entity Resolution. It is shown that the proposed algorithm improves the efficiency of two existing methods by two orders of magnitude and leads an approximate method to more probable solutions. We also give show that CPI, at convergence, is guaranteed to be at least as accurate as the method used within its inner loop. Another core contribution of this work is a theoretic and empirical analysis of the boundary conditions of Cutting Plane Inference. We describe cases when Cutting Plane Inference will definitely be difficult (because it instantiates large networks or needs many iterations) and when it will be easy (because it instantiates small networks and needs only few iterations).

APA, Harvard, Vancouver, ISO, and other styles

17

Fernquist, Johan. "Detection of deceptive reviews : using classification and natural language processing features." Thesis, Uppsala universitet, Institutionen för teknikvetenskaper, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-306956.

Full text

Abstract:

With the great growth of open forums online where anyone can givetheir opinion on everything, the Internet has become a place wherepeople are trying to mislead others. By assuming that there is acorrelation between a deceptive text's purpose and the way to writethe text, our goal with this thesis was to develop a model fordetecting these fake texts by taking advantage of this correlation.Our approach was to use classification together with threedifferent feature types, term frequency-inverse document frequency,word2vec and probabilistic context-free grammar. We have managed todevelop a model which have improved all, to us known, results for twodifferent datasets.With machine translation, we have detected that there is apossibility to hide the stylometric footprints and thecharacteristics of deceptive texts, making it possible to slightlydecrease the accuracy of a classifier and still convey a message.Finally we investigated whether it was possible to train and test ourmodel on data from different sources and managed to achieve anaccuracy hardly better than chance. That indicated the resultingmodel is not versatile enough to be used on different kinds ofdeceptive texts than it has been trained on.

APA, Harvard, Vancouver, ISO, and other styles

18

Alkathiri, Abdul Aziz. "Decentralized Large-Scale Natural Language Processing Using Gossip Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281277.

Full text

Abstract:

The field of Natural Language Processing in machine learning has seen rising popularity and use in recent years. The nature of Natural Language Processing, which deals with natural human language and computers, has led to the research and development of many algorithms that produce word embeddings. One of the most widely-used of these algorithms is Word2Vec. With the abundance of data generated by users and organizations and the complexity of machine learning and deep learning models, performing training using a single machine becomes unfeasible. The advancement in distributed machine learning offers a solution to this problem. Unfortunately, due to reasons concerning data privacy and regulations, in some real-life scenarios, the data must not leave its local machine. This limitation has lead to the development of techniques and protocols that are massively-parallel and data-private. The most popular of these protocols is federated learning. However, due to its centralized nature, it still poses some security and robustness risks. Consequently, this led to the development of massively-parallel, data private, decentralized approaches, such as gossip learning. In the gossip learning protocol, every once in a while each node in the network randomly chooses a peer for information exchange, which eliminates the need for a central node. This research intends to test the viability of gossip learning for large- scale, real-world applications. In particular, it focuses on implementation and evaluation for a Natural Language Processing application using gossip learning. The results show that application of Word2Vec in a gossip learning framework is viable and yields comparable results to its non-distributed, centralized counterpart for various scenarios, with an average loss on quality of 6.904%.
Fältet Naturlig Språkbehandling (Natural Language Processing eller NLP) i maskininlärning har sett en ökande popularitet och användning under de senaste åren. Naturen av Naturlig Språkbehandling, som bearbetar naturliga mänskliga språk och datorer, har lett till forskningen och utvecklingen av många algoritmer som producerar inbäddningar av ord. En av de mest använda av dessa algoritmer är Word2Vec. Med överflödet av data som genereras av användare och organisationer, komplexiteten av maskininlärning och djupa inlärningsmodeller, blir det omöjligt att utföra utbildning med hjälp av en enda maskin. Avancemangen inom distribuerad maskininlärning erbjuder en lösning på detta problem, men tyvärr får data av sekretesskäl och datareglering i vissa verkliga scenarier inte lämna sin lokala maskin. Denna begränsning har lett till utvecklingen av tekniker och protokoll som är massivt parallella och dataprivata. Det mest populära av dessa protokoll är federerad inlärning (federated learning), men på grund av sin centraliserade natur utgör det ändock vissa säkerhets- och robusthetsrisker. Följaktligen ledde detta till utvecklingen av massivt parallella, dataprivata och decentraliserade tillvägagångssätt, såsom skvallerinlärning (gossip learning). I skvallerinlärningsprotokollet väljer varje nod i nätverket slumpmässigt en like för informationsutbyte, vilket eliminerarbehovet av en central nod. Syftet med denna forskning är att testa livskraftighetenav skvallerinlärning i större omfattningens verkliga applikationer. I synnerhet fokuserar forskningen på implementering och utvärdering av en NLP-applikation genom användning av skvallerinlärning. Resultaten visar att tillämpningen av Word2Vec i en skvallerinlärnings ramverk är livskraftig och ger jämförbara resultat med dess icke-distribuerade, centraliserade motsvarighet för olika scenarier, med en genomsnittlig kvalitetsförlust av 6,904%.

APA, Harvard, Vancouver, ISO, and other styles

19

Ruberg, Nicolaas. "Bert goes sustainable: an NLP approach to ESG financing." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/24787/.

Full text

Abstract:

Environmental, Social, and Governance (ESG) factors are a strategic topic for investors and financing institutions like the Brazilian Development Bank (BNDES). Currently, the bank’s experts are developing a framework based on those factors to assess companies' sustainable financing. We identify an opportunity to use Natural Language Processing (NLP) in this development. This opportunity arises from the observation that a critical document to the ESG analysis is the company annual activity report. This document undergoes a manual screening, and later it is decomposed, and its parts are redirected to specialists’ analysis. Therefore, the screening process would largely benefit from NLP to automate the classification of text excerpts from the annual report. The proposed solution is based on different Bidirectional Encoder Representations from Transformers (BERT) architectures, which rely on the attention mechanism to achieve optimal results on sentence-level analysis tasks. We devised a text classification task to enable the analysis of excerpts from the annual activity report of companies considering three categories, according to the ESG reference standard, the Global Reporting Initiative (GRI). To establish a benchmark, we implemented a baseline solution using a classic NLP approach, Naïve Bayes, which got a 51% accuracy and 50,33% F1-score. RoBERTa and BERT-large achieved 88% accuracy and almost 85% F1-score, the best results obtained from our experiments with different BERT architectures. Also, Albert showed to be a possible alternative for limited memory devices, with 85% accuracy and 78.5924% F1-score. Finally, we experimented with a multilingual setup that would be interesting for a scenario where the BNDES wants a more generic model that can analyze English or Portuguese annual reports. Bert multilingual model reached almost 86% accuracy and 81.18% F1-score.

APA, Harvard, Vancouver, ISO, and other styles

20

Giménez, Fayos María Teresa. "Natural Language Processing using Deep Learning in Social Media." Doctoral thesis, Universitat Politècnica de València, 2021. http://hdl.handle.net/10251/172164.

Full text

Abstract:

[ES] En los últimos años, los modelos de aprendizaje automático profundo (AP) han revolucionado los sistemas de procesamiento de lenguaje natural (PLN). Hemos sido testigos de un avance formidable en las capacidades de estos sistemas y actualmente podemos encontrar sistemas que integran modelos PLN de manera ubicua. Algunos ejemplos de estos modelos con los que interaccionamos a diario incluyen modelos que determinan la intención de la persona que escribió un texto, el sentimiento que pretende comunicar un tweet o nuestra ideología política a partir de lo que compartimos en redes sociales. En esta tesis se han propuestos distintos modelos de PNL que abordan tareas que estudian el texto que se comparte en redes sociales. En concreto, este trabajo se centra en dos tareas fundamentalmente: el análisis de sentimientos y el reconocimiento de la personalidad de la persona autora de un texto. La tarea de analizar el sentimiento expresado en un texto es uno de los problemas principales en el PNL y consiste en determinar la polaridad que un texto pretende comunicar. Se trata por lo tanto de una tarea estudiada en profundidad de la cual disponemos de una vasta cantidad de recursos y modelos. Por el contrario, el problema del reconocimiento de personalidad es una tarea revolucionaria que tiene como objetivo determinar la personalidad de los usuarios considerando su estilo de escritura. El estudio de esta tarea es más marginal por lo que disponemos de menos recursos para abordarla pero que no obstante presenta un gran potencial. A pesar de que el enfoque principal de este trabajo fue el desarrollo de modelos de aprendizaje profundo, también hemos propuesto modelos basados en recursos lingüísticos y modelos clásicos del aprendizaje automático. Estos últimos modelos nos han permitido explorar las sutilezas de distintos elementos lingüísticos como por ejemplo el impacto que tienen las emociones en la clasificación correcta del sentimiento expresado en un texto. Posteriormente, tras estos trabajos iniciales se desarrollaron modelos AP, en particular, Redes neuronales convolucionales (RNC) que fueron aplicadas a las tareas previamente citadas. En el caso del reconocimiento de la personalidad, se han comparado modelos clásicos del aprendizaje automático con modelos de aprendizaje profundo, pudiendo establecer una comparativa bajo las mismas premisas. Cabe destacar que el PNL ha evolucionado drásticamente en los últimos años gracias al desarrollo de campañas de evaluación pública, donde múltiples equipos de investigación comparan las capacidades de los modelos que proponen en las mismas condiciones. La mayoría de los modelos presentados en esta tesis fueron o bien evaluados mediante campañas de evaluación públicas, o bien emplearon la configuración de una campaña pública previamente celebrada. Siendo conscientes, por lo tanto, de la importancia de estas campañas para el avance del PNL, desarrollamos una campaña de evaluación pública cuyo objetivo era clasificar el tema tratado en un tweet, para lo cual recogimos y etiquetamos un nuevo conjunto de datos. A medida que avanzabamos en el desarrollo del trabajo de esta tesis, decidimos estudiar en profundidad como las RNC se aplicaban a las tareas de PNL. En este sentido, se exploraron dos líneas de trabajo. En primer lugar, propusimos un método de relleno semántico para RNC, que plantea una nueva manera de representar el texto para resolver tareas de PNL. Y en segundo lugar, se introdujo un marco teórico para abordar una de las críticas más frecuentes del aprendizaje profundo, el cual es la falta de interpretabilidad. Este marco busca visualizar qué patrones léxicos, si los hay, han sido aprendidos por la red para clasificar un texto.
[CA] En els últims anys, els models d'aprenentatge automàtic profund (AP) han revolucionat els sistemes de processament de llenguatge natural (PLN). Hem estat testimonis d'un avanç formidable en les capacitats d'aquests sistemes i actualment podem trobar sistemes que integren models PLN de manera ubiqua. Alguns exemples d'aquests models amb els quals interaccionem diàriament inclouen models que determinen la intenció de la persona que va escriure un text, el sentiment que pretén comunicar un tweet o la nostra ideologia política a partir del que compartim en xarxes socials. En aquesta tesi s'han proposats diferents models de PNL que aborden tasques que estudien el text que es comparteix en xarxes socials. En concret, aquest treball se centra en dues tasques fonamentalment: l'anàlisi de sentiments i el reconeixement de la personalitat de la persona autora d'un text. La tasca d'analitzar el sentiment expressat en un text és un dels problemes principals en el PNL i consisteix a determinar la polaritat que un text pretén comunicar. Es tracta per tant d'una tasca estudiada en profunditat de la qual disposem d'una vasta quantitat de recursos i models. Per contra, el problema del reconeixement de la personalitat és una tasca revolucionària que té com a objectiu determinar la personalitat dels usuaris considerant el seu estil d'escriptura. L'estudi d'aquesta tasca és més marginal i en conseqüència disposem de menys recursos per abordar-la però no obstant i això presenta un gran potencial. Tot i que el fouc principal d'aquest treball va ser el desenvolupament de models d'aprenentatge profund, també hem proposat models basats en recursos lingüístics i models clàssics de l'aprenentatge automàtic. Aquests últims models ens han permès explorar les subtileses de diferents elements lingüístics com ara l'impacte que tenen les emocions en la classificació correcta del sentiment expressat en un text. Posteriorment, després d'aquests treballs inicials es van desenvolupar models AP, en particular, Xarxes neuronals convolucionals (XNC) que van ser aplicades a les tasques prèviament esmentades. En el cas de el reconeixement de la personalitat, s'han comparat models clàssics de l'aprenentatge automàtic amb models d'aprenentatge profund la qual cosa a permet establir una comparativa de les dos aproximacions sota les mateixes premisses. Cal remarcar que el PNL ha evolucionat dràsticament en els últims anys gràcies a el desenvolupament de campanyes d'avaluació pública on múltiples equips d'investigació comparen les capacitats dels models que proposen sota les mateixes condicions. La majoria dels models presentats en aquesta tesi van ser o bé avaluats mitjançant campanyes d'avaluació públiques, o bé s'ha emprat la configuració d'una campanya pública prèviament celebrada. Sent conscients, per tant, de la importància d'aquestes campanyes per a l'avanç del PNL, vam desenvolupar una campanya d'avaluació pública on l'objectiu era classificar el tema tractat en un tweet, per a la qual cosa vam recollir i etiquetar un nou conjunt de dades. A mesura que avançàvem en el desenvolupament del treball d'aquesta tesi, vam decidir estudiar en profunditat com les XNC s'apliquen a les tasques de PNL. En aquest sentit, es van explorar dues línies de treball.En primer lloc, vam proposar un mètode d'emplenament semàntic per RNC, que planteja una nova manera de representar el text per resoldre tasques de PNL. I en segon lloc, es va introduir un marc teòric per abordar una de les crítiques més freqüents de l'aprenentatge profund, el qual és la falta de interpretabilitat. Aquest marc cerca visualitzar quins patrons lèxics, si n'hi han, han estat apresos per la xarxa per classificar un text.
[EN] In the last years, Deep Learning (DL) has revolutionised the potential of automatic systems that handle Natural Language Processing (NLP) tasks. We have witnessed a tremendous advance in the performance of these systems. Nowadays, we found embedded systems ubiquitously, determining the intent of the text we write, the sentiment of our tweets or our political views, for citing some examples. In this thesis, we proposed several NLP models for addressing tasks that deal with social media text. Concretely, this work is focused mainly on Sentiment Analysis and Personality Recognition tasks. Sentiment Analysis is one of the leading problems in NLP, consists of determining the polarity of a text, and it is a well-known task where the number of resources and models proposed is vast. In contrast, Personality Recognition is a breakthrough task that aims to determine the users' personality using their writing style, but it is more a niche task with fewer resources designed ad-hoc but with great potential. Despite the fact that the principal focus of this work was on the development of Deep Learning models, we have also proposed models based on linguistic resources and classical Machine Learning models. Moreover, in this more straightforward setup, we have explored the nuances of different language devices, such as the impact of emotions in the correct classification of the sentiment expressed in a text. Afterwards, DL models were developed, particularly Convolutional Neural Networks (CNNs), to address previously described tasks. In the case of Personality Recognition, we explored the two approaches, which allowed us to compare the models under the same circumstances. Noteworthy, NLP has evolved dramatically in the last years through the development of public evaluation campaigns, where multiple research teams compare the performance of their approaches under the same conditions. Most of the models here presented were either assessed in an evaluation task or either used their setup. Recognising the importance of this effort, we curated and developed an evaluation campaign for classifying political tweets. In addition, as we advanced in the development of this work, we decided to study in-depth CNNs applied to NLP tasks. Two lines of work were explored in this regard. Firstly, we proposed a semantic-based padding method for CNNs, which addresses how to represent text more appropriately for solving NLP tasks. Secondly, a theoretical framework was introduced for tackling one of the most frequent critics of Deep Learning: interpretability. This framework seeks to visualise what lexical patterns, if any, the CNN is learning in order to classify a sentence. In summary, the main achievements presented in this thesis are: - The organisation of an evaluation campaign for Topic Classification from texts gathered from social media. - The proposal of several Machine Learning models tackling the Sentiment Analysis task from social media. Besides, a study of the impact of linguistic devices such as figurative language in the task is presented. - The development of a model for inferring the personality of a developer provided the source code that they have written. - The study of Personality Recognition tasks from social media following two different approaches, models based on machine learning algorithms and handcrafted features, and models based on CNNs were proposed and compared both approaches. - The introduction of new semantic-based paddings for optimising how the text was represented in CNNs. - The definition of a theoretical framework to provide interpretable information to what CNNs were learning internally.
Giménez Fayos, MT. (2021). Natural Language Processing using Deep Learning in Social Media [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/172164
TESIS

APA, Harvard, Vancouver, ISO, and other styles

21

Hänig, Christian. "Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual Resources." Doctoral thesis, Universitätsbibliothek Leipzig, 2013. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-112706.

Full text

Abstract:

This thesis aims to develop a Relation Extraction algorithm to extract knowledge out of automotive data. While most approaches to Relation Extraction are only evaluated on newspaper data dealing with general relations from the business world their applicability to other data sets is not well studied. Part I of this thesis deals with theoretical foundations of Information Extraction algorithms. Text mining cannot be seen as the simple application of data mining methods to textual data. Instead, sophisticated methods have to be employed to accurately extract knowledge from text which then can be mined using statistical methods from the field of data mining. Information Extraction itself can be divided into two subtasks: Entity Detection and Relation Extraction. The detection of entities is very domain-dependent due to terminology, abbreviations and general language use within the given domain. Thus, this task has to be solved for each domain employing thesauri or another type of lexicon. Supervised approaches to Named Entity Recognition will not achieve reasonable results unless they have been trained for the given type of data. The task of Relation Extraction can be basically approached by pattern-based and kernel-based algorithms. The latter achieve state-of-the-art results on newspaper data and point out the importance of linguistic features. In order to analyze relations contained in textual data, syntactic features like part-of-speech tags and syntactic parses are essential. Chapter 4 presents machine learning approaches and linguistic foundations being essential for syntactic annotation of textual data and Relation Extraction. Chapter 6 analyzes the performance of state-of-the-art algorithms of POS tagging, syntactic parsing and Relation Extraction on automotive data. The findings are: supervised methods trained on newspaper corpora do not achieve accurate results when being applied on automotive data. This is grounded in various reasons. Besides low-quality text, the nature of automotive relations states the main challenge. Automotive relation types of interest (e. g. component – symptom) are rather arbitrary compared to well-studied relation types like is-a or is-head-of. In order to achieve acceptable results, algorithms have to be trained directly on this kind of data. As the manual annotation of data for each language and data type is too costly and inflexible, unsupervised methods are the ones to rely on. Part II deals with the development of dedicated algorithms for all three essential tasks. Unsupervised POS tagging (Chapter 7) is a well-studied task and algorithms achieving accurate tagging exist. All of them do not disambiguate high frequency words, only out-of-lexicon words are disambiguated. Most high frequency words bear syntactic information and thus, it is very important to differentiate between their different functions. Especially domain languages contain ambiguous and high frequent words bearing semantic information (e. g. pump). In order to improve POS tagging, an algorithm for disambiguation is developed and used to enhance an existing state-of-the-art tagger. This approach is based on context clustering which is used to detect a word type’s different syntactic functions. Evaluation shows that tagging accuracy is raised significantly. An approach to unsupervised syntactic parsing (Chapter 8) is developed in order to suffice the requirements of Relation Extraction. These requirements include high precision results on nominal and prepositional phrases as they contain the entities being relevant for Relation Extraction. Furthermore, accurate shallow parsing is more desirable than deep binary parsing as it facilitates Relation Extraction more than deep parsing. Endocentric and exocentric constructions can be distinguished and improve proper phrase labeling. unsuParse is based on preferred positions of word types within phrases to detect phrase candidates. Iterating the detection of simple phrases successively induces deeper structures. The proposed algorithm fulfills all demanded criteria and achieves competitive results on standard evaluation setups. Syntactic Relation Extraction (Chapter 9) is an approach exploiting syntactic statistics and text characteristics to extract relations between previously annotated entities. The approach is based on entity distributions given in a corpus and thus, provides a possibility to extend text mining processes to new data in an unsupervised manner. Evaluation on two different languages and two different text types of the automotive domain shows that it achieves accurate results on repair order data. Results are less accurate on internet data, but the task of sentiment analysis and extraction of the opinion target can be mastered. Thus, the incorporation of internet data is possible and important as it provides useful insight into the customer\'s thoughts. To conclude, this thesis presents a complete unsupervised workflow for Relation Extraction – except for the highly domain-dependent Entity Detection task – improving performance of each of the involved subtasks compared to state-of-the-art approaches. Furthermore, this work applies Natural Language Processing methods and Relation Extraction approaches to real world data unveiling challenges that do not occur in high quality newspaper corpora.

APA, Harvard, Vancouver, ISO, and other styles

22

Bhaduri, Sreyoshi. "NLP in Engineering Education - Demonstrating the use of Natural Language Processing Techniques for Use in Engineering Education Classrooms and Research." Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/82202.

Full text

Abstract:

Engineering Education is a developing field, with new research and ideas constantly emerging and contributing to the ever-evolving nature of this discipline. Textual data (such as publications, open-ended questions on student assignments, and interview transcripts) form an important means of dialogue between the various stakeholders of the engineering community. Analysis of textual data demands consumption of a lot of time and resources. As a result, researchers end up spending a lot of time and effort in analyzing such text repositories. While there is a lot to be gained through in-depth research analysis of text data, some educators or administrators could benefit from an automated system which could reveal trends and present broader overviews for given datasets in more time and resource efficient ways. Analyzing datasets using Natural Language Processing is one solution to this problem. The purpose of my doctoral research was two-pronged: first, to describe the current state of use of Natural Language Processing as it applies to the broader field of Education, and second, to demonstrate the use of Natural Language Processing techniques for two Engineering Education specific contexts of instruction and research respectively. Specifically, my research includes three manuscripts: (1) systematic review of existing publications on the use of Natural Language Processing in education research, (2) automated classification system for open-ended student responses to gauge metacognition levels in engineering classrooms, and (3) using insights from Natural Language Processing techniques to facilitate exploratory analysis of a large interview dataset led by a novice researcher. A common theme across the three tasks was to explore the use of Natural Language Processing techniques to enable the computer to extract meaningful information from textual data for Engineering Education related contexts. Results from my first manuscript suggested that researchers in the broader fields of Education used Natural Language Processing for a wide range of tasks, primarily serving to automate instruction in terms of creating content for examinations, automated grading or intelligent tutoring purposes. In manuscripts two and three I implemented some of the Natural Language Processing techniques such as Part-of-Speech tagging and tf-idf (text frequency-inverse document frequency) that were found (through my systematic review) to be used by researchers, to (a) develop an automated classification system for student responses to gauge their metacognitive levels and (b) conduct an exploratory novice led analysis of excerpts from interviews of students on career preparedness, respectively. Overall results of my research studies indicate that although the use of Natural Language Processing techniques in Engineering Education is not widespread, although such research endeavors could facilitate research and practice in our field. Particularly, this type of approach to textual data could be of use to practitioners in large engineering classrooms who are unable to devote large amounts of time to data analysis but would benefit from algorithmic systems that could quickly present a summary based on information processed from available text data.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

23

Eriksson, Caroline, and Emilia Kallis. "NLP-Assisted Workflow Improving Bug Ticket Handling." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301248.

Full text

Abstract:

Software companies spend a lot of resources on debugging, a process where previous solutions can help in solving current problems. The bug tickets, containing this information, are often time-consuming to read. To minimize the time spent on debugging and to make sure that the knowledge from prior solutions is kept in the company, an evaluation was made to see if summaries could make this process more efficient. Abstractive and extractive summarization models were tested for this task and fine-tuning of the bert-extractive-summarizer was performed. The model-generated summaries were compared in terms of perceived quality, speed, similarity to each other, and summarization length. The average description summary contained part of the description needed and the found solution was either well documented or did not answer the problem at all. The fine-tuned extractive model and the abstractive model BART provided good conditions for generating summaries containing all the information needed.
Vid mjukvaruutveckling går mycket resurser åt till felsökning, en process där tidigare lösningar kan hjälpa till att lösa aktuella problem. Det är ofta tidskrävande att läsa felrapporterna som innehåller denna information. För att minimera tiden som läggs på felsökning och säkerställa att kunskap från tidigare lösningar bevaras inom företaget, utvärderades om sammanfattningar skulle kunna effektivisera detta. Abstrakta och extraherande sammanfattningsmodeller testades för uppgiften och en finjustering av bert-extractive- summarizer gjordes. De genererade sammanfattningarna jämfördes i avseende på upplevd kvalitet, genereringshastighet, likhet mellan varandra och sammanfattningslängd. Den genomsnittliga sammanfattningen innehöll delar av den viktigaste informationen och den föreslagna lösningen var antingen väldokumenterad eller besvarade inte problembeskrivningen alls. Den finjusterade BERT och den abstrakta modellen BART visade goda förutsättningar för att generera sammanfattningar innehållande all den viktigaste informationen.

APA, Harvard, Vancouver, ISO, and other styles

24

Lindén, Johannes. "Huvudtitel: Understand and Utilise Unformatted Text Documents by Natural Language Processing algorithms." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-31043.

Full text

Abstract:

News companies have a need to automate and make the editors process of writing about hot and new events more effective. Current technologies involve robotic programs that fills in values in templates and website listeners that notifies the editors when changes are made so that the editor can read up on the source change at the actual website. Editors can provide news faster and better if directly provided with abstracts of the external sources. This study applies deep learning algorithms to automatically formulate abstracts and tag sources with appropriate tags based on the context. The study is a full stack solution, which manages both the editors need for speed and the training, testing and validation of the algorithms. Decision Tree, Random Forest, Multi Layer Perceptron and phrase document vectors are used to evaluate the categorisation and Recurrent Neural Networks is used to paraphrase unformatted texts. In the evaluation a comparison between different models trained by the algorithms with a variation of parameters are done based on the F-score. The results shows that the F-scores are increasing the more document the training has and decreasing the more categories the algorithm needs to consider. The Multi-Layer Perceptron perform best followed by Random Forest and finally Decision Tree. The document length matters, when larger documents are considered during training the score is increasing considerably. A user survey about the paraphrase algorithms shows the paraphrase result is insufficient to satisfy editors need. It confirms a need for more memory to conduct longer experiments.

APA, Harvard, Vancouver, ISO, and other styles

25

Demmelmaier, Gustav, and Carl Westerberg. "Data Segmentation Using NLP: Gender and Age." Thesis, Uppsala universitet, Avdelningen för datalogi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-434622.

Full text

Abstract:

Natural language processing (NLP) opens the possibilities for a computer to read, decipher, and interpret human languages to eventually use it in ways that enable yet further understanding of the interaction and communication between the human and the computer. When appropriate data is available, NLP makes it possible to determine not only the sentiment information of a text but also information about the author behind an online post. Previously conducted studies show aspects of NLP potentially going deeper into the subjective information, enabling author classification from text data. This thesis addresses the lack of demographic insights of online user data by studying language use in texts. It compares four popular yet diverse machine learning algorithms for gender and age segmentation. During the project, the age analysis was abandoned due to insufficient data. The online texts were analysed and quantified into 118 parameters based on linguistic differences. Using supervised learning, the researchers succeeded in correctly predicting the gender in 82% of the cases when analysing data from English online users. The training and test data may have some correlations, which is important to notice. Language is complex and, in this case, the more complex methods SVM and Neural networks were performing better than the less complex Naive Bayes and Logistic regression.

APA, Harvard, Vancouver, ISO, and other styles

26

Aljadri, Sinan. "Chatbot : A qualitative study of users' experience of Chatbots." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-105434.

Full text

Abstract:

The aim of the present study has been to examine users' experience of Chatbot from a business perspective and a consumer perspective. The study has also focused on highlighting what limitations a Chatbot can have and possible improvements for future development. The study is based on a qualitative research method with semi-structured interviews that have been analyzed on the basis of a thematic analysis. The results of the interview material have been analyzed based on previous research and various theoretical perspectives such as Artificial Intelligence (AI), Natural Language Processing (NLP). The results of the study have shown that the experience of Chatbot can differ between businesses that offer Chatbot, which are more positive and consumers who use it as customer service. Limitations and suggestions for improvements around Chatbotar are also a consistent result of the study.
Den föreliggande studie har haft som syfte att undersöka användarnas upplevelse av Chatbot utifrån verksamhetsperspektiv och konsumentperspektiv. Studien har också fokuserat på att lyfta fram vilka begränsningar en Chatbot kan ha och eventuella förbättringar för framtida utvecklingen. Studien är baserad på en kvalitativ forskningsmetod med semistrukturerade intervjuer som har analyserats utifrån en tematisk analys. Resultatet av intervjumaterialet har analyserat utifrån tidigare forskning och olika teoretiska perspektiv som Artificial Intelligence (AI), Natural Language Processing (NLP). Resultatet av studien har visat att upplevelsen av Chatbot kan skilja sig mellan verksamheter som erbjuder Chatbot, som är mer positiva och konsumenter som använder det som kundtjänst. Begränsningar och förslag på förbättringar kring Chatbotar är också ett genomgående resultat i studien.

APA, Harvard, Vancouver, ISO, and other styles

27

Kärde, Wilhelm. "Tool for linguistic quality evaluation of student texts." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186434.

Full text

Abstract:

Spell checkers are nowadays a common occurrence in most editors. A student writing an essay in school will often have the availability of a spell checker. However, the feedback from a spell checker seldom correlates with the feedback from a teacher. A reason for this being that the teacher has more aspects on which it evaluates a text. The teacher will, as opposed to the the spell checker, evaluate a text based on aspects such as genre adaptation, structure and word variation. This thesis evaluates how well those aspects translate to NLP (Natural Language Processing) and implements those who translate well into a rule based solution called Granska.
Grammatikgranskare ﬁnns numera tillgängligt i de ﬂesta ordbehandlare. En student som skriver en uppsats har allt som oftast tillgång till en grammatikgranskare. Dock så skiljer det sig mycket mellan den återkoppling som studenten får från grammatikgranskaren respektive läraren. Detta då läraren ofta har ﬂer aspekter som den använder sig av vid bedömingen utav en elevtext. Läraren, till skillnad från grammatikgranskaren, bedömmer en text på aspekter så som hur väl texten hör till en viss genre, dess struktur och ordvariation. Denna uppsats utforskar hur pass väl dessa aspekter går att anpassas till NLP (Natural Language Processing) och implementerar de som passar väl in i en regelbaserad lösning som heter Granska.

APA, Harvard, Vancouver, ISO, and other styles

28

Hellmann, Sebastian [Verfasser], Klaus-Peter [Akademischer Betreuer] Fähnrich, Klaus-Peter [Gutachter] Fähnrich, Sören [Akademischer Betreuer] Auer, Jens [Akademischer Betreuer] Lehmann, and Hans [Gutachter] Uszkoreit. "Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data / Sebastian Hellmann ; Gutachter: Klaus-Peter Fähnrich, Hans Uszkoreit ; Klaus-Peter Fähnrich, Sören Auer, Jens Lehmann." Leipzig : Universitätsbibliothek Leipzig, 2015. http://d-nb.info/1239422202/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Khizra, Shufa. "Using Natural Language Processing and Machine Learning for Analyzing Clinical Notes in Sickle Cell Disease Patients." Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright154759374321405.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Cao, Haoliang. "Automating Question Generation Given the Correct Answer." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-287460.

Full text

Abstract:

In this thesis, we propose an end-to-end deep learning model for a question generation task. Given a Wikipedia article written in English and a segment of text appearing in the article, the model can generate a simple question whose answer is the given text segment. The model is based on an encoder-decoder architecture. Our experiments show that a model with a fine-tuned BERT encoder and a self-attention decoder give the best performance. We also propose an evaluation metric for the question generation task, which evaluates both syntactic correctness and relevance of the generated questions. According to our analysis on sampled data, the new metric is found to give better evaluation compared to other popular metrics for sequence to sequence tasks.
I den här avhandlingen presenteras en djup neural nätverksmodell för en frågeställningsuppgift. Givet en Wikipediaartikel skriven på engelska och ett textsegment i artikeln kan modellen generera en enkel fråga vars svar är det givna textsegmentet. Modellen är baserad på en kodar-avkodararkitektur (encoderdecoder architecture). Våra experiment visar att en modell med en finjusterad BERT-kodare och en självuppmärksamhetsavkodare (self-attention decoder) ger bästa prestanda. Vi föreslår också en utvärderingsmetrik för frågeställningsuppgiften, som utvärderar både syntaktisk korrekthet och relevans för de genererade frågorna. Enligt vår analys av samplade data visar det sig att den nya metriken ger bättre utvärdering jämfört med andra populära metriker för utvärdering.

APA, Harvard, Vancouver, ISO, and other styles

31

Storby, Johan. "Information extraction from text recipes in a web format." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189888.

Full text

Abstract:

Searching the Internet for recipes to find interesting ideas for meals to prepare is getting increasingly popular. It can however be difficult to find a recipe for a dish that can be prepared with the items someone has available at home. In this thesis a solution to a part of that problem will be presented. This thesis will investigate a method for extracting the various parts of a recipe from the Internet in order to save them and build a searchable database of recipes where users can search for recipes based on the ingredients they have available. The system works for both English and Swedish and is able identify both languages. This is a problem within Natural Language Processing and the subfield Information Extraction. To solve the Information Extraction problem rule-based techniques based on Named Entity Recognition, Content Extraction and general rule-based extraction are used. The results indicate a generally good but not flawless functionality. For English the rule-based algorithm achieved an F1-score of 83.8% for ingredient identification, 94.5% for identification of cooking instructions and an accuracy of 88.0% and 96.4% for cooking time and number of portions respectively. For Swedish the ingredient identification worked slightly better but the other parts worked slightly worse. The results are comparable to the results of other similar methods and can hence be considered good, they are however not good enough for the system to be used independently without a supervising human.
Att söka på Internet efter recept för att hitta intressanta idéer till måltider att laga blir allt populärare. Det kan dock vara svårt att hitta ett recept till en maträtt som kan tillagas med råvarorna som finns hemma. I detta examensarbete kommer en lösning på en del av detta problem att presenteras. Detta examensarbete undersöker en metod för att extrahera de olika delarna av ett recept från Internet för att spara dem och fylla en sökbar databas av recept där användarna kan söka efter recept baserat på de ingredienser som de har till förfogande. Systemet fungerar för både engelska och svenska och kan identifiera båda språken. Detta är ett problem inom språkteknologi och delfältet informationsextraktion. För att lösa informationsextraktionsproblemet använder vi regelbaserade metoder baserade på entitetsigenkänning, metoder för extraktion av brödtext samt allmäna regelbaserade extraktionsmetoder. Resultaten visar på en generellt bra men inte felfri funktionalitet. För engelska har den regelbaserade algoritmen uppnått ett F1-värde av 83,8 % för ingrediensidentifiering, 94,5 % för identifiering av tillagningsinstruktioner och en träffsäkerhet på 88,0 % och 96,4 % för tillagningstid och antal portioner. För svenska fungerade ingrediensidentifieringen något bättre än för engelska men de andra delarna fungerade något sämre. Resultaten är jämförbara med resultaten för andra liknande metoder och kan därmed betraktas som goda, de är dock inte tillräckligt bra för att systemet skall kunna användas självständigt utan en övervakande människa.

APA, Harvard, Vancouver, ISO, and other styles

32

Nahnsen, Thade. "Automation of summarization evaluation methods and their application to the summarization process." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5278.

Full text

Abstract:

Summarization is the process of creating a more compact textual representation of a document or a collection of documents. In view of the vast increase in electronically available information sources in the last decade, filters such as automatically generated summaries are becoming ever more important to facilitate the efficient acquisition and use of required information. Different methods using natural language processing (NLP) techniques are being used to this end. One of the shallowest approaches is the clustering of available documents and the representation of the resulting clusters by one of the documents; an example of this approach is the Google News website. It is also possible to augment the clustering of documents with a summarization process, which would result in a more balanced representation of the information in the cluster, NewsBlaster being an example. However, while some systems are already available on the web, summarization is still considered a difficult problem in the NLP community. One of the major problems hampering the development of proficient summarization systems is the evaluation of the (true) quality of system-generated summaries. This is exemplified by the fact that the current state-of-the-art evaluation method to assess the information content of summaries, the Pyramid evaluation scheme, is a manual procedure. In this light, this thesis has three main objectives. 1. The development of a fully automated evaluation method. The proposed scheme is rooted in the ideas underlying the Pyramid evaluation scheme and makes use of deep syntactic information and lexical semantics. Its performance improves notably on previous automated evaluation methods. 2. The development of an automatic summarization system which draws on the conceptual idea of the Pyramid evaluation scheme and the techniques developed for the proposed evaluation system. The approach features the algorithm for determining the pyramid and bases importance on the number of occurrences of the variable-sized contributors of the pyramid as opposed to word-based methods exploited elsewhere. 3. The development of a text coherence component that can be used for obtaining the best ordering of the sentences in a summary.

APA, Harvard, Vancouver, ISO, and other styles

33

Luff, Robert. "The use of systems engineering principles for the integration of existing models and simulations." Thesis, Loughborough University, 2017. https://dspace.lboro.ac.uk/2134/26739.

Full text

Abstract:

With the rise in computational power, the prospect of simulating a complex engineering system with a high degree of accuracy and in a meaningful way is becoming a real possibility. Modelling and simulation have become ubiquitous throughout the engineering life cycle, as a consequence there are many thousands of existing models and simulations that are potential candidates for integration. This work is concerned with ascertaining if systems engineering principles are of use in the support of virtual testing, from desire to test, designing experiments, specifying simulations, selecting models and simulations, integrating component parts, verifying that the work is as specified, and validating that any outcomes are meaningful. A novel representation of systems engineering framework is proposed and forms the bases for the methods that were developed. It takes the core systems engineering principles and expresses them in a way that can be implemented in a variety of ways. An end to end process for virtual testing with the potential to use existing models and simulations is proposed, it provides structure and order to the testing task. A key part of the proposed process is the recognition that models and simulations requirements are different from those of the system being designed, and hence a modelling and simulation specific writing guide is produced. The automation of any engineering task has the potential to reduce the time to market of the final product, for this reason the potential of natural language processing technology to hasten the proposed processes was investigated. Two case studies were selected to test and demonstrate the potential of the novel approach, the first being an investigation into material selection for a squash ball, and the second being automotive in nature concerned with combining steering and braking systems. The processes and methods indicated their potential value, especially in the automotive case study where inconsistences were identified that could have otherwise affected the successful integration. This capability, combined with the verification stages, improves the confidence of any model and simulation integration. The NLP proof of concept software also demonstrated that such technology has value in the automation of integration. With further testing and development there is the possibility to create a software package to guide engineers through the difficult task of virtual testing. Such a tool would have the potential to drastically reduce the time to market of complex products.

APA, Harvard, Vancouver, ISO, and other styles

34

Alsehaimi, Afnan Abdulrahman A. "Sentiment Analysis for E-book Reviews on Amazon to Determine E-book Impact Rank." University of Dayton / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=dayton1619109972210567.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Palm, Myllylä Johannes. "Domain Adaptation for Hypernym Discovery via Automatic Collection of Domain-Speciﬁc Training Data." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-157693.

Full text

Abstract:

Identifying semantic relations in natural language text is an important component of many knowledge extraction systems. This thesis studies the task of hypernym discovery, i.e discovering terms that are related by the hypernymy (is-a) relation. Speciﬁcally, this thesis explores how state-of-the-art methods for hypernym discovery perform when applied in speciﬁc language domains. In recent times, state-of-the-art methods for hypernym discovery are mostly made up by supervised machine learning models that leverage distributional word representations such as word embeddings. These models require labeled training data in the form of term pairs that are known to be related by hypernymy. Such labeled training data is often not available when working with a speciﬁc language domain. This thesis presents experiments with an automatic training data collection algorithm. The algorithm leverages a pre-deﬁned domain-speciﬁc vocabulary, and the lexical resource WordNet, to extract training pairs automatically. This thesis contributes by presenting experimental results when attempting to leverage such automatically collected domain-speciﬁc training data for the purpose of domain adaptation. Experiments are conducted in two different domains: One domain where there is a large amount of text data, and another domain where there is a much smaller amount of text data. Results show that the automatically collected training data has a positive impact on performance in both domains. The performance boost is most signiﬁcant in the domain with a large amount of text data, with mean average precision increasing by up to 8 points.

APA, Harvard, Vancouver, ISO, and other styles

36

Dagerman, Björn. "Semantic Analysis of Natural Language and Definite Clause Grammar using Statistical Parsing and Thesauri." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-26142.

Full text

Abstract:

Services that rely on the semantic computations of users’ natural linguistic inputs are becoming more frequent. Computing semantic relatedness between texts is problematic due to the inherit ambiguity of natural language. The purpose of this thesis was to show how a sentence could be compared to a predefined semantic Definite Clause Grammar (DCG). Furthermore, it should show how a DCG-based system could benefit from such capabilities. Our approach combines openly available specialized NLP frameworks for statistical parsing, part-of-speech tagging and word-sense disambiguation. We compute the semantic relatedness using a large lexical and conceptual-semantic thesaurus. Also, we extend an existing programming language for multimodal interfaces, which uses static predefined DCGs: COactive Language Definition (COLD). That is, every word that should be acceptable by COLD needs to be explicitly defined. By applying our solution, we show how our approach can remove dependencies on word definitions and improve grammar definitions in DCG-based systems.

APA, Harvard, Vancouver, ISO, and other styles

37

Dai, Xiang. "Recognising Biomedical Names: Challenges and Solutions." Thesis, The University of Sydney, 2021. https://hdl.handle.net/2123/25482.

Full text

Abstract:

The growth rate in the amount of biomedical documents is staggering. Unlocking information trapped in these documents can enable researchers and practitioners to operate confidently in the information world. Biomedical Named Entity Recognition (NER), the task of recognising biomedical names, is usually employed as the first step of the NLP pipeline. Standard NER models, based on sequence tagging technique, are good at recognising short entity mentions in the generic domain. However, there are several open challenges of applying these models to recognise biomedical names: ● Biomedical names may contain complex inner structure (discontinuity and overlapping) which cannot be recognised using standard sequence tagging technique; ● The training of NER models usually requires large amount of labelled data, which are difficult to obtain in the biomedical domain; and, ● Commonly used language representation models are pre-trained on generic data; a domain shift therefore exists between these models and target biomedical data. To deal with these challenges, we explore several research directions and make the following contributions: (1) we propose a transition-based NER model which can recognise discontinuous mentions; (2) We develop a cost-effective approach that nominates the suitable pre-training data; and, (3) We design several data augmentation methods for NER. Our contributions have obvious practical implications, especially when new biomedical applications are needed. Our proposed data augmentation methods can help the NER model achieve decent performance, requiring only a small amount of labelled data. Our investigation regarding selecting pre-training data can improve the model by incorporating language representation models, which are pre-trained using in-domain data. Finally, our proposed transition-based NER model can further improve the performance by recognising discontinuous mentions.

APA, Harvard, Vancouver, ISO, and other styles

38

Marano, Federica. "Exploring formal models of linguistic data structuring. Enhanced solutions for knowledge management systems based on NLP applications." Doctoral thesis, Universita degli studi di Salerno, 2012. http://hdl.handle.net/10556/349.

Full text

Abstract:

2010 - 2011
The principal aim of this research is describing to which extent formal models for linguistic data structuring are crucial in Natural Language Processing (NLP) applications. In this sense, we will pay particular attention to those Knowledge Management Systems (KMS) which are designed for the Internet, and also to the enhanced solutions they may require. In order to appropriately deal with this topics, we will describe how to achieve computational linguistics applications helpful to humans in establishing and maintaining an advantageous relationship with technologies, especially with those technologies which are based on or produce man-machine interactions in natural language. We will explore the positive relationship which may exist between well-structured Linguistic Resources (LR) and KMS, in order to state that if the information architecture of a KMS is based on the formalization of linguistic data, then the system works better and is more consistent. As for the topics we want to deal with, frist of all it is indispensable to state that in order to structure efficient and effective Information Retrieval (IR) tools, understanding and formalizing natural language combinatory mechanisms seems to be the first operation to achieve, also because any piece of information produced by humans on the Internet is necessarily a linguistic act. Therefore, in this research work we will also discuss the NLP structuring of a linguistic formalization Hybrid Model, which we hope will prove to be a useful tool to support, improve and refine KMSs. More specifically, in section 1 we will describe how to structure language resources implementable inside KMSs, to what extent they can improve the performance of these systems and how the problem of linguistic data structuring is dealt with by natural language formalization methods. In section 2 we will proceed with a brief review of computational linguistics, paying particular attention to specific software packages such Intex, Unitex, NooJ, and Cataloga, which are developed according to Lexicon-Grammar (LG) method, a linguistic theory established during the 60’s by Maurice Gross. In section 3 we will describe some specific works useful to monitor the state of the art in Linguistic Data Structuring Models, Enhanced Solutions for KMSs, and NLP Applications for KMSs. In section 4 we will cope with problems related to natural language formalization methods, describing mainly Transformational-Generative Grammar (TGG) and LG, plus other methods based on statistical approaches and ontologies. In section 5 we will propose a Hybrid Model usable in NLP applications in order to create effective enhanced solutions for KMSs. Specific features and elements of our hybrid model will be shown through some results on experimental research work. The case study we will present is a very complex NLP problem yet little explored in recent years, i.e. Multi Word Units (MWUs) treatment. In section 6 we will close our research evaluating its results and presenting possible future work perspectives. [edited by author]
X n.s.

APA, Harvard, Vancouver, ISO, and other styles

39

Acosta, Andrew D. "Laff-O-Tron: Laugh Prediction in TED Talks." DigitalCommons@CalPoly, 2016. https://digitalcommons.calpoly.edu/theses/1667.

Full text

Abstract:

Did you hear where the thesis found its ancestors? They were in the "parent-thesis"! This joke, whether you laughed at it or not, contains a fascinating and mysterious quality: humor. Humor is something so incredibly human that if you squint, the two words can even look the same. As such, humor is not often considered something that computers can understand. But, that doesn't mean we won't try to teach it to them. In this thesis, we propose the system Laff-O-Tron to attempt to predict when the audience of a public speech would laugh by looking only at the text of the speech. To do this, we create a corpus of over 1700 TED Talks retrieved from the TED website. We then adapted various techniques used by researchers to identify humor in text. We also investigated features that were specific to our public speaking environment. Using supervised learning, we try to classify if a chunk of text would cause the audience to laugh or not based on these features. We examine the effects of each feature, classifier, and size of the text chunk provided. On a balanced data set, we are able to accurately predict laughter with up to 75% accuracy in our best conditions. Medium level conditions prove to be around 70% accuracy; while our worst conditions result in 66% accuracy. Computers with humor recognition capabilities would be useful in the fields of human computer interaction and communications. Humor can make a computer easier to interact with and function as a tool to check if humor was properly used in an advertisement or speech.

APA, Harvard, Vancouver, ISO, and other styles

40

Ramponi, Alan. "Knowledge Extraction from Biomedical Literature with Symbolic and Deep Transfer Learning Methods." Doctoral thesis, Università degli studi di Trento, 2021. http://hdl.handle.net/11572/310787.

Full text

Abstract:

The available body of biomedical literature is increasing at a high pace, exceeding the ability of researchers to promptly leverage this knowledge-rich amount of information. Although the outstanding progress in natural language processing (NLP) we observed in the past few years, current technological advances in the field mainly concern newswire and web texts, and do not directly translate in good performance on highly specialized domains such as biomedicine due to linguistic variations along surface, syntax and semantic levels. Given the advances in NLP and the challenges the biomedical domain exhibits, and the explosive growth of biomedical knowledge being currently published, in this thesis we contribute to the biomedical NLP field by providing efficient means for extracting semantic relational information from biomedical literature texts. To this end, we made the following contributions towards the real-world adoption of knowledge extraction methods to support biomedicine: (i) we propose a symbolic high-precision biomedical relation extraction approach to reduce the time-consuming manual curation efforts of extracted relational evidence (Chapter 3), (ii) we conduct a thorough cross-domain study to quantify the drop in performance of deep learning methods for biomedical edge detection shedding lights on the importance of linguistic varieties in biomedicine (Chapter 4), and (iii) we propose a fast and accurate end-to-end solution for biomedical event extraction, leveraging sequential transfer learning and multi-task learning, making it a viable approach for real-world large-scale scenarios (Chapter 5). We then outline the conclusions by highlighting challenges and providing future research directions in the field.

APA, Harvard, Vancouver, ISO, and other styles

41

Ramponi, Alan. "Knowledge Extraction from Biomedical Literature with Symbolic and Deep Transfer Learning Methods." Doctoral thesis, Università degli studi di Trento, 2021. http://hdl.handle.net/11572/310787.

Full text

Abstract:

The available body of biomedical literature is increasing at a high pace, exceeding the ability of researchers to promptly leverage this knowledge-rich amount of information. Although the outstanding progress in natural language processing (NLP) we observed in the past few years, current technological advances in the field mainly concern newswire and web texts, and do not directly translate in good performance on highly specialized domains such as biomedicine due to linguistic variations along surface, syntax and semantic levels. Given the advances in NLP and the challenges the biomedical domain exhibits, and the explosive growth of biomedical knowledge being currently published, in this thesis we contribute to the biomedical NLP field by providing efficient means for extracting semantic relational information from biomedical literature texts. To this end, we made the following contributions towards the real-world adoption of knowledge extraction methods to support biomedicine: (i) we propose a symbolic high-precision biomedical relation extraction approach to reduce the time-consuming manual curation efforts of extracted relational evidence (Chapter 3), (ii) we conduct a thorough cross-domain study to quantify the drop in performance of deep learning methods for biomedical edge detection shedding lights on the importance of linguistic varieties in biomedicine (Chapter 4), and (iii) we propose a fast and accurate end-to-end solution for biomedical event extraction, leveraging sequential transfer learning and multi-task learning, making it a viable approach for real-world large-scale scenarios (Chapter 5). We then outline the conclusions by highlighting challenges and providing future research directions in the field.

APA, Harvard, Vancouver, ISO, and other styles

42

Lauly, Stanislas. "Exploration des réseaux de neurones à base d'autoencodeur dans le cadre de la modélisation des données textuelles." Thèse, Université de Sherbrooke, 2016. http://hdl.handle.net/11143/9461.

Full text

Abstract:

Depuis le milieu des années 2000, une nouvelle approche en apprentissage automatique, l'apprentissage de réseaux profonds (deep learning), gagne en popularité. En effet, cette approche a démontré son efficacité pour résoudre divers problèmes en améliorant les résultats obtenus par d'autres techniques qui étaient considérées alors comme étant l'état de l'art. C'est le cas pour le domaine de la reconnaissance d'objets ainsi que pour la reconnaissance de la parole. Sachant cela, l’utilisation des réseaux profonds dans le domaine du Traitement Automatique du Langage Naturel (TALN, Natural Language Processing) est donc une étape logique à suivre. Cette thèse explore différentes structures de réseaux de neurones dans le but de modéliser le texte écrit, se concentrant sur des modèles simples, puissants et rapides à entraîner.

APA, Harvard, Vancouver, ISO, and other styles

43

Piscaglia, Nicola. "Deep Learning for Natural Language Processing: Novel State-of-the-art Solutions in Summarisation of Legal Case Reports." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20342/.

Full text

Abstract:

Deep neural networks are one of the major classification machines in machine learning. Several Deep Neural Networks (DNNs) have been developed and evaluated in recent years in recognition tasks, such as estimating a user base and estimating their interactivity. We present the best algorithms for extracting or summarising text using a deep neural network while allowing the workers to interpret the texts from the output speech. In this work both extractive and abstractive summarisation approaches have been applied. In particular, BERT (Base, Multilingual Cased) and a deep neural network composed by CNN and GRU layers have been used in the extraction-based summarisation while the abstraction-based one has been performed by applying the GPT-2 Transformer model. We show our models achieve high scores in syntactical terms while a human evaluation is still needed to judge the coherence, consistency and unreferenced harmonicity of speech. Our proposed work outperform the state of the art results for extractive summarisation on the Australian Legal Case Report Dataset. Our paper can be viewed as further demonstrating that our model can outperform the state of the art on a variety of extractive and abstractive summarisation tasks. Note: The abstract above was not written by the author, it was generated by providing a part of thesis introduction as input text to the pre-trained GPT-2 (Small) Transformer model used in this work which has been previously fine-tuned for 4 epochs with the”NIPS 2015 Papers” dataset.

APA, Harvard, Vancouver, ISO, and other styles

44

Panesar, Kulvinder. "Functional linguistic based motivations for a conversational software agent." Cambridge Scholars Publishing, 2019. http://hdl.handle.net/10454/18134.

Full text

Abstract:

No
This chapter discusses a linguistically orientated model of a conversational software agent (CSA) (Panesar 2017) framework sensitive to natural language processing (NLP) concepts and the levels of adequacy of a functional linguistic theory (LT). We discuss the relationship between NLP and knowledge representation (KR), and connect this with the goals of a linguistic theory (Van Valin and LaPolla 1997), in particular Role and Reference Grammar (RRG) (Van Valin Jr 2005). We debate the advantages of RRG and consider its fitness and computational adequacy. We present a design of a computational model of the linking algorithm that utilises a speech act construction as a grammatical object (Nolan 2014a, Nolan 2014b) and the sub-model of belief, desire and intentions (BDI) (Rao and Georgeff 1995). This model has been successfully implemented in software, using the resource description framework (RDF), and we highlight some implementation issues that arose at the interface between language and knowledge representation (Panesar 2017).
The full-text of this article will be released for public view at the end of the publisher embargo on 27 Sep 2024.

APA, Harvard, Vancouver, ISO, and other styles

45

Packer, Thomas L. "Surface Realization Using a Featurized Syntactic Statistical Language Model." Diss., CLICK HERE for online access, 2006. http://contentdm.lib.byu.edu/ETD/image/etd1195.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Sidås, Albin, and Simon Sandberg. "Conversational Engine for Transportation Systems." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176810.

Full text

Abstract:

Today's communication between operators and professional drivers takes place through direct conversations between the parties. This thesis project explores the possibility to support the operators in classifying the topic of incoming communications and which entities are affected through the use of named entity recognition and topic classifications. By developing a synthetic training dataset, a NER model and a topic classification model was developed and evaluated to achieve F1-scores of 71.4 and 61.8 respectively. These results were explained by a low variance in the synthetic dataset in comparison to a transcribed dataset from the real world which included anomalies not represented in the synthetic dataset. The aforementioned models were integrated into the dialogue framework Emora to seamlessly handle the back and forth communication and generating responses.

APA, Harvard, Vancouver, ISO, and other styles

47

Mrkšić, Nikola. "Data-driven language understanding for spoken dialogue systems." Thesis, University of Cambridge, 2018. https://www.repository.cam.ac.uk/handle/1810/276689.

Full text

Abstract:

Spoken dialogue systems provide a natural conversational interface to computer applications. In recent years, the substantial improvements in the performance of speech recognition engines have helped shift the research focus to the next component of the dialogue system pipeline: the one in charge of language understanding. The role of this module is to translate user inputs into accurate representations of the user goal in the form that can be used by the system to interact with the underlying application. The challenges include the modelling of linguistic variation, speech recognition errors and the effects of dialogue context. Recently, the focus of language understanding research has moved to making use of word embeddings induced from large textual corpora using unsupervised methods. The work presented in this thesis demonstrates how these methods can be adapted to overcome the limitations of language understanding pipelines currently used in spoken dialogue systems. The thesis starts with a discussion of the pros and cons of language understanding models used in modern dialogue systems. Most models in use today are based on the delexicalisation paradigm, where exact string matching supplemented by a list of domain-specific rephrasings is used to recognise users' intents and update the system's internal belief state. This is followed by an attempt to use pretrained word vector collections to automatically induce domain-specific semantic lexicons, which are typically hand-crafted to handle lexical variation and account for a plethora of system failure modes. The results highlight the deficiencies of distributional word vectors which must be overcome to make them useful for downstream language understanding models. The thesis next shifts focus to overcoming the language understanding models' dependency on semantic lexicons. To achieve that, the proposed Neural Belief Tracking (NBT) model forsakes the use of standard one-hot n-gram representations used in Natural Language Processing in favour of distributed representations of user utterances, dialogue context and domain ontologies. The NBT model makes use of external lexical knowledge embedded in semantically specialised word vectors, obviating the need for domain-specific semantic lexicons. Subsequent work focuses on semantic specialisation, presenting an efficient method for injecting external lexical knowledge into word vector spaces. The proposed Attract-Repel algorithm boosts the semantic content of existing word vectors while simultaneously inducing high-quality cross-lingual word vector spaces. Finally, NBT models powered by specialised cross-lingual word vectors are used to train multilingual belief tracking models. These models operate across many languages at once, providing an efficient method for bootstrapping language understanding models for lower-resource languages with limited training data.

APA, Harvard, Vancouver, ISO, and other styles

48

Eriksson, Patrik, and Philip Wester. "Granskning av examensarbetesrapporter med IBM Watson molntjänster." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-232057.

Full text

Abstract:

Cloud services are one of the fast expanding fields of today. Companies such as Amazon, Google, Microsoft and IBM offer these cloud services in various forms. As this field progresses, the natural question occurs ”What can you do with the technology today?”. The technology offers scalability for hardware usage and user demands, that is attractive to developers and companies. This thesis tries to examine the applicability of cloud services, by combining it with the question: ”Is it possible to make an automated thesis examiner?” By narrowing down the services to IBM Watson web services, this thesis main question reads ”Is it possible to make an automated thesis examiner using IBM Watson?”. Hence the goal of this thesis was to create an automated thesis examiner. The project used a modified version of Bunge’s technological research method. Where amongst the first steps, a definition of an software thesis examiner for student theses was created. Then an empirical study of the Watson services, that seemed relevant from the literature study, proceeded. These empirical studies allowed a deeper understanding about the services’ practices and boundaries. From these implications and the definition of a software thesis examiner for student theses, an idea of how to build and implement an automated thesis examiner was created. Most of IBM Watson’s services were thoroughly evaluated, except for the service Machine Learning, that should have been studied further if the time resources would not have been depleted. This project found the Watson web services useful in many cases but did not find a service that was well suited for thesis examination. Although the goal was not reached, this thesis researched the Watson web services and can be used to improve understanding of its applicability, and for future implementations that face the provided definition.
Molntjänster är ett av de områden som utvecklas snabbast idag. Företag såsom Amazon, Google, Microsoft och IBM tillhandahåller dessa tjänster i flera former. Allteftersom utvecklingen tar fart, uppstår den naturliga frågan ”Vad kan man göra med den här tekniken idag?”. Tekniken erbjuder en skalbarhet mot använd hårdvara och antalet användare, som är attraktiv för utvecklare och företag. Det här examensarbetet försöker svara på hur molntjänster kan användas genom att kombinera det med frågan ”Är det möjligt att skapa en automatiserad examensarbetesrapportsgranskare?”. Genom att avgränsa undersökningen till IBM Watson molntjänster försöker arbetet huvudsakligen svara på huvudfrågan ”Är det möjligt att skapa en automatiserad examensarbetesrapportsgranskare med Watson molntjänster?”. Därmed var målet med arbetet att skapa en automatiserad examensarbetesrapportsgranskare. Projektet följde en modifierad version av Bunge’s teknologiska undersökningsmetod, där det första steget var att skapa en definition för en mjukvaruexamensarbetesrapportsgranskare följt av en utredning av de Watson molntjänster som ansågs relevanta från litteratur studien. Dessa undersöktes sedan vidare i empirisk studie. Genom de empiriska studierna skapades förståelse för tjänsternas tillämpligheter och begränsningar, för att kunna kartlägga hur de kan användas i en automatiserad examensarbetsrapportsgranskare. De flesta tjänster behandlades grundligt, förutom Machine Learning, som skulle behövt vidare undersökning om inte tidsresurserna tog slut. Projektet visar på att Watson molntjänster är användbara men inte perfekt anpassade för att granska examensarbetesrapporter. Även om inte målet uppnåddes, undersöktes Watson molntjänster, vilket kan ge förståelse för deras användbarhet och framtida implementationer för att möta den skapade definitionen.

APA, Harvard, Vancouver, ISO, and other styles

49

Baglodi, Venkatesh. "A Feature Structure Approach for Disambiguating Preposition Senses." NSUWorks, 2009. http://nsuworks.nova.edu/gscis_etd/83.

Full text

Abstract:

Word Sense Disambiguation (WSD) continues to be an open research problem in spite of recent advances in the NLP field, especially in machine learning. WSD for open-class words is well understood. However, WSD for closed class structural words (such as prepositions) is not so well resolved, and their role in frame semantics seems to be a relatively unknown area. This research uses a new method to disambiguate preposition senses by using a combined lookup from FrameNet and TPP databases. Motivated by recent work by Popescu, Tonelli, & Pianta (2007), it extends the concept to provide a deterministic WSD of prepositions using the lexical information drawn from the sentences in a local context. While the primary goal of the research is to disambiguate preposition sense, the approach also assigns frames and roles to different sentence elements. The use of prepositions for frame and role assignment seems to be a largely unexplored area which could provide a new dimension to research in lexical semantics.

APA, Harvard, Vancouver, ISO, and other styles

50

Murray, Jonathan. "Finding Implicit Citations in Scientific Publications : Improvements to Citation Context Detection Methods." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-173913.

Full text

Abstract:

This thesis deals with the task of identifying implicit citations between scientiﬁc publications. Apart from being useful knowledge on their own, the citations may be used as input to other problems such as determining an author’s sentiment towards a reference, or summarizing a paper based on what others have written about it. We extend two recently proposed methods, a Machine Learning classiﬁer and an iterative Belief Propagation algorithm. Both are implemented and evaluated on a common pre-annotated dataset. Several changes to the algorithms are then presented, incorporating new sentence features, diﬀerent semantic text similarity measures as well as combining the methods into a single classiﬁer. Our main ﬁnding is that the introduction of new sentence features yield signiﬁcantly improved F-scores for both approaches.
Detta examensarbete behandlar frågan om att hitta implicita citeringar mellan vetenskapliga publikationer. Förutom att vara intressanta på egen hand kan dessa citeringar användas inom andra problem, såsom att bedöma en författares inställning till en referens eller att sammanfatta en rapport utifrån hur den har blivit citerad av andra. Vi utgår från två nyliga metoder, en maskininlärningsbaserad klassiﬁcerare och en iterativ algoritm baserad på en grafmodell. Dessa implementeras och utvärderas på en gemensam förannoterad datamängd. Ett antal förändringar till algoritmerna presenteras i form av nya särdrag hos meningarna (eng. sentence features), olika semantiska textlikhetsmått och ett sätt att kombinera de två metoderna. Arbetets huvudsakliga resultat är att de nya meningssärdragen leder till anmärkningsvärt förbättrade F-värden för de båda metoderna.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Natural Language Processing (NLP)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles