To see the other types of publications on this topic, follow the link: Semantic embeddings.

Dissertations / Theses on the topic 'Semantic embeddings'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 48 dissertations / theses for your research on the topic 'Semantic embeddings.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Malmberg, Jacob. "Evaluating semantic similarity using sentence embeddings." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291425.

Full text
Abstract:
Semantic similarity search is the task of searching for documents or sentences which contain semantically similar content to a user-submitted search term. This task is often carried out, for instance when searching for information on the internet. To facilitate this, vector representations referred to as embeddings of both the documents to be searched as well as the search term must be created. Traditional approaches to create embeddings include the term frequency - inverse document frequency algorithm (TF-IDF). Modern approaches include neural networks, which have seen a large rise in popularity over the last few years. The BERT network released in 2018 is a highly regarded neural network which can be used to create embeddings. Multiple variations of the BERT network have been created since its release, such as the Sentence-BERT network which is explicitly designed to create sentence embeddings. This master thesis is concerned with evaluating semantic similarity search using sentence embeddings produced by both traditional and modern approaches. Different experiments were carried out to contrast the different approaches used to create sentence embeddings. Since datasets designed explicitly for the types of experiments performed could not be located, commonly used datasets were modified. The results showed that the TF-IDF algorithm outperformed the neural network based approaches in almost all experiments. Among the neural networks evaluated, the Sentence-BERT network performed proved to be better than the BERT network. To create more generalizable results, datasets explicitly designed for the task are needed.
Sammanfattning Semantisk likhets-sökning är en typ av sökning som syftar till att hitta dokument eller meningar som är semantiskt lika en användarspecifierad sökterm. Denna typ av sökning utförs ofta, exempelvis när användaren söker efter information på internet. För att möjliggöra detta måste vektorrepresentationer av både dokumenten som ska genomsökas såväl som söktermen skapas. Ett vanligt sätt att skapa dessa representationer har varit term frequency - inverse document frequencyalgoritmen (TF-IDF). Moderna metoder använder neurala nätverk som har blivit mycket populära under de senaste åren. BERT-nätverket som släpptes 2018 är ett väl ansett nätverk som kan användas för att skapa vektorrepresentationer. Många varianter av BERT-nätverket har skapats, exempelvis nätverket Sentence-BERT som är uttryckligen skapad för att skapa vektorrepresentationer av meningar. Denna avhandling ämnar att utvärdera semantisk likhets-sökning som bygger på vektorrepresentationer av meningar producerade av både traditionella och moderna approacher. Olika experiment utfördes för att kontrastera de olika approacherna. Eftersom dataset uttryckligen skapade för denna typ av experiment inte kunde lokaliseras modifierades dataset som vanligen används. Resultaten visade att algoritmen TF-IDF överträffade approacherna som var baserade på neurala nätverk i nästintill alla experiment. Av de neurala nätverk som utvärderades var Sentence-BERT bättre än BERT-nätverket. För att skapa mer generaliserbara resultat krävs dataset uttryckligen designade för semantisk likhets-sökning.
APA, Harvard, Vancouver, ISO, and other styles
2

Yu, Lu. "Semantic representation: from color to deep embeddings." Doctoral thesis, Universitat Autònoma de Barcelona, 2019. http://hdl.handle.net/10803/669458.

Full text
Abstract:
Un dels problemes fonamentals de la visió per computador és representar imatges amb descripcions compactes semànticament rellevants. Aquestes descripcions podrien utilitzar-se en una àmplia varietat d'aplicacions, com la comparació d'imatges, la detecció d'objectes i la cerca de vídeos. L'objectiu principal d'aquesta tesi és estudiar les representacions d'imatges des de dos aspectes: les descripcions de color i les descripcions profundes amb xarxes neuronals. A la primera part de la tesi partim de descripcions de color modelades a mà. Existeixen noms comuns en diverses llengües per als colors bàsics, i proposem un mètode per estendre els noms de colors addicionals d'acord amb la seva naturalesa complementària als bàsics. Això ens permet calcular representacions de noms de colors de longitud arbitrària amb un alt poder discriminatori. Els experiments psicofísics confirmen que el mètode proposat supera els marcs de referència existents. En segon lloc, en agregar estratègies d'atenció, aprenem descripcions de colors profundes amb xarxes neuronals a partir de dades amb anotacions per a la imatge, en comptes de per a cada un dels píxels. L'estratègia d'atenció aconsegueix identificar correctament les regions rellevants per a cada classe que volem avaluar. L'avantatge de l'enfocament proposat és que els noms de colors a utilitzar es poden aprendre específicament per a dominis dels que no existeixen anotacions a nivell de píxel. A la segona part de la tesi, ens centrem en les descripcions profundes amb xarxes neuronals. En primer lloc, abordem el problema de comprimir grans xarxes de descriptors en xarxes més petites, mantenint un rendiment similar. Proposem destil·lar les mètriques d'una xarxa mestre a una xarxa estudiant. S'introdueixen dues noves funcions de cost per a modelar la comunicació de la xarxa mestre a una xarxa estudiant més petita: una basada en un mestre absolut, on l'estudiant pretén produir els mateixos descriptors que el mestre, i una altra basada en un mestre relatiu, on les distàncies entre parells de punts de dades són comunicades del mestre a l'alumne. A més, s'han investigat diversos aspectes de la destil·lació per a les representacions, incloses les capes d'atenció, l'aprenentatge semi-supervisat i la destil·lació de qualitat creuada. Finalment, s'estudia un altre aspecte de l'aprenentatge per mètrica profund, l'aprenentatge continuat. Observem que es produeix una variació del coneixement après durant l'entrenament de noves tasques. En aquesta tesi es presenta un mètode per estimar la variació semàntica en funció de la variació que experimenten les dades de la tasca actual durant el seu aprenentatge. Tenint en compte aquesta estimació, les tasques anteriors poden ser compensades, millorant així el seu rendiment. A més, mostrem que les xarxes de descripcions profundes pateixen significativament menys oblits catastròfics en comparació amb les xarxes de classificació quan aprenen noves tasques.
Uno de los problemas fundamentales de la visión por computador es representar imágenes con descripciones compactas semánticamente relevantes. Estas descripciones podrían utilizarse en una amplia variedad de aplicaciones, como la comparación de imágenes, la detección de objetos y la búsqueda de vídeos. El objetivo principal de esta tesis es estudiar las representaciones de imágenes desde dos aspectos: las descripciones de color y las descripciones profundas con redes neuronales. En la primera parte de la tesis partimos de descripciones de color modeladas a mano. Existen nombres comunes en varias lenguas para los colores básicos, y proponemos un método para extender los nombres de colores adicionales de acuerdo con su naturaleza complementaria a los básicos. Esto nos permite calcular representaciones de nombres de colores de longitud arbitraria con un alto poder discriminatorio. Los experimentos psicofísicos confirman que el método propuesto supera a los marcos de referencia existentes. En segundo lugar, al agregar estrategias de atención, aprendemos descripciones de colores profundos con redes neuronales a partir de datos con anotaciones para la imagen en vez de para cada uno de los píxeles. La estrategia de atención logra identificar correctamente las regiones relevantes para cada clase que queremos evaluar. La ventaja del enfoque propuesto es que los nombres de colores a usar se pueden aprender específicamente para dominios de los que no existen anotaciones a nivel de píxel. En la segunda parte de la tesis, nos centramos en las descripciones profundas con redes neuronales. En primer lugar, abordamos el problema de comprimir grandes redes de descriptores en redes más pequeñas, manteniendo un rendimiento similar. Proponemos destilar las métricas de una red maestro a una red estudiante. Se introducen dos nuevas funciones de coste para modelar la comunicación de la red maestro a una red estudiante más pequeña: una basada en un maestro absoluto, donde el estudiante pretende producir los mismos descriptores que el maestro, y otra basada en un maestro relativo, donde las distancias entre pares de puntos de datos son comunicadas del maestro al alumno. Además, se han investigado diversos aspectos de la destilación para las representaciones, incluidas las capas de atención, el aprendizaje semi-supervisado y la destilación de calidad cruzada. Finalmente, se estudia otro aspecto del aprendizaje por métrica profundo, el aprendizaje continuado. Observamos que se produce una variación del conocimiento aprendido durante el entrenamiento de nuevas tareas. En esta tesis se presenta un método para estimar la variación semántica en función de la variación que experimentan los datos de la tarea actual durante su aprendizaje. Teniendo en cuenta esta estimación, las tareas anteriores pueden ser compensadas, mejorando así su rendimiento. Además, mostramos que las redes de descripciones profundas sufren significativamente menos olvidos catastróficos en comparación con las redes de clasificación cuando aprenden nuevas tareas.
One of the fundamental problems of computer vision is to represent images with compact semantically relevant embeddings. These embeddings could then be used in a wide variety of applications, such as image retrieval, object detection, and video search. The main objective of this thesis is to study image embeddings from two aspects: color embeddings and deep embeddings. In the first part of the thesis we start from hand-crafted color embeddings. We propose a method to order the additional color names according to their complementary nature with the basic eleven color names. This allows us to compute color name representations with high discriminative power of arbitrary length. Psychophysical experiments confirm that our proposed method outperforms baseline approaches. Secondly, we learn deep color embeddings from weakly labeled data by adding an attention strategy. The attention branch is able to correctly identify the relevant regions for each class. The advantage of our approach is that it can learn color names for specific domains for which no pixel-wise labels exists. In the second part of the thesis, we focus on deep embeddings. Firstly, we address the problem of compressing large embedding networks into small networks, while maintaining similar performance. We propose to distillate the metrics from a teacher network to a student network. Two new losses are introduced to model the communication of a deep teacher network to a small student network: one based on an absolute teacher, where the student aims to produce the same embeddings as the teacher, and one based on a relative teacher, where the distances between pairs of data points is communicated from the teacher to the student. In addition, various aspects of distillation have been investigated for embeddings, including hint and attention layers, semi-supervised learning and cross quality distillation. Finally, another aspect of deep metric learning, namely lifelong learning, is studied. We observed some drift occurs during training of new tasks for metric learning. A method to estimate the semantic drift based on the drift which is experienced by data of the current task during its training is introduced. Having this estimation, previous tasks can be compensated for this drift, thereby improving their performance. Furthermore, we show that embedding networks suffer significantly less from catastrophic forgetting compared to classification networks when learning new tasks.
APA, Harvard, Vancouver, ISO, and other styles
3

Moss, Adam. "Detecting Lexical Semantic Change Using Probabilistic Gaussian Word Embeddings." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-412539.

Full text
Abstract:
In this work, we test two novel methods of using word embeddings to detect lexical semantic change, attempting to overcome limitations associated with conventional approaches to this problem. Using a diachronic corpus spanning over a hundred years, we generate word embeddings for each decade with the intention of evaluating how meaning changes are represented in embeddings for the same word across time. Our approach differs from previous works in this field in that we encode words as probabilistic Gaussian distributions and bimodal probabilistic Gaussian mixtures, rather than conventional word vectors. We provide a discussion and analysis of our results, comparing the approaches we implemented with those used in previous works. We also conducted further analysis on whether additional information regarding the nature of semantic change could be discerned from particular qualities of the embeddings we generated for our experiments. In our results, we find that encoding words as probabilistic Gaussian embeddings can provide an enhanced degree of reliability with regard to detecting lexical semantic change. Furthermore, we are able to represent additional information regarding the nature of such changes through the variance of these embeddings. Encoding words as bimodal Gaussian mixtures however is generally unsuccessful for this task, proving to be not reliable enough at distinguishing between discrete senses to effectively detect and measure such changes. We provide potential explanations for the results we observe, and propose improvements that can be made to our approach to potentially improve performance.
APA, Harvard, Vancouver, ISO, and other styles
4

Montariol, Syrielle. "Models of diachronic semantic change using word embeddings." Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG006.

Full text
Abstract:
Dans cette thèse, nous étudions les changements lexico-sémantiques : les variations temporelles dans l'usage et la signification des mots, également appelé extit{diachronie}. Ces changements reflètent l'évolution de divers aspects de la société tels que l'environnement technologique et culturel.Nous explorons et évaluons des méthodes de construction de plongements lexicaux variant dans le temps afin d'analyser l'évolution du language. Nous utilisont notamment des plongements contextualisés à partir de modèles de langue pré-entraînés tels que BERT.Nous proposons plusieurs approches pour extraire et agréger les représentations contextualisées des mots dans le temps, et quantifier leur degré de changement sémantique. En particulier, nous abordons l'aspect pratique de ces systèmes: le passage à l'échelle de nos approches, en vue de les appliquer à de grands corpus ou de larges vocabulaire; leur interprétabilité, en désambiguïsant les différents usages d'un mot au cours du temps; et leur applicabilité à des problématiques concrètes, pour des documents liés au COVID19 et des corpus du domaine financier. Nous évaluons l'efficacité de ces méthodes de manière quantitative, en utilisant plusieurs corpus annotés, et de manière qualitative, en liant les variations détectées dans des corpus avec des événements de la vie réelle et des données numériques.Enfin, nous étendons la tâche de détection de changements sémantiques au-delà de la dimension temporelle. Nous l'adaptons à un cadre bilingue, pour étudier l'évolution conjointe d'un mot et sa traduction dans deux corpus de langues différentes; et à un cadre synchronique, pour détecter des variations sémantiques entre différentes sources ou communautés en plus de la variation temporelle
In this thesis, we study lexical semantic change: temporal variations in the use and meaning of words, also called extit{diachrony}. These changes are carried by the way people use words, and mirror the evolution of various aspects of society such as its technological and cultural environment.We explore, compare and evaluate methods to build time-varying embeddings from a corpus in order to analyse language evolution.We focus on contextualised word embeddings using pre-trained language models such as BERT. We propose several approaches to extract and aggregate the contextualised representations of words over time, and quantify their level of semantic change.In particular, we address the practical aspect of these systems: the scalability of our approaches, with a view to applying them to large corpora or large vocabularies; their interpretability, by disambiguating the different uses of a word over time; and their applicability to concrete issues, for documents related to COVID19We evaluate the efficiency of these methods quantitatively using several annotated corpora, and qualitatively by linking the detected semantic variations with real-life events and numerical data.Finally, we extend the task of semantic change detection beyond the temporal dimension. We adapt it to a bilingual setting, to study the joint evolution of a word and its translation in two corpora of different languages; and to a synchronic frame, to detect semantic variations across different sources or communities on top of the temporal variation
APA, Harvard, Vancouver, ISO, and other styles
5

Shaik, Arshad. "Biomedical Semantic Embeddings: Using Hybrid Sentences to Construct Biomedical Word Embeddings and Their Applications." Thesis, University of North Texas, 2019. https://digital.library.unt.edu/ark:/67531/metadc1609064/.

Full text
Abstract:
Word embeddings is a useful method that has shown enormous success in various NLP tasks, not only in open domain but also in biomedical domain. The biomedical domain provides various domain specific resources and tools that can be exploited to improve performance of these word embeddings. However, most of the research related to word embeddings in biomedical domain focuses on analysis of model architecture, hyper-parameters and input text. In this paper, we use SemMedDB to design new sentences called `Semantic Sentences'. Then we use these sentences in addition to biomedical text as inputs to the word embedding model. This approach aims at introducing biomedical semantic types defined by UMLS, into the vector space of word embeddings. The semantically rich word embeddings presented here rivals state of the art biomedical word embedding in both semantic similarity and relatedness metrics up to 11%. We also demonstrate how these semantic types in word embeddings can be utilized.
APA, Harvard, Vancouver, ISO, and other styles
6

Shaik, Arshad. "Biomedical Semantic Embeddings: Using Hybrid Sentences to Construct Biomedical Word Embeddings and its Applications." Thesis, University of North Texas, 2019. https://digital.library.unt.edu/ark:/67531/metadc1609064/.

Full text
Abstract:
Word embeddings is a useful method that has shown enormous success in various NLP tasks, not only in open domain but also in biomedical domain. The biomedical domain provides various domain specific resources and tools that can be exploited to improve performance of these word embeddings. However, most of the research related to word embeddings in biomedical domain focuses on analysis of model architecture, hyper-parameters and input text. In this paper, we use SemMedDB to design new sentences called `Semantic Sentences'. Then we use these sentences in addition to biomedical text as inputs to the word embedding model. This approach aims at introducing biomedical semantic types defined by UMLS, into the vector space of word embeddings. The semantically rich word embeddings presented here rivals state of the art biomedical word embedding in both semantic similarity and relatedness metrics up to 11%. We also demonstrate how these semantic types in word embeddings can be utilized.
APA, Harvard, Vancouver, ISO, and other styles
7

Munbodh, Mrinal. "Deriving A Better Metric To Assess theQuality of Word Embeddings Trained OnLimited Specialized Corpora." University of Cincinnati / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1601995854965902.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Balzar, Ekenbäck Nils. "Evaluation of Sentence Representations in Semantic Text Similarity Tasks." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291334.

Full text
Abstract:
This thesis explores the methods of representing sentence representations for semantic text similarity using word embeddings and benchmarks them against sentence based evaluation test sets. Two methods were used to evaluate the representations: STS Benchmark and STS Benchmark converted to a binary similarity task. Results showed that preprocessing of the word vectors could significantly boost performance in both tasks and conclude that word embed-dings still provide an acceptable solution for specific applications. The study also concluded that the dataset used might not be ideal for this type of evalua-tion, as the sentence pairs in general had a high lexical overlap. To tackle this, the study suggests that a paraphrasing dataset could act as a complement but that further investigation would be needed.
Denna avhandling undersöker metoder för att representera meningar i vektor-form för semantisk textlikhet och jämför dem med meningsbaserade testmäng-der. För att utvärdera representationerna användes två metoder: STS Bench-mark, en vedertagen metod för att utvärdera språkmodellers förmåga att ut-värdera semantisk likhet, och STS Benchmark konverterad till en binär lik-hetsuppgift. Resultaten visade att förbehandling av texten och ordvektorerna kunde ge en signifikant ökning i resultatet för dessa uppgifter. Studien konklu-derade även att datamängden som användes kanske inte är ideal för denna typ av utvärdering, då meningsparen i stort hade ett högt lexikalt överlapp. Som komplement föreslår studien en parafrasdatamängd, något som skulle kräva ytterligare studier.
APA, Harvard, Vancouver, ISO, and other styles
9

Zhou, Hanqing. "DBpedia Type and Entity Detection Using Word Embeddings and N-gram Models." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37324.

Full text
Abstract:
Nowadays, knowledge bases are used more and more in Semantic Web tasks, such as knowledge acquisition (Hellmann et al., 2013), disambiguation (Garcia et al., 2009) and named entity corpus construction (Hahm et al., 2014), to name a few. DBpedia is playing a central role on the linked open data cloud; therefore, the quality of this knowledge base is becoming a central point of focus. However, there are some issues with the quality of DBpedia. In particular, DBpedia suffers from three major types of problems: a) invalid types for entities, b) missing types for entities, and c) invalid entities in the resources’ description. In order to enhance the quality of DBpedia, it is important to detect these invalid types and resources, as well as complete missing types. The three main goals of this thesis are: a) invalid entity type detection in order to solve the problem of invalid DBpedia types for entities, b) automatic detection of the types of entities in order to solve the problem of missing DBpedia types for entities, and c) invalid entity detection in order to solve the problem of invalid entities in the resource description of a DBpedia entity. We compare several methods for the detection of invalid types, automatic typing of entities, and invalid entities detection in the resource descriptions. In particular, we compare different classification and clustering algorithms based on various sets of features: entity embedding features (Skip-gram and CBOW models) and traditional n-gram features. We present evaluation results for 358 DBpedia classes extracted from the DBpedia ontology. The main contribution of this work consists of the development of automatic invalid type detection, automatic entity typing, and automatic invalid entity detection methods using clustering and classification. Our results show that entity embedding models usually perform better than n-gram models, especially the Skip-gram embedding model.
APA, Harvard, Vancouver, ISO, and other styles
10

Felt, Paul L. "Facilitating Corpus Annotation by Improving Annotation Aggregation." BYU ScholarsArchive, 2015. https://scholarsarchive.byu.edu/etd/5678.

Full text
Abstract:
Annotated text corpora facilitate the linguistic investigation of language as well as the automation of natural language processing (NLP) tasks. NLP tasks include problems such as spam email detection, grammatical analysis, and identifying mentions of people, places, and events in text. However, constructing high quality annotated corpora can be expensive. Cost can be reduced by employing low-cost internet workers in a practice known as crowdsourcing, but the resulting annotations are often inaccurate, decreasing the usefulness of a corpus. This inaccuracy is typically mitigated by collecting multiple redundant judgments and aggregating them (e.g., via majority vote) to produce high quality consensus answers. We improve the quality of consensus labels inferred from imperfect annotations in a number of ways. We show that transfer learning can be used to derive benefit from out-dated annotations which would typically be discarded. We show that, contrary to popular preference, annotation aggregation models that take a generative data modeling approach tend to outperform those that take a condition approach. We leverage this insight to develop csLDA, a novel annotation aggregation model that improves on the state of the art for a variety of annotation tasks. When data does not permit generative data modeling, we identify a conditional data modeling approach based on vector-space text representations that achieves state-of-the-art results on several unusual semantic annotation tasks. Finally, we identify a family of models capable of aggregating annotation data containing heterogenous annotation types such as label frequencies and labeled features. We present a multiannotator active learning algorithm for this model family that jointly selects an annotator, data items, and annotation type.
APA, Harvard, Vancouver, ISO, and other styles
11

Landthaler, Jörg [Verfasser], Florian [Akademischer Betreuer] Matthes, Kevin D. [Gutachter] Ashley, and Florian [Gutachter] Matthes. "Improving Semantic Search in the German Legal Domain with Word Embeddings / Jörg Landthaler ; Gutachter: Kevin D. Ashley, Florian Matthes ; Betreuer: Florian Matthes." München : Universitätsbibliothek der TU München, 2020. http://d-nb.info/1216242410/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Landthaler, Jörg Verfasser], Florian [Akademischer Betreuer] Matthes, Kevin D. [Gutachter] [Ashley, and Florian [Gutachter] Matthes. "Improving Semantic Search in the German Legal Domain with Word Embeddings / Jörg Landthaler ; Gutachter: Kevin D. Ashley, Florian Matthes ; Betreuer: Florian Matthes." München : Universitätsbibliothek der TU München, 2020. http://d-nb.info/1216242410/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Landthaler, Jörg Verfasser], Florian [Akademischer Betreuer] [Matthes, Kevin D. [Gutachter] Ashley, and Florian [Gutachter] Matthes. "Improving Semantic Search in the German Legal Domain with Word Embeddings / Jörg Landthaler ; Gutachter: Kevin D. Ashley, Florian Matthes ; Betreuer: Florian Matthes." München : Universitätsbibliothek der TU München, 2020. http://nbn-resolving.de/urn:nbn:de:bvb:91-diss-20200605-1521744-1-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Tissier, Julien. "Improving methods to learn word representations for efficient semantic similarites computations." Thesis, Lyon, 2020. http://www.theses.fr/2020LYSES008.

Full text
Abstract:
De nombreuses applications en traitement du langage naturel (TALN) reposent sur les représentations de mots, ou “word embeddings”. Ces représentations doivent capturer à la fois de l’information syntaxique et sémantique pour donner des bonnes performances dans les tâches en aval qui les utilisent. Cependant, les méthodes courantes pour les apprendre utilisent des textes génériques comme Wikipédia qui ne contiennent pas d’information sémantique précise. De plus, un espace mémoire important est requis pour pouvoir les sauvegarder car le nombre de représentations de mots à apprendre peut être de l’ordre du million. Le sujet de ma thèse est de développer de nouveaux algorithmes pour améliorer l’information sémantique dans les word embeddings tout en réduisant leur taille en mémoire lors de leur utilisation dans des tâches en aval de TALN.La première partie de mes travaux améliore l’information sémantique contenue dans les word embeddings. J’ai développé dict2vec, un modèle qui utilise l’information des dictionnaires linguistiques lors de l’apprentissage des word embeddings. Les word embeddings appris par dict2vec obtiennent des scores supérieurs d’environ 15% par rapport à ceux appris avec d’autres méthodes sur des tâches de similarités sémantiques de mots. La seconde partie de mes travaux consiste à réduire la taille mémoire des word embeddings. J’ai développé une architecture basée sur un auto-encodeur pour transformer des word embeddings à valeurs réelles en vecteurs binaires, réduisant leur taille mémoire de 97% avec seulement une baisse de précision d’environ 2% dans des tâches de TALN en aval
Many natural language processing applications rely on word representations (also called word embeddings) to achieve state-of-the-art results. These numerical representations of the language should encode both syntactic and semantic information to perform well in downstream tasks. However, common models (word2vec, GloVe) use generic corpus like Wikipedia to learn them and they therefore lack specific semantic information. Moreover it requires a large memory space to store them because the number of representations to save can be in the order of a million. The topic of my thesis is to develop new learning algorithms to both improve the semantic information encoded within the representations while making them requiring less memory space for storage and their application in NLP tasks.The first part of my work is to improve the semantic information contained in word embeddings. I developed dict2vec, a model that uses additional information from online lexical dictionaries when learning word representations. The dict2vec word embeddings perform ∼15% better against the embeddings learned by other models on word semantic similarity tasks. The second part of my work is to reduce the memory size of the embeddings. I developed an architecture based on an autoencoder to transform commonly used real-valued embeddings into binary embeddings, reducing their size in memory by 97% with only a loss of ∼2% in accuracy in downstream NLP tasks
APA, Harvard, Vancouver, ISO, and other styles
15

Necşulescu, Silvia. "Automatic acquisition of lexical-semantic relations: gathering information in a dense representation." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/374234.

Full text
Abstract:
Lexical-semantic relationships between words are key information for many NLP tasks, which require this knowledge in the form of lexical resources. This thesis addresses the acquisition of lexical-semantic relation instances. State of the art systems rely on word pair representations based on patterns of contexts where two related words co-occur to detect their relation. This approach is hindered by data sparsity: even when mining very large corpora, not every semantically related word pair co-occurs or not frequently enough. In this work, we investigate novel representations to predict if two words hold a lexical-semantic relation. Our intuition was that these representations should contain information about word co-occurrences combined with information about the meaning of words involved in the relation. These two sources of information have to be the basis of a generalization strategy to be able to provide information even for words that do not co-occur.
Les relacions lexicosemàntiques entre paraules són una informació clau per a moltes tasques del PLN, què requereixen aquest coneixement en forma de recursos lingüístics. Aquesta tesi tracta l’adquisició d'instàncies lexicosemàntiques. Els sistemes actuals utilitzen representacions basades en patrons dels contextos en què dues paraules coocorren per detectar la relació que s'hi estableix. Aquest enfocament s'enfronta a problemes de falta d’informació: fins i tot en el cas de treballar amb corpus de grans dimensions, hi haurà parells de paraules relacionades que no coocorreran, o no ho faran amb la freqüència necessària. Per tant, el nostre objectiu principal ha estat proposar noves representacions per predir si dues paraules estableixen una relació lexicosemàntica. La intuïció era que aquestes representacions noves havien de contenir informació sobre patrons dels contextos, combinada amb informació sobre el significat de les paraules implicades en la relació. Aquestes dues fonts d'informació havien de ser la base d'una estratègia de generalització que oferís informació fins i tot quan les dues paraules no coocorrien.
APA, Harvard, Vancouver, ISO, and other styles
16

Åkerström, Joakim, and Aravena Carlos Peñaloza. "Semantiska modeller för syntetisk textgenerering - en jämförelsestudie." Thesis, Högskolan i Borås, Akademin för bibliotek, information, pedagogik och IT, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-13981.

Full text
Abstract:
Denna kunskapsöversikt undersöker det forskningsfält som rör musikintegrerad matematikundervisning. Syftet med översikten är att få en inblick i hur musiken påverkar elevernas matematikprestationer samt hur forskningen ser ut inom denna kombination. Därför är vår frågeställning: Vad kännetecknar forskningen om integrationen mellan matematik och musik? För att besvara denna fråga har vi utfört litteratursökningar för att finna studier och artiklar som tillsammans bildar en överblick. Med hjälp av den metod som Claes Nilholm beskriver i SMART (2016) har vi skapat en struktur för hur vi arbetat. Ur det material som vi fann under sökningarna har vi funnit mönster som talar för musikens positiva inverkan på matematikundervisning. Förmågan att uttrycka sina känslor i form av ord eller beröra andra med dem har alltid varit enbeundransvärd och sällsynt egenskap. Det här projektet handlar om att skapa en text generatorkapabel av att skriva text i stil med enastående män och kvinnor med den här egenskapen. Arbetet har genomförts genom att träna ett neuronnät med citat skrivna av märkvärdigamänniskor såsom Oscar Wilde, Mark Twain, Charles Dickens, etc. Nätverket samarbetar med två olika semantiska modeller: Word2Vec och One-Hot och alla tre är delarna som vår textgenerator består av. Med dessa genererade texterna gjordes en enkätudersökning för att samlaåsikter från studenter om kvaliteten på de genererade texterna för att på så vis utvärderalämpligheten hos de olika semantiska modellerna. Efter analysen av resultatet lärde vi oss att de flesta respondenter tyckte att texterna de läste var sammanhängande och roliga. Vi lärde oss också att Word2Vec, presterade signifikant bättre än One-hot.
The ability of expressing feelings in words or moving others with them has always been admired and rare feature. This project involves creating a text generator able to write text in the style of remarkable men and women with this ability, this gift. This has been done by training a neural network with quotes written by outstanding people such as Oscar Wilde, Mark Twain, Charles Dickens, et alt. This neural network cooperate with two different semantic models: Word2Vec and One-Hot and the three of them compound our text generator. With the text generated we carried out a survey in order to collect the opinion of students about the quality of the text generated by our generator. Upon examination of the result, we proudly learned that most of the respondents thought the texts were coherent and fun to read, we also learned that the former semantic model performed, not by a factor of magnitude, better than the latter.
APA, Harvard, Vancouver, ISO, and other styles
17

Silva, Allan de Barcelos. "O uso de recursos linguísticos para mensurar a semelhança semântica entre frases curtas através de uma abordagem híbrida." Universidade do Vale do Rio dos Sinos, 2017. http://www.repositorio.jesuita.org.br/handle/UNISINOS/6974.

Full text
Abstract:
Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2018-04-04T11:46:54Z No. of bitstreams: 1 Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5)
Made available in DSpace on 2018-04-04T11:46:55Z (GMT). No. of bitstreams: 1 Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5) Previous issue date: 2017-12-14
Nenhuma
Na área de Processamento de Linguagem Natural, a avaliação da similaridade semântica textual é considerada como um elemento importante para a construção de recursos em diversas frentes de trabalho, tais como a recuperação de informações, a classificação de textos, o agrupamento de documentos, as aplicações de tradução, a interação através de diálogos, entre outras. A literatura da área descreve aplicações e técnicas voltadas, em grande parte, para a língua inglesa. Além disso, observa-se o uso prioritário de recursos probabilísticos, enquanto os aspectos linguísticos são utilizados de forma incipiente. Trabalhos na área destacam que a linguística possui um papel fundamental na avaliação de similaridade semântica textual, justamente por ampliar o potencial dos métodos exclusivamente probabilísticos e evitar algumas de suas falhas, que em boa medida são resultado da falta de tratamento mais aprofundado de aspectos da língua. Este contexto é potencializado no tratamento de frases curtas, que consistem no maior campo de utilização das técnicas de similaridade semântica textual, pois este tipo de sentença é composto por um conjunto reduzido de informações, diminuindo assim a capacidade de tratamento probabilístico eficiente. Logo, considera-se vital a identificação e aplicação de recursos a partir do estudo mais aprofundado da língua para melhor compreensão dos aspectos que definem a similaridade entre sentenças. O presente trabalho apresenta uma abordagem para avaliação da similaridade semântica textual em frases curtas no idioma português brasileiro. O principal diferencial apresentado é o uso de uma abordagem híbrida, na qual tanto os recursos de representação distribuída como os aspectos léxicos e linguísticos são utilizados. Para a consolidação do estudo, foi definida uma metodologia que permite a análise de diversas combinações de recursos, possibilitando a avaliação dos ganhos que são introduzidos com a ampliação de aspectos linguísticos e também através de sua combinação com o conhecimento gerado por outras técnicas. A abordagem proposta foi avaliada com relação a conjuntos de dados conhecidos na literatura (evento PROPOR 2016) e obteve bons resultados.
One of the areas of Natural language processing (NLP), the task of assessing the Semantic Textual Similarity (STS) is one of the challenges in NLP and comes playing an increasingly important role in related applications. The STS is a fundamental part of techniques and approaches in several areas, such as information retrieval, text classification, document clustering, applications in the areas of translation, check for duplicates and others. The literature describes the experimentation with almost exclusive application in the English language, in addition to the priority use of probabilistic resources, exploring the linguistic ones in an incipient way. Since the linguistic plays a fundamental role in the analysis of semantic textual similarity between short sentences, because exclusively probabilistic works fails in some way (e.g. identification of far or close related sentences, anaphora) due to lack of understanding of the language. This fact stems from the few non-linguistic information in short sentences. Therefore, it is vital to identify and apply linguistic resources for better understand what make two or more sentences similar or not. The current work presents a hybrid approach, in which are used both of distributed, lexical and linguistic aspects for an evaluation of semantic textual similarity between short sentences in Brazilian Portuguese. We evaluated proposed approach with well-known and respected datasets in the literature (PROPOR 2016) and obtained good results.
APA, Harvard, Vancouver, ISO, and other styles
18

Lisena, Pasquale. "Knowledge-based music recommendation : models, algorithms and exploratory search." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS614.

Full text
Abstract:
Représenter l'information décrivant la musique est une activité complexe, qui implique différentes sous-tâches. Ce manuscrit de thèse porte principalement sur la musique classique et étudie comment représenter et exploiter ses informations. L'objectif principal est l'étude de stratégies de représentation et de découverte des connaissances appliquées à la musique classique, dans des domaines tels que la production de base de connaissances, la prédiction de métadonnées et les systèmes de recommandation. Nous proposons une architecture pour la gestion des métadonnées de musique à l'aide des technologies du Web Sémantique. Nous introduisons une ontologie spécialisée et un ensemble de vocabulaires contrôlés pour les différents concepts spécifiques à la musique. Ensuite, nous présentons une approche de conversion des données, afin d’aller au-delà de la pratique bibliothécaire actuellement utilisée, en s’appuyant sur des règles de mapping et sur l’interconnexion avec des vocabulaires contrôlés. Enfin, nous montrons comment ces données peuvent être exploitées. En particulier, nous étudions des approches basées sur des plongements calculés sur des métadonnées structurées, des titres et de la musique symbolique pour classer et recommander de la musique. Plusieurs applications de démonstration ont été réalisées pour tester les approches et les ressources précédentes
Representing the information about music is a complex activity that involves different sub-tasks. This thesis manuscript mostly focuses on classical music, researching how to represent and exploit its information. The main goal is the investigation of strategies of knowledge representation and discovery applied to classical music, involving subjects such as Knowledge-Base population, metadata prediction, and recommender systems. We propose a complete workflow for the management of music metadata using Semantic Web technologies. We introduce a specialised ontology and a set of controlled vocabularies for the different concepts specific to music. Then, we present an approach for converting data, in order to go beyond the librarian practice currently in use, relying on mapping rules and interlinking with controlled vocabularies. Finally, we show how these data can be exploited. In particular, we study approaches based on embeddings computed on structured metadata, titles, and symbolic music for ranking and recommending music. Several demo applications have been realised for testing the previous approaches and resources
APA, Harvard, Vancouver, ISO, and other styles
19

Nguyen, Gia Hung. "Modèles neuronaux pour la recherche d'information : approches dirigées par les ressources sémantiques." Thesis, Toulouse 3, 2018. http://www.theses.fr/2018TOU30233.

Full text
Abstract:
Le projet de thèse porte sur l'application des approches neuronales pour la représentation de textes et l'appariement de textes en recherche d'information en vue de lever le verrou du fossé sémantique. Plus précisément, les activités de thèse explorent la combinaison des apports de la sémantique relationnelle issue de ressources externes (comme DPBedia et UMLS) et la sémantique distributionnelle basée sur les réseaux de neurones, dans le but : 1) d'apprendre des représentations de granules d'informations (mots, concepts) et représentations de documents, et 2) d'apprendre la fonction pertinence d'un document pour une requête. Notre première contribution comprend des modèles neuronaux pour l'apprentissage en ligne et apprentissage hors ligne des représentations de texte à plusieurs niveaux (mot, sens, document). Ces modèles intègrent les contraintes relationnelles issues des ressources externes par régularisation de la fonction objectif ou par enrichissement sémantique des instances d'apprentissage. La deuxième contribution consiste en un modèle d'appariement requête-document par un réseau de neurones siamois. Ce réseau apprend à mesurer un score de pertinence entre un document et une requête à partir des vecteurs de représentation en entrée modélisant des objets (concepts, entités) identifiés dans la requêtes et documents et leurs relations issues des ressources externes. Les évaluation expérimentales sont conduites sur des tâches de RI et de traitement du langage naturel (TALN) en utilisant des collections standards TREC et des ressources largement utilisées comme DBpedia ou UMLS. Les résultats montrent principalement l'intérêt de l'utilisation des approches neuronales à la fois au niveau de la représentation des textes et de leur appariement ainsi que la variabilité de leurs performances selon les tâches considérées
In this thesis, we focus on bridging the semantic gap between the documents and queries representations, hence improve the matching performance. We propose to combine relational semantics from knowledge resources and distributed semantics of the corpus inferred by neural models. Our contributions consist of two main aspects: (1) Improving distributed representations of text for IR tasks. We propose two models that integrate relational semantics into the distributed representations: a) an offline model that combines two types of pre-trained representations to obtain a hybrid representation of the document; b) an online model that jointly learns distributed representations of documents, concepts and words. To better integrate relational semantics from knowledge resources, we propose two approaches to inject these relational constraints, one based on the regularization of the objective function, the other based on instances in the training text. (2) Exploiting neural networks for semantic matching of documents}. We propose a neural model for document-query matching. Our neural model relies on: a) a representation of raw-data that models the relational semantics of text by jointly considering objects and relations expressed in a knowledge resource, and b) an end-to-end neural architecture that learns the query-document relevance by leveraging the distributional and relational semantics of documents and queries
APA, Harvard, Vancouver, ISO, and other styles
20

Callin, Jimmy. "Word Representations and Machine Learning Models for Implicit Sense Classification in Shallow Discourse Parsing." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-325876.

Full text
Abstract:
CoNLL 2015 featured a shared task on shallow discourse parsing. In 2016, the efforts continued with an increasing focus on sense classification. In the case of implicit sense classification, there was an interesting mix of traditional and modern machine learning classifiers using word representation models. In this thesis, we explore the performance of a number of these models, and investigate how they perform using a variety of word representation models. We show that there are large performance differences between word representation models for certain machine learning classifiers, while others are more robust to the choice of word representation model. We also show that with the right choice of word representation model, simple and traditional machine learning classifiers can reach competitive scores even when compared with modern neural network approaches.
APA, Harvard, Vancouver, ISO, and other styles
21

Scala, Simone. "Realizzazione di un motore di ricerca semantico basato sui Document Embedding." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2017.

Find full text
Abstract:
Il presente elaborato propone lo sviluppo di un motore di ricerca semantico in grado di estrapolare contenuti all'interno di una vasta gamma di informazioni testuali. Verranno definite delle tecniche di ricerca capaci di interpretare il significato contestuale di interrogazioni formulate in linguaggio naturale. Il metodo impiegato per effettuare le suddette ricerche si basa sull'analisi semantica distribuzionale, argomento di rilevante impatto nel campo dell’elaborazione del linguaggio naturale. Uno dei più recenti sviluppi in tale ambito ha portato alla formulazione di metodi noti in letteratura come Document Embedding fino a ottenere, con Paragraph Vector e tecnologie affini, la possibilità di mappare semanticamente intere porzioni di testo. Questi sviluppi aprono nuove possibilità legate all'implementazione di nuove tecniche su cui strutturare i motori di ricerca. Il contributo sarà quindi quello di realizzare un innovativo sistema di question answering (QA) basato sul modello semantico vettoriale di Document Embedding, in grado di rispondere automaticamente a una domanda espressa in linguaggio naturale. L'esperimento sarà condotto su un corpus di testi proveniente da Wikipedia. L'efficacia di questo sistema sarà poi misurata in maniera sperimentale considerando un test set di domande casuali (inerenti il corpus) che potrebbero essere poste a un motore di ricerca. I risultati che affioreranno da tale analisi mostreranno come i modelli di Document Embedding siano effettivamente idonei allo scopo prefissato.
APA, Harvard, Vancouver, ISO, and other styles
22

Wang, Run Fen. "Semantic Text Matching Using Convolutional Neural Networks." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-362134.

Full text
Abstract:
Semantic text matching is a fundamental task for many applications in NaturalLanguage Processing (NLP). Traditional methods using term frequencyinversedocument frequency (TF-IDF) to match exact words in documentshave one strong drawback which is TF-IDF is unable to capture semanticrelations between closely-related words which will lead to a disappointingmatching result. Neural networks have recently been used for various applicationsin NLP, and achieved state-of-the-art performances on many tasks.Recurrent Neural Networks (RNN) have been tested on text classificationand text matching, but it did not gain any remarkable results, which is dueto RNNs working more effectively on texts with a short length, but longdocuments. In this paper, Convolutional Neural Networks (CNN) will beapplied to match texts in a semantic aspect. It uses word embedding representationsof two texts as inputs to the CNN construction to extract thesemantic features between the two texts and give a score as the output ofhow certain the CNN model is that they match. The results show that aftersome tuning of the parameters the CNN model could produce accuracy,prediction, recall and F1-scores all over 80%. This is a great improvementover the previous TF-IDF results and further improvements could be madeby using dynamic word vectors, better pre-processing of the data, generatelarger and more feature rich data sets and further tuning of the parameters.
APA, Harvard, Vancouver, ISO, and other styles
23

Wang, Qian. "Zero-shot visual recognition via latent embedding learning." Thesis, University of Manchester, 2018. https://www.research.manchester.ac.uk/portal/en/theses/zeroshot-visual-recognition-via-latent-embedding-learning(bec510af-6a53-4114-9407-75212e1a08e1).html.

Full text
Abstract:
Traditional supervised visual recognition methods require a great number of annotated examples for each concerned class. The collection and annotation of visual data (e.g., images and videos) could be laborious, tedious and time-consuming when the number of classes involved is very large. In addition, there are such situations where the test instances are from novel classes for which training examples are unavailable in the training stage. These issues can be addressed by zero-shot learning (ZSL), an emerging machine learning technique enabling the recognition of novel classes. The key issue in zero-shot visual recognition is the semantic gap between visual and semantic representations. We address this issue in this thesis from three different perspectives: visual representations, semantic representations and the learning models. We first propose a novel bidirectional latent embedding framework for zero-shot visual recognition. By learning a latent space from visual representations and labelling information of the training examples, instances of different classes can be mapped into the latent space with the preserving of both visual and semantic relatedness, hence the semantic gap can be bridged. We conduct experiments on both object and human action recognition benchmarks to validate the effectiveness of the proposed ZSL framework. Then we extend the ZSL to the multi-label scenarios for multi-label zero-shot human action recognition based on weakly annotated video data. We employ a long short term memory (LSTM) neural network to explore the multiple actions underlying the video data. A joint latent space is learned by two component models (i.e. the visual model and the semantic model) to bridge the semantic gap. The two component embedding models are trained alternately to optimize the ranking based objectives. Extensive experiments are carried out on two multi-label human action datasets to evaluate the proposed framework. Finally, we propose alternative semantic representations for human actions towards narrowing the semantic gap from the perspective of semantic representation. A simple yet effective solution based on the exploration of web data has been investigated to enhance the semantic representations for human actions. The novel semantic representations are proved to benefit the zero-shot human action recognition significantly compared to the traditional attributes and word vectors. In summary, we propose novel frameworks for zero-shot visual recognition towards narrowing and bridging the semantic gap, and achieve state-of-the-art performance in different settings on multiple benchmarks.
APA, Harvard, Vancouver, ISO, and other styles
24

Norlund, Tobias. "The Use of Distributional Semantics in Text Classification Models : Comparative performance analysis of popular word embeddings." Thesis, Linköpings universitet, Datorseende, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-127991.

Full text
Abstract:
In the field of Natural Language Processing, supervised machine learning is commonly used to solve classification tasks such as sentiment analysis and text categorization. The classical way of representing the text has been to use the well known Bag-Of-Words representation. However lately low-dimensional dense word vectors have come to dominate the input to state-of-the-art models. While few studies have made a fair comparison of the models' sensibility to the text representation, this thesis tries to fill that gap. We especially seek insight in the impact various unsupervised pre-trained vectors have on the performance. In addition, we take a closer look at the Random Indexing representation and try to optimize it jointly with the classification task. The results show that while low-dimensional pre-trained representations often have computational benefits and have also reported state-of-the-art performance, they do not necessarily outperform the classical representations in all cases.
APA, Harvard, Vancouver, ISO, and other styles
25

Choudhary, Rishabh R. "Construction and Visualization of Semantic Spaces for Domain-Specific Text Corpora." University of Cincinnati / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627666092811419.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Šůstek, Martin. "Word2vec modely s přidanou kontextovou informací." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-363837.

Full text
Abstract:
This thesis is concerned with the explanation of the word2vec models. Even though word2vec was introduced recently (2013), many researchers have already tried to extend, understand or at least use the model because it provides surprisingly rich semantic information. This information is encoded in N-dim vector representation and can be recall by performing some operations over the algebra. As an addition, I suggest a model modifications in order to obtain different word representation. To achieve that, I use public picture datasets. This thesis also includes parts dedicated to word2vec extension based on convolution neural network.
APA, Harvard, Vancouver, ISO, and other styles
27

Das, Manirupa. "Neural Methods Towards Concept Discovery from Text via Knowledge Transfer." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1572387318988274.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Romeo, Lauren Michele. "The Structure of the lexicon in the task of the automatic acquisition of lexical information." Doctoral thesis, Universitat Pompeu Fabra, 2015. http://hdl.handle.net/10803/325420.

Full text
Abstract:
La información de clase semántica de los nombres es fundamental para una amplia variedad de tareas del procesamiento del lenguaje natural (PLN), como la traducción automática, la discriminación de referentes en tareas como la detección y el seguimiento de eventos, la búsqueda de respuestas, el reconocimiento y la clasificación de nombres de entidades, la construcción y ampliación automática de ontologías, la inferencia textual, etc. Una aproximación para resolver la construcción y el mantenimiento de los léxicos de gran cobertura que alimentan los sistemas de PNL, una tarea muy costosa y lenta, es la adquisición automática de información léxica, que consiste en la inducción de una clase semántica relacionada con una palabra en concreto a partir de datos de su distribución obtenidos de un corpus. Precisamente, por esta razón, se espera que la investigación actual sobre los métodos para la producción automática de léxicos de alta calidad, con gran cantidad de información y con anotación de clase como el trabajo que aquí presentamos, tenga un gran impacto en el rendimiento de la mayoría de las aplicaciones de PNL. En esta tesis, tratamos la adquisición automática de información léxica como un problema de clasificación. Con este propósito, adoptamos métodos de aprendizaje automático para generar un modelo que represente los datos de distribución vectorial que, basados en ejemplos conocidos, permitan hacer predicciones de otras palabras desconocidas. Las principales preguntas de investigación que planteamos en esta tesis son: (i) si los datos de corpus proporcionan suficiente información para construir representaciones de palabras de forma eficiente y que resulten en decisiones de clasificación precisas y sólidas, y (ii) si la adquisición automática puede gestionar, también, los nombres polisémicos. Para hacer frente a estos problemas, realizamos una serie de validaciones empíricas sobre nombres en inglés. Nuestros resultados confirman que la información obtenida a partir de la distribución de los datos de corpus es suficiente para adquirir automáticamente clases semánticas, como lo demuestra un valor-F global promedio de 0,80 aproximadamente utilizando varios modelos de recuento de contextos y en datos de corpus de distintos tamaños. No obstante, tanto el estado de la cuestión como los experimentos que realizamos destacaron una serie de retos para este tipo de modelos, que son reducir la escasez de datos del vector y dar cuenta de la polisemia nominal en las representaciones distribucionales de las palabras. En este contexto, los modelos de word embedding (WE) mantienen la “semántica” subyacente en las ocurrencias de un nombre en los datos de corpus asignándole un vector. Con esta elección, hemos sido capaces de superar el problema de la escasez de datos, como lo demuestra un valor-F general promedio de 0,91 para las clases semánticas de nombres de sentido único, a través de una combinación de la reducción de la dimensionalidad y de números reales. Además, las representaciones de WE obtuvieron un rendimiento superior en la gestión de las ocurrencias asimétricas de cada sentido de los nombres de tipo complejo polisémicos regulares en datos de corpus. Como resultado, hemos podido clasificar directamente esos nombres en su propia clase semántica con un valor-F global promedio de 0,85. La principal aportación de esta tesis consiste en una validación empírica de diferentes representaciones de distribución utilizadas para la clasificación semántica de nombres junto con una posterior expansión del trabajo anterior, lo que se traduce en recursos léxicos y conjuntos de datos innovadores que están disponibles de forma gratuita para su descarga y uso.
La información de clase semántica de los nombres es fundamental para una amplia variedad de tareas del procesamiento del lenguaje natural (PLN), como la traducción automática, la discriminación de referentes en tareas como la detección y el seguimiento de eventos, la búsqueda de respuestas, el reconocimiento y la clasificación de nombres de entidades, la construcción y ampliación automática de ontologías, la inferencia textual, etc. Una aproximación para resolver la construcción y el mantenimiento de los léxicos de gran cobertura que alimentan los sistemas de PNL, una tarea muy costosa y lenta, es la adquisición automática de información léxica, que consiste en la inducción de una clase semántica relacionada con una palabra en concreto a partir de datos de su distribución obtenidos de un corpus. Precisamente, por esta razón, se espera que la investigación actual sobre los métodos para la producción automática de léxicos de alta calidad, con gran cantidad de información y con anotación de clase como el trabajo que aquí presentamos, tenga un gran impacto en el rendimiento de la mayoría de las aplicaciones de PNL. En esta tesis, tratamos la adquisición automática de información léxica como un problema de clasificación. Con este propósito, adoptamos métodos de aprendizaje automático para generar un modelo que represente los datos de distribución vectorial que, basados en ejemplos conocidos, permitan hacer predicciones de otras palabras desconocidas. Las principales preguntas de investigación que planteamos en esta tesis son: (i) si los datos de corpus proporcionan suficiente información para construir representaciones de palabras de forma eficiente y que resulten en decisiones de clasificación precisas y sólidas, y (ii) si la adquisición automática puede gestionar, también, los nombres polisémicos. Para hacer frente a estos problemas, realizamos una serie de validaciones empíricas sobre nombres en inglés. Nuestros resultados confirman que la información obtenida a partir de la distribución de los datos de corpus es suficiente para adquirir automáticamente clases semánticas, como lo demuestra un valor-F global promedio de 0,80 aproximadamente utilizando varios modelos de recuento de contextos y en datos de corpus de distintos tamaños. No obstante, tanto el estado de la cuestión como los experimentos que realizamos destacaron una serie de retos para este tipo de modelos, que son reducir la escasez de datos del vector y dar cuenta de la polisemia nominal en las representaciones distribucionales de las palabras. En este contexto, los modelos de word embedding (WE) mantienen la “semántica” subyacente en las ocurrencias de un nombre en los datos de corpus asignándole un vector. Con esta elección, hemos sido capaces de superar el problema de la escasez de datos, como lo demuestra un valor-F general promedio de 0,91 para las clases semánticas de nombres de sentido único, a través de una combinación de la reducción de la dimensionalidad y de números reales. Además, las representaciones de WE obtuvieron un rendimiento superior en la gestión de las ocurrencias asimétricas de cada sentido de los nombres de tipo complejo polisémicos regulares en datos de corpus. Como resultado, hemos podido clasificar directamente esos nombres en su propia clase semántica con un valor-F global promedio de 0,85. La principal aportación de esta tesis consiste en una validación empírica de diferentes representaciones de distribución utilizadas para la clasificación semántica de nombres junto con una posterior expansión del trabajo anterior, lo que se traduce en recursos léxicos y conjuntos de datos innovadores que están disponibles de forma gratuita para su descarga y uso.
Lexical semantic class information for nouns is critical for a broad variety of Natural Language Processing (NLP) tasks including, but not limited to, machine translation, discrimination of referents in tasks such as event detection and tracking, question answering, named entity recognition and classification, automatic construction and extension of ontologies, textual inference, etc. One approach to solve the costly and time-consuming manual construction and maintenance of large-coverage lexica to feed NLP systems is the Automatic Acquisition of Lexical Information, which involves the induction of a semantic class related to a particular word from distributional data gathered within a corpus. This is precisely why current research on methods for the automatic production of high- quality information-rich class-annotated lexica, such as the work presented here, is expected to have a high impact on the performance of most NLP applications. In this thesis, we address the automatic acquisition of lexical information as a classification problem. For this reason, we adopt machine learning methods to generate a model representing vectorial distributional data which, grounded on known examples, allows for the predictions of other unknown words. The main research questions we investigate in this thesis are: (i) whether corpus data provides sufficient distributional information to build efficient word representations that result in accurate and robust classification decisions and (ii) whether automatic acquisition can handle also polysemous nouns. To tackle these problems, we conducted a number of empirical validations on English nouns. Our results confirmed that the distributional information obtained from corpus data is indeed sufficient to automatically acquire lexical semantic classes, demonstrated by an average overall F1-Score of almost 0.80 using diverse count-context models and on different sized corpus data. Nonetheless, both the State of the Art and the experiments we conducted highlighted a number of challenges of this type of model such as reducing vector sparsity and accounting for nominal polysemy in distributional word representations. In this context, Word Embeddings (WE) models maintain the “semantics” underlying the occurrences of a noun in corpus data by mapping it to a feature vector. With this choice, we were able to overcome the sparse data problem, demonstrated by an average overall F1-Score of 0.91 for single-sense lexical semantic noun classes, through a combination of reduced dimensionality and “real” numbers. In addition, the WE representations obtained a higher performance in handling the asymmetrical occurrences of each sense of regular polysemous complex-type nouns in corpus data. As a result, we were able to directly classify such nouns into their own lexical-semantic class with an average overall F1-Score of 0.85. The main contribution of this dissertation consists of an empirical validation of different distributional representations used for nominal lexical semantic classification along with a subsequent expansion of previous work, which results in novel lexical resources and data sets that have been made freely available for download and use.
APA, Harvard, Vancouver, ISO, and other styles
29

Stigeborn, Olivia. "Text ranking based on semantic meaning of sentences." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-300442.

Full text
Abstract:
Finding a suitable candidate to client match is an important part of consultant companies work. It takes a lot of time and effort for the recruiters at the company to read possibly hundreds of resumes to find a suitable candidate. Natural language processing is capable of performing a ranking task where the goal is to rank the resumes with the most suitable candidates ranked the highest. This ensures that the recruiters are only required to look at the top ranked resumes and can quickly get candidates out in the field. Former research has used methods that count specific keywords in resumes and can make decisions on whether a candidate has an experience or not. The main goal of this thesis is to use the semantic meaning of the text in the resumes to get a deeper understanding of a candidate’s level of experience. It also evaluates if the model is possible to run on-device and if the database can contain a mix of English and Swedish resumes. An algorithm was created that uses the word embedding model DistilRoBERTa that is capable of capturing the semantic meaning of text. The algorithm was evaluated by generating job descriptions from the resumes by creating a summary of each resume. The run time, memory usage and the ranking the wanted candidate achieved was documented and used to analyze the results. When the candidate who was used to generate the job description is ranked in the top 10 the classification was considered to be correct. The accuracy was calculated using this method and an accuracy of 68.3% was achieved. The results show that the algorithm is capable of ranking resumes. The algorithm is able to rank both Swedish and English resumes with an accuracy of 67.7% for Swedish resumes and 74.7% for English. The run time was fast enough at an average of 578 ms but the memory usage was too large to make it possible to use the algorithm on-device. In conclusion the semantic meaning of resumes can be used to rank resumes and possible future work would be to combine this method with a method that counts keywords to research if the accuracy would increase.
Att hitta en lämplig kandidat till kundmatchning är en viktig del av ett konsultföretags arbete. Det tar mycket tid och ansträngning för rekryterare på företaget att läsa eventuellt hundratals CV:n för att hitta en lämplig kandidat. Det finns språkteknologiska metoder för att rangordna CV:n med de mest lämpliga kandidaterna rankade högst. Detta säkerställer att rekryterare endast behöver titta på de topprankade CV:erna och snabbt kan få kandidater ut i fältet. Tidigare forskning har använt metoder som räknar specifika nyckelord i ett CV och är kapabla att avgöra om en kandidat har specifika erfarenheter. Huvudmålet med denna avhandling är att använda den semantiska innebörden av texten iCV:n för att få en djupare förståelse för en kandidats erfarenhetsnivå. Den utvärderar också om modellen kan köras på mobila enheter och om algoritmen kan rangordna CV:n oberoende av om CV:erna är på svenska eller engelska. En algoritm skapades som använder ordinbäddningsmodellen DistilRoBERTa som är kapabel att fånga textens semantiska betydelse. Algoritmen utvärderades genom att generera jobbeskrivningar från CV:n genom att skapa en sammanfattning av varje CV. Körtiden, minnesanvändningen och rankningen som den önskade kandidaten fick dokumenterades och användes för att analysera resultatet. När den kandidat som användes för att generera jobbeskrivningen rankades i topp 10 ansågs klassificeringen vara korrekt. Noggrannheten beräknades med denna metod och en noggrannhet på 68,3 % uppnåddes. Resultaten visar att algoritmen kan rangordna CV:n. Algoritmen kan rangordna både svenska och engelska CV:n med en noggrannhet på 67,7 % för svenska och 74,7 % för engelska. Körtiden var i genomsnitt 578 ms vilket skulle möjliggöra att algoritmen kan köras på mobila enheter men minnesanvändningen var för stor. Sammanfattningsvis kan den semantiska betydelsen av CV:n användas för att rangordna CV:n och ett eventuellt framtida arbete är att kombinera denna metod med en metod som räknar nyckelord för att undersöka hur noggrannheten skulle påverkas.
APA, Harvard, Vancouver, ISO, and other styles
30

Martignano, Alessandro. "Transfer learning nella classificazione di dati testuali gerarchici: approcci semantici basati su ontologie e word embeddings." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2019.

Find full text
Abstract:
Nell'era dell'informatizzazione e con l'avvento del web semantico vi è una sempre crescente necessità di classificare e organizzare grandi moli di dati non strutturati posti in linguaggio naturale al fine di poter trarre da essi informazioni utili alla costruzione di una conoscenza. Tuttavia l'analisi e la classificazione di tali tipi di dati rappresenta, per un sistema automatico, un problema non triviale quanto per un essere umano. A tale scopo in tempi odierni assistiamo a una larga diffusione di tecniche di natural language processing, text mining e sentiment analysis grazie anche ai recenti importanti sviluppi nel campo del machine learning. L'obiettivo che questo elaborato si pone è quello di presentare un approccio alternativo alla classificazione automatica di documenti in categorie gerarchiche di argomenti basato sulle relazioni semantiche che intercorrono tra le parole che compongono gli stessi. Tali relazioni vengono attinte da una base di conoscenza semantica mediante algoritmi sviluppati appositamente, rendendo così la determinazione del grado di correlazione completamente indipendente dagli argomenti usati per l'addestramento e persino dalla lingua. Il modello proposto si differenzia da quelli presenti in letteratura classica per le sue caratteristiche di riusabilità e generalizzazione grazie alle quali è in grado di operare, in fase di classificazione, anche su tematiche e categorie non conosciute in fase di addestramento.
APA, Harvard, Vancouver, ISO, and other styles
31

Pierrejean, Bénédicte. "Qualitative evaluation of word embeddings : investigating the instability in neural-based models." Thesis, Toulouse 2, 2020. http://www.theses.fr/2020TOU20001.

Full text
Abstract:
La sémantique distributionnelle a récemment connu de grandes avancées avec l’arrivée des plongements de mots (word embeddings) basés sur des méthodes neuronales qui ont rendu les modèles sémantiques plus accessibles en fournissant des méthodes d’entraînement rapides, efficaces et faciles à utiliser. Ces représentations denses d’unités lexicales basées sur l’analyse non supervisée de gros corpus sont de plus en plus utilisées dans diverses applications. Elles sont intégrées en tant que première couche dans les modèles d’apprentissage profond et sont également utilisées pour faire de l’observation qualitative en linguistique de corpus. Cependant, malgré leur popularité, il n’existe toujours pas de méthode d’évaluation des plongements de mots qui donne à la fois une vision globale et précise des différences existant entre plusieurs modèles.Dans cette thèse, nous proposons une méthodologie pour évaluer les plongements de mots. Nous fournissons également une étude détaillée des modèles entraînés avec la méthode word2vec.Dans la première partie de cette thèse, nous donnons un aperçu de l’évolution de la sémantique distributionnelle et passons en revue les différentes méthodes utilisées pour évaluer les plongements de mots. Par la suite, nous identifions les limites de ces méthodes et proposons de comparer les plongements de mots en utilisant une approche basée sur les voisins sémantiques. Nous expérimentons avec cette approche sur des modèles entrainés avec différents paramètres ou sur différents corpus. Étant donné la nature non déterministe des méthodes neuronales, nous reconnaissons les limites de cette approche et nous concentrons par la suite sur le problème de l’instabilité des voisins sémantiques dans les modèles de plongement de mots. Plutôt que d’éviter ce problème, nous choisissons de l’utiliser comme indice pour mieux comprendre les plongements de mots. Nous montrons que le problème d’instabilité n’affecte pas tous les mots de la même manière et que plus plusieurs traits linguistiques permettent d’expliquer une partie de ce phénomène. Ceci constitue un pas vers une meilleure compréhension du fonctionnement des modèles sémantiques vectoriels
Distributional semantics has been revolutionized by neural-based word embeddings methods such as word2vec that made semantics models more accessible by providing fast, efficient and easy to use training methods. These dense representations of lexical units based on the unsupervised analysis of large corpora are more and more used in various types of applications. They are integrated as the input layer in deep learning models or they are used to draw qualitative conclusions in corpus linguistics. However, despite their popularity, there still exists no satisfying evaluation method for word embeddings that provides a global yet precise vision of the differences between models. In this PhD thesis, we propose a methodology to qualitatively evaluate word embeddings and provide a comprehensive study of models trained using word2vec. In the first part of this thesis, we give an overview of distributional semantics evolution and review the different methods that are currently used to evaluate word embeddings. We then identify the limits of the existing methods and propose to evaluate word embeddings using a different approach based on the variation of nearest neighbors. We experiment with the proposed method by evaluating models trained with different parameters or on different corpora. Because of the non-deterministic nature of neural-based methods, we acknowledge the limits of this approach and consider the problem of nearest neighbors instability in word embeddings models. Rather than avoiding this problem we embrace it and use it as a mean to better understand word embeddings. We show that the instability problem does not impact all words in the same way and that several linguistic features are correlated. This is a step towards a better understanding of vector-based semantic models
APA, Harvard, Vancouver, ISO, and other styles
32

Van, Tassel John Peter. "Femto-VHDL : the semantics of a subset of VHDL and its embedding in the HOL proof assistant." Thesis, University of Cambridge, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.308190.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Berndl, Emanuel [Verfasser], and Harald [Akademischer Betreuer] Kosch. "Embedding a Multimedia Metadata Model into a Workflow-driven Environment Using Idiomatic Semantic Web Technologies / Emanuel Berndl ; Betreuer: Harald Kosch." Passau : Universität Passau, 2019. http://d-nb.info/1192512022/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Warren, Jared. "Using Haskell to Implement Syntactic Control of Interference." Thesis, Kingston, Ont. : [s.n.], 2008. http://hdl.handle.net/1974/1237.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Bucher, Maxime. "Apprentissage et exploitation de représentations sémantiques pour la classification et la recherche d'images." Thesis, Normandie, 2018. http://www.theses.fr/2018NORMC250/document.

Full text
Abstract:
Dans cette thèse nous étudions différentes questions relatives à la mise en pratique de modèles d'apprentissage profond. En effet malgré les avancées prometteuses de ces algorithmes en vision par ordinateur, leur emploi dans certains cas d'usage réels reste difficile. Une première difficulté est, pour des tâches de classification d'images, de rassembler pour des milliers de catégories suffisamment de données d'entraînement pour chacune des classes. C'est pourquoi nous proposons deux nouvelles approches adaptées à ce scénario d'apprentissage, appelé <>.L'utilisation d'information sémantique pour modéliser les classes permet de définir les modèles par description, par opposition à une modélisation à partir d'un ensemble d'exemples, et rend possible la modélisation sans donnée de référence. L'idée fondamentale du premier chapitre est d'obtenir une distribution d'attributs optimale grâce à l'apprentissage d'une métrique, capable à la fois de sélectionner et de transformer la distribution des données originales. Dans le chapitre suivant, contrairement aux approches standards de la littérature qui reposent sur l'apprentissage d'un espace d'intégration commun, nous proposons de générer des caractéristiques visuelles à partir d'un générateur conditionnel. Une fois générés ces exemples artificiels peuvent être utilisés conjointement avec des données réelles pour l'apprentissage d'un classifieur discriminant. Dans une seconde partie de ce manuscrit, nous abordons la question de l'intelligibilité des calculs pour les tâches de vision par ordinateur. En raison des nombreuses et complexes transformations des algorithmes profonds, il est difficile pour un utilisateur d'interpréter le résultat retourné. Notre proposition est d'introduire un <> dans le processus de traitement. La représentation de l'image est exprimée entièrement en langage naturel, tout en conservant l'efficacité des représentations numériques. L'intelligibilité de la représentation permet à un utilisateur d'examiner sur quelle base l'inférence a été réalisée et ainsi d'accepter ou de rejeter la décision suivant sa connaissance et son expérience humaine
In this thesis, we examine some practical difficulties of deep learning models.Indeed, despite the promising results in computer vision, implementing them in some situations raises some questions. For example, in classification tasks where thousands of categories have to be recognised, it is sometimes difficult to gather enough training data for each category.We propose two new approaches for this learning scenario, called <>. We use semantic information to model classes which allows us to define models by description, as opposed to modelling from a set of examples.In the first chapter we propose to optimize a metric in order to transform the distribution of the original data and to obtain an optimal attribute distribution. In the following chapter, unlike the standard approaches of the literature that rely on the learning of a common integration space, we propose to generate visual features from a conditional generator. The artificial examples can be used in addition to real data for learning a discriminant classifier. In the second part of this thesis, we address the question of computational intelligibility for computer vision tasks. Due to the many and complex transformations of deep learning algorithms, it is difficult for a user to interpret the returned prediction. Our proposition is to introduce what we call a <> in the processing pipeline, which is a crossing point in which the representation of the image is entirely expressed with natural language, while retaining the efficiency of numerical representations. This semantic bottleneck allows to detect failure cases in the prediction process so as to accept or reject the decision
APA, Harvard, Vancouver, ISO, and other styles
36

Dergachyova, Olga. "Knowledge-based support for surgical workflow analysis and recognition." Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S059/document.

Full text
Abstract:
L'assistance informatique est devenue une partie indispensable pour la réalisation de procédures chirurgicales modernes. Le désir de créer une nouvelle génération de blocs opératoires intelligents a incité les chercheurs à explorer les problèmes de perception et de compréhension automatique de la situation chirurgicale. Dans ce contexte de prise de conscience de la situation, un domaine de recherche en plein essor adresse la reconnaissance automatique du flux chirurgical. De grands progrès ont été réalisés pour la reconnaissance des phases et des gestes chirurgicaux. Pourtant, il existe encore un vide entre ces deux niveaux de granularité dans la hiérarchie du processus chirurgical. Très peu de recherche se concentre sur les activités chirurgicales portant des informations sémantiques vitales pour la compréhension de la situation. Deux facteurs importants entravent la progression. Tout d'abord, la reconnaissance et la prédiction automatique des activités chirurgicales sont des tâches très difficiles en raison de la courte durée d'une activité, de leur grand nombre et d'un flux de travail très complexe et une large variabilité. Deuxièmement, une quantité très limitée de données cliniques ne fournit pas suffisamment d'informations pour un apprentissage réussi et une reconnaissance précise. À notre avis, avant de reconnaître les activités chirurgicales, une analyse soigneuse des éléments qui composent l'activité est nécessaire pour choisir les bons signaux et les capteurs qui faciliteront la reconnaissance. Nous avons utilisé une approche d'apprentissage profond pour évaluer l'impact de différents éléments sémantiques de l'activité sur sa reconnaissance. Grâce à une étude approfondie, nous avons déterminé un ensemble minimum d'éléments suffisants pour une reconnaissance précise. Les informations sur la structure anatomique et l'instrument chirurgical sont de première importance. Nous avons également abordé le problème de la carence en matière de données en proposant des méthodes de transfert de connaissances à partir d'autres domaines ou chirurgies. Les méthodes de ''word embedding'' et d'apprentissage par transfert ont été proposées. Ils ont démontré leur efficacité sur la tâche de prédiction d'activité suivante offrant une augmentation de précision de 22%. De plus, des observations pertinentes
Computer assistance became indispensable part of modern surgical procedures. Desire of creating new generation of intelligent operating rooms incited researchers to explore problems of automatic perception and understanding of surgical situations. Situation awareness includes automatic recognition of surgical workflow. A great progress was achieved in recognition of surgical phases and gestures. Yet, there is still a blank between these two granularity levels in the hierarchy of surgical process. Very few research is focused on surgical activities carrying important semantic information vital for situation understanding. Two important factors impede the progress. First, automatic recognition and prediction of surgical activities is a highly challenging task due to short duration of activities, their great number and a very complex workflow with multitude of possible execution and sequencing ways. Secondly, very limited amount of clinical data provides not enough information for successful learning and accurate recognition. In our opinion, before recognizing surgical activities a careful analysis of elements that compose activity is necessary in order to chose right signals and sensors that will facilitate recognition. We used a deep learning approach to assess the impact of different semantic elements of activity on its recognition. Through an in-depth study we determined a minimal set of elements sufficient for an accurate recognition. Information about operated anatomical structure and surgical instrument was shown to be the most important. We also addressed the problem of data deficiency proposing methods for transfer of knowledge from other domains or surgeries. The methods of word embedding and transfer learning were proposed. They demonstrated their effectiveness on the task of next activity prediction offering 22% increase in accuracy. In addition, pertinent observations about the surgical practice were made during the study. In this work, we also addressed the problem of insufficient and improper validation of recognition methods. We proposed new validation metrics and approaches for assessing the performance that connect methods to targeted applications and better characterize capacities of the method. The work described in this these aims at clearing obstacles blocking the progress of the domain and proposes a new perspective on the problem of surgical workflow recognition
APA, Harvard, Vancouver, ISO, and other styles
37

Gränsbo, Gustav. "Word Clustering in an Interactive Text Analysis Tool." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-157497.

Full text
Abstract:
A central operation of users of the text analysis tool Gavagai Explorer is to look through a list of words and arrange them in groups. This thesis explores the use of word clustering to automatically arrange the words in groups intended to help users. A new word clustering algorithm is introduced, which attempts to produce word clusters tailored to be small enough for a user to quickly grasp the common theme of the words. The proposed algorithm computes similarities among words using word embeddings, and clusters them using hierarchical graph clustering. Multiple variants of the algorithm are evaluated in an unsupervised manner by analysing the clusters they produce when applied to 110 data sets previously analysed by users of Gavagai Explorer. A supervised evaluation is performed to compare clusters to the groups of words previously created by users of Gavagai Explorer. Results show that it was possible to choose a set of hyperparameters deemed to perform well across most data sets in the unsupervised evaluation. These hyperparameters also performed among the best on the supervised evaluation. It was concluded that the choice of word embedding and graph clustering algorithm had little impact on the behaviour of the algorithm. Rather, limiting the maximum size of clusters and filtering out similarities between words had a much larger impact on behaviour.
APA, Harvard, Vancouver, ISO, and other styles
38

Gkotse, Blerina. "Ontology-based Generation of Personalised Data Management Systems : an Application to Experimental Particle Physics." Thesis, Université Paris sciences et lettres, 2020. http://www.theses.fr/2020UPSLM017.

Full text
Abstract:
Ce travail de thèse vise à combler le fossé entre les domaines de la sémantique du Web et de la physique des particules expérimentales. En prenant comme cas d'utilisation un type spécifique d'expérience de physique, les expériences d'irradiation utilisées pour tester la résistance des composants au rayonnement, un modèle de domaine, ce qui, dans le domaine de la sémantique du Web, est appelé ontologie, a été créé pour décrire les principaux concepts de la gestion des données des expériences d'irradiation. Puis, en s'appuyant sur ce type de formalisation, une méthodologie a été conçue pour réaliser automatiquement la génération de systèmes de gestion de données fondés sur des ontologies ; elle a été utilisée pour générer des interfaces utilisateur pour l'ontologie IEDM introduite précédemment. Dans la dernière partie de ce travail de thèse, nous nous sommes penchés sur l'utilisation des préférences d'affichage des interfaces-utilisateur (UI), stockées en tant qu'instances d'une ontologie de description d'interfaces que nous avons développée pour enrichir IEDM. Nous introduisons une nouvelle méthode d'encodage de ces données, instances d'ontologie, en tant que vecteurs de plongement (``embeddings'') qui pourront être utilisés pour réaliser, à terme, des interfaces-utilisateur personnalisées
This thesis work aims at bridging the gap between the fields of Web Semantics and Experimental Particle Physics. Taking as a use case a specific type of physics experiments, namely the irradiation experiments used for assessing the resistance of components to radiation, a domain model, what in Web Semantics is called an ontology, has been created for describing the main concepts underlying the data management of irradiation experiments. Using such a formalisation, a methodology has been introduced for the automatic generation of data management systems based on ontologies and used to generate a web application for IEDM, the previously introduced ontology. In the last part of this thesis work, by the use of user-interface (UI) display preferences stored as instances of a UI-dedicated ontology we introduced, a method that represents these ontology instances as feature vectors (embeddings) for recommending personalised UIs is presented
APA, Harvard, Vancouver, ISO, and other styles
39

Aliane, Nourredine. "Evaluation des représentations vectorielles de mots." Thesis, Paris 8, 2019. http://www.theses.fr/2019PA080014.

Full text
Abstract:
Dans le traitement des langues, la représentation vectorielle des mots est une question clé, permettant l'emploi d'algorithmes basés sur des modèles mathématiques. Récemment ont émergé de nouvelles méthodes de vectorisation et leur évaluation est cruciale. Les évaluations actuelles portent surtout sur l'anglais, d’où le besoin d’évaluations multilingues. Notre travail porte sur la généralisation des évaluations, leur comparaison, l'élaboration d'évaluations nouvelles, et sur WordNet, ressource multilingue.Nous avons choisi 6 vectorisations : CBOW, SkipGram, GloVe, une plus ancienne comme base, et deux plus récentes. Les évaluations sont directes, évaluant avec un gold standard, ou indirectes, évaluant une application produite avec ces vectorisations. Comme méthode indirecte, nous prenons la catégorisation sémantique avec des algorithmes de clustering pour comparer les vectorisations sous-jacentes. Les algorithmes choisis sont : le plus utilisé (Kmeans), un neuronal (SOM) et un probabiliste (EM).Notre système applique les évaluations sur des corpus en anglais, français et arabe, et compare les vectorisations. Nous proposons 5 méthodes d'évaluation, dont 4 fondées sur WordNet, et un protocole d’évaluation par sondage. Nos résultats donnent trois classements des méthodes validés sur ces langues, s’accordant sur plusieurs points décisifs, et invalident certaines des évaluations existantes. Pour nos propres évaluations, le protocole est validé, et, de nos 5 méthodes, une a été invalidée (nous avons analysé les causes de l'échec), une a été validée pour l'anglais et le français, mais pas pour l'arabe, deux ont été validées sur les trois langues, et une reste à explorer
In Natural Language Processing vectorization of words is a key that enables the use of algorithms based on mathematical models. Recently new methods have appeared, and evaluating their quality is a necessity. At present, evaluations are mostly effective on English, which introduces the question of multilingual evaluations. We worked on generalizing methods, on comparing them, on devising new evaluations, and on WordNet as a multilingual resource used for evaluation.We choose six vectorization methods : CBOW, SkipGram, GloVe, an older method as baseline, and two more recent methods. Evaluations can be direct, comparing with some gold standard, or indirect, evaluating the result of an application produced with some vectorization. As an indirect method, we choose semantic clustering of words for comparing the underlying vectorizations. The chosen clustering algorithms were: the most used Kmeans, a neuronal one (SOM) and a probabilistic one (EM).Our system applies evaluation methods on big corpora in English, French and Arabic, then compares underlying vectorizations. We propose five new evaluation methods, with four based on WordNet, and one new protocol for polling. Our results yield three different vectorization orderings agreeing on decisive points, and invalidate some existing evaluations. As for our own evaluations, the protocol is validated, one method is invalidated and the reason analyzed, one is validated for English and French, but not Arabic, two are validated on the three languages, and one is left for further exploration
APA, Harvard, Vancouver, ISO, and other styles
40

Lin, Wei-Rou, and 林瑋柔. "Learning and Exploring Sequential Visual-Semantic Embeddings from Visual Story Ordering." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/c4vzq4.

Full text
Abstract:
碩士
國立臺灣大學
資訊工程學研究所
106
As more and more text-image intertwined stories, such as blog post, is generated on the internet, we are curious about the similarities and differences between infor-mation carried by the two modalities. Thus, we introduce the visual story ordering problem: to order image and text in a visual story jointly and handle the problem by a model training text and image into the same embedding. We try several models to deal with the problem, including pairwise and listwise approaches. We employ the result of coreference resolution as a baseline. In addition, we also implement a reader, processor, writer architecture and self-attention mecha-nism. We further proposed to decode in a bidirectional model with bidirectional beam search. We experimented on our methods with VIST, which is a visual storytelling dataset (Huang et al., 2016). The results show that bidirectional models outperform unidirec-tional models, whereas models trained with image outperform models trained without image on a text-only subset. We also found our embedding narrow the distance be-tween images and their corresponding story sentence even though we do not align the two modalities directly. In the future, we can further test the effectiveness of our model on different datasets, exploring the bidirectional inference mechanism deeper and augment our model with more functionality adapting the existing models.
APA, Harvard, Vancouver, ISO, and other styles
41

Pombo, José Luís Fava de Matos. "Landing on the right job : a machine learning approach to match candidates with jobs applying semantic embeddings." Master's thesis, 2019. http://hdl.handle.net/10362/60405.

Full text
Abstract:
Project Work presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
Job application’ screening is a challenging and time-consuming task to execute manually. For recruiting companies such as Landing.Jobs it poses constraints on the ability to scale the business. Some systems have been built for assisting recruiters screening applications but they tend to overlook the challenges related with natural language. On the other side, most people nowadays specially in the IT-sector use the Internet to look for jobs, however, given the huge amount of job postings online, it can be complicated for a candidate to short-list the right ones for applying to. In this work we test a collection of Machine Learning algorithms and through the usage of cross-validation we calibrate the most important hyper-parameters of each algorithm. The learning algorithms attempt to learn what makes a successful match between candidate profile and job requirements using for training historical data of selected/reject applications in the screening phase. The features we use for building our models include the similarities between the job requirements and the candidate profile in dimensions such as skills, profession, location and a set of job features which intend to capture the experience level, salary expectations, among others. In a first set of experiments, our best results emerge from the application of the Multilayer Perceptron algorithm (also known as Feed-Forward Neural Networks). After this, we improve the skills-matching feature by applying techniques for semantically embedding required/offered skills in order to tackle problems such as synonyms and typos which artificially degrade the similarity between job profile and candidate profile and degrade the overall quality of the results. Through the usage of word2vec algorithm for embedding skills and Multilayer Perceptron to learn the overall matching we obtain our best results. We believe our results could be even further improved by extending the idea of semantic embedding to other features and by finding candidates with similar job preferences with the target candidate and building upon that a richer presentation of the candidate profile. We consider that the final model we present in this work can be deployed in production as a first-level tool for doing the heavy-lifting of screening all applications, then passing the top N matches for manual inspection. Also, the results of our model can be used to complement any recommendation system in place by simply running the model encoding the profile of all candidates in the database upon any new job opening and recommend the jobs to the candidates which yield higher matching probability.
APA, Harvard, Vancouver, ISO, and other styles
42

Silva, André de Vasconcelos Santos. "Sparse distributed representations as word embeddings for language understanding." Master's thesis, 2018. http://hdl.handle.net/10071/18245.

Full text
Abstract:
Word embeddings are vector representations of words that capture semantic and syntactic similarities between them. Similar words tend to have closer vector representations in a N dimensional space considering, for instance, Euclidean distance between the points associated with the word vector representations in a continuous vector space. This property, makes word embeddings valuable in several Natural Language Processing tasks, from word analogy and similarity evaluation to the more complex text categorization, summarization or translation tasks. Typically state of the art word embeddings are dense vector representations, with low dimensionality varying from tens to hundreds of floating number dimensions, usually obtained from unsupervised learning on considerable amounts of text data by training and optimizing an objective function of a neural network. This work presents a methodology to derive word embeddings as binary sparse vectors, or word vector representations with high dimensionality, sparse representation and binary features (e.g. composed only by ones and zeros). The proposed methodology tries to overcome some disadvantages associated with state of the art approaches, namely the size of corpus needed for training the model, while presenting comparable evaluations in several Natural Language Processing tasks. Results show that high dimensionality sparse binary vectors representations, obtained from a very limited amount of training data, achieve comparable performances in similarity and categorization intrinsic tasks, whereas in analogy tasks good results are obtained only for nouns categories. Our embeddings outperformed eight state of the art word embeddings in word similarity tasks, and two word embeddings in categorization tasks.
A designação word embeddings refere-se a representações vetoriais das palavras que capturam as similaridades semânticas e sintáticas entre estas. Palavras similares tendem a ser representadas por vetores próximos num espaço N dimensional considerando, por exemplo, a distância Euclidiana entre os pontos associados a estas representações vetoriais num espaço vetorial contínuo. Esta propriedade, torna as word embeddings importantes em várias tarefas de Processamento Natural da Língua, desde avaliações de analogia e similaridade entre palavras, às mais complexas tarefas de categorização, sumarização e tradução automática de texto. Tipicamente, as word embeddings são constituídas por vetores densos, de dimensionalidade reduzida. São obtidas a partir de aprendizagem não supervisionada, recorrendo a consideráveis quantidades de dados, através da otimização de uma função objetivo de uma rede neuronal. Este trabalho propõe uma metodologia para obter word embeddings constituídas por vetores binários esparsos, ou seja, representações vetoriais das palavras simultaneamente binárias (e.g. compostas apenas por zeros e uns), esparsas e com elevada dimensionalidade. A metodologia proposta tenta superar algumas desvantagens associadas às metodologias do estado da arte, nomeadamente o elevado volume de dados necessário para treinar os modelos, e simultaneamente apresentar resultados comparáveis em várias tarefas de Processamento Natural da Língua. Os resultados deste trabalho mostram que estas representações, obtidas a partir de uma quantidade limitada de dados de treino, obtêm performances consideráveis em tarefas de similaridade e categorização de palavras. Por outro lado, em tarefas de analogia de palavras apenas se obtém resultados consideráveis para a categoria gramatical dos substantivos. As word embeddings obtidas com a metodologia proposta, e comparando com o estado da arte, superaram a performance de oito word embeddings em tarefas de similaridade, e de duas word embeddings em tarefas de categorização de palavras.
APA, Harvard, Vancouver, ISO, and other styles
43

Beland-Leblanc, Samuel. "Learning discrete word embeddings to achieve better interpretability and processing efficiency." Thesis, 2020. http://hdl.handle.net/1866/25464.

Full text
Abstract:
L’omniprésente utilisation des plongements de mot dans le traitement des langues naturellesest la preuve de leur utilité et de leur capacité d’adaptation a une multitude de tâches. Ce-pendant, leur nature continue est une importante limite en terme de calculs, de stockage enmémoire et d’interprétation. Dans ce travail de recherche, nous proposons une méthode pourapprendre directement des plongements de mot discrets. Notre modèle est une adaptationd’une nouvelle méthode de recherche pour base de données avec des techniques dernier crien traitement des langues naturelles comme les Transformers et les LSTM. En plus d’obtenirdes plongements nécessitant une fraction des ressources informatiques nécéssaire à leur sto-ckage et leur traitement, nos expérimentations suggèrent fortement que nos représentationsapprennent des unités de bases pour le sens dans l’espace latent qui sont analogues à desmorphèmes. Nous appelons ces unités dessememes, qui, de l’anglaissemantic morphemes,veut dire morphèmes sémantiques. Nous montrons que notre modèle a un grand potentielde généralisation et qu’il produit des représentations latentes montrant de fortes relationssémantiques et conceptuelles entre les mots apparentés.
The ubiquitous use of word embeddings in Natural Language Processing is proof of theirusefulness and adaptivity to a multitude of tasks. However, their continuous nature is pro-hibitive in terms of computation, storage and interpretation. In this work, we propose amethod of learning discrete word embeddings directly. The model is an adaptation of anovel database searching method using state of the art natural language processing tech-niques like Transformers and LSTM. On top of obtaining embeddings requiring a fractionof the resources to store and process, our experiments strongly suggest that our representa-tions learn basic units of meaning in latent space akin to lexical morphemes. We call theseunitssememes, i.e., semantic morphemes. We demonstrate that our model has a greatgeneralization potential and outputs representation showing strong semantic and conceptualrelations between related words.
APA, Harvard, Vancouver, ISO, and other styles
44

"Video2Vec: Learning Semantic Spatio-Temporal Embedding for Video Representations." Master's thesis, 2016. http://hdl.handle.net/2286/R.I.40765.

Full text
Abstract:
abstract: High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos. Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information. In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.
Dissertation/Thesis
Masters Thesis Computer Science 2016
APA, Harvard, Vancouver, ISO, and other styles
45

Chen, Hsiao-Yi, and 陳曉毅. "A Semantic Search over Encrypted Cloud Data based on Word Embedding 研." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/7b4m86.

Full text
Abstract:
碩士
國立臺灣科技大學
資訊工程系
107
The services of cloud storage have been very popular in recent years. With the superiority of low-cost and high-capacity, people are inclined to move their data from a local computer to a remote facility such as the cloud server. The majority of the existing methods for searching data on the cloud concentrate on keyword-based search scheme. With the rise of information security awareness, data owners hope that the data placed in the cloud server can keep privacy from being snooped by untrusted users, and users also hope that their query content will not be record by untrusted server. Therefore, encrypting data and queries is the most common way.However, the encrypted ciphertext has lost the relationship of the original plaintext, which will cause many difficulties in keyword search.In addition, most of the existing search methods are not able to efficiently obtain the information that the user is really interested in from the user's query keywords. To address these problems, this study proposes a word embedding based semantic search scheme for searching documents on the cloud. The word embedding model is implemented by a neural network. The neural network model can learn the semantic relationship between words in the corpus and express the words in vectors. By using a word-embedded model, a document index vector and a query vector can be generated. The proposed scheme can encrypt the query vector and the index vector into ciphertext, which can preserve the efficiency of the search while protecting the privacy of the user and the security of the document.
APA, Harvard, Vancouver, ISO, and other styles
46

Ke, Hao, and 葛浩. "Some approaches of combining word embedding and lexicalresource for semantic relateness mesurement." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/87144345989462382585.

Full text
Abstract:
碩士
國立臺灣大學
資訊工程學研究所
104
In this thesis, we propose three different approaches to measure the semantic relatedness: (1) Boost the performance of GloVe word embedding by removing ortransforming abnormal dimensions. (2) Linearly combines the path information extracted from WordNet and the word embedding. (3) Utilize word embedding and twelve linguisticinformation extracted from WordNet as features for support vector regression. We conduct our experiments on six benchmark data sets. The evaluation measurecomputes the Pearson and Spearman correlation between the output of our methods and the ground truth. We report our results together with three state-of-the-art approaches. Theexperimental results show that our methods outperform the state-of-the-art approaches in most of the benchmark data sets.
APA, Harvard, Vancouver, ISO, and other styles
47

Hwang, Sung Ju. "Discriminative object categorization with external semantic knowledge." 2013. http://hdl.handle.net/2152/21320.

Full text
Abstract:
Visual object category recognition is one of the most challenging problems in computer vision. Even assuming that we can obtain a near-perfect instance level representation with the advances in visual input devices and low-level vision techniques, object categorization still remains as a difficult problem because it requires drawing boundaries between instances in a continuous world, where the boundaries are solely defined by human conceptualization. Object categorization is essentially a perceptual process that takes place in a human-defined semantic space. In this semantic space, the categories reside not in isolation, but in relation to others. Some categories are similar, grouped, or co-occur, and some are not. However, despite this semantic nature of object categorization, most of the today's automatic visual category recognition systems rely only on the category labels for training discriminative recognition with statistical machine learning techniques. In many cases, this could result in the recognition model being misled into learning incorrect associations between visual features and the semantic labels, from essentially overfitting to training set biases. This limits the model's prediction power when new test instances are given. Using semantic knowledge has great potential to benefit object category recognition. First, semantic knowledge could guide the training model to learn a correct association between visual features and the categories. Second, semantics provide much richer information beyond the membership information given by the labels, in the form of inter-category and category-attribute distances, relations, and structures. Finally, the semantic knowledge scales well as the relations between categories become larger with an increasing number of categories. My goal in this thesis is to learn discriminative models for categorization that leverage semantic knowledge for object recognition, with a special focus on the semantic relationships among different categories and concepts. To this end, I explore three semantic sources, namely attributes, taxonomies, and analogies, and I show how to incorporate them into the original discriminative model as a form of structural regularization. In particular, for each form of semantic knowledge I present a feature learning approach that defines a semantic embedding to support the object categorization task. The regularization penalizes the models that deviate from the known structures according to the semantic knowledge provided. The first semantic source I explore is attributes, which are human-describable semantic characteristics of an instance. While the existing work treated them as mid-level features which did not introduce new information, I focus on their potential as a means to better guide the learning of object categories, by enforcing the object category classifiers to share features with attribute classifiers, in a multitask feature learning framework. This approach essentially discovers the common low-dimensional features that support predictions in both semantic spaces. Then, I move on to the semantic taxonomy, which is another valuable source of semantic knowledge. The merging and splitting criteria for the categories on a taxonomy are human-defined, and I aim to exploit this implicit semantic knowledge. Specifically, I propose a tree of metrics (ToM) that learns metrics that capture granularity-specific similarities at different nodes of a given semantic taxonomy, and uses a regularizer to isolate granularity-specific disjoint features. This approach captures the intuition that the features used for the discrimination of the parent class should be different from the features used for the children classes. Such learned metrics can be used for hierarchical classification. The use of a single taxonomy can be limited in that its structure is not optimal for hierarchical classification, and there may exist no single optimal semantic taxonomy that perfectly aligns with visual distributions. Thus, I next propose a way to overcome this limitation by leveraging multiple taxonomies as semantic sources to exploit, and combine the acquired complementary information across multiple semantic views and granularities. This allows us, for example, to synthesize semantics from both 'Biological', and 'Appearance'-based taxonomies when learning the visual features. Finally, as a further exploration of more complex semantic relations different from the previous two pairwise similarity-based models, I exploit analogies, which encode the relational similarities between two related pairs of categories. Specifically, I use analogies to regularize a discriminatively learned semantic embedding space for categorization, such that the displacements between the two category embeddings in both category pairs of the analogy are enforced to be the same. Such a constraint allows for a more confusing pair of categories to benefit from a clear separation in the matched pair of categories that share the same relation. All of these methods are evaluated on challenging public datasets, and are shown to effectively improve the recognition accuracy over purely discriminative models, while also guiding the recognition to be more semantic to human perception. Further, the applications of the proposed methods are not limited to visual object categorization in computer vision, but they can be applied to any classification problems where there exists some domain knowledge about the relationships or structures between the classes. Possible applications of my methods outside the visual recognition domain include document classification in natural language processing, and gene-based animal or protein classification in computational biology.
text
APA, Harvard, Vancouver, ISO, and other styles
48

Ivensky, Ilya. "Prediction of Alzheimer's disease and semantic dementia from scene description: toward better language and topic generalization." Thesis, 2020. http://hdl.handle.net/1866/24317.

Full text
Abstract:
La segmentation des données par la langue et le thème des tests psycholinguistiques devient de plus en plus un obstacle important à la généralisation des modèles de prédiction. Cela limite notre capacité à comprendre le cœur du dysfonctionnement linguistique et cognitif, car les modèles sont surajustés pour les détails d'une langue ou d'un sujet particulier. Dans ce travail, nous étudions les approches potentielles pour surmonter ces limitations. Nous discutons des propriétés de divers modèles de plonjement de mots FastText pour l'anglais et le français et proposons un ensemble des caractéristiques, dérivées de ces propriétés. Nous montrons que malgré les différences dans les langues et les algorithmes de plonjement, un ensemble universel de caractéristiques de vecteurs de mots indépendantes de la langage est capable de capturer le dysfonctionnement cognitif. Nous soutenons que dans le contexte de données rares, les caractéristiques de vecteur de mots fabriquées à la main sont une alternative raisonnable pour l'apprentissage des caractéristiques, ce qui nous permet de généraliser sur les limites de la langue et du sujet.
Data segmentation by the language and the topic of psycholinguistic tests increasingly becomes a significant obstacle for generalization of predicting models. It limits our ability to understand the core of linguistic and cognitive dysfunction because the models overfit the details of a particular language or topic. In this work, we study potential approaches to overcome such limitations. We discuss the properties of various FastText word embedding models for English and French and propose a set of features derived from these properties. We show that despite the differences in the languages and the embedding algorithms, a universal language-agnostic set of word-vector features can capture cognitive dysfunction. We argue that in the context of scarce data, the hand-crafted word-vector features is a reasonable alternative for feature learning, which allows us to generalize over the language and topic boundaries.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography