Dissertations / Theses on the topic 'Semantic embeddings'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 48 dissertations / theses for your research on the topic 'Semantic embeddings.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Malmberg, Jacob. "Evaluating semantic similarity using sentence embeddings." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291425.
Full textSammanfattning Semantisk likhets-sökning är en typ av sökning som syftar till att hitta dokument eller meningar som är semantiskt lika en användarspecifierad sökterm. Denna typ av sökning utförs ofta, exempelvis när användaren söker efter information på internet. För att möjliggöra detta måste vektorrepresentationer av både dokumenten som ska genomsökas såväl som söktermen skapas. Ett vanligt sätt att skapa dessa representationer har varit term frequency - inverse document frequencyalgoritmen (TF-IDF). Moderna metoder använder neurala nätverk som har blivit mycket populära under de senaste åren. BERT-nätverket som släpptes 2018 är ett väl ansett nätverk som kan användas för att skapa vektorrepresentationer. Många varianter av BERT-nätverket har skapats, exempelvis nätverket Sentence-BERT som är uttryckligen skapad för att skapa vektorrepresentationer av meningar. Denna avhandling ämnar att utvärdera semantisk likhets-sökning som bygger på vektorrepresentationer av meningar producerade av både traditionella och moderna approacher. Olika experiment utfördes för att kontrastera de olika approacherna. Eftersom dataset uttryckligen skapade för denna typ av experiment inte kunde lokaliseras modifierades dataset som vanligen används. Resultaten visade att algoritmen TF-IDF överträffade approacherna som var baserade på neurala nätverk i nästintill alla experiment. Av de neurala nätverk som utvärderades var Sentence-BERT bättre än BERT-nätverket. För att skapa mer generaliserbara resultat krävs dataset uttryckligen designade för semantisk likhets-sökning.
Yu, Lu. "Semantic representation: from color to deep embeddings." Doctoral thesis, Universitat Autònoma de Barcelona, 2019. http://hdl.handle.net/10803/669458.
Full textUno de los problemas fundamentales de la visión por computador es representar imágenes con descripciones compactas semánticamente relevantes. Estas descripciones podrían utilizarse en una amplia variedad de aplicaciones, como la comparación de imágenes, la detección de objetos y la búsqueda de vídeos. El objetivo principal de esta tesis es estudiar las representaciones de imágenes desde dos aspectos: las descripciones de color y las descripciones profundas con redes neuronales. En la primera parte de la tesis partimos de descripciones de color modeladas a mano. Existen nombres comunes en varias lenguas para los colores básicos, y proponemos un método para extender los nombres de colores adicionales de acuerdo con su naturaleza complementaria a los básicos. Esto nos permite calcular representaciones de nombres de colores de longitud arbitraria con un alto poder discriminatorio. Los experimentos psicofísicos confirman que el método propuesto supera a los marcos de referencia existentes. En segundo lugar, al agregar estrategias de atención, aprendemos descripciones de colores profundos con redes neuronales a partir de datos con anotaciones para la imagen en vez de para cada uno de los píxeles. La estrategia de atención logra identificar correctamente las regiones relevantes para cada clase que queremos evaluar. La ventaja del enfoque propuesto es que los nombres de colores a usar se pueden aprender específicamente para dominios de los que no existen anotaciones a nivel de píxel. En la segunda parte de la tesis, nos centramos en las descripciones profundas con redes neuronales. En primer lugar, abordamos el problema de comprimir grandes redes de descriptores en redes más pequeñas, manteniendo un rendimiento similar. Proponemos destilar las métricas de una red maestro a una red estudiante. Se introducen dos nuevas funciones de coste para modelar la comunicación de la red maestro a una red estudiante más pequeña: una basada en un maestro absoluto, donde el estudiante pretende producir los mismos descriptores que el maestro, y otra basada en un maestro relativo, donde las distancias entre pares de puntos de datos son comunicadas del maestro al alumno. Además, se han investigado diversos aspectos de la destilación para las representaciones, incluidas las capas de atención, el aprendizaje semi-supervisado y la destilación de calidad cruzada. Finalmente, se estudia otro aspecto del aprendizaje por métrica profundo, el aprendizaje continuado. Observamos que se produce una variación del conocimiento aprendido durante el entrenamiento de nuevas tareas. En esta tesis se presenta un método para estimar la variación semántica en función de la variación que experimentan los datos de la tarea actual durante su aprendizaje. Teniendo en cuenta esta estimación, las tareas anteriores pueden ser compensadas, mejorando así su rendimiento. Además, mostramos que las redes de descripciones profundas sufren significativamente menos olvidos catastróficos en comparación con las redes de clasificación cuando aprenden nuevas tareas.
One of the fundamental problems of computer vision is to represent images with compact semantically relevant embeddings. These embeddings could then be used in a wide variety of applications, such as image retrieval, object detection, and video search. The main objective of this thesis is to study image embeddings from two aspects: color embeddings and deep embeddings. In the first part of the thesis we start from hand-crafted color embeddings. We propose a method to order the additional color names according to their complementary nature with the basic eleven color names. This allows us to compute color name representations with high discriminative power of arbitrary length. Psychophysical experiments confirm that our proposed method outperforms baseline approaches. Secondly, we learn deep color embeddings from weakly labeled data by adding an attention strategy. The attention branch is able to correctly identify the relevant regions for each class. The advantage of our approach is that it can learn color names for specific domains for which no pixel-wise labels exists. In the second part of the thesis, we focus on deep embeddings. Firstly, we address the problem of compressing large embedding networks into small networks, while maintaining similar performance. We propose to distillate the metrics from a teacher network to a student network. Two new losses are introduced to model the communication of a deep teacher network to a small student network: one based on an absolute teacher, where the student aims to produce the same embeddings as the teacher, and one based on a relative teacher, where the distances between pairs of data points is communicated from the teacher to the student. In addition, various aspects of distillation have been investigated for embeddings, including hint and attention layers, semi-supervised learning and cross quality distillation. Finally, another aspect of deep metric learning, namely lifelong learning, is studied. We observed some drift occurs during training of new tasks for metric learning. A method to estimate the semantic drift based on the drift which is experienced by data of the current task during its training is introduced. Having this estimation, previous tasks can be compensated for this drift, thereby improving their performance. Furthermore, we show that embedding networks suffer significantly less from catastrophic forgetting compared to classification networks when learning new tasks.
Moss, Adam. "Detecting Lexical Semantic Change Using Probabilistic Gaussian Word Embeddings." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-412539.
Full textMontariol, Syrielle. "Models of diachronic semantic change using word embeddings." Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG006.
Full textIn this thesis, we study lexical semantic change: temporal variations in the use and meaning of words, also called extit{diachrony}. These changes are carried by the way people use words, and mirror the evolution of various aspects of society such as its technological and cultural environment.We explore, compare and evaluate methods to build time-varying embeddings from a corpus in order to analyse language evolution.We focus on contextualised word embeddings using pre-trained language models such as BERT. We propose several approaches to extract and aggregate the contextualised representations of words over time, and quantify their level of semantic change.In particular, we address the practical aspect of these systems: the scalability of our approaches, with a view to applying them to large corpora or large vocabularies; their interpretability, by disambiguating the different uses of a word over time; and their applicability to concrete issues, for documents related to COVID19We evaluate the efficiency of these methods quantitatively using several annotated corpora, and qualitatively by linking the detected semantic variations with real-life events and numerical data.Finally, we extend the task of semantic change detection beyond the temporal dimension. We adapt it to a bilingual setting, to study the joint evolution of a word and its translation in two corpora of different languages; and to a synchronic frame, to detect semantic variations across different sources or communities on top of the temporal variation
Shaik, Arshad. "Biomedical Semantic Embeddings: Using Hybrid Sentences to Construct Biomedical Word Embeddings and Their Applications." Thesis, University of North Texas, 2019. https://digital.library.unt.edu/ark:/67531/metadc1609064/.
Full textShaik, Arshad. "Biomedical Semantic Embeddings: Using Hybrid Sentences to Construct Biomedical Word Embeddings and its Applications." Thesis, University of North Texas, 2019. https://digital.library.unt.edu/ark:/67531/metadc1609064/.
Full textMunbodh, Mrinal. "Deriving A Better Metric To Assess theQuality of Word Embeddings Trained OnLimited Specialized Corpora." University of Cincinnati / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1601995854965902.
Full textBalzar, Ekenbäck Nils. "Evaluation of Sentence Representations in Semantic Text Similarity Tasks." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291334.
Full textDenna avhandling undersöker metoder för att representera meningar i vektor-form för semantisk textlikhet och jämför dem med meningsbaserade testmäng-der. För att utvärdera representationerna användes två metoder: STS Bench-mark, en vedertagen metod för att utvärdera språkmodellers förmåga att ut-värdera semantisk likhet, och STS Benchmark konverterad till en binär lik-hetsuppgift. Resultaten visade att förbehandling av texten och ordvektorerna kunde ge en signifikant ökning i resultatet för dessa uppgifter. Studien konklu-derade även att datamängden som användes kanske inte är ideal för denna typ av utvärdering, då meningsparen i stort hade ett högt lexikalt överlapp. Som komplement föreslår studien en parafrasdatamängd, något som skulle kräva ytterligare studier.
Zhou, Hanqing. "DBpedia Type and Entity Detection Using Word Embeddings and N-gram Models." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37324.
Full textFelt, Paul L. "Facilitating Corpus Annotation by Improving Annotation Aggregation." BYU ScholarsArchive, 2015. https://scholarsarchive.byu.edu/etd/5678.
Full textLandthaler, Jörg [Verfasser], Florian [Akademischer Betreuer] Matthes, Kevin D. [Gutachter] Ashley, and Florian [Gutachter] Matthes. "Improving Semantic Search in the German Legal Domain with Word Embeddings / Jörg Landthaler ; Gutachter: Kevin D. Ashley, Florian Matthes ; Betreuer: Florian Matthes." München : Universitätsbibliothek der TU München, 2020. http://d-nb.info/1216242410/34.
Full textLandthaler, Jörg Verfasser], Florian [Akademischer Betreuer] Matthes, Kevin D. [Gutachter] [Ashley, and Florian [Gutachter] Matthes. "Improving Semantic Search in the German Legal Domain with Word Embeddings / Jörg Landthaler ; Gutachter: Kevin D. Ashley, Florian Matthes ; Betreuer: Florian Matthes." München : Universitätsbibliothek der TU München, 2020. http://d-nb.info/1216242410/34.
Full textLandthaler, Jörg Verfasser], Florian [Akademischer Betreuer] [Matthes, Kevin D. [Gutachter] Ashley, and Florian [Gutachter] Matthes. "Improving Semantic Search in the German Legal Domain with Word Embeddings / Jörg Landthaler ; Gutachter: Kevin D. Ashley, Florian Matthes ; Betreuer: Florian Matthes." München : Universitätsbibliothek der TU München, 2020. http://nbn-resolving.de/urn:nbn:de:bvb:91-diss-20200605-1521744-1-5.
Full textTissier, Julien. "Improving methods to learn word representations for efficient semantic similarites computations." Thesis, Lyon, 2020. http://www.theses.fr/2020LYSES008.
Full textMany natural language processing applications rely on word representations (also called word embeddings) to achieve state-of-the-art results. These numerical representations of the language should encode both syntactic and semantic information to perform well in downstream tasks. However, common models (word2vec, GloVe) use generic corpus like Wikipedia to learn them and they therefore lack specific semantic information. Moreover it requires a large memory space to store them because the number of representations to save can be in the order of a million. The topic of my thesis is to develop new learning algorithms to both improve the semantic information encoded within the representations while making them requiring less memory space for storage and their application in NLP tasks.The first part of my work is to improve the semantic information contained in word embeddings. I developed dict2vec, a model that uses additional information from online lexical dictionaries when learning word representations. The dict2vec word embeddings perform ∼15% better against the embeddings learned by other models on word semantic similarity tasks. The second part of my work is to reduce the memory size of the embeddings. I developed an architecture based on an autoencoder to transform commonly used real-valued embeddings into binary embeddings, reducing their size in memory by 97% with only a loss of ∼2% in accuracy in downstream NLP tasks
Necşulescu, Silvia. "Automatic acquisition of lexical-semantic relations: gathering information in a dense representation." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/374234.
Full textLes relacions lexicosemàntiques entre paraules són una informació clau per a moltes tasques del PLN, què requereixen aquest coneixement en forma de recursos lingüístics. Aquesta tesi tracta l’adquisició d'instàncies lexicosemàntiques. Els sistemes actuals utilitzen representacions basades en patrons dels contextos en què dues paraules coocorren per detectar la relació que s'hi estableix. Aquest enfocament s'enfronta a problemes de falta d’informació: fins i tot en el cas de treballar amb corpus de grans dimensions, hi haurà parells de paraules relacionades que no coocorreran, o no ho faran amb la freqüència necessària. Per tant, el nostre objectiu principal ha estat proposar noves representacions per predir si dues paraules estableixen una relació lexicosemàntica. La intuïció era que aquestes representacions noves havien de contenir informació sobre patrons dels contextos, combinada amb informació sobre el significat de les paraules implicades en la relació. Aquestes dues fonts d'informació havien de ser la base d'una estratègia de generalització que oferís informació fins i tot quan les dues paraules no coocorrien.
Åkerström, Joakim, and Aravena Carlos Peñaloza. "Semantiska modeller för syntetisk textgenerering - en jämförelsestudie." Thesis, Högskolan i Borås, Akademin för bibliotek, information, pedagogik och IT, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-13981.
Full textThe ability of expressing feelings in words or moving others with them has always been admired and rare feature. This project involves creating a text generator able to write text in the style of remarkable men and women with this ability, this gift. This has been done by training a neural network with quotes written by outstanding people such as Oscar Wilde, Mark Twain, Charles Dickens, et alt. This neural network cooperate with two different semantic models: Word2Vec and One-Hot and the three of them compound our text generator. With the text generated we carried out a survey in order to collect the opinion of students about the quality of the text generated by our generator. Upon examination of the result, we proudly learned that most of the respondents thought the texts were coherent and fun to read, we also learned that the former semantic model performed, not by a factor of magnitude, better than the latter.
Silva, Allan de Barcelos. "O uso de recursos linguísticos para mensurar a semelhança semântica entre frases curtas através de uma abordagem híbrida." Universidade do Vale do Rio dos Sinos, 2017. http://www.repositorio.jesuita.org.br/handle/UNISINOS/6974.
Full textMade available in DSpace on 2018-04-04T11:46:55Z (GMT). No. of bitstreams: 1 Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5) Previous issue date: 2017-12-14
Nenhuma
Na área de Processamento de Linguagem Natural, a avaliação da similaridade semântica textual é considerada como um elemento importante para a construção de recursos em diversas frentes de trabalho, tais como a recuperação de informações, a classificação de textos, o agrupamento de documentos, as aplicações de tradução, a interação através de diálogos, entre outras. A literatura da área descreve aplicações e técnicas voltadas, em grande parte, para a língua inglesa. Além disso, observa-se o uso prioritário de recursos probabilísticos, enquanto os aspectos linguísticos são utilizados de forma incipiente. Trabalhos na área destacam que a linguística possui um papel fundamental na avaliação de similaridade semântica textual, justamente por ampliar o potencial dos métodos exclusivamente probabilísticos e evitar algumas de suas falhas, que em boa medida são resultado da falta de tratamento mais aprofundado de aspectos da língua. Este contexto é potencializado no tratamento de frases curtas, que consistem no maior campo de utilização das técnicas de similaridade semântica textual, pois este tipo de sentença é composto por um conjunto reduzido de informações, diminuindo assim a capacidade de tratamento probabilístico eficiente. Logo, considera-se vital a identificação e aplicação de recursos a partir do estudo mais aprofundado da língua para melhor compreensão dos aspectos que definem a similaridade entre sentenças. O presente trabalho apresenta uma abordagem para avaliação da similaridade semântica textual em frases curtas no idioma português brasileiro. O principal diferencial apresentado é o uso de uma abordagem híbrida, na qual tanto os recursos de representação distribuída como os aspectos léxicos e linguísticos são utilizados. Para a consolidação do estudo, foi definida uma metodologia que permite a análise de diversas combinações de recursos, possibilitando a avaliação dos ganhos que são introduzidos com a ampliação de aspectos linguísticos e também através de sua combinação com o conhecimento gerado por outras técnicas. A abordagem proposta foi avaliada com relação a conjuntos de dados conhecidos na literatura (evento PROPOR 2016) e obteve bons resultados.
One of the areas of Natural language processing (NLP), the task of assessing the Semantic Textual Similarity (STS) is one of the challenges in NLP and comes playing an increasingly important role in related applications. The STS is a fundamental part of techniques and approaches in several areas, such as information retrieval, text classification, document clustering, applications in the areas of translation, check for duplicates and others. The literature describes the experimentation with almost exclusive application in the English language, in addition to the priority use of probabilistic resources, exploring the linguistic ones in an incipient way. Since the linguistic plays a fundamental role in the analysis of semantic textual similarity between short sentences, because exclusively probabilistic works fails in some way (e.g. identification of far or close related sentences, anaphora) due to lack of understanding of the language. This fact stems from the few non-linguistic information in short sentences. Therefore, it is vital to identify and apply linguistic resources for better understand what make two or more sentences similar or not. The current work presents a hybrid approach, in which are used both of distributed, lexical and linguistic aspects for an evaluation of semantic textual similarity between short sentences in Brazilian Portuguese. We evaluated proposed approach with well-known and respected datasets in the literature (PROPOR 2016) and obtained good results.
Lisena, Pasquale. "Knowledge-based music recommendation : models, algorithms and exploratory search." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS614.
Full textRepresenting the information about music is a complex activity that involves different sub-tasks. This thesis manuscript mostly focuses on classical music, researching how to represent and exploit its information. The main goal is the investigation of strategies of knowledge representation and discovery applied to classical music, involving subjects such as Knowledge-Base population, metadata prediction, and recommender systems. We propose a complete workflow for the management of music metadata using Semantic Web technologies. We introduce a specialised ontology and a set of controlled vocabularies for the different concepts specific to music. Then, we present an approach for converting data, in order to go beyond the librarian practice currently in use, relying on mapping rules and interlinking with controlled vocabularies. Finally, we show how these data can be exploited. In particular, we study approaches based on embeddings computed on structured metadata, titles, and symbolic music for ranking and recommending music. Several demo applications have been realised for testing the previous approaches and resources
Nguyen, Gia Hung. "Modèles neuronaux pour la recherche d'information : approches dirigées par les ressources sémantiques." Thesis, Toulouse 3, 2018. http://www.theses.fr/2018TOU30233.
Full textIn this thesis, we focus on bridging the semantic gap between the documents and queries representations, hence improve the matching performance. We propose to combine relational semantics from knowledge resources and distributed semantics of the corpus inferred by neural models. Our contributions consist of two main aspects: (1) Improving distributed representations of text for IR tasks. We propose two models that integrate relational semantics into the distributed representations: a) an offline model that combines two types of pre-trained representations to obtain a hybrid representation of the document; b) an online model that jointly learns distributed representations of documents, concepts and words. To better integrate relational semantics from knowledge resources, we propose two approaches to inject these relational constraints, one based on the regularization of the objective function, the other based on instances in the training text. (2) Exploiting neural networks for semantic matching of documents}. We propose a neural model for document-query matching. Our neural model relies on: a) a representation of raw-data that models the relational semantics of text by jointly considering objects and relations expressed in a knowledge resource, and b) an end-to-end neural architecture that learns the query-document relevance by leveraging the distributional and relational semantics of documents and queries
Callin, Jimmy. "Word Representations and Machine Learning Models for Implicit Sense Classification in Shallow Discourse Parsing." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-325876.
Full textScala, Simone. "Realizzazione di un motore di ricerca semantico basato sui Document Embedding." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2017.
Find full textWang, Run Fen. "Semantic Text Matching Using Convolutional Neural Networks." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-362134.
Full textWang, Qian. "Zero-shot visual recognition via latent embedding learning." Thesis, University of Manchester, 2018. https://www.research.manchester.ac.uk/portal/en/theses/zeroshot-visual-recognition-via-latent-embedding-learning(bec510af-6a53-4114-9407-75212e1a08e1).html.
Full textNorlund, Tobias. "The Use of Distributional Semantics in Text Classification Models : Comparative performance analysis of popular word embeddings." Thesis, Linköpings universitet, Datorseende, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-127991.
Full textChoudhary, Rishabh R. "Construction and Visualization of Semantic Spaces for Domain-Specific Text Corpora." University of Cincinnati / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627666092811419.
Full textŠůstek, Martin. "Word2vec modely s přidanou kontextovou informací." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-363837.
Full textDas, Manirupa. "Neural Methods Towards Concept Discovery from Text via Knowledge Transfer." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1572387318988274.
Full textRomeo, Lauren Michele. "The Structure of the lexicon in the task of the automatic acquisition of lexical information." Doctoral thesis, Universitat Pompeu Fabra, 2015. http://hdl.handle.net/10803/325420.
Full textLa información de clase semántica de los nombres es fundamental para una amplia variedad de tareas del procesamiento del lenguaje natural (PLN), como la traducción automática, la discriminación de referentes en tareas como la detección y el seguimiento de eventos, la búsqueda de respuestas, el reconocimiento y la clasificación de nombres de entidades, la construcción y ampliación automática de ontologías, la inferencia textual, etc. Una aproximación para resolver la construcción y el mantenimiento de los léxicos de gran cobertura que alimentan los sistemas de PNL, una tarea muy costosa y lenta, es la adquisición automática de información léxica, que consiste en la inducción de una clase semántica relacionada con una palabra en concreto a partir de datos de su distribución obtenidos de un corpus. Precisamente, por esta razón, se espera que la investigación actual sobre los métodos para la producción automática de léxicos de alta calidad, con gran cantidad de información y con anotación de clase como el trabajo que aquí presentamos, tenga un gran impacto en el rendimiento de la mayoría de las aplicaciones de PNL. En esta tesis, tratamos la adquisición automática de información léxica como un problema de clasificación. Con este propósito, adoptamos métodos de aprendizaje automático para generar un modelo que represente los datos de distribución vectorial que, basados en ejemplos conocidos, permitan hacer predicciones de otras palabras desconocidas. Las principales preguntas de investigación que planteamos en esta tesis son: (i) si los datos de corpus proporcionan suficiente información para construir representaciones de palabras de forma eficiente y que resulten en decisiones de clasificación precisas y sólidas, y (ii) si la adquisición automática puede gestionar, también, los nombres polisémicos. Para hacer frente a estos problemas, realizamos una serie de validaciones empíricas sobre nombres en inglés. Nuestros resultados confirman que la información obtenida a partir de la distribución de los datos de corpus es suficiente para adquirir automáticamente clases semánticas, como lo demuestra un valor-F global promedio de 0,80 aproximadamente utilizando varios modelos de recuento de contextos y en datos de corpus de distintos tamaños. No obstante, tanto el estado de la cuestión como los experimentos que realizamos destacaron una serie de retos para este tipo de modelos, que son reducir la escasez de datos del vector y dar cuenta de la polisemia nominal en las representaciones distribucionales de las palabras. En este contexto, los modelos de word embedding (WE) mantienen la “semántica” subyacente en las ocurrencias de un nombre en los datos de corpus asignándole un vector. Con esta elección, hemos sido capaces de superar el problema de la escasez de datos, como lo demuestra un valor-F general promedio de 0,91 para las clases semánticas de nombres de sentido único, a través de una combinación de la reducción de la dimensionalidad y de números reales. Además, las representaciones de WE obtuvieron un rendimiento superior en la gestión de las ocurrencias asimétricas de cada sentido de los nombres de tipo complejo polisémicos regulares en datos de corpus. Como resultado, hemos podido clasificar directamente esos nombres en su propia clase semántica con un valor-F global promedio de 0,85. La principal aportación de esta tesis consiste en una validación empírica de diferentes representaciones de distribución utilizadas para la clasificación semántica de nombres junto con una posterior expansión del trabajo anterior, lo que se traduce en recursos léxicos y conjuntos de datos innovadores que están disponibles de forma gratuita para su descarga y uso.
Lexical semantic class information for nouns is critical for a broad variety of Natural Language Processing (NLP) tasks including, but not limited to, machine translation, discrimination of referents in tasks such as event detection and tracking, question answering, named entity recognition and classification, automatic construction and extension of ontologies, textual inference, etc. One approach to solve the costly and time-consuming manual construction and maintenance of large-coverage lexica to feed NLP systems is the Automatic Acquisition of Lexical Information, which involves the induction of a semantic class related to a particular word from distributional data gathered within a corpus. This is precisely why current research on methods for the automatic production of high- quality information-rich class-annotated lexica, such as the work presented here, is expected to have a high impact on the performance of most NLP applications. In this thesis, we address the automatic acquisition of lexical information as a classification problem. For this reason, we adopt machine learning methods to generate a model representing vectorial distributional data which, grounded on known examples, allows for the predictions of other unknown words. The main research questions we investigate in this thesis are: (i) whether corpus data provides sufficient distributional information to build efficient word representations that result in accurate and robust classification decisions and (ii) whether automatic acquisition can handle also polysemous nouns. To tackle these problems, we conducted a number of empirical validations on English nouns. Our results confirmed that the distributional information obtained from corpus data is indeed sufficient to automatically acquire lexical semantic classes, demonstrated by an average overall F1-Score of almost 0.80 using diverse count-context models and on different sized corpus data. Nonetheless, both the State of the Art and the experiments we conducted highlighted a number of challenges of this type of model such as reducing vector sparsity and accounting for nominal polysemy in distributional word representations. In this context, Word Embeddings (WE) models maintain the “semantics” underlying the occurrences of a noun in corpus data by mapping it to a feature vector. With this choice, we were able to overcome the sparse data problem, demonstrated by an average overall F1-Score of 0.91 for single-sense lexical semantic noun classes, through a combination of reduced dimensionality and “real” numbers. In addition, the WE representations obtained a higher performance in handling the asymmetrical occurrences of each sense of regular polysemous complex-type nouns in corpus data. As a result, we were able to directly classify such nouns into their own lexical-semantic class with an average overall F1-Score of 0.85. The main contribution of this dissertation consists of an empirical validation of different distributional representations used for nominal lexical semantic classification along with a subsequent expansion of previous work, which results in novel lexical resources and data sets that have been made freely available for download and use.
Stigeborn, Olivia. "Text ranking based on semantic meaning of sentences." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-300442.
Full textAtt hitta en lämplig kandidat till kundmatchning är en viktig del av ett konsultföretags arbete. Det tar mycket tid och ansträngning för rekryterare på företaget att läsa eventuellt hundratals CV:n för att hitta en lämplig kandidat. Det finns språkteknologiska metoder för att rangordna CV:n med de mest lämpliga kandidaterna rankade högst. Detta säkerställer att rekryterare endast behöver titta på de topprankade CV:erna och snabbt kan få kandidater ut i fältet. Tidigare forskning har använt metoder som räknar specifika nyckelord i ett CV och är kapabla att avgöra om en kandidat har specifika erfarenheter. Huvudmålet med denna avhandling är att använda den semantiska innebörden av texten iCV:n för att få en djupare förståelse för en kandidats erfarenhetsnivå. Den utvärderar också om modellen kan köras på mobila enheter och om algoritmen kan rangordna CV:n oberoende av om CV:erna är på svenska eller engelska. En algoritm skapades som använder ordinbäddningsmodellen DistilRoBERTa som är kapabel att fånga textens semantiska betydelse. Algoritmen utvärderades genom att generera jobbeskrivningar från CV:n genom att skapa en sammanfattning av varje CV. Körtiden, minnesanvändningen och rankningen som den önskade kandidaten fick dokumenterades och användes för att analysera resultatet. När den kandidat som användes för att generera jobbeskrivningen rankades i topp 10 ansågs klassificeringen vara korrekt. Noggrannheten beräknades med denna metod och en noggrannhet på 68,3 % uppnåddes. Resultaten visar att algoritmen kan rangordna CV:n. Algoritmen kan rangordna både svenska och engelska CV:n med en noggrannhet på 67,7 % för svenska och 74,7 % för engelska. Körtiden var i genomsnitt 578 ms vilket skulle möjliggöra att algoritmen kan köras på mobila enheter men minnesanvändningen var för stor. Sammanfattningsvis kan den semantiska betydelsen av CV:n användas för att rangordna CV:n och ett eventuellt framtida arbete är att kombinera denna metod med en metod som räknar nyckelord för att undersöka hur noggrannheten skulle påverkas.
Martignano, Alessandro. "Transfer learning nella classificazione di dati testuali gerarchici: approcci semantici basati su ontologie e word embeddings." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2019.
Find full textPierrejean, Bénédicte. "Qualitative evaluation of word embeddings : investigating the instability in neural-based models." Thesis, Toulouse 2, 2020. http://www.theses.fr/2020TOU20001.
Full textDistributional semantics has been revolutionized by neural-based word embeddings methods such as word2vec that made semantics models more accessible by providing fast, efficient and easy to use training methods. These dense representations of lexical units based on the unsupervised analysis of large corpora are more and more used in various types of applications. They are integrated as the input layer in deep learning models or they are used to draw qualitative conclusions in corpus linguistics. However, despite their popularity, there still exists no satisfying evaluation method for word embeddings that provides a global yet precise vision of the differences between models. In this PhD thesis, we propose a methodology to qualitatively evaluate word embeddings and provide a comprehensive study of models trained using word2vec. In the first part of this thesis, we give an overview of distributional semantics evolution and review the different methods that are currently used to evaluate word embeddings. We then identify the limits of the existing methods and propose to evaluate word embeddings using a different approach based on the variation of nearest neighbors. We experiment with the proposed method by evaluating models trained with different parameters or on different corpora. Because of the non-deterministic nature of neural-based methods, we acknowledge the limits of this approach and consider the problem of nearest neighbors instability in word embeddings models. Rather than avoiding this problem we embrace it and use it as a mean to better understand word embeddings. We show that the instability problem does not impact all words in the same way and that several linguistic features are correlated. This is a step towards a better understanding of vector-based semantic models
Van, Tassel John Peter. "Femto-VHDL : the semantics of a subset of VHDL and its embedding in the HOL proof assistant." Thesis, University of Cambridge, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.308190.
Full textBerndl, Emanuel [Verfasser], and Harald [Akademischer Betreuer] Kosch. "Embedding a Multimedia Metadata Model into a Workflow-driven Environment Using Idiomatic Semantic Web Technologies / Emanuel Berndl ; Betreuer: Harald Kosch." Passau : Universität Passau, 2019. http://d-nb.info/1192512022/34.
Full textWarren, Jared. "Using Haskell to Implement Syntactic Control of Interference." Thesis, Kingston, Ont. : [s.n.], 2008. http://hdl.handle.net/1974/1237.
Full textBucher, Maxime. "Apprentissage et exploitation de représentations sémantiques pour la classification et la recherche d'images." Thesis, Normandie, 2018. http://www.theses.fr/2018NORMC250/document.
Full textIn this thesis, we examine some practical difficulties of deep learning models.Indeed, despite the promising results in computer vision, implementing them in some situations raises some questions. For example, in classification tasks where thousands of categories have to be recognised, it is sometimes difficult to gather enough training data for each category.We propose two new approaches for this learning scenario, called <>. We use semantic information to model classes which allows us to define models by description, as opposed to modelling from a set of examples.In the first chapter we propose to optimize a metric in order to transform the distribution of the original data and to obtain an optimal attribute distribution. In the following chapter, unlike the standard approaches of the literature that rely on the learning of a common integration space, we propose to generate visual features from a conditional generator. The artificial examples can be used in addition to real data for learning a discriminant classifier. In the second part of this thesis, we address the question of computational intelligibility for computer vision tasks. Due to the many and complex transformations of deep learning algorithms, it is difficult for a user to interpret the returned prediction. Our proposition is to introduce what we call a <> in the processing pipeline, which is a crossing point in which the representation of the image is entirely expressed with natural language, while retaining the efficiency of numerical representations. This semantic bottleneck allows to detect failure cases in the prediction process so as to accept or reject the decision
Dergachyova, Olga. "Knowledge-based support for surgical workflow analysis and recognition." Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S059/document.
Full textComputer assistance became indispensable part of modern surgical procedures. Desire of creating new generation of intelligent operating rooms incited researchers to explore problems of automatic perception and understanding of surgical situations. Situation awareness includes automatic recognition of surgical workflow. A great progress was achieved in recognition of surgical phases and gestures. Yet, there is still a blank between these two granularity levels in the hierarchy of surgical process. Very few research is focused on surgical activities carrying important semantic information vital for situation understanding. Two important factors impede the progress. First, automatic recognition and prediction of surgical activities is a highly challenging task due to short duration of activities, their great number and a very complex workflow with multitude of possible execution and sequencing ways. Secondly, very limited amount of clinical data provides not enough information for successful learning and accurate recognition. In our opinion, before recognizing surgical activities a careful analysis of elements that compose activity is necessary in order to chose right signals and sensors that will facilitate recognition. We used a deep learning approach to assess the impact of different semantic elements of activity on its recognition. Through an in-depth study we determined a minimal set of elements sufficient for an accurate recognition. Information about operated anatomical structure and surgical instrument was shown to be the most important. We also addressed the problem of data deficiency proposing methods for transfer of knowledge from other domains or surgeries. The methods of word embedding and transfer learning were proposed. They demonstrated their effectiveness on the task of next activity prediction offering 22% increase in accuracy. In addition, pertinent observations about the surgical practice were made during the study. In this work, we also addressed the problem of insufficient and improper validation of recognition methods. We proposed new validation metrics and approaches for assessing the performance that connect methods to targeted applications and better characterize capacities of the method. The work described in this these aims at clearing obstacles blocking the progress of the domain and proposes a new perspective on the problem of surgical workflow recognition
Gränsbo, Gustav. "Word Clustering in an Interactive Text Analysis Tool." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-157497.
Full textGkotse, Blerina. "Ontology-based Generation of Personalised Data Management Systems : an Application to Experimental Particle Physics." Thesis, Université Paris sciences et lettres, 2020. http://www.theses.fr/2020UPSLM017.
Full textThis thesis work aims at bridging the gap between the fields of Web Semantics and Experimental Particle Physics. Taking as a use case a specific type of physics experiments, namely the irradiation experiments used for assessing the resistance of components to radiation, a domain model, what in Web Semantics is called an ontology, has been created for describing the main concepts underlying the data management of irradiation experiments. Using such a formalisation, a methodology has been introduced for the automatic generation of data management systems based on ontologies and used to generate a web application for IEDM, the previously introduced ontology. In the last part of this thesis work, by the use of user-interface (UI) display preferences stored as instances of a UI-dedicated ontology we introduced, a method that represents these ontology instances as feature vectors (embeddings) for recommending personalised UIs is presented
Aliane, Nourredine. "Evaluation des représentations vectorielles de mots." Thesis, Paris 8, 2019. http://www.theses.fr/2019PA080014.
Full textIn Natural Language Processing vectorization of words is a key that enables the use of algorithms based on mathematical models. Recently new methods have appeared, and evaluating their quality is a necessity. At present, evaluations are mostly effective on English, which introduces the question of multilingual evaluations. We worked on generalizing methods, on comparing them, on devising new evaluations, and on WordNet as a multilingual resource used for evaluation.We choose six vectorization methods : CBOW, SkipGram, GloVe, an older method as baseline, and two more recent methods. Evaluations can be direct, comparing with some gold standard, or indirect, evaluating the result of an application produced with some vectorization. As an indirect method, we choose semantic clustering of words for comparing the underlying vectorizations. The chosen clustering algorithms were: the most used Kmeans, a neuronal one (SOM) and a probabilistic one (EM).Our system applies evaluation methods on big corpora in English, French and Arabic, then compares underlying vectorizations. We propose five new evaluation methods, with four based on WordNet, and one new protocol for polling. Our results yield three different vectorization orderings agreeing on decisive points, and invalidate some existing evaluations. As for our own evaluations, the protocol is validated, one method is invalidated and the reason analyzed, one is validated for English and French, but not Arabic, two are validated on the three languages, and one is left for further exploration
Lin, Wei-Rou, and 林瑋柔. "Learning and Exploring Sequential Visual-Semantic Embeddings from Visual Story Ordering." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/c4vzq4.
Full text國立臺灣大學
資訊工程學研究所
106
As more and more text-image intertwined stories, such as blog post, is generated on the internet, we are curious about the similarities and differences between infor-mation carried by the two modalities. Thus, we introduce the visual story ordering problem: to order image and text in a visual story jointly and handle the problem by a model training text and image into the same embedding. We try several models to deal with the problem, including pairwise and listwise approaches. We employ the result of coreference resolution as a baseline. In addition, we also implement a reader, processor, writer architecture and self-attention mecha-nism. We further proposed to decode in a bidirectional model with bidirectional beam search. We experimented on our methods with VIST, which is a visual storytelling dataset (Huang et al., 2016). The results show that bidirectional models outperform unidirec-tional models, whereas models trained with image outperform models trained without image on a text-only subset. We also found our embedding narrow the distance be-tween images and their corresponding story sentence even though we do not align the two modalities directly. In the future, we can further test the effectiveness of our model on different datasets, exploring the bidirectional inference mechanism deeper and augment our model with more functionality adapting the existing models.
Pombo, José Luís Fava de Matos. "Landing on the right job : a machine learning approach to match candidates with jobs applying semantic embeddings." Master's thesis, 2019. http://hdl.handle.net/10362/60405.
Full textJob application’ screening is a challenging and time-consuming task to execute manually. For recruiting companies such as Landing.Jobs it poses constraints on the ability to scale the business. Some systems have been built for assisting recruiters screening applications but they tend to overlook the challenges related with natural language. On the other side, most people nowadays specially in the IT-sector use the Internet to look for jobs, however, given the huge amount of job postings online, it can be complicated for a candidate to short-list the right ones for applying to. In this work we test a collection of Machine Learning algorithms and through the usage of cross-validation we calibrate the most important hyper-parameters of each algorithm. The learning algorithms attempt to learn what makes a successful match between candidate profile and job requirements using for training historical data of selected/reject applications in the screening phase. The features we use for building our models include the similarities between the job requirements and the candidate profile in dimensions such as skills, profession, location and a set of job features which intend to capture the experience level, salary expectations, among others. In a first set of experiments, our best results emerge from the application of the Multilayer Perceptron algorithm (also known as Feed-Forward Neural Networks). After this, we improve the skills-matching feature by applying techniques for semantically embedding required/offered skills in order to tackle problems such as synonyms and typos which artificially degrade the similarity between job profile and candidate profile and degrade the overall quality of the results. Through the usage of word2vec algorithm for embedding skills and Multilayer Perceptron to learn the overall matching we obtain our best results. We believe our results could be even further improved by extending the idea of semantic embedding to other features and by finding candidates with similar job preferences with the target candidate and building upon that a richer presentation of the candidate profile. We consider that the final model we present in this work can be deployed in production as a first-level tool for doing the heavy-lifting of screening all applications, then passing the top N matches for manual inspection. Also, the results of our model can be used to complement any recommendation system in place by simply running the model encoding the profile of all candidates in the database upon any new job opening and recommend the jobs to the candidates which yield higher matching probability.
Silva, André de Vasconcelos Santos. "Sparse distributed representations as word embeddings for language understanding." Master's thesis, 2018. http://hdl.handle.net/10071/18245.
Full textA designação word embeddings refere-se a representações vetoriais das palavras que capturam as similaridades semânticas e sintáticas entre estas. Palavras similares tendem a ser representadas por vetores próximos num espaço N dimensional considerando, por exemplo, a distância Euclidiana entre os pontos associados a estas representações vetoriais num espaço vetorial contínuo. Esta propriedade, torna as word embeddings importantes em várias tarefas de Processamento Natural da Língua, desde avaliações de analogia e similaridade entre palavras, às mais complexas tarefas de categorização, sumarização e tradução automática de texto. Tipicamente, as word embeddings são constituídas por vetores densos, de dimensionalidade reduzida. São obtidas a partir de aprendizagem não supervisionada, recorrendo a consideráveis quantidades de dados, através da otimização de uma função objetivo de uma rede neuronal. Este trabalho propõe uma metodologia para obter word embeddings constituídas por vetores binários esparsos, ou seja, representações vetoriais das palavras simultaneamente binárias (e.g. compostas apenas por zeros e uns), esparsas e com elevada dimensionalidade. A metodologia proposta tenta superar algumas desvantagens associadas às metodologias do estado da arte, nomeadamente o elevado volume de dados necessário para treinar os modelos, e simultaneamente apresentar resultados comparáveis em várias tarefas de Processamento Natural da Língua. Os resultados deste trabalho mostram que estas representações, obtidas a partir de uma quantidade limitada de dados de treino, obtêm performances consideráveis em tarefas de similaridade e categorização de palavras. Por outro lado, em tarefas de analogia de palavras apenas se obtém resultados consideráveis para a categoria gramatical dos substantivos. As word embeddings obtidas com a metodologia proposta, e comparando com o estado da arte, superaram a performance de oito word embeddings em tarefas de similaridade, e de duas word embeddings em tarefas de categorização de palavras.
Beland-Leblanc, Samuel. "Learning discrete word embeddings to achieve better interpretability and processing efficiency." Thesis, 2020. http://hdl.handle.net/1866/25464.
Full textThe ubiquitous use of word embeddings in Natural Language Processing is proof of theirusefulness and adaptivity to a multitude of tasks. However, their continuous nature is pro-hibitive in terms of computation, storage and interpretation. In this work, we propose amethod of learning discrete word embeddings directly. The model is an adaptation of anovel database searching method using state of the art natural language processing tech-niques like Transformers and LSTM. On top of obtaining embeddings requiring a fractionof the resources to store and process, our experiments strongly suggest that our representa-tions learn basic units of meaning in latent space akin to lexical morphemes. We call theseunitssememes, i.e., semantic morphemes. We demonstrate that our model has a greatgeneralization potential and outputs representation showing strong semantic and conceptualrelations between related words.
"Video2Vec: Learning Semantic Spatio-Temporal Embedding for Video Representations." Master's thesis, 2016. http://hdl.handle.net/2286/R.I.40765.
Full textDissertation/Thesis
Masters Thesis Computer Science 2016
Chen, Hsiao-Yi, and 陳曉毅. "A Semantic Search over Encrypted Cloud Data based on Word Embedding 研." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/7b4m86.
Full text國立臺灣科技大學
資訊工程系
107
The services of cloud storage have been very popular in recent years. With the superiority of low-cost and high-capacity, people are inclined to move their data from a local computer to a remote facility such as the cloud server. The majority of the existing methods for searching data on the cloud concentrate on keyword-based search scheme. With the rise of information security awareness, data owners hope that the data placed in the cloud server can keep privacy from being snooped by untrusted users, and users also hope that their query content will not be record by untrusted server. Therefore, encrypting data and queries is the most common way.However, the encrypted ciphertext has lost the relationship of the original plaintext, which will cause many difficulties in keyword search.In addition, most of the existing search methods are not able to efficiently obtain the information that the user is really interested in from the user's query keywords. To address these problems, this study proposes a word embedding based semantic search scheme for searching documents on the cloud. The word embedding model is implemented by a neural network. The neural network model can learn the semantic relationship between words in the corpus and express the words in vectors. By using a word-embedded model, a document index vector and a query vector can be generated. The proposed scheme can encrypt the query vector and the index vector into ciphertext, which can preserve the efficiency of the search while protecting the privacy of the user and the security of the document.
Ke, Hao, and 葛浩. "Some approaches of combining word embedding and lexicalresource for semantic relateness mesurement." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/87144345989462382585.
Full text國立臺灣大學
資訊工程學研究所
104
In this thesis, we propose three different approaches to measure the semantic relatedness: (1) Boost the performance of GloVe word embedding by removing ortransforming abnormal dimensions. (2) Linearly combines the path information extracted from WordNet and the word embedding. (3) Utilize word embedding and twelve linguisticinformation extracted from WordNet as features for support vector regression. We conduct our experiments on six benchmark data sets. The evaluation measurecomputes the Pearson and Spearman correlation between the output of our methods and the ground truth. We report our results together with three state-of-the-art approaches. Theexperimental results show that our methods outperform the state-of-the-art approaches in most of the benchmark data sets.
Hwang, Sung Ju. "Discriminative object categorization with external semantic knowledge." 2013. http://hdl.handle.net/2152/21320.
Full texttext
Ivensky, Ilya. "Prediction of Alzheimer's disease and semantic dementia from scene description: toward better language and topic generalization." Thesis, 2020. http://hdl.handle.net/1866/24317.
Full textData segmentation by the language and the topic of psycholinguistic tests increasingly becomes a significant obstacle for generalization of predicting models. It limits our ability to understand the core of linguistic and cognitive dysfunction because the models overfit the details of a particular language or topic. In this work, we study potential approaches to overcome such limitations. We discuss the properties of various FastText word embedding models for English and French and propose a set of features derived from these properties. We show that despite the differences in the languages and the embedding algorithms, a universal language-agnostic set of word-vector features can capture cognitive dysfunction. We argue that in the context of scarce data, the hand-crafted word-vector features is a reasonable alternative for feature learning, which allows us to generalize over the language and topic boundaries.