To see the other types of publications on this topic, follow the link: Open information extraction.

Dissertations / Theses on the topic 'Open information extraction'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 20 dissertations / theses for your research on the topic 'Open information extraction.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Xavier, Clarissa Castell? "Learning non-verbal relations under open information extraction paradigm." Pontif?cia Universidade Cat?lica do Rio Grande do Sul, 2014. http://tede2.pucrs.br/tede2/handle/tede/5275.

Full text
Abstract:
Made available in DSpace on 2015-04-14T14:50:19Z (GMT). No. of bitstreams: 1 466321.pdf: 1994049 bytes, checksum: fbbeef81814a876679c25f4e015925f5 (MD5) Previous issue date: 2014-03-12<br>O paradigma Open Information Extraction - Open IE (Extra??o Aberta de Informa??es) de extra??o de rela??es trabalha com a identifica??o de rela??es n?o definidas previamente, buscando superar as limita??es impostas pelos m?todos tradicionais de Extra??o de Informa??es como a depend?ncia de dom?nio e a dif?cil escalabilidade. Visando estender o paradigma Open IE para que sejam extra?das rela??es n?o expressas por verbos a partir de textos em ingl?s, apresentamos CompIE, um componente que aprende rela??es expressas em compostos nominais (CNs), como (oil, extracted from, olive) - (?leo, extra?do da, oliva) - do composto nominal olive oil - ?leo de oliva, ou em pares do tipo adjetivo-substantivo (ASs), como (moon, that is, gorgeous) - (lua, que ?, linda) - do AS gorgeous moon (linda lua). A entrada do CompIE ? um arquivo texto, e sua sa?da ? um conjunto de triplas descrevendo rela??es bin?rias. Sua arquitetura ? composta por duas tarefas principais: Extrator de CNs e ASs (1) e Interpretador de CNs e ASs (2). A primeira tarefa gera uma lista de CNs e ASs a partir do corpus de entrada. A segunda tarefa realiza a interpreta??o dos CNs e ASs gerando as triplas que descrevem as rela??es extra?das do corpus. Para estudar a viabilidade da solu??o apresentada, realizamos uma avalia??o baseada em hip?teses. Um prot?tipo foi constru?do com o intuito de validar cada uma das hip?teses. Os resultados obtidos mostram que nossa solu??o alcan?a 89% de Precis?o e demonstram que o CompIE atinge sua meta de estender o paradigma Open IE extraindo rela??es expressas dentro dos CNs e ASs.<br>The Open Information Extraction (Open IE) is a relation extraction paradigm in which the target relationships cannot be specified in advance, and it aims to overcome the limitations imposed by traditional IE methods, such as domain-dependence and scalability. In order to extend Open IE to extract relationships that are not expressed by verbs from texts in English, we introduce CompIE, a component that learns relations expressed in noun compounds (NCs), such as (oil, extracted from, olive) from olive oil, or in adjectivenoun pairs (ANs), such as (moon, that is, gorgeous) from gorgeous moon. CompIE input is a text file, and the output is a set of triples describing binary relationships. The architecture comprises two main tasks: NCs and ANs Extraction (1) and NCs and ANs Interpretation (2). The first task generates a list of NCs and ANs from the input corpus. The second task performs the interpretation of NCs and ANs and generates the tuples that describe the relations extracted from the corpus. In order to study CompIE s feasibility, we perform an evaluation based on hypotheses. In order to implement the strategies to validate each hypothesis we have built a prototype. The results show that our solution achieves 89% Precision and demonstrate that CompIE reaches its goal of extending Open IE paradigm extracting relationships within NCs and ANs.
APA, Harvard, Vancouver, ISO, and other styles
2

Xavier, Clarissa Castellã. "Learning non-verbal relations under open information extraction paradigm." Pontifícia Universidade Católica do Rio Grande do Sul, 2014. http://hdl.handle.net/10923/7073.

Full text
Abstract:
Made available in DSpace on 2015-03-17T02:01:01Z (GMT). No. of bitstreams: 1 000466321-Texto+Completo-0.pdf: 1994049 bytes, checksum: fbbeef81814a876679c25f4e015925f5 (MD5) Previous issue date: 2014<br>The Open Information Extraction (Open IE) is a relation extraction paradigm in which the target relationships cannot be specified in advance, and it aims to overcome the limitations imposed by traditional IE methods, such as domain-dependence and scalability. In order to extend Open IE to extract relationships that are not expressed by verbs from texts in English, we introduce CompIE, a component that learns relations expressed in noun compounds (NCs), such as (oil, extracted from, olive) from olive oil, or in adjectivenoun pairs (ANs), such as (moon, that is, gorgeous) from gorgeous moon. CompIE input is a text file, and the output is a set of triples describing binary relationships. The architecture comprises two main tasks: NCs and ANs Extraction (1) and NCs and ANs Interpretation (2). The first task generates a list of NCs and ANs from the input corpus. The second task performs the interpretation of NCs and ANs and generates the tuples that describe the relations extracted from the corpus. In order to study CompIE’s feasibility, we perform an evaluation based on hypotheses. In order to implement the strategies to validate each hypothesis we have built a prototype. The results show that our solution achieves 89% Precision and demonstrate that CompIE reaches its goal of extending Open IE paradigm extracting relationships within NCs and ANs.<br>O paradigma Open Information Extraction - Open IE (Extração Aberta de Informações) de extração de relações trabalha com a identificação de relações não definidas previamente, buscando superar as limitações impostas pelos métodos tradicionais de Extração de Informações como a dependência de domínio e a difícil escalabilidade. Visando estender o paradigma Open IE para que sejam extraídas relações não expressas por verbos a partir de textos em inglês, apresentamos CompIE, um componente que aprende relações expressas em compostos nominais (CNs), como (oil, extracted from, olive) - (óleo, extraído da, oliva) - do composto nominal olive oil - óleo de oliva, ou em pares do tipo adjetivo-substantivo (ASs), como (moon, that is, gorgeous) - (lua, que é, linda) - do AS gorgeous moon (linda lua). A entrada do CompIE é um arquivo texto, e sua saída é um conjunto de triplas descrevendo relações binárias. Sua arquitetura é composta por duas tarefas principais: Extrator de CNs e ASs (1) e Interpretador de CNs e ASs (2). A primeira tarefa gera uma lista de CNs e ASs a partir do corpus de entrada. A segunda tarefa realiza a interpretação dos CNs e ASs gerando as triplas que descrevem as relações extraídas do corpus. Para estudar a viabilidade da solução apresentada, realizamos uma avaliação baseada em hipóteses. Um protótipo foi construído com o intuito de validar cada uma das hipóteses. Os resultados obtidos mostram que nossa solução alcança 89% de Precisão e demonstram que o CompIE atinge sua meta de estender o paradigma Open IE extraindo relações expressas dentro dos CNs e ASs.
APA, Harvard, Vancouver, ISO, and other styles
3

Xavier, Clarissa [Verfasser]. "Learning Non-Verbal Relations Under Open Information Extraction Paradigm / Clarissa Xavier." Munich : GRIN Verlag, 2015. http://d-nb.info/1097578720/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Gashteovski, Kiril [Verfasser], and Rainer [Akademischer Betreuer] Gemulla. "Compact open information extraction: methods, corpora, analysis / Kiril Gashteovski ; Betreuer: Rainer Gemulla." Mannheim : Universitätsbibliothek Mannheim, 2021. http://d-nb.info/123650285X/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Genest, Pierre-Yves. "Unsupervised open-world information extraction from unstructured and domain-specific document collections." Electronic Thesis or Diss., Lyon, INSA, 2024. http://www.theses.fr/2024ISAL0111.

Full text
Abstract:
La croissance exponentielle de la production de données a fait de l’analyse de collections de documents textuels non structurés un défi majeur. Cette thèse de doctorat vise à relever ce défi en se concentrant sur l’extraction d’information (IE), qui englobe quatre tâches principales : reconnaissance d’entités nommées (NER), résolution des coréférences (CR), annotation sémantique (EL) et extraction de relations (RE). Ces tâches permettent d’extraire et de structurer des connaissances à partir de documents non formatés, ce qui facilite leur intégration dans des bases de données structurées et leur utilisation par des outils d’analyse de données. Nos contributions commencent par la création de Linked-DocRED, le premier jeu de données de grande taille, diversifié, et annoté manuellement pour l’IE sur des documents. Pour cela, nous partons du jeu de données DocRED, que nous complétons avec des annotations sémantiques de haute qualité. Également, nous proposons un nouvel ensemble de métriques pour évaluer les modèles d’extraction d’information. L’évaluation de baselines sur Linked-DocRED met en évidence les complexités et les défis inhérents à l’IE sur des documents : erreurs en cascade, traitement de longs contextes et rareté de l’information. Nous présentons ensuite PromptORE, un modèle d’extraction de relations non supervisé et en monde ouvert. En adaptant le paradigme du prompt-tuning, PromptORE réalise la représentation et le clustering de relations sans nécessiter d’entraînement ni d’ajustement d’hyperparamètres (une faiblesse majeure des baselines précédentes) et surpasse de manière significative les modèles de l’état de l’art. Cette méthode démontre la faisabilité de l’extraction de relations sémantiquement cohérentes dans un contexte de monde ouvert. En généralisant notre approche basée sur les prompts, nous développons CITRUN, un NER non supervisé et fonctionnant en monde ouvert. En utilisant l’apprentissage contrastif avec des données étiquetées hors domaine, CITRUN améliore la représentation des types d’entités, surpassant les NERs non supervisés basés sur des LLMs, et atteignant des performances compétitives par rapport aux modèles zero-shot qui sont plus supervisés. Ces avancées facilitent l’extraction de connaissances à partir de documents non structurés, tout en tenant compte des contraintes pratiques du monde réel et en améliorant l’applicabilité des modèles d’IE dans des contextes industriels<br>The exponential growth in data generation has rendered the effective analysis of unstructured textual document collections a critical challenge. This PhD thesis aims to address this challenge by focusing on Information Extraction (IE), which encompasses four essential tasks: Named Entity Recognition (NER), Coreference Resolution (CR), Entity Linking (EL), and Relation Extraction (RE). These tasks collectively enable extracting and structuring knowledge from unformatted documents, facilitating its integration into structured databases for further analytical processes. Our contributions start with creating Linked-DocRED, the first large-scale, diverse, and manually annotated dataset for document-level IE. This dataset enriches the existing DocRED dataset with high-quality entity linking labels. Additionally, we propose a novel set of metrics for evaluating end-to-end IE models. The evaluation of baseline models on Linked-DocRED highlights the complexities and challenges inherent to document-level IE: cascading errors, long context handling, and information scarcity. We then introduce PromptORE, an unsupervised and open-world RE model. Adapting the prompt-tuning paradigm, PromptORE achieves relation embedding and clustering without requiring fine-tuning or hyperparameter tuning (a major weakness of previous baselines) and significantly outperforms state-of-the-art models. This method demonstrates the feasibility of extracting semantically coherent relation types in an open-world context. Further extending our prompt-based approach, we develop CITRUN for unsupervised and open-world NER. By employing contrastive learning with off-domain labeled data, CITRUN improves entity type embeddings, surpassing LLM-based unsupervised NERs, and achieving competitive performance against zero-shot models that are more supervised. These advancements facilitate meaningful knowledge extraction from unstructured documents, addressing practical, real-world constraints and enhancing the applicability of IE models in industrial contexts
APA, Harvard, Vancouver, ISO, and other styles
6

Ramsey, Marshall C., Thian-Huat Ong, and Hsinchun Chen. "Multilingual Input System for the Web - an Open Multimedia Approach of Keyboard and Handwriting Recognition for Chinese and Japanese." IEEE, 1998. http://hdl.handle.net/10150/105120.

Full text
Abstract:
Artificial Intelligence Lab, Department of MIS, University of Arizona<br>The basic building block of a multilingual information retrieval system is the input system. Chinese and Japanese characters pose great challenges for the conventional 101 -key alphabet-based keyboard, because they are radical-based and number in the thousands. This paper reviews the development of various approaches and then presents a framework and working demonstrations of Chinese and Japanese input methods implemented in Java, which allow open deployment over the web to any platform, The demo includes both popular keyboard input methods and neural network handwriting recognition using a mouse or pen. This framework is able to accommodate future extension to other input mediums and languages of interest.
APA, Harvard, Vancouver, ISO, and other styles
7

Ramsey, Marshall C., Thian-Huat Ong, and Hsinchun Chen. "Multilingual input system for the Web - an open multimedia approach of keyboard and handwritten recognition for Chinese and Japanese." IEEE, 1998. http://hdl.handle.net/10150/105350.

Full text
Abstract:
Artificial Intelligence Lab, Department of MIS, University of Arizona<br>The basic building block of a multilingual information retrieval system is the input system. Chinese and Japanese characters pose great challenges for the conventional 101-key alphabet-based keyboard, because they are radical-based and number in the thousands. This paper reviews the development of various approaches and then presents a framework and working demonstrations of Chinese and Japanese input methods implemented in Java, which allow open deployment over the web to any platform, The demo includes both popular keyboard input methods and neural network handwriting recognition using a mouse or pen. This framework is able to accommodate future extension to other input mediums and languages of interest.
APA, Harvard, Vancouver, ISO, and other styles
8

De, Wilde Max. "From Information Extraction to Knowledge Discovery: Semantic Enrichment of Multilingual Content with Linked Open Data." Doctoral thesis, Universite Libre de Bruxelles, 2015. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/218774.

Full text
Abstract:
Discovering relevant knowledge out of unstructured text in not a trivial task. Search engines relying on full-text indexing of content reach their limits when confronted to poor quality, ambiguity, or multiple languages. Some of these shortcomings can be addressed by information extraction and related natural language processing techniques, but it still falls short of adequate knowledge representation. In this thesis, we defend a generic approach striving to be as language-independent, domain-independent, and content-independent as possible. To reach this goal, we offer to disambiguate terms with their corresponding identifiers in Linked Data knowledge bases, paving the way for full-scale semantic enrichment of textual content. The added value of our approach is illustrated with a comprehensive case study based on a trilingual historical archive, addressing constraints of data quality, multilingualism, and language evolution. A proof-of-concept implementation is also proposed in the form of a Multilingual Entity/Resource Combiner & Knowledge eXtractor (MERCKX), demonstrating to a certain extent the general applicability of our methodology to any language, domain, and type of content.<br>Découvrir de nouveaux savoirs dans du texte non-structuré n'est pas une tâche aisée. Les moteurs de recherche basés sur l'indexation complète des contenus montrent leur limites quand ils se voient confrontés à des textes de mauvaise qualité, ambigus et/ou multilingues. L'extraction d'information et d'autres techniques issues du traitement automatique des langues permettent de répondre partiellement à cette problématique, mais sans pour autant atteindre l'idéal d'une représentation adéquate de la connaissance. Dans cette thèse, nous défendons une approche générique qui se veut la plus indépendante possible des langues, domaines et types de contenus traités. Pour ce faire, nous proposons de désambiguïser les termes à l'aide d'identifiants issus de bases de connaissances du Web des données, facilitant ainsi l'enrichissement sémantique des contenus. La valeur ajoutée de cette approche est illustrée par une étude de cas basée sur une archive historique trilingue, en mettant un accent particulier sur les contraintes de qualité, de multilinguisme et d'évolution dans le temps. Un prototype d'outil est également développé sous le nom de Multilingual Entity/Resource Combiner & Knowledge eXtractor (MERCKX), démontrant ainsi le caractère généralisable de notre approche, dans un certaine mesure, à n'importe quelle langue, domaine ou type de contenu.<br>Doctorat en Information et communication<br>info:eu-repo/semantics/nonPublished
APA, Harvard, Vancouver, ISO, and other styles
9

FRESCHI, Sergio. "A Multidisciplinary Approach to the Reuse of Open Learning Resources." University of Sydney, 2008. http://hdl.handle.net/2123/3937.

Full text
Abstract:
Master of Engineering (Research)<br>Educational standards are having a significant impact on e-Learning. They allow for better exchange of information among different organizations and institutions. They simplify reusing and repurposing learning materials. They give teachers the possibility of personalizing them according to the student’s background and learning speed. Thanks to these standards, off-the-shelf content can be adapted to a particular student cohort’s context and learning needs. The same course content can be presented in different languages. Overall, all the parties involved in the learning-teaching process (students, teachers and institutions) can benefit from these standards and so online education can be improved. To materialize the benefits of standards, learning resources should be structured according to these standards. Unfortunately, there is the problem that a large number of existing e-Learning materials lack the intrinsic logical structure required, and further, when they have the structure, they are not encoded as required. These problems make it virtually impossible to share these materials. This thesis addresses the following research question: How to make the best use of existing open learning resources available on the Internet by taking advantage of educational standards and specifications and thus improving content reusability?In order to answer this question, I combine different technologies, techniques and standards that make the sharing of publicly available learning resources possible in innovative ways. I developed and implemented a three-stage tool to tackle the above problem. By applying information extraction techniques and open e-Learning standards to legacy learning resources the tool has proven to improve content reusability. In so doing, it contributes to the understanding of how these technologies can be used in real scenarios and shows how online education can benefit from them. In particular, three main components were created which enable the conversion process from unstructured educational content into a standard compliant form in a systematic and automatic way. An increasing number of repositories with educational resources are available, including Wikiversity and the Massachusetts Institute of Technology OpenCourseware. Wikivesity is an open repository containing over 6,000 learning resources in several disciplines and for all age groups [1]. I used the OpenCourseWare repository to evaluate the effectiveness of my software components and ideas. The results show that it is possible to create standard compliant learning objects from the publicly available web pages, improving their searchability, interoperability and reusability.
APA, Harvard, Vancouver, ISO, and other styles
10

Del, Corro Luciano [Verfasser], and Rainer [Akademischer Betreuer] Gemulla. "Methods for open information extraction and sense disambiguation on natural language text / Luciano Del Corro. Betreuer: Rainer Gemulla." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2016. http://d-nb.info/1081935006/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Gooch, P. "A modular, open-source information extraction framework for identifying clinical concepts and processes of care in clinical narratives." Thesis, City University London, 2012. http://openaccess.city.ac.uk/2112/.

Full text
Abstract:
In this thesis, a synthesis is presented of the knowledge models required by clinical informa- tion systems that provide decision support for longitudinal processes of care. Qualitative research techniques and thematic analysis are novelly applied to a systematic review of the literature on the challenges in implementing such systems, leading to the development of an original conceptual framework. The thesis demonstrates how these process-oriented systems make use of a knowledge base derived from workflow models and clinical guidelines, and argues that one of the major barriers to implementation is the need to extract explicit and implicit information from diverse resources in order to construct the knowledge base. Moreover, concepts in both the knowledge base and in the electronic health record (EHR) must be mapped to a common ontological model. However, the majority of clinical guideline information remains in text form, and much of the useful clinical information residing in the EHR resides in the free text fields of progress notes and laboratory reports. In this thesis, it is shown how natural language processing and information extraction techniques provide a means to identify and formalise the knowledge components required by the knowledge base. Original contributions are made in the development of lexico-syntactic patterns and the use of external domain knowledge resources to tackle a variety of information extraction tasks in the clinical domain, such as recognition of clinical concepts, events, temporal relations, term disambiguation and abbreviation expansion. Methods are developed for adapting existing tools and resources in the biomedical domain to the processing of clinical texts, and approaches to improving the scalability of these tools are proposed and evalu- ated. These tools and techniques are then combined in the creation of a novel approach to identifying processes of care in the clinical narrative. It is demonstrated that resolution of coreferential and anaphoric relations as narratively and temporally ordered chains provides a means to extract linked narrative events and processes of care from clinical notes. Coreference performance in discharge summaries and progress notes is largely dependent on correct identification of protagonist chains (patient, clinician, family relation), pronominal resolution, and string matching that takes account of experiencer, temporal, spatial, and anatomical context; whereas for laboratory reports additional, external domain knowledge is required. The types of external knowledge and their effects on system performance are identified and evaluated. Results are compared against existing systems for solving these tasks and are found to improve on them, or to approach the performance of recently reported, state-of-the- art systems. Software artefacts developed in this research have been made available as open-source components within the General Architecture for Text Engineering framework.
APA, Harvard, Vancouver, ISO, and other styles
12

Hou, Jun. "Text mining with semantic annotation : using enriched text representation for entity-oriented retrieval, semantic relation identification and text clustering." Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/79206/1/Jun_Hou_Thesis.pdf.

Full text
Abstract:
This project is a step forward in the study of text mining where enhanced text representation with semantic information plays a significant role. It develops effective methods of entity-oriented retrieval, semantic relation identification and text clustering utilizing semantically annotated data. These methods are based on enriched text representation generated by introducing semantic information extracted from Wikipedia into the input text data. The proposed methods are evaluated against several start-of-art benchmarking methods on real-life data-sets. In particular, this thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.
APA, Harvard, Vancouver, ISO, and other styles
13

Amad, Ashraf. "L’acquisition et l’extraction de connaissances dans un contexte patrimoniale peu documenté." Thesis, Paris 8, 2017. http://www.theses.fr/2017PA080101.

Full text
Abstract:
L’importance de la documentation du patrimoine culturel croit parallèlement aux risques auxquels il est exposé tels que les guerres, le développement urbain incontrôlé, les catastrophes naturelles, la négligence et les techniques ou stratégies de conservation inappropriées. De plus, la documentation constitue un outil fondamental pour l'évaluation, la conservation, le suivi et la gestion du patrimoine culturel. Dès lors, cet outil majeur nous permet d’estimer la valeur historique, scientifique, sociale et économique de ce patrimoine. Selon plusieurs institutions internationales dédiées à la conservation du patrimoine culturel, il y a un besoin réel de développer et d’adapter de solutions informatiques capables de faciliter et de soutenir la documentation du patrimoine culturel peu documenté surtout dans les pays en développement où il y a un manque flagrant de ressources. Parmi ces pays, la Palestine représente un cas d’étude pertinent dans cette problématique de carence en documentation de son patrimoine. Pour répondre à cette problématique, nous proposons une approche d’acquisition et d’extraction de connaissances patrimoniales dans un contexte peu documenté. Nous prenons comme cas d’étude l’église de la Nativité en Palestine et nous mettons en place notre approche théorique par le développement d’une plateforme d’acquisition et d’extraction de connaissances patrimoniales à l’aide d’un Framework pour la documentation de patrimoine culturel.Notre solution est basée sur les technologies sémantiques, ce qui nous donne la possibilité, dès le début, de fournir une description ontologique riche, une meilleure structuration de l'information, un niveau élevé d'interopérabilité et un meilleur traitement automatique (lisibilité par les machines) sans efforts additionnels.De plus, notre approche est évolutive et réciproque car l’acquisition de connaissance (sous forme structurée) améliore l’extraction de connaissances patrimoniales à partir de texte non structuré et vice versa. Dès lors, l’interaction entre les deux composants de notre système ainsi que les connaissances patrimoniales se développent et s’améliorent au fil de temps surtout que notre système utilise les contributions manuelles et validations des résultats automatiques (dans les deux composants) par les experts afin d’optimiser sa performance<br>The importance of cultural heritage documentation increases in parallel with the risks to which it is exposed, such as wars, uncontrolled urban development, natural disasters, neglect and inappropriate conservation techniques or strategies. In addition, this documentation is a fundamental tool for the assessment, the conservation, and the management of cultural heritage. Consequently, this tool allows us to estimate the historical, scientific, social and economic value of this heritage. According to several international institutions dedicated to the preservation of cultural heritage, there is an urgent need to develop computer solutions to facilitate and support the documentation of poorly documented cultural heritage especially in developing countries where there is a lack of resources. Among these countries, Palestine represents a relevant case study in this issue of lack of documentation of its heritage. To address this issue, we propose an approach of knowledge acquisition and extraction in the context of poorly documented heritage. We take as a case study the church of the Nativity in Palestine and we put in place our theoretical approach by the development of a platform for the acquisition and extraction of heritage knowledge. Our solution is based on the semantic technologies, which gives us the possibility, from the beginning, to provide a rich ontological description, a better structuring of the information, a high level of interoperability and a better automatic processing without additional efforts.Additionally, our approach is evolutionary and reciprocal because the acquisition of knowledge (in structured form) improves the extraction of heritage knowledge from unstructured text and vice versa. Therefore, the interaction between the two components of our system as well as the heritage knowledge develop and improve over time especially that our system uses manual contributions and validations of the automatic results (in both components) by the experts to optimize its performance
APA, Harvard, Vancouver, ISO, and other styles
14

Amad, Ashraf. "L’acquisition et l’extraction de connaissances dans un contexte patrimoniale peu documenté." Electronic Thesis or Diss., Paris 8, 2017. http://www.theses.fr/2017PA080101.

Full text
Abstract:
L’importance de la documentation du patrimoine culturel croit parallèlement aux risques auxquels il est exposé tels que les guerres, le développement urbain incontrôlé, les catastrophes naturelles, la négligence et les techniques ou stratégies de conservation inappropriées. De plus, la documentation constitue un outil fondamental pour l'évaluation, la conservation, le suivi et la gestion du patrimoine culturel. Dès lors, cet outil majeur nous permet d’estimer la valeur historique, scientifique, sociale et économique de ce patrimoine. Selon plusieurs institutions internationales dédiées à la conservation du patrimoine culturel, il y a un besoin réel de développer et d’adapter de solutions informatiques capables de faciliter et de soutenir la documentation du patrimoine culturel peu documenté surtout dans les pays en développement où il y a un manque flagrant de ressources. Parmi ces pays, la Palestine représente un cas d’étude pertinent dans cette problématique de carence en documentation de son patrimoine. Pour répondre à cette problématique, nous proposons une approche d’acquisition et d’extraction de connaissances patrimoniales dans un contexte peu documenté. Nous prenons comme cas d’étude l’église de la Nativité en Palestine et nous mettons en place notre approche théorique par le développement d’une plateforme d’acquisition et d’extraction de connaissances patrimoniales à l’aide d’un Framework pour la documentation de patrimoine culturel.Notre solution est basée sur les technologies sémantiques, ce qui nous donne la possibilité, dès le début, de fournir une description ontologique riche, une meilleure structuration de l'information, un niveau élevé d'interopérabilité et un meilleur traitement automatique (lisibilité par les machines) sans efforts additionnels.De plus, notre approche est évolutive et réciproque car l’acquisition de connaissance (sous forme structurée) améliore l’extraction de connaissances patrimoniales à partir de texte non structuré et vice versa. Dès lors, l’interaction entre les deux composants de notre système ainsi que les connaissances patrimoniales se développent et s’améliorent au fil de temps surtout que notre système utilise les contributions manuelles et validations des résultats automatiques (dans les deux composants) par les experts afin d’optimiser sa performance<br>The importance of cultural heritage documentation increases in parallel with the risks to which it is exposed, such as wars, uncontrolled urban development, natural disasters, neglect and inappropriate conservation techniques or strategies. In addition, this documentation is a fundamental tool for the assessment, the conservation, and the management of cultural heritage. Consequently, this tool allows us to estimate the historical, scientific, social and economic value of this heritage. According to several international institutions dedicated to the preservation of cultural heritage, there is an urgent need to develop computer solutions to facilitate and support the documentation of poorly documented cultural heritage especially in developing countries where there is a lack of resources. Among these countries, Palestine represents a relevant case study in this issue of lack of documentation of its heritage. To address this issue, we propose an approach of knowledge acquisition and extraction in the context of poorly documented heritage. We take as a case study the church of the Nativity in Palestine and we put in place our theoretical approach by the development of a platform for the acquisition and extraction of heritage knowledge. Our solution is based on the semantic technologies, which gives us the possibility, from the beginning, to provide a rich ontological description, a better structuring of the information, a high level of interoperability and a better automatic processing without additional efforts.Additionally, our approach is evolutionary and reciprocal because the acquisition of knowledge (in structured form) improves the extraction of heritage knowledge from unstructured text and vice versa. Therefore, the interaction between the two components of our system as well as the heritage knowledge develop and improve over time especially that our system uses manual contributions and validations of the automatic results (in both components) by the experts to optimize its performance
APA, Harvard, Vancouver, ISO, and other styles
15

López, Massaguer Oriol 1972. "Development of informatic tools for extracting biomedical data from open and propietary data sources with predictive purposes." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/471540.

Full text
Abstract:
Hem desenvolupat noves eines de software per tal d’obtenir informació de fonts publiques i privades per tal de desenvolupar models de toxicitat in silico. La primera eina es Collector, una aplicació de programari lliure que genera series de compostos preparats per fer modelat QSAR anotats amb bioactivitats extretes de la plataforma Open PHACTS usant tecnologies de la web semàntica. Collector ha estat utilitzada dins el projecte eTOX per desenvolupar models predictius sobre endpoints de toxicitat. Addicionalment hem concebut, desenvolupat i implementat un mètode per derivar scorings de toxicitat apropiats per modelatge predictiu que utilitza les dades obtingudes de informes d’estudis amb dosis repetides in vivo de la industria farmacèutica. El nostre mètode ha estat testejant aplicant-lo al modelat de hepatotoxicitat obtenint les dades corresponents per 3 endpoints: ‘degenerative lesions’, ‘inflammatory liver changes’ and ‘non-neoplasic proliferative lesions’. S’ha validat la idoneïtat d’aquestes dades obtingudes comparant-les amb els valors de point of departure obtinguts experimentalment i també desenvolupant models QSAR de prova obtenint resultats acceptables. El nostre mètode es basa en la inferència basada en ontologies per extreure informació de la nostra base de dades on tenim dades anotades basades en ontologies. El nostre mètode també es pot aplicar a altres bases de dades amb informació preclínica per generar scorings de toxicitat. Addicionalment el nostre mètode d’inferència basat en ontologies es pot aplicar a d’altre bases de dades relacionals anotades amb ontologies.<br>We developed new software tools to obtain information from public and private data sources to develop in silico toxicity models. The first of these tools is Collector, an Open Source application that generates “QSAR-ready” series of compounds annotated with bioactivities, extracting the data from the Open PHACTS platform using semantic web technologies. Collector was applied in the framework of the eTOX project to develop predictive models for toxicity endpoints. Additionally, we conceived, designed, implemented and tested a method to derive toxicity scorings suitable for predictive modelling starting from in vivo preclinical repeated-dose studies generated by the pharmaceutical industry. This approach was tested by generating scorings for three hepatotoxicity endpoints: ‘degenerative lesions’, ‘inflammatory liver changes’ and ‘non-neoplasic proliferative lesions’. The suitability of these scores was tested by comparing them with experimentally obtained point of departure doses as well as by developing tentative QSAR models, obtaining acceptable results. Our method relies on ontology-based inference to extract information from our ontology annotated data stored in a relational database. Our method, as a whole, can be applied to other preclinical toxicity databases to generate toxicity scorings. Moreover, the ontology-based inference method on its own is applicable to any relational databases annotated with ontologies.
APA, Harvard, Vancouver, ISO, and other styles
16

Artchounin, Daniel. "Tuning of machine learning algorithms for automatic bug assignment." Thesis, Linköpings universitet, Programvara och system, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139230.

Full text
Abstract:
In software development projects, bug triage consists mainly of assigning bug reports to software developers or teams (depending on the project). The partial or total automation of this task would have a positive economic impact on many software projects. This thesis introduces a systematic four-step method to find some of the best configurations of several machine learning algorithms intending to solve the automatic bug assignment problem. These four steps are respectively used to select a combination of pre-processing techniques, a bug report representation, a potential feature selection technique and to tune several classifiers. The aforementioned method has been applied on three software projects: 66 066 bug reports of a proprietary project, 24 450 bug reports of Eclipse JDT and 30 358 bug reports of Mozilla Firefox. 619 configurations have been applied and compared on each of these three projects. In production, using the approach introduced in this work on the bug reports of the proprietary project would have increased the accuracy by up to 16.64 percentage points.
APA, Harvard, Vancouver, ISO, and other styles
17

Şentürk, Sertan. "Computational analysis of audio recordings and music scores for the description and discovery of Ottoman-Turkish Makam music." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/402102.

Full text
Abstract:
This thesis addresses several shortcomings on the current state of the art methodologies in music information retrieval (MIR). In particular, it proposes several computational approaches to automatically analyze and describe music scores and audio recordings of Ottoman-Turkish makam music (OTMM). The main contributions of the thesis are the music corpus that has been created to carry out the research and the audio-score alignment methodology developed for the analysis of the corpus. In addition, several novel computational analysis methodologies are presented in the context of common MIR tasks of relevance for OTMM. Some example tasks are predominant melody extraction, tonic identification, tempo estimation, makam recognition, tuning analysis, structural analysis and melodic progression analysis. These methodologies become a part of a complete system called Dunya-makam for the exploration of large corpora of OTMM. The thesis starts by presenting the created CompMusic Ottoman- Turkish makam music corpus. The corpus includes 2200 music scores, more than 6500 audio recordings, and accompanying metadata. The data has been collected, annotated and curated with the help of music experts. Using criteria such as completeness, coverage and quality, we validate the corpus and show its research potential. In fact, our corpus is the largest and most representative resource of OTMM that can be used for computational research. Several test datasets have also been created from the corpus to develop and evaluate the specific methodologies proposed for different computational tasks addressed in the thesis. The part focusing on the analysis of music scores is centered on phrase and section level structural analysis. Phrase boundaries are automatically identified using an existing state-of-the-art segmentation methodology. Section boundaries are extracted using heuristics specific to the formatting of the music scores. Subsequently, a novel method based on graph analysis is used to establish similarities across these structural elements in terms of melody and lyrics, and to label the relations semiotically. The audio analysis section of the thesis reviews the state-of-the-art for analysing the melodic aspects of performances of OTMM. It proposes adaptations of existing predominant melody extraction methods tailored to OTMM. It also presents improvements over pitch-distribution-based tonic identification and makam recognition methodologies. The audio-score alignment methodology is the core of the thesis. It addresses the culture-specific challenges posed by the musical characteristics, music theory related representations and oral praxis of OTMM. Based on several techniques such as subsequence dynamic time warping, Hough transform and variable-length Markov models, the audio-score alignment methodology is designed to handle the structural differences between music scores and audio recordings. The method is robust to the presence of non-notated melodic expressions, tempo deviations within the music performances, and differences in tonic and tuning. The methodology utilizes the outputs of the score and audio analysis, and links the audio and the symbolic data. In addition, the alignment methodology is used to obtain score-informed description of audio recordings. The scoreinformed audio analysis not only simplifies the audio feature extraction steps that would require sophisticated audio processing approaches, but also substantially improves the performance compared with results obtained from the state-of-the-art methods solely relying on audio data. The analysis methodologies presented in the thesis are applied to the CompMusic Ottoman-Turkish makam music corpus and integrated into a web application aimed at culture-aware music discovery. Some of the methodologies have already been applied to other music traditions such as Hindustani, Carnatic and Greek music. Following open research best practices, all the created data, software tools and analysis results are openly available. The methodologies, the tools and the corpus itself provide vast opportunities for future research in many fields such as music information retrieval, computational musicology and music education.<br>Esta tesis aborda varias limitaciones de las metodologías más avanzadas en el campo de recuperación de información musical (MIR por sus siglas en inglés). En particular, propone varios métodos computacionales para el análisis y la descripción automáticas de partituras y grabaciones de audio de música de makam turco-otomana (MMTO). Las principales contribuciones de la tesis son el corpus de música que ha sido creado para el desarrollo de la investigación y la metodología para alineamiento de audio y partitura desarrollada para el análisis del corpus. Además, se presentan varias metodologías nuevas para análisis computacional en el contexto de las tareas comunes de MIR que son relevantes para MMTO. Algunas de estas tareas son, por ejemplo, extracción de la melodía predominante, identificación de la tónica, estimación de tempo, reconocimiento de makam, análisis de afinación, análisis estructural y análisis de progresión melódica. Estas metodologías constituyen las partes de un sistema completo para la exploración de grandes corpus de MMTO llamado Dunya-makam. La tesis comienza presentando el corpus de música de makam turcootomana de CompMusic. El corpus incluye 2200 partituras, más de 6500 grabaciones de audio, y los metadatos correspondientes. Los datos han sido recopilados, anotados y revisados con la ayuda de expertos. Utilizando criterios como compleción, cobertura y calidad, validamos el corpus y mostramos su potencial para investigación. De hecho, nuestro corpus constituye el recurso de mayor tamaño y representatividad disponible para la investigación computacional de MMTO. Varios conjuntos de datos para experimentación han sido igualmente creados a partir del corpus, con el fin de desarrollar y evaluar las metodologías específicas propuestas para las diferentes tareas computacionales abordadas en la tesis. La parte dedicada al análisis de las partituras se centra en el análisis estructural a nivel de sección y de frase. Los márgenes de frase son identificados automáticamente usando uno de los métodos de segmentación existentes más avanzados. Los márgenes de sección son extraídos usando una heurística específica al formato de las partituras. A continuación, se emplea un método de nueva creación basado en análisis gráfico para establecer similitudes a través de estos elementos estructurales en cuanto a melodía y letra, así como para etiquetar relaciones semióticamente. La sección de análisis de audio de la tesis repasa el estado de la cuestión en cuanto a análisis de los aspectos melódicos en grabaciones de MMTO. Se proponen modificaciones de métodos existentes para extracción de melodía predominante para ajustarlas a MMTO. También se presentan mejoras de metodologías tanto para identificación de tónica basadas en distribución de alturas, como para reconocimiento de makam. La metodología para alineación de audio y partitura constituye el grueso de la tesis. Aborda los retos específicos de esta cultura según vienen determinados por las características musicales, las representaciones relacionadas con la teoría musical y la praxis oral de MMTO. Basada en varias técnicas tales como deformaciones dinámicas de tiempo subsecuentes, transformada de Hough y modelos de Markov de longitud variable, la metodología de alineamiento de audio y partitura está diseñada para tratar las diferencias estructurales entre partituras y grabaciones de audio. El método es robusto a la presencia de expresiones melódicas no anotadas, desviaciones de tiempo en las grabaciones, y diferencias de tónica y afinación. La metodología utiliza los resultados del análisis de partitura y audio para enlazar el audio y los datos simbólicos. Además, la metodología de alineación se usa para obtener una descripción informada por partitura de las grabaciones de audio. El análisis de audio informado por partitura no sólo simplifica los pasos para la extracción de características de audio que de otro modo requerirían sofisticados métodos de procesado de audio, sino que también mejora sustancialmente su rendimiento en comparación con los resultados obtenidos por los métodos más avanzados basados únicamente en datos de audio. Las metodologías analíticas presentadas en la tesis son aplicadas al corpus de música de makam turco-otomana de CompMusic e integradas en una aplicación web dedicada al descubrimiento culturalmente específico de música. Algunas de las metodologías ya han sido aplicadas a otras tradiciones musicales, como música indostaní, carnática y griega. Siguiendo las mejores prácticas de investigación en abierto, todos los datos creados, las herramientas de software y los resultados de análisis está disponibles públicamente. Las metodologías, las herramientas y el corpus en sí mismo ofrecen grandes oportunidades para investigaciones futuras en muchos campos tales como recuperación de información musical, musicología computacional y educación musical.<br>Aquesta tesi adreça diverses deficiències en l’estat actual de les metodologies d’extracció d’informació de música (Music Information Retrieval o MIR). En particular, la tesi proposa diverses estratègies per analitzar i descriure automàticament partitures musicals i enregistraments d’actuacions musicals de música Makam Turca Otomana (OTMM en les seves sigles en anglès). Les contribucions principals de la tesi són els corpus musicals que s’han creat en el context de la tesi per tal de dur a terme la recerca i la metodologia de alineament d’àudio amb la partitura que s’ha desenvolupat per tal d’analitzar els corpus. A més la tesi presenta diverses noves metodologies d’anàlisi computacional d’OTMM per a les tasques més habituals en MIR. Alguns exemples d’aquestes tasques són la extracció de la melodia principal, la identificació del to musical, l’estimació de tempo, el reconeixement de Makam, l’anàlisi de la afinació, l’anàlisi de la estructura musical i l’anàlisi de la progressió melòdica. Aquest seguit de metodologies formen part del sistema Dunya-makam per a la exploració de grans corpus musicals d’OTMM. En primer lloc, la tesi presenta el corpus CompMusic Ottoman- Turkish makam music. Aquest inclou 2200 partitures musicals, més de 6500 enregistraments d’àudio i metadata complementària. Les dades han sigut recopilades i anotades amb ajuda d’experts en aquest repertori musical. El corpus ha estat validat en termes de d’exhaustivitat, cobertura i qualitat i mostrem aquí el seu potencial per a la recerca. De fet, aquest corpus és el la font més gran i representativa de OTMM que pot ser utilitzada per recerca computacional. També s’han desenvolupat diversos subconjunts de dades per al desenvolupament i evaluació de les metodologies específiques proposades per a les diverses tasques computacionals que es presenten en aquest tesi. La secció de la tesi que tracta de l’anàlisi de partitures musicals se centra en l’anàlisi estructural a nivell de secció i de frase musical. Els límits temporals de les frases musicals s’identifiquen automàticament gràcies a un metodologia de segmentació d’última generació. Els límits de les seccions s’extreuen utilitzant un seguit de regles heurístiques determinades pel format de les partitures musicals. Posteriorment s’utilitza un nou mètode basat en anàlisi gràfic per establir semblances entre aquest elements estructurals en termes de melodia i text. També s’utilitza aquest mètode per etiquetar les relacions semiòtiques existents. La següent secció de la tesi tracta sobre anàlisi d’àudio i en particular revisa les tecnologies d’avantguardia d’anàlisi dels aspectes melòdics en OTMM. S’hi proposen adaptacions dels mètodes d’extracció de melodia existents que s’ajusten a OTMM. També s’hi presenten millores en metodologies de reconeixement de makam i en identificació de tònica basats en distribució de to. La metodologia d’alineament d’àudio amb partitura és el nucli de la tesi. Aquesta aborda els reptes culturalment específics imposats per les característiques musicals, les representacions de la teoria musical i la pràctica oral particulars de l’OTMM. Utilitzant diverses tècniques tal i com Dynamic Time Warping, Hough Transform o models de Markov de durada variable, la metodologia d’alineament esta dissenyada per enfrontar les diferències estructurals entre partitures musicals i enregistraments d’àudio. El mètode és robust inclús en presència d’expressions musicals no anotades en la partitura, desviacions de tempo ocorregudes en les actuacions musicals i diferències de tònica i afinació. La metodologia aprofita els resultats de l’anàlisi de la partitura i l’àudio per enllaçar la informació simbòlica amb l’àudio. A més, la tècnica d’alineament s’utilitza per obtenir descripcions de l’àudio fonamentades en la partitura. L’anàlisi de l’àudio fonamentat en la partitura no només simplifica les fases d’extracció de característiques d’àudio que requeririen de mètodes de processament d’àudio sofisticats, sinó que a més millora substancialment els resultats comparat amb altres mètodes d´ultima generació que només depenen de contingut d’àudio. Les metodologies d’anàlisi presentades s’han utilitzat per analitzar el corpus CompMusic Ottoman-Turkish makam music i s’han integrat en una aplicació web destinada al descobriment musical de tradicions culturals específiques. Algunes de les metodologies ja han sigut també aplicades a altres tradicions musicals com la Hindustani, la Carnàtica i la Grega. Seguint els preceptes de la investigació oberta totes les dades creades, eines computacionals i resultats dels anàlisis estan disponibles obertament. Tant les metodologies, les eines i el corpus en si mateix proporcionen àmplies oportunitats per recerques futures en diversos camps de recerca tal i com la musicologia computacional, la extracció d’informació musical i la educació musical. Traducció d’anglès a català per Oriol Romaní Picas.
APA, Harvard, Vancouver, ISO, and other styles
18

DELLI, BOVI CLAUDIO. "Harnessing sense-level information for semantically augmented knowledge extraction." Doctoral thesis, 2018. http://hdl.handle.net/11573/1065614.

Full text
Abstract:
Nowadays, building accurate computational models for the semantics of language lies at the very core of Natural Language Processing and Artificial Intelligence. A first and foremost step in this respect consists in moving from word-based to sense-based approaches, in which operating explicitly at the level of word senses enables a model to produce more accurate and unambiguous results. At the same time, word senses create a bridge towards structured lexico-semantic resources, where the vast amount of available machine-readable information can help overcome the shortage of annotated data in many languages and domains of knowledge. This latter phenomenon, known as the knowledge acquisition bottlneck, is a crucial problem that hampers the development of large-scale, data-driven approaches for many Natural Language Processing tasks, especially when lexical semantics is directly involved. One of these tasks is Information Extraction, where an effective model has to cope with data sparsity, as well as with lexical ambiguity that can arise at the level of both arguments and relational phrases. Even in more recent Information Extraction approaches where semantics is implicitly modeled, these issues have not yet been addressed in their entirety. On the other hand, however, having access to explicit sense-level information is a very demanding task on its own, which can rarely be performed with high accuracy on a large scale. With this in mind, in ths thesis we will tackle a two-fold objective: our first focus will be on studying fully automatic approaches to obtain high-quality sense-level information from textual corpora; then, we will investigate in depth where and how such sense-level information has the potential to enhance the extraction of knowledge from open text. In the first part of this work, we will explore three different disambiguation scenar- ios (semi-structured text, parallel text, and definitional text) and devise automatic disambiguation strategies that are not only capable of scaling to different corpus sizes and different languages, but that actually take advantage of a multilingual and/or heterogeneous setting to improve and refine their performance. As a result, we will obtain three sense-annotated resources that, when tested experimentally with a baseline system in a series of downstream semantic tasks (i.e. Word Sense Disam- biguation, Entity Linking, Semantic Similarity), show very competitive performances on standard benchmarks against both manual and semi-automatic competitors. In the second part we will instead focus on Information Extraction, with an emphasis on Open Information Extraction (OIE), where issues like sparsity and lexical ambiguity are especially critical, and study how to exploit at best sense-level information within the extraction process. We will start by showing that enforcing a deeper semantic analysis in a definitional setting enables a full-fledged extraction pipeline to compete with state-of-the-art approaches based on much larger (but noisier) data. We will then demonstrate how working at the sense level at the end of an extraction pipeline is also beneficial: indeed, by leveraging sense-based techniques, very heterogeneous OIE-derived data can be aligned semantically, and unified with respect to a common sense inventory. Finally, we will briefly shift the focus to the more constrained setting of hypernym discovery, and study a sense-aware supervised framework for the task that is robust and effective, even when trained on heterogeneous OIE-derived hypernymic knowledge.
APA, Harvard, Vancouver, ISO, and other styles
19

Patkar, Vivek, and Smita Chandra. "e-Research and the Ubiquitious Open Grid Digital Libraries of the Future." 2006. http://hdl.handle.net/10150/105624.

Full text
Abstract:
Libraries have traditionally facilitated each of the following elements of research: production of new knowledge, its preservation and its organization to make it accessible for use over the generations. In modern times, the library is constantly required to meet the challenges of information explosion. Assimilating resources and restructuring practices to process the large data volumes both in the print and digital form held across the globe, therefore, becomes very important. A recourse by the libraries to application of successive forms of what can be called as Digital Library Technologies (DLT) has been the imperative. The Open Archives Initiative (OAI) is one recent development that is expected to assist the libraries to partner in setting up virtual learning environment and integrating research on a near universal scale. Future extension of this concept is envisaged to be that of Grid Computing. The technologies driving the â Gridâ would let people share computing power, databases, and other on-line tools securely across institutional and geographic boundaries without sacrificing the local autonomy. Ushering an era of the ubiquitous library helping the e-research is thus on the card. This paper reviews the emerging technological changes and charts the future role for the libraries with special reference to India.
APA, Harvard, Vancouver, ISO, and other styles
20

Léchelle, William. "Protocoles d'évaluation pour l'extraction d'information libre." Thèse, 2019. http://hdl.handle.net/1866/22659.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!