Dissertations / Theses: 'Texte informatif'

1

Boulanger, Chantale Lamontagne Anne. "Pour une étude du texte informatif /." Thèse, Chicoutimi : Université du Québec à Chicoutimi, 1989. http://theses.uqac.ca.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Boulanger, Chantale. "Pour une étude du texte informatif." Thèse, Université du Québec à Trois-Rivières, 1989. http://constellation.uqac.ca/1564/1/1461671.pdf.

Full text

Abstract:

L'étude du texte informatif a impliqué deux concepts: celui de texte et celui d'information. Dans le premier chapitre, nous avons positionné le concept de texte en l'appuyant sur une définition qui le rende opérationnel et fonctionnel. Conséquemment, nous avons défini le texte comme étant un enchaînement de signes manoeuvrant sur les paramètres syntaxique et sémantique, et poursuivant deux trajectoires: horizontale ou syntagmatique par la combinaison linéaire des signes, et verticale ou paradigmatique par la sélection d'éléments associés en paradigmes. Mais peu importe le parcours emprunté, l'élément fondamental du texte demeure toujours le signe: c'est-à-dire le rapport signifiant-signifié. La mise en relation des signes entre eux permet la production du sens pour chacun des éléments et pour l'ensemble textuel. Notre étude a révélé que la production sémantique (ou le passage de la signification au sens) est imputable à l'action portée par le lecteur sur le texte. En d'autres termes, le lecteur motivé par les manifestations textuelles en place active les mécanismes du texte, mécanismes qui ne sont que virtuels sans l'intervention lectorale. Le deuxième chapitre de ce mémoire porte sur le concept d'information. Le concept d'information a fait l'objet de plusieurs études, contrairement à celui de texte informatif, dont la plupart cependant ne nous convenait pas en ce qu'elles l'envisageaient sous un angle différent de celui qui nous préoccupait. Dans ces circonstances, il était trop souvent étudié d'un point de vue sociologique alors que nous avions besoin de l'intégrer dans le cadre de la communication. Nous nous sommes donc essentiellement basées, pour définir notre concept, sur une théorie bien connue des scientifiques notamment la théorie mathématique de la communication telle que développée par Claude Shannon. En raison de son caractère scientifique, nous nous sommes surtout référées, pour mieux la comprendre, à des ouvrages qui traitaient eux aussi de cette théorie comme par exemple « La science et la théorie de l'information » de Léon Brillouin, « Introduction à la théorie de la communication » d'Elie Roubine, etc. Nous avons cru pertinent de situer historiquement la théorie de Claude Shannon et d'en faire un compte-rendu sommaire à partir de ses objectifs et de sa problématique. Nous avons défini l'information et la problématique de la communication conformément à la théorie de Claude Shannon. L'information se conçoit dans l'optique de la théorie comme une grandeur mesurable en fonction du savoir du récepteur. A l'aide d'une certaine formule que nous avons symboliquement illustrée, l'information se mesure sur la base des probabilités: probabilités de réalisation d'un message ou d'un événement. En vertu de ces réalisations, nous avons distingué deux niveaux d'information et relevé des affinités entre la théorie de Claude Shannon et une autre développée par Roman Jakobson. Nous avons, quant à la problématique de la communication, expliqué trois notions lui étant relatives et à partir desquelles elle s'est élaborée notamment celles de bruit de fond, de redondance et de filtrage. La réception du texte a fait l'objet de notre troisième chapitre. Nous avons considéré la lecture du texte en fonction de sa réception en tant que message. A ce titre, nous l'avons rapprochée de la théorie mathématique de la communication c'est-à-dire que nous avons estimé la réception textuelle comme étant une perception de signaux dont le traitement s'effectue en regard d'une levée d'incertitude, d'une évaluation des probabilités d'apparition des éléments lectoraux. L'activité de lecture, qui vise sans aucun doute la compréhension du texte, figure comme une interaction entre son objet et son sujet, entre le texte et le lecteur. Le traitement, proprement dit, des signaux reçus se manifeste lors du processus de compréhension. Afin d'illustrer les mécanismes inhérents à ce processus, nous avons, dans le quatrième chapitre, appuyé l'étude du documentaire Le Québec d?une forêt à l'autre sur un modèle d'analyse développé par Frederiksen. Ce modèle analytique se fonde sur les deux positions interactionnistes. D'une part, il exploite les mécanismes textuels; et d'autre part, il étudie les processus cognitifs sollicités chez le lecteur. Du côté textuel, nous avons procédé tout d'abord à la segmentation du texte qui favorise l'établissement des relations inter-propositionnelles. Cette démarche a ainsi permis de déterminer quels étaient les procédés qui garantissaient l'unité textuelle. Dès lors, nous avons pu soutenir que l'unité du texte est assurée par une force cohésive programmée par la réutilisation lexicale, sémantique ou structurelle, par la coréférence contextuelle ou par des marqueurs de cohésion. Quant à la partie cognitive assumée par le lecteur, elle correspond au montage. Ce dernier consiste à produire le sens du texte par la mise en relation des signes entre eux, et à construire la cohérence textuelle à partir des manifestations cohésives. Cela nous amène à considérer l'activité inférentielle dans le travail de montage et ce, dans la rassure où le lecteur obtient des informations implicites par la corrélation d'informations explicites. Ce type d'analyse aura donc permis de cautionner la conception interactionniste de l'acte de lecture. A l'intérieur de notre cinquième et dernier chapitre, nous avons défini le concept de texte informatif en nous référant aux chapitres I, II, III et IV et au guide pédagogique Et.Si.Je. tel qu'élaboré par Ghislain Bourque et Monique Noël-Gaudreault. Nous en avons proposé deux définitions dont l'une est théorique tandis que l'autre est pratique. D'un point de vue théorique, le texte informatif se considère comme le résultat de la mise en relation du concept de texte et de celui d'information alors que d'un point de vue pratique, il se reconnaît à un certain nombre de caractéristiques. Pour compléter notre définition, nous avons également pris soin de distinguer les grandes catégories de texte informatif existant. Nous avons procédé, pour clore ce chapitre, à l'analyse d'un texte en vérifiant les applications de notre définition.

APA, Harvard, Vancouver, ISO, and other styles

3

Donnelly, Karen. "Le développement du texte informatif en classe d'immersion au primaire." Thesis, Université Laval, 2013. http://www.theses.ulaval.ca/2013/29400/29400.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Mélançon, Julie. "Effets des habiletés métaphonologiques et métasyntaxiques sur la compréhension d'un texte informatif en 1re année du primaire." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1997. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/mq25676.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Boganika, Luciane. "Le défi de l'éducation au Brésil et en France : Le processus de lecture des jeunes et des adultes en situation de réinsertion scolaire dans la perspective d'une reprise d'études." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAL023/document.

Full text

Abstract:

En matière de réinsertion scolaire des jeunes et des adultes, le Brésil met en place l’Éducation des Jeunes et des Adultes (EJA), et la France le Diplôme d’Accès aux Études Universitaires – DAEU. Chacune de ces formations constitue un dispositif pédagogique complet inséré dans le système national d’Éducation avec des programmes officiels, des cours spécifiques, un corps enseignant et des examens avec une visée diplômante. Leurs diplômés ont le droit de postuler aux concours de la fonction publique, ainsi comme au Brésil de postuler pour la sélection des universités publiques et privées, et en France de s’inscrire dans les établissements ouverts aux titulaires d'un baccalauréat. Cependant, au Brésil la formation est orientée vers le marché du travail, afin de favoriser la réinsertion et la mobilité professionnelles, ainsi que la formation de l’individu, tandis qu’en France, l’objectif principal est la poursuite d’études.Dans notre thèse, nous nous intéressons à l’enseignement de la lecture et nous cherchons, dans une perspective contrastive, à évaluer les effets de chacun de ces deux systèmes sur l’apprentissage de la compréhension écrite, et nous discutons des effets du modèle scolaire au sein duquel les lecteurs ont réalisé leurs premiers apprentissages, en particulier, et lecturaux.Notre protocole de recherche, mis en place lors des séances d’étude, propose aux lecteurs de répondre à des questionnaires de lecture portant sur des textes informatifs et journalistiques extraits de publications en ligne. Pour répondre à notre problématique nous formulons les questions suivantes :(1) Comment ces lecteurs lisent-ils ?(2) Quels sont les éléments du texte qu’ils mobilisent ?(3) Comment ces éléments sont-ils repris dans leurs réponses ?(4) Quelles sont les stratégies mises en œuvre lors de la lecture ?Ce protocole et ses fondements théoriques s’inspirent des travaux sur la lecture et la compréhension écrite menés au sein du laboratoire LIDILEM de l’université Grenoble-Alpes dans l’esprit de ceux initiés par Michel DABENE (DABÈNE, FRIER & VISOZ, 1992) et des travaux de l’équipe de recherche brésilienne dirigée par Lúcia CHEREM et Rosa NERY sur la matrice de questions et la progression de la lecture. Le croisement de ces différents indicateurs (partie du texte mobilisée, stratégie d’appropriation du texte, identification du parcours argumentatif, des jugements de valeur et de la polyphonie énonciative, formulation de la réponse) nous permet de reconstruire la cohérence du parcours de lecture - du repérage des informations à la compréhension de la complexité textuelle grâce à l’analyse de l’argumentation - et d’identifier des types de lecture correspondant à des degrés de développement du processus de lecture et de compréhension.De plus, nous nous demandons si la préoccupation du travail de lecture dans une pensée sociale (FREIRE, 1967) a des influences sur la qualité de lecture et de la compréhension écrite et sur la formation des lecteurs. Nous élaborons nos réponses en nous référant à des chercheurs qui, comme Paulo Freire et Pierre Bourdieu, ont pensé la question sociale. Nous pensons, avec eux, que quand la lecture a une signification pour le lecteur, le travail d’apprentissage est plus motivé et est plus cohérent.L’attention particulière à l’article de presse en ligne que nous prenons comme document de base pour notre recherche est motivée par cette perspective de faire dialoguer les pratiques sociales de l’écrit et l’enseignement – apprentissage de la lecture et de la compréhension écrite
In terms of school reintegration of young people and adults, Brazil set up the educational program named Education for Young People and Adults [Educação de Jovens e Adultos – EJA], while France set up the educational project called Pass degree for the sake of the Access to University Studies [Diplôme d’Accès aux Études Universitaires - DAEU]. Each of these constitutes a complete educational device that composes The National System of Education in relation to others official programs as well as to specific courses, teaching staff and exams based on the degree acquisition. The graduates of both programs have the right to apply for public service. Moreover, in the Brazilian context they can apply for public and private universities and in the French situation they are able to sign up in the institutions that are open to people with a university degree. However, in Brazil the pedagogical training is oriented towards the job market and aims to promote the professional reintegration and the mobility of people, while the pedagogical training in France has as main point the pursuit of the studies by these individuals.In our thesis, we are interested in the teaching of the reading and we seek, by a contrastive perspective, to evaluate the effects of each of these two educational systems on the learning of the written comprehension; we also discuss the effects of the school model in which the readers have made their first learning, particularly, their first readings.Our research method, set up during several study sessions, invites the readers to answer a reading quiz about informative and journalistic texts that were extracted from the internet. To answer our present problematization, we formulate the following questions:(1) How do these readers read?(2) What are the elements of the text that they mobilize?(3) How are these elements taken up in their answers?(4) What are the strategies implemented during the reading?This method and its theoretical foundations are inspired by the work on reading and written comprehension conducted in the LIDILEM laboratory of the University of Grenoble-Alpes in the sense of the studies initiated by Michel DABENE (DABÈNE, FRIER & VISOZ, 1992) and of the works performed by the Brazilian team of research led by Lúcia CHEREM and Rosa NERY on the matrix of questions and the progress of reading. The intersection of these different indicators (part of the mobilized text, appropriation strategy of the text, identification of the argumentative path, value judgments and enunciative polyphony, formulation of the answer) allows us to reconstruct the coherence of the reading path – from the retrieval of the information to the understanding of the textual complexity through argumentation analysis – and to identify the types of reading concerning both the degrees of development and of reading-comprehension process.Furthermore, we ask whether the concern with the reading work in social thought (FREIRE, 1967) influences on the quality of the reading and written comprehension as well as on the quality of the pedagogical training of readers. We elaborate our answers by referring to researchers who, like Paulo Freire and Pierre Bourdieu, thought about the social question. We think, with them, that when reading is meaningful to the reader, the learning work is more motivated and more coherent.The special attention to the online press article, that we take as a basic document for our research, is motivated by this perspective that establishes the dialogue between the social practices of the written and the teaching-learning of the reading and written comprehension

APA, Harvard, Vancouver, ISO, and other styles

6

Svanberg, Kerstin. "Les sensations sensorielles qu’évoquent les vins - pénibles à exprimer et encore plus à traduire ? : À propos de la richesse des couleurs et des arômes du roi bordelais et de ses confrères." Thesis, Linnéuniversitetet, Institutionen för språk (SPR), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-89214.

Full text

Abstract:

The present study is based on a translation of a text about wine and wine-making, from French to Swedish. The purpose has been to identify and find solutions for the challenges that can arise in the translation of a text containing numerous descriptions of sensory sensations, i.e. tastes, smells, appearance and touch, in the field of wine. Based on a set of data retrieved from the two texts, our study is focusing on the deviations between descriptions of wine and grape varietals as to their grammatical or linguistic categories in the two languages. We have looked at three different types of descriptions: simple descriptors, metaphors and similes. Subsequently, these have been analyzed from a translation point of view in order to identify which tools are at the hands of the translator to render an idiomatic target text.

APA, Harvard, Vancouver, ISO, and other styles

7

Shrestha, Prajol. "Alignement inter-modalités de corpus comparable monolingue." Phd thesis, Université de Nantes, 2013. http://tel.archives-ouvertes.fr/tel-00909179.

Full text

Abstract:

L'augmentation de la production des documents électroniques disponibles sous forme du texte ou d'audio (journaux, radio, enregistrements audio de télévision, etc.) nécessite le développement d'outils automatisés pour le suivi et la navigation. Il devrait être possible, par exemple, lors de la lecture d'un article d'un journal en ligne, d'accéder à des émissions radio correspondant à la lecture en cours. Cette navigation fine entre les différents médias exige l'alignement des "passages" avec un contenu similaire dans des documents issus de différentes modalités monolingues et comparables. Notre travail se concentre sur ce problème d'alignement de textes courts dans un contexte comparable monolingue et multimodal. Le problème consiste à trouver des similitudes entre le texte court et comment extraire les caractéristiques de ces textes pour nous aider à trouver les similarités pour le processus d'alignement. Nous contributions à ce problème en trois parties. La première partie tente de définir la similitude qui est la base du processus d'alignement. La deuxième partie vise à développer une nouvelle représentation de texte afin de faciliter la création du corpus de référence qui va servir à évaluer les méthodes d'alignement. Enfin, la troisième contribution est d'étudier différentes méthodes d'alignement et l'effet de ses composants sur le processus d'alignement. Ces composants comprennent différentes représentations textuelles, des poids et des mesures de similarité.

APA, Harvard, Vancouver, ISO, and other styles

8

Kavanagh, Judith. "The Text Analyzer: A tool for knowledge acquisition from texts." Thesis, University of Ottawa (Canada), 1995. http://hdl.handle.net/10393/10149.

Full text

Abstract:

The world is being inundated with knowledge at an ever-increasing rate. As intelligent beings and users of knowledge, we must find new ways to locate particular items of information in this huge reservoir of knowledge or we will soon be overwhelmed with enormous quantities of documents that no one any longer has time to read. The vast majority of knowledge is still being stored in conventional text written in natural language, such as books and articles, rather than in more "advanced" forms like knowledge bases. With more and more of these texts being stored on-line rather than solely in print, an opportunity exists to make use of the power of the computer to aid in the location and analysis of knowledge in on-line texts. We propose a tool to do this--the Text Analyzer. We have combined methods from computational linguistics and artificial intelligence to provide the users of the Text Analyzer with a variety of options for finding information in documents, verifying the consistency of this information, performing word and conceptual analyses and other operations. Parsing and indexing are not used in the Text Analyzer. The Text Analyzer can be connected to CODE4, a knowledge management system, so that a knowledge base can be constructed as knowledge is found in the text. We believe this tool will be especially useful for linguists, knowledge engineers, and document specialists.

APA, Harvard, Vancouver, ISO, and other styles

9

Rosell, Magnus. "Text Clustering Exploration : Swedish Text Representation and Clustering Results Unraveled." Doctoral thesis, KTH, Numerisk Analys och Datalogi, NADA, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-10129.

Full text

Abstract:

Text clustering divides a set of texts into clusters (parts), so that texts within each cluster are similar in content. It may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on familiar ones. The main contributions of this thesis are an investigation of text representation for Swedish and some extensions of the work on how to use text clustering as an exploration tool. We have also done some work on synonyms and evaluation of clustering results. Text clustering, at least such as it is treated here, is performed using the vector space model, which is commonly used in information retrieval. This model represents texts by the words that appear in them and considers texts similar in content if they share many words. Languages differ in what is considered a word. We have investigated the impact of some of the characteristics of Swedish on text clustering. Swedish has more morphological variation than for instance English. We show that it is beneficial to use the lemma form of words rather than the word forms. Swedish has a rich production of solid compounds. Most of the constituents of these are used on their own as words and in several different compounds. In fact, Swedish solid compounds often correspond to phrases or open compounds in other languages. Our experiments show that it is beneficial to split solid compounds into their parts when building the representation. The vector space model does not regard word order. We have tried to extend it with nominal phrases in different ways. We have also tried to differentiate between homographs, words that look alike but mean different things, by augmenting all words with a tag indicating their part of speech. None of our experiments using phrases or part of speech information have shown any improvement over using the ordinary model. Evaluation of text clustering results is very hard. What is a good partition of a text set is inherently subjective. External quality measures compare a clustering with a (manual) categorization of the same text set. The theoretical best possible value for a measure is known, but it is not obvious what a good value is – text sets differ in difficulty to cluster and categorizations are more or less adapted to a particular text set. We describe how evaluation can be improved for cases where a text set has more than one categorization. In such cases the result of a clustering can be compared with the result for one of the categorizations, which we assume is a good partition. In some related work we have built a dictionary of synonyms. We use it to compare two different principles for automatic word relation extraction through clustering of words. Text clustering can be used to explore the contents of a text set. We have developed a visualization method that aids such exploration, and implemented it in a tool, called Infomat. It presents the representation matrix directly in two dimensions. When the order of texts and words are changed, by for instance clustering, distributional patterns that indicate similarities between texts and words appear. We have used Infomat to explore a set of free text answers about occupation from a questionnaire given to over 40 000 Swedish twins. The questionnaire also contained a closed answer regarding smoking. We compared several clusterings of the text answers to the closed answer, regarded as a categorization, by means of clustering evaluation. A recurring text cluster of high quality led us to formulate the hypothesis that “farmers smoke less than the average”, which we later could verify by reading previous studies. This hypothesis generation method could be used on any set of texts that is coupled with data that is restricted to a limited number of possible values.
QC 20100806

APA, Harvard, Vancouver, ISO, and other styles

10

Saint-Germain, Isabelle. "Le passage de l'article scientifique au texte vulgarisé : analyse de la structure, du contenu et de la rhétorique des textes." Sherbrooke : Université de Sherbrooke, 2004.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

11

Malanga, Paul R. "Using instruction and practice to improve recall of thematic and text-based information from orally presented texts /." The Ohio State University, 1997. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487946776022526.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Öhrman, Frida. "Texter om QUASI-projektet : Att popularisera forskningsinformation." Thesis, Mälardalen University, Department of Innovation, Design and Product Development, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-343.

Full text

Abstract:

QUASI-projektet är ett tvärvetenskapligt forskningsprojekt som inkluderar universitet och högskolor i fem europeiska länder. QUASI-samarbetet fokuserar på forskning om jästceller. Projektet avslutades i maj 2007 och de vetenskapliga resultaten ska förklaras för allmänheten. Syftet med mitt examensarbete har varit att utforma lättbegripliga texter om QUASI-projektet för att informera allmänheten om projektet och resultaten. Texterna ska publiceras på en webbsida. Min frågeställning har varit: Hur förenklar man forskningsinformation så att människor utan bakgrundskunskaper i ämnet förstår informationen på ett bra sätt? Jag har använt flera olika metoder för att komma fram till ett slutresultat. Genom att samla in litteratur och genomföra en enkätundersökning fick jag ett material att utgå från. Därefter har jag producerat mina texter, utprovat dem på tänkta användare, bearbetat och utprovat igen. Jag har även utprovat bilder i texterna, och sedan låtit en illustratör göra nya bilder. Jag har arbetat fram texter som enligt utprovningar är lättbegripliga för målgruppen. Jag blev klar med mina texter i tid och de blev godkända av uppdragsgivaren. Därmed har jag uppnått syftet med mitt examensarbete. Jag har lärt mig att det är väldigt personligt hur man populariserar forskning. Att skriva texter är ett hantverk som tar tid att lära sig. Jag har utarbetat ett sätt som fungerar bra för mig, men jag tror att alla har sina egna sätt att arbeta på. Det viktigaste är att man utprovar sitt material i flera omgångar på tänkta slutanvändare, och verkligen lyssnar på vad de har att säga och ändrar sina texter efter det.

APA, Harvard, Vancouver, ISO, and other styles

13

Moysset, Bastien. "Détection, localisation et typage de texte dans des images de documents hétérogènes par Réseaux de Neurones Profonds." Thesis, Lyon, 2018. http://www.theses.fr/2018LYSEI044/document.

Full text

Abstract:

Lire automatiquement le texte présent dans les documents permet de rendre accessible les informations qu'ils contiennent. Pour réaliser la transcription de pages complètes, la localisation des lignes de texte est une étape cruciale. Les méthodes traditionnelles de détection de lignes, basées sur des approches de traitement d'images, peinent à généraliser à des jeux de données hétérogènes. Pour cela, nous proposons dans cette thèse une approche par réseaux de neurones profonds. Nous avons d'abord proposé une approche de segmentation mono-dimensionnelle des paragraphes de texte en lignes à l'aide d'une technique inspirée des modèles de reconnaissance, où une classification temporelle connexionniste (CTC) est utilisée pour aligner implicitement les séquences. Ensuite, nous proposons un réseau qui prédit directement les coordonnées des boîtes englobant les lignes de texte. L'ajout d'un terme de confiance à ces boîtes hypothèses permet de localiser un nombre variable d'objets. Nous proposons une prédiction locale des objets afin de partager les paramètres entre les localisations et, ainsi, de multiplier les exemples d'objets vus par chaque prédicteur de boîte lors de l'entraînement. Cela permet de compenser la taille restreinte des jeux de données utilisés. Pour récupérer les informations contextuelles permettant de prendre en compte la structure du document, nous ajoutons, entre les couches convolutionnelles, des couches récurrentes LSTM multi-dimensionnelles. Nous proposons trois stratégies de reconnaissance pleine page qui permettent de tenir compte du besoin important de précision au niveau des positions et nous montrons, sur la base hétérogène Maurdor, la performance de notre approche pour des documents multilingues pouvant être manuscrits et imprimés. Nous nous comparons favorablement à des méthodes issues de l'état de l'art. La visualisation des concepts appris par nos neurones permet de souligner la capacité des couches récurrentes à apporter l'information contextuelle
Being able to automatically read the texts written in documents, both printed and handwritten, makes it possible to access the information they convey. In order to realize full page text transcription, the detection and localization of the text lines is a crucial step. Traditional methods tend to use image processing based approaches, but they hardly generalize to very heterogeneous datasets. In this thesis, we propose to use a deep neural network based approach. We first propose a mono-dimensional segmentation of text paragraphs into lines that uses a technique inspired by the text recognition models. The connexionist temporal classification (CTC) method is used to implicitly align the sequences. Then, we propose a neural network that directly predicts the coordinates of the boxes bounding the text lines. Adding a confidence prediction to these hypothesis boxes enables to locate a varying number of objects. We propose to predict the objects locally in order to share the network parameters between the locations and to increase the number of different objects that each single box predictor sees during training. This compensates the rather small size of the available datasets. In order to recover the contextual information that carries knowledge on the document layout, we add multi-dimensional LSTM recurrent layers between the convolutional layers of our networks. We propose three full page text recognition strategies that tackle the need of high preciseness of the text line position predictions. We show on the heterogeneous Maurdor dataset how our methods perform on documents that can be printed or handwritten, in French, English or Arabic and we favourably compare to other state of the art methods. Visualizing the concepts learned by our neurons enables to underline the ability of the recurrent layers to convey the contextual information

APA, Harvard, Vancouver, ISO, and other styles

14

Sætre, Rune. "GeneTUC: Natural Language Understanding in Medical Text." Doctoral thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2006. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-545.

Full text

Abstract:

Natural Language Understanding (NLU) is a 50 years old research field, but its application to molecular biology literature (BioNLU) is a less than 10 years old field. After the complete human genome sequence was published by Human Genome Project and Celera in 2001, there has been an explosion of research, shifting the NLU focus from domains like news articles to the domain of molecular biology and medical literature. BioNLU is needed, since there are almost 2000 new articles published and indexed every day, and the biologists need to know about existing knowledge regarding their own research. So far, BioNLU results are not as good as in other NLU domains, so more research is needed to solve the challenges of creating useful NLU applications for the biologists.

The work in this PhD thesis is a “proof of concept”. It is the first to show that an existing Question Answering (QA) system can be successfully applied in the hard BioNLU domain, after the essential challenge of unknown entities is solved. The core contribution is a system that discovers and classifies unknown entities and relations between them automatically. The World Wide Web (through Google) is used as the main resource, and the performance is almost as good as other named entity extraction systems, but the advantage of this approach is that it is much simpler and requires less manual labor than any of the other comparable systems.

The first paper in this collection gives an overview of the field of NLU and shows how the Information Extraction (IE) problem can be formulated with Local Grammars. The second paper uses Machine Learning to automatically recognize protein name based on features from the GSearch Engine. In the third paper, GSearch is substituted with Google, and the task in this paper is to extract all unknown names belonging to one of 273 biomedical entity classes, like genes, proteins, processes etc. After getting promising results with Google, the fourth paper shows that this approach can also be used to retrieve interactions or relationships between the named entities. The fifth paper describes an online implementation of the system, and shows that the method scales well to a larger set of entities.

The final paper concludes the “proof of concept” research, and shows that the performance of the original GeneTUC NLU system has increased from handling 10% of the sentences in a large collection of abstracts in 2001, to 50% in 2006. This is still not good enough to create a commercial system, but it is believed that another 40% performance gain can be achieved by importing more verb templates into GeneTUC, just like nouns were imported during this work. Work has already begun on this, in the form of a local Masters Thesis.

APA, Harvard, Vancouver, ISO, and other styles

15

Biedert, Ralf [Verfasser]. "Gaze-Based Human-Text Interaction/Text 2.0 / Ralf Biedert." München : Verlag Dr. Hut, 2014. http://d-nb.info/1050331605/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Richards, Eric D. "Goal information and text comprehension." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape10/PQDD_0003/MQ41764.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Ciravegna, Fabio. "User-defined information extraction from texts." Thesis, University of East Anglia, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.273293.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Adámek, Tomáš. "Metody stemmingu používané při dolování textu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-235547.

Full text

Abstract:

The main theme of this master's thesis is a description of text mining. This document is specialized to English texts and their automatic data preprocessing. The main part of this thesis analyses various stemming algorithms (Lovins, Porter and Paice/Husk). Stemming is a procedure for automatic conflating semantically related terms together via the use of rule sets. Next part of this thesis describes design of an application for various types of stemming algorithms. Application is based on the Java platform with using of graphic library Swing and MVC architecture. Next chapter contains description of implementation of the application and stemming algorithms. In the last part of this master's thesis experiments with stemming algorithms and comparing the algorithm from viewpoint to the results of classification the text are described.

APA, Harvard, Vancouver, ISO, and other styles

19

Pospíšil, Milan. "Extrakce sémantických vztahů z textu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-412824.

Full text

Abstract:

Today exists many semi-structured documents, whitch we want convert to structured form. Goal of this work is create a system, that make this task more automatized. That could be difficult problem, because most of these documents are not generated by computer, so system have to tolerate differences. We also need some semantic understanding, thats why we choose only domain of meeting minutes documents.

APA, Harvard, Vancouver, ISO, and other styles

20

Popescu, Ana-Maria. "Information extraction from unstructured web text /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/6935.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Hodgson, Grant Michael. "Suggesting Missing Information in Text Documents." BYU ScholarsArchive, 2018. https://scholarsarchive.byu.edu/etd/7296.

Full text

Abstract:

A key part of contract drafting involves thinking of issues that have not been addressedand adding language that will address the missing issues. To assist attorneys with this task, we present a pipeline approach for identifying missing information within a contract section. The pipeline takes a contract section as input and includes 1) identifying sections that are similar to the input section from a corpus of contract sections; and 2) identifying and suggesting information from the similar sections that are missing from the input section. By taking advantage of sentence embedding and principal component analysis, this approach suggests sentences that are helpful for finishing a contract. We show that sentence suggestions are more useful than the state of the art topic suggestion algorithm by synthetic experiments and a user study.

APA, Harvard, Vancouver, ISO, and other styles

22

Viana, Hugo Henrique Amorim. "Automatic information retrieval through text-mining." Master's thesis, Faculdade de Ciências e Tecnologia, 2013. http://hdl.handle.net/10362/11308.

Full text

Abstract:

The dissertation presented for obtaining the Master’s Degree in Electrical Engineering and Computer Science, at Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia
Nowadays, around a huge amount of firms in the European Union catalogued as Small and Medium Enterprises (SMEs), employ almost a great portion of the active workforce in Europe. Nonetheless, SMEs cannot afford implementing neither methods nor tools to systematically adapt innovation as a part of their business process. Innovation is the engine to be competitive in the globalized environment, especially in the current socio-economic situation. This thesis provides a platform that when integrated with ExtremeFactories(EF) project, aids SMEs to become more competitive by means of monitoring schedule functionality. In this thesis a text-mining platform that possesses the ability to schedule a gathering information through keywords is presented. In order to develop the platform, several choices concerning the implementation have been made, in the sense that one of them requires particular emphasis is the framework, Apache Lucene Core 2 by supplying an efficient text-mining tool and it is highly used for the purpose of the thesis.

APA, Harvard, Vancouver, ISO, and other styles

23

Staab, Steffen. "Grading knowledge extracting degree information from texts /." Berlin ; Heidelberg : Springer, 2000. http://deposit.ddb.de/cgi-bin/dokserv?idn=965576841.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Cabral, Loni Grimm. "The role of metaphor in informative texts." reponame:Repositório Institucional da UFSC, 1994. https://repositorio.ufsc.br/xmlui/handle/123456789/157849.

Full text

Abstract:

Tese (doutorado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão
Made available in DSpace on 2016-01-08T18:47:18Z (GMT). No. of bitstreams: 1 96396.pdf: 4450999 bytes, checksum: 91488e04a41a0f60ee312b12e40eb5ee (MD5) Previous issue date: 1994
Textos informativos em Português e em Inglês são analisados a fim de observar o papel coesivo da metáfora. A análise é feita dentro da perspectiva de matching relations (Winter, 1986), rótulos (Francis, 1994), repetição lexical (Hoey, 1991) padrões textuais (Winter e Hove, 1986), retrospecção e prospeção (Sinclair, 1992) e macrofunções da linguagem (Halliday, 1985). Foram observadas marcas textuais diferenciando as metáforas interpessoais e ideacionais. Um exame da metáfora na tradução de textos informativos confirma o papel coesivo e sua diferenciação quanto às funções. Implicações da pesquisa para o ensino de leitura e análise de texto são apresentadas.

APA, Harvard, Vancouver, ISO, and other styles

25

Tabassum, Binte Jafar Jeniya. "Information Extraction From User Generated Noisy Texts." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1606315356821532.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Nguyen, Minh Tien. "Détection de textes générés automatiquement." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAM025/document.

Full text

Abstract:

Le texte généré automatiquement a été utilisé dans de nombreuses occasions à des buts différents. Il peut simplement passer des commentaires générés dans une discussion en ligne à une tâche beaucoup plus malveillante, comme manipuler des informations bibliographiques. Ainsi, cette thèse introduit d'abord différentes méthodes pour générer des textes libres ayant trait à un certain sujet et comment ces textes peuvent être utilisés. Par conséquent, nous essayons d'aborder plusieurs questions de recherche. La première question est comment et quelle est la meilleure méthode pour détecter un document entièrement généré.Ensuite, nous irons un peu plus loin et montrer la possibilité de détecter quelques phrases ou un petit paragraphe de texte généré automatiquement en proposant une nouvelle méthode pour calculer la similarité des phrases en utilisant leur structure grammaticale. La dernière question est comment détecter un document généré automatiquement sans aucun échantillon, ceci est utilisé pour illustrer le cas d'un nouveau générateur ou d'un générateur dont il est impossible de collecter des échantillons dessus.Cette thèse étudie également l'aspect industriel du développement. Un aperçu simple d'un flux de travail de publication d'un éditeur de premier plan est présenté. À partir de là, une analyse est effectuée afin de pouvoir intégrer au mieux notre méthode de détection dans le flux de production.En conclusion, cette thèse a fait la lumière sur de multiples questions de recherche importantes concernant la possibilité de détecter des textes générés automatiquement dans différents contextes. En plus de l'aspect de la recherche, des travaux d'ingénierie importants dans un environnement industriel réel sont également réalisés pour démontrer qu'il est important d'avoir une application réelle pour accompagner une recherche hypothétique
Automatically generated text has been used in numerous occasions with distinct intentions. It can simply go from generated comments in an online discussion to a much more mischievous task, such as manipulating bibliography information. So, this thesis first introduces different methods of generating free texts that resemble a certain topic and how those texts can be used. Therefore, we try to tackle with multiple research questions. The first question is how and what is the best method to detect a fully generated document.Then, we take it one step further to address the possibility of detecting a couple of sentences or a small paragraph of automatically generated text by proposing a new method to calculate sentences similarity using their grammatical structure. The last question is how to detect an automatically generated document without any samples, this is used to address the case of a new generator or a generator that it is impossible to collect samples from.This thesis also deals with the industrial aspect of development. A simple overview of a publishing workflow from a high-profile publisher is presented. From there, an analysis is carried out to be able to best incorporate our method of detection into the production workflow.In conclusion, this thesis has shed light on multiple important research questions about the possibility of detecting automatically generated texts in different setting. Besides the researching aspect, important engineering work in a real life industrial environment is also carried out to demonstrate that it is important to have real application along with hypothetical research

APA, Harvard, Vancouver, ISO, and other styles

27

Carlehed, Claes. "Kognitiva belastningar vid läsning och navigering i elektronisk text." Thesis, University of Skövde, Department of Computer Science, 1998. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-207.

Full text

Abstract:

Idag lagras enorma mängder elektronisk text i databaser på företag, institutioner och liknande ställen. Textmassorna är oftast lättåtkomliga, lätt redigerade och enkla att sprida. Dessvärre dyker det upp problem när man ska läsa och navigera i stora dokument. Kognitiva belastningar på korttidsminnet är en följd av svårigheten att överblicka stora dokument. Det uppstår även problem med navigeringen i dokumenten. Oftast leder detta till att läsaren skriver ut dokumentet för att undkomma dessa och liknande problem.

Föreliggande arbete utfördes i samarbete med Familjen Dafgård AB i Källby. Ett intranet håller på att utvecklas hos dem och i detta ingår stora textbaserade databaser. Undersökningar har utförts med hjälp av anställda på företaget för att fånga deras synpunkter vad gäller navigation och läsning i elektroniska dokument.

Metoden som använts för att bringa klarhet i hur läsarna upplever olika navigeringssätt i elektronisk text har varit en kvalitativ intervju och en kvantitativ effektivitetsmätning av två olika sätt att navigera i dokument, scrollning eller länkning av textmassan.

Resultaten visar att de anställda i huvudsak föredrar länkningen framför scrollningen i långa textdokument. Tidsstudien visar tendenser till att länkningen är något snabbare än scrollningen.

APA, Harvard, Vancouver, ISO, and other styles

28

Reffle, Ulrich [Verfasser]. "Algorithmen und Methoden zur dokumentenspezifischen Analyse historischer und OCR-erfasster Texte / Ulrich Reffle." München : Verlag Dr. Hut, 2011. http://d-nb.info/1017353417/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Chétrit, Héloèise. "Ett verktyg för konstruktion av ontologier från text." Thesis, Linköping University, Department of Computer and Information Science, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2228.

Full text

Abstract:

With the growth of information stored over Internet, especially in the biological field, and with discoveries being made daily in this domain, scientists are faced with an overwhelming amount of articles. Reading all published articles is a tedious and time-consuming process. Therefore a way to summarise the information in the articles is needed. A solution is the derivation of an ontology representing the knowledge enclosed in the set of articles and allowing to browse through them.

In this thesis we present the tool Ontolo, which allows to build an initial ontology of a domain by inserting a set of articles related to that domain in the system. The quality of the ontology construction has been tested by comparing our ontology results for keywords to the ones provided by the Gene Ontology for the same keywords.

The obtained results are quite promising for a first prototype of the system as it finds many common terms on both ontologies for justa few hundred of inserted articles.

APA, Harvard, Vancouver, ISO, and other styles

30

Staab, Steffen [Verfasser]. "Grading knowledge : extracting degree information from texts / Steffen Staab." Berlin, 2000. http://d-nb.info/965576841/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Seidel, Christian [Verfasser]. "System zur dynamischen Erfassung domänenspezifischer Texteinheiten und Texte mit Hilfe von Eigenvektormethoden / Christian Seidel." München : Verlag Dr. Hut, 2010. http://d-nb.info/1008331384/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Romsdorfer, Harald [Verfasser]. "Polyglot Text-to-Speech Synthesis : Text Analysis & Prosody Control / Harald Romsdorfer." Aachen : Shaker, 2009. http://d-nb.info/1156517354/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Leidner, Jochen Lothar. "Toponym resolution in text." Thesis, University of Edinburgh, 2007. http://hdl.handle.net/1842/1849.

Full text

Abstract:

Background. In the area of Geographic Information Systems (GIS), a shared discipline between informatics and geography, the term geo-parsing is used to describe the process of identifying names in text, which in computational linguistics is known as named entity recognition and classification (NERC). The term geo-coding is used for the task of mapping from implicitly geo-referenced datasets (such as structured address records) to explicitly geo-referenced representations (e.g., using latitude and longitude). However, present-day GIS systems provide no automatic geo-coding functionality for unstructured text. In Information Extraction (IE), processing of named entities in text has traditionally been seen as a two-step process comprising a flat text span recognition sub-task and an atomic classification sub-task; relating the text span to a model of the world has been ignored by evaluations such as MUC or ACE (Chinchor (1998); U.S. NIST (2003)). However, spatial and temporal expressions refer to events in space-time, and the grounding of events is a precondition for accurate reasoning. Thus, automatic grounding can improve many applications such as automatic map drawing (e.g. for choosing a focus) and question answering (e.g. for questions like How far is London from Edinburgh?, given a story in which both occur and can be resolved). Whereas temporal grounding has received considerable attention in the recent past (Mani and Wilson (2000); Setzer (2001)), robust spatial grounding has long been neglected. Concentrating on geographic names for populated places, I define the task of automatic Toponym Resolution (TR) as computing the mapping from occurrences of names for places as found in a text to a representation of the extensional semantics of the location referred to (its referent), such as a geographic latitude/longitude footprint. The task of mapping from names to locations is hard due to insufficient and noisy databases, and a large degree of ambiguity: common words need to be distinguished from proper names (geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other Londons on earth). In addition, names of places and the boundaries referred to change over time, and databases are incomplete. Objective. I investigate how referentially ambiguous spatial named entities can be grounded, or resolved, with respect to an extensional coordinate model robustly on open-domain news text. I begin by comparing the few algorithms proposed in the literature, and, comparing semiformal, reconstructed descriptions of them, I factor out a shared repertoire of linguistic heuristics (e.g. rules, patterns) and extra-linguistic knowledge sources (e.g. population sizes). I then investigate how to combine these sources of evidence to obtain a superior method. I also investigate the noise effect introduced by the named entity tagging step that toponym resolution relies on in a sequential system pipeline architecture. Scope. In this thesis, I investigate a present-day snapshot of terrestrial geography as represented in the gazetteer defined and, accordingly, a collection of present-day news text. I limit the investigation to populated places; geo-coding of artifact names (e.g. airports or bridges), compositional geographic descriptions (e.g. 40 miles SW of London, near Berlin), for instance, is not attempted. Historic change is a major factor affecting gazetteer construction and ultimately toponym resolution. However, this is beyond the scope of this thesis. Method. While a small number of previous attempts have been made to solve the toponym resolution problem, these were either not evaluated, or evaluation was done by manual inspection of system output instead of curating a reusable reference corpus. Since the relevant literature is scattered across several disciplines (GIS, digital libraries, information retrieval, natural language processing) and descriptions of algorithms are mostly given in informal prose, I attempt to systematically describe them and aim at a reconstruction in a uniform, semi-formal pseudo-code notation for easier re-implementation. A systematic comparison leads to an inventory of heuristics and other sources of evidence. In order to carry out a comparative evaluation procedure, an evaluation resource is required. Unfortunately, to date no gold standard has been curated in the research community. To this end, a reference gazetteer and an associated novel reference corpus with human-labeled referent annotation are created. These are subsequently used to benchmark a selection of the reconstructed algorithms and a novel re-combination of the heuristics catalogued in the inventory. I then compare the performance of the same TR algorithms under three different conditions, namely applying it to the (i) output of human named entity annotation, (ii) automatic annotation using an existing Maximum Entropy sequence tagging model, and (iii) a na¨ıve toponym lookup procedure in a gazetteer. Evaluation. The algorithms implemented in this thesis are evaluated in an intrinsic or component evaluation. To this end, we define a task-specific matching criterion to be used with traditional Precision (P) and Recall (R) evaluation metrics. This matching criterion is lenient with respect to numerical gazetteer imprecision in situations where one toponym instance is marked up with different gazetteer entries in the gold standard and the test set, respectively, but where these refer to the same candidate referent, caused by multiple near-duplicate entries in the reference gazetteer. Main Contributions. The major contributions of this thesis are as follows: • A new reference corpus in which instances of location named entities have been manually annotated with spatial grounding information for populated places, and an associated reference gazetteer, from which the assigned candidate referents are chosen. This reference gazetteer provides numerical latitude/longitude coordinates (such as 51320 North, 0 50 West) as well as hierarchical path descriptions (such as London > UK) with respect to a world wide-coverage, geographic taxonomy constructed by combining several large, but noisy gazetteers. This corpus contains news stories and comprises two sub-corpora, a subset of the REUTERS RCV1 news corpus used for the CoNLL shared task (Tjong Kim Sang and De Meulder (2003)), and a subset of the Fourth Message Understanding Contest (MUC-4; Chinchor (1995)), both available pre-annotated with gold-standard. This corpus will be made available as a reference evaluation resource; • a new method and implemented system to resolve toponyms that is capable of robustly processing unseen text (open-domain online newswire text) and grounding toponym instances in an extensional model using longitude and latitude coordinates and hierarchical path descriptions, using internal (textual) and external (gazetteer) evidence; • an empirical analysis of the relative utility of various heuristic biases and other sources of evidence with respect to the toponym resolution task when analysing free news genre text; • a comparison between a replicated method as described in the literature, which functions as a baseline, and a novel algorithm based on minimality heuristics; and • several exemplary prototypical applications to show how the resulting toponym resolution methods can be used to create visual surrogates for news stories, a geographic exploration tool for news browsing, geographically-aware document retrieval and to answer spatial questions (How far...?) in an open-domain question answering system. These applications only have demonstrative character, as a thorough quantitative, task-based (extrinsic) evaluation of the utility of automatic toponym resolution is beyond the scope of this thesis and left for future work.

APA, Harvard, Vancouver, ISO, and other styles

34

Krishnan, Sharenya. "Text-Based Information Retrieval Using Relevance Feedback." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-53603.

Full text

Abstract:

Europeana, a freely accessible digital library with an idea to make Europe's cultural and scientific heritage available to the public was founded by the European Commission in 2008. The goal was to deliver a semantically enriched digital content with multilingual access to it. Even though they managed to increase the content of data they slowly faced the problem of retrieving information in an unstructured form. So to complement the Europeana portal services, ASSETS (Advanced Search Service and Enhanced Technological Solutions) was introduced with services that sought to improve the usability and accessibility of Europeana. My contribution is to study different text-based information retrieval models, their relevance feedback techniques and to implement one simple model. The thesis explains a detailed overview of the information retrieval process along with the implementation of the chosen strategy for relevance feedback that generates automatic query expansion. Finally, the thesis concludes with the analysis made using relevance feedback, discussion on the model implemented and then an assessment on future use of this model both as a continuation of my work and using this model in ASSETS.

APA, Harvard, Vancouver, ISO, and other styles

35

Lanquillon, Carsten. "Enhancing text classification to improve information filtering." [S.l. : s.n.], 2001. http://deposit.ddb.de/cgi-bin/dokserv?idn=963801805.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Foster, Mary Ellen. "Automatically generating text to accompany information graphics." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape7/PQDD_0001/MQ45946.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Teufel, Simone. "Argumentative zoning : information extraction from scientific text." Thesis, University of Edinburgh, 1999. http://hdl.handle.net/1842/11456.

Full text

Abstract:

We present a new type of analysis for scientific text which we call Argumentative Zoning. We demonstrate that this type of text analysis can be used for generating user-tailored and task-tailored summarises or for performing more informative citation analyses. We also demonstrate that our type of analysis can be applied to unrestricted text, both automatically and by humans. The corpus we use for the analysis (80 conference papers in computational linguistics) is a difficult test bed; it shows great variation with respect to subdomain, writing style, register and linguistic expression. We present reliability studies which we performed on this corpus and for which we used two unrelated trained annotators. The definition of our seven categories (argumentative zones) is not specific to the domain, only to the text type; it is based on the typical argumentation to be found in scientific articles. It reflects the attribution of intellectual ownership in articles, expressions of author’s stance and typical statements about problem-solving processes. On the basis of sentential features, we use a Naive Bayesian model and an ngram model over sentences to estimate a sentence’s argumentative status, taking the hand-annotated corpus as training material. An alternative, symbolic system uses the features in a rule-based way. The general working hypothesis of this thesis is that empirical discourse studies can contribute to practical document management problems: the analysis of a significant amount of naturally occurring text is essential for discourse linguistic theories, and the application of a robust discourse and argumentation analysis can make text understanding techniques for practical document management more robust.

APA, Harvard, Vancouver, ISO, and other styles

38

Murad, Masrah Azrifah Azmi. "Fuzzy text mining for intelligent information retrieval." Thesis, University of Bristol, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.416830.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Kyriakides, Alexandros 1977. "Supervised information retrieval for text and images." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/28426.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.
Includes bibliographical references (leaves 73-74).
We present a novel approach to choosing an appropriate image for a news story. Our method uses the caption of the image to retrieve a suitable image. We have developed a word-extraction engine called WordEx. WordEx uses supervised learning to predict which words in the text of a news story are likely to be present in the caption of an appropriate image. The words extracted by WordEx are then used to retrieve the image from a collection of images. On average, the number of words extracted by WordEx is 10% of the original story text. Therefore, this word-extraction engine can also be applied to text documents for feature reduction.
by Alexandros Kyriakides.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

40

Smail, Nabila. "Contribution à l'analyse et à la recherche d'information en texte intégral : application de la transformée en ondelettes pour la recherche et l'analyse de textes." Phd thesis, Université Paris-Est, 2009. http://tel.archives-ouvertes.fr/tel-00504368.

Full text

Abstract:

L'objet des systèmes de recherche d'informations est de faciliter l'accès à un ensemble de documents, afin de permettre à l'utilisateur de retrouver ceux qui sont pertinents, c'est-à-dire ceux dont le contenu correspond le mieux à son besoin en information. La qualité des résultats de la recherche se mesure en comparant les réponses du système avec les réponses idéales que l'utilisateur espère recevoir. Plus les réponses du système correspondent à celles que l'utilisateur espère, plus le système est jugé performant. Les premiers systèmes permettaient d'effectuer des recherches booléennes, c'est à dire, des recherches ou seule la présence ou l'absence d'un terme de la requête dans un texte permet de le sélectionner. Il a fallu attendre la fin des années 60, pour que l'on applique le modèle vectoriel aux problématiques de la recherche d'information. Dans ces deux modèles, seule la présence, l'absence, ou la fréquence des mots dans le texte est porteuse d'information. D'autres systèmes de recherche d'information adoptent cette approche dans la modélisation des données textuelles et dans le calcul de la similarité entre documents ou par rapport à une requête. SMART (System for the Mechanical Analysis and Retrieval of Text) [4] est l'un des premiers systèmes de recherche à avoir adopté cette approche. Plusieurs améliorations des systèmes de recherche d'information utilisent les relations sémantiques qui existent entre les termes dans un document. LSI (Latent Semantic Indexing) [5], par exemple réalise ceci à travers des méthodes d'analyse qui mesurent la cooccurrence entre deux termes dans un même contexte, tandis que Hearst et Morris [6] utilisent des thésaurus en ligne pour créer des liens sémantiques entre les termes dans un processus de chaines lexicales. Dans ces travaux nous développons un nouveau système de recherche qui permet de représenter les données textuelles par des signaux. Cette nouvelle forme de représentation nous permettra par la suite d'appliquer de nombreux outils mathématiques de la théorie du signal, tel que les Transformées en ondelettes et jusqu'a aujourd'hui inconnue dans le domaine de la recherche d'information textuelle

APA, Harvard, Vancouver, ISO, and other styles

41

Jessop, David M. "Information extraction from chemical patents." Thesis, University of Cambridge, 2011. https://www.repository.cam.ac.uk/handle/1810/238302.

Full text

Abstract:

The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature. Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use. PatentEye - an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) - is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye - 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information. Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.

APA, Harvard, Vancouver, ISO, and other styles

42

Bundschus, Markus. "From Text to Knowledge." Diss., lmu, 2010. http://nbn-resolving.de/urn:nbn:de:bvb:19-118841.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Song, Min Song Il-Yeol. "Robust knowledge extraction over large text collections /." Philadelphia, Pa. : Drexel University, 2005. http://dspace.library.drexel.edu/handle/1860/495.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Käter, Thorsten. "Evaluierung des Text-Retrievalsystems "Intelligent Miner for Text" von IBM : eine Studie im Vergleich zur Evaluierung anderer Systeme /." [S.l. : s.n.], 1999. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB8230685.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Hasegawa, Satoshi, Kumi Sato, Shohei Matsunuma, Masaru Miyao, and Kohei Okamoto. "Multilingual Disaster Information System : Information Delivery Using Graphic Text for Mobile Phones." Springer, 2005. http://hdl.handle.net/2237/8651.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Brifkany, Jan, and Yasini Anass El. "Text Recognition in Natural Images : A study in Text Detection." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-282935.

Full text

Abstract:

In recent years, a surge in computer vision methods and solutions has been developed to solve the computer vision problem. By combining different methods from different areas of computer vision, computer scientists have been able to develop more advanced and sophisticated models to solve these problems. This report will cover two categories, text detection and text recognition. These areas will be defined, described, and analyzed in the result and discussion chapter. This report will cover an exciting and challenging topic, text recognition in natural images. It set out to assess the improvement of OCR accuracy after three image segmentation methods have been applied to images. The methods used are Maximally stable extremal regions and geometric filtering based on geometric properties. The result showed that the accuracy of OCR with segmentation methods had an overall better accuracy when compared to OCR without segmentation methods. Also, it was shown that images with horizontal text orientation had better accuracy when applying OCR with segmentation methods compared to images with multi-oriented text orientation.
Under de senaste åren har en ökning av datorseende metoder och lösningar utvecklats för att lösa datorseende problemet. Genom att kombinera olika metoder från olika områden av datorseende har datavetare kunnat utveckla mer avancerade och komplexa modeller för att lösa dessa problem. Denna rapport kommer att omfatta två kategorier, textidentifiering och textigenkänning. Dessa områden kommer att definieras, beskrivas och analyseras i resultat- och diskussionskapitlet. Denna rapport kommer att omfatta ett mycket intressant och utmanande ämne, textigenkänning i naturliga bilder. Rapporten syftar till att bedöma förbättringen av OCR-resultatet efter det att tre bildsegmenteringsmetoder har tillämpats på bilder. Metoderna som har använts är ” Maximally stable extremal regions” och geometrisk filtrering baserad på geometriska egenskaper. Resultatet visade att hos OCR med segmenteringsmetoder hade en övergripande bättre resultat jämfört med OCR utan segmenteringsmetoder. Det visades också att bilder med horisontell textorientering hade bättre noggrannhet vid tillämpning av OCR med segmenteringsmetoder jämfört med bilder med flerorienterad textorientering.

APA, Harvard, Vancouver, ISO, and other styles

47

Boynuegri, Akif. "Cross-lingual Information Retrieval On Turkish And English Texts." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12611903/index.pdf.

Full text

Abstract:

In this thesis, cross-lingual information retrieval (CLIR) approaches are comparatively evaluated for Turkish and English texts. As a complementary study, knowledge-based methods for word sense disambiguation (WSD), which is one of the most important parts of the CLIR studies, are compared for Turkish words. Query translation and sense indexing based CLIR approaches are used in this study. In query translation approach, we use automatic and manual word sense disambiguation methods and Google translation service during translation of queries. In sense indexing based approach, documents are indexed according to meanings of words instead of words themselves. Retrieval of documents is performed according to meanings of the query words as well. During the identification of intended meaning of query terms, manual and automatic word sense disambiguation methods are used and compared to each other. Knowledge based WSD methods that use different gloss enrichment techniques are compared for Turkish words. Turkish WordNet is used as a primary knowledge base and English WordNet and Turkish Wikipedia are employed as enrichment resources. Meanings of words are more clearly identified by using semantic relations defined in WordNets and Turkish Wikipedia. Also, during calculation of semantic relatedness of senses, cosine similarity metric is used as an alternative metric to word overlap count. Effects of using cosine similarity metric are observed for each WSD methods that use different knowledge bases.

APA, Harvard, Vancouver, ISO, and other styles

48

Sabir, Ahmed. "Enhancing scene text recognition with visual context information." Doctoral thesis, Universitat Politècnica de Catalunya, 2020. http://hdl.handle.net/10803/670286.

Full text

Abstract:

This thesis addresses the problem of improving text spotting systems, which aim to detect and recognize text in unrestricted images (e.g. a street sign, an advertisement, a bus destination, etc.). The goal is to improve the performance of off-the-shelf vision systems by exploiting the semantic information derived from the image itself. The rationale is that knowing the content of the image or the visual context can help to decide which words are the correct andidate words. For example, the fact that an image shows a coffee shop makes it more likely that a word on a signboard reads as Dunkin and not unkind. We address this problem by drawing on successful developments in natural language processing and machine learning, in particular, learning to re-rank and neural networks, to present post-process frameworks that improve state-of-the-art text spotting systems without the need for costly data-driven re-training or tuning procedures. Discovering the degree of semantic relatedness of candidate words and their image context is a task related to assessing the semantic similarity between words or text fragments. However, semantic relatedness is more general than similarity (e.g. car, road, and traffic light are related but not similar) and requires certain adaptations. To meet the requirements of these broader perspectives of semantic similarity, we develop two approaches to learn the semantic related-ness of the spotted word and its environmental context: word-to-word (object) or word-to-sentence (caption). In the word-to-word approach, word embed-ding based re-rankers are developed. The re-ranker takes the words from the text spotting baseline and re-ranks them based on the visual context from the object classifier. For the second, an end-to-end neural approach is designed to drive image description (caption) at the sentence-level as well as the word-level (objects) and re-rank them based not only on the visual context but also on the co-occurrence between them. As an additional contribution, to meet the requirements of data-driven ap-proaches such as neural networks, we propose a visual context dataset for this task, in which the publicly available COCO-text dataset [Veit et al. 2016] has been extended with information about the scene (including the objects and places appearing in the image) to enable researchers to include the semantic relations between texts and scene in their Text Spotting systems, and to offer a common evaluation baseline for such approaches.
Aquesta tesi aborda el problema de millorar els sistemes de reconeixement de text, que permeten detectar i reconèixer text en imatges no restringides (per exemple, un cartell al carrer, un anunci, una destinació d’autobús, etc.). L’objectiu és millorar el rendiment dels sistemes de visió existents explotant la informació semàntica derivada de la pròpia imatge. La idea principal és que conèixer el contingut de la imatge o el context visual en el que un text apareix, pot ajudar a decidir quines són les paraules correctes. Per exemple, el fet que una imatge mostri una cafeteria fa que sigui més probable que una paraula en un rètol es llegeixi com a Dunkin que no pas com unkind. Abordem aquest problema recorrent a avenços en el processament del llenguatge natural i l’aprenentatge automàtic, en particular, aprenent re-rankers i xarxes neuronals, per presentar solucions de postprocés que milloren els sistemes de l’estat de l’art de reconeixement de text, sense necessitat de costosos procediments de reentrenament o afinació que requereixin grans quantitats de dades. Descobrir el grau de relació semàntica entre les paraules candidates i el seu context d’imatge és una tasca relacionada amb l’avaluació de la semblança semàntica entre paraules o fragments de text. Tanmateix, determinar l’existència d’una relació semàntica és una tasca més general que avaluar la semblança (per exemple, cotxe, carretera i semàfor estan relacionats però no són similars) i per tant els mètodes existents requereixen certes adaptacions. Per satisfer els requisits d’aquestes perspectives més àmplies de relació semàntica, desenvolupem dos enfocaments per aprendre la relació semàntica de la paraula reconeguda i el seu context: paraula-a-paraula (amb els objectes a la imatge) o paraula-a-frase (subtítol de la imatge). En l’enfocament de paraula-a-paraula s’usen re-rankers basats en word-embeddings. El re-ranker pren les paraules proposades pel sistema base i les torna a reordenar en funció del context visual proporcionat pel classificador d’objectes. Per al segon cas, s’ha dissenyat un enfocament neuronal d’extrem a extrem per explotar la descripció de la imatge (subtítol) tant a nivell de frase com a nivell de paraula i re-ordenar les paraules candidates basant-se tant en el context visual com en les co-ocurrències amb el subtítol. Com a contribució addicional, per satisfer els requisits dels enfocs basats en dades com ara les xarxes neuronals, presentem un conjunt de dades de contextos visuals per a aquesta tasca, en el què el conjunt de dades COCO-text disponible públicament [Veit et al. 2016] s’ha ampliat amb informació sobre l’escena (inclosos els objectes i els llocs que apareixen a la imatge) per permetre als investigadors incloure les relacions semàntiques entre textos i escena als seus sistemes de reconeixement de text, i oferir una base d’avaluació comuna per a aquests enfocaments.

APA, Harvard, Vancouver, ISO, and other styles

49

Tarczyńska, Anna. "Methods of Text Information Extraction in Digital Videos." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2656.

Full text

Abstract:

Context The huge amount of existing digital video files needs to provide indexing to make it available for customers (easier searching). The indexing can be provided by text information extraction. In this thesis we have analysed and compared methods of text information extraction in digital videos. Furthermore, we have evaluated them in the new context proposed by us, namely usefulness in sports news indexing and information retrieval. Objectives The objectives of this thesis are as follows: providing a better understanding of the nature of text extraction; performing a systematic literature review on various methods of text information extraction in digital videos of TV sports news; designing and executing an experiment in the testing environment; evaluating available and promising methods of text information extraction from digital video files in the proposed context associated with video sports news indexing and retrieval; providing an adequate solution in the proposed context described above. Methods This thesis consists of three research methods: Systematic Literature Review, Video Content Analysis with the checklist, and Experiment. The Systematic Literature Review has been used to study the nature of text information extraction, to establish the methods and challenges, and to specify the effective way of conducting the experiment. The video content analysis has been used to establish the context for the experiment. Finally, the experiment has been conducted to answer the main research question: How useful are the methods of text information extraction for indexation of video sports news and information retrieval? Results Through the Systematic Literature Review we identified 29 challenges of the text information extraction methods, and 10 chains between them. We extracted 21 tools and 105 different methods, and analyzed the relations between them. Through Video Content Analysis we specified three groups of probability of text extraction from video, and 14 categories for providing video sports news indexation with the taxonomy hierarchy. We have conducted the Experiment on three videos files, with 127 frames, 8970 characters, and 1814 words, using the only available MoCA tool. As a result, we reported 10 errors and proposed recommendations for each of them. We evaluated the tool according to the categories mentioned above and offered four advantages, and nine disadvantages of the Tool mentioned above. Conclusions It is hard to compare the methods described in the literature, because the tools are not available for testing, and they are not compared with each other. Furthermore, the values of recall and precision measures highly depend on the quality of the text contained in the video. Therefore, performing the experiments on the same indexed database is necessary. However, the text information extraction is time consuming (because of huge amount of frames in video), and even high character recognition rate gives low word recognition rate. Therefore, the usefulness of text information extraction for video indexation is still low. Because most of the text information contained in the videos news is inserted in post-processing, the text extraction could be provided in the root: during the processing of the original video, by the broadcasting company (e.g. by automatically saving inserted text in separate file). Then the text information extraction will not be necessary for managing the new video files
The huge amount of existing digital video files needs to provide indexing to make it available for customers (easier searching). The indexing can be provided by text information extraction. In this thesis we have analysed and compared methods of text information extraction in digital videos. Furthermore, we have evaluated them in the new context proposed by us, namely usefulness in sports news indexing and information retrieval.

APA, Harvard, Vancouver, ISO, and other styles

50

Odd, Joel, and Emil Theologou. "Utilize OCR text to extract receipt data and classify receipts with common Machine Learning algorithms." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-148350.

Full text

Abstract:

This study investigated if it was feasible to use machine learning tools on OCR extracted text data to classify receipts and extract specific data points. Two OCR tools were evaluated, the first was Azure Computer Vision API and the second was Google Drive REST Api, where Google Drive REST Api was the main OCR tool used in the project because of its impressive performance. The classification task mainly tried to predict which of five given categories the receipts belongs to, and also a more challenging task of predicting specific subcategories inside those five larger categories. The data points we where trying to extract was the date of purchase on the receipt and the total price of the transaction. The classification was mainly done with the help of scikit-learn, while the extraction of data points was achieved by a simple custom made N-gram model. The results were promising with about 94 % cross validation score for classifying receipts based on category with the help of a LinearSVC classifier. Our custom model was successful in 72 % of cases for the price data point while the results for extracting the date was less successful with an accuracy of 50 %, which we still consider very promising given the simplistic nature of the custom model.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Texte informatif'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles