To see the other types of publications on this topic, follow the link: Texts classification.

Dissertations / Theses on the topic 'Texts classification'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Texts classification.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Berio, Luciano. "Text of Texts." Bärenreiter Verlag, 1998. https://slub.qucosa.de/id/qucosa%3A36791.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Pettersson, Fredrik. "Optimizing Deep Neural Networks for Classification of Short Texts." Thesis, Luleå tekniska universitet, Datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-76811.

Full text
Abstract:
This master's thesis investigates how a state-of-the-art (SOTA) deep neural network (NN) model can be created for a specific natural language processing (NLP) dataset, the effects of using different dimensionality reduction techniques on common pre-trained word embeddings and how well this model generalize on a secondary dataset. The research is motivated by two factors. One is that the construction of a machine learning (ML) text classification (TC) model is typically done around a specific dataset and often requires a lot of manual intervention. It's therefore hard to know exactly what procedures to implement for a specific dataset and how the result will be affected. The other reason is that, if the dimensionality of pre-trained embedding vectors can be lowered without losing accuracy, and thus saving execution time, other techniques can be used during the time saved to achieve even higher accuracy. A handful of deep neural network architectures are used, namely a convolutional neural network (CNN), long short-term memory neural network (LSTM) and a bidirectional LSTM (Bi-LSTM) architecture. These deep neural network architectures are combined with four different word embeddings: GoogleNews-vectors-negative300, glove.840B.300d, paragram_300_sl999 and wiki-news-300d-1M. Three main experiments are conducted in this thesis. In the first experiment, a top-performing TC model is created for a recent NLP competition held at Kaggle.com. Each implemented procedure is benchmarked on how the accuracy and execution time of the model is affected. In the second experiment, principal component analysis (PCA) and random projection (RP) are applied to the pre-trained word embeddings used in the top-performing model to investigate how the accuracy and execution time is affected when creating lower-dimensional embedding vectors. In the third experiment, the same model is benchmarked on a separate dataset (Sentiment140) to investigate how well it generalizes on other data and how each implemented procedure affects the accuracy compared to on the original dataset. The first experiment results in a bidirectional LSTM model and a combination of the three embeddings: glove, paragram and wiki-news concatenated together. The model is able to give predictions with an F1 score of 71% which is good enough to reach 9th place out of 1,401 participating teams in the competition. In the second experiment, the execution time is improved by 13%, by using PCA, while lowering the dimensionality of the embeddings by 66% and only losing half a percent of F1 accuracy. RP gave a constant accuracy of 66-67% regardless of the projected dimensions compared to over 70% when using PCA. In the third experiment, the model gained around 12% accuracy from the initial to the final benchmarks, compared to 19% on the competition dataset. The best-achieved accuracy on the Sentiment140 dataset is 86% and thus higher than the 71% achieved on the Quora dataset.
APA, Harvard, Vancouver, ISO, and other styles
3

Starr, John Michael. "Quantitative analysis of the Aramaic Qumran texts." Thesis, University of Edinburgh, 2013. http://hdl.handle.net/1842/8947.

Full text
Abstract:
Ηirbet-Qumran lies about 15km south of Jericho. Between 1947 and 1956 eleven caves were discovered that contained thousands of fragments, mainly of prepared animal skin, representing approximately 900 texts many of which are considered copies. Over 100 of these texts are in Aramaic, though many are short fragments. The provenance of these texts is uncertain, but the fact that copies of some of them are known from beyond Qumran indicates that not all the Aramaic texts found at Qumran are likely to have originated there, though Qumran may have been a site where they were copied. Hitherto, this Aramaic corpus has been referred to as a single entity linguistically, but increasingly it is recognized that heterogeneity of textual features is present. Such heterogeneity may provide clues as to the origins of different texts. The current classification of Qumran texts is in terms of the cave they were found in and the order in which they were found: it does not provide any information about textual relationships between texts. In this thesis I first review the literature to identify suitable quantitative criteria for classification of Aramaic Qumran texts. Second, I determine appropriate statistical methods to classify edited Aramaic Qumran texts according to quantitative textual criteria. Third, I establish ‘proof of principle’ by classifying Hebrew bible books. Fourth, I classify the Aramaic Qumran texts as a corpus without reference to external reference texts. Fifth, I investigate the alignment of this internally-derived classification with external reference texts. Sixth, I perform confirmatory analyses to determine the number of different text-type groups within the corpus. Seventh, I use this classification to allocate previously unclassified texts to one of these text-types. Finally, I examine the textual characteristics of these different text-types and discuss what this can tell scholars about individual texts and the linguistic development of Aramaic during Second Temple Judaism as reflected by Qumran.
APA, Harvard, Vancouver, ISO, and other styles
4

Nurse, Derek. "Historical texts from the Swahili coast." Swahili Forum 1 (1994) S. 47-85, 1994. https://ul.qucosa.de/id/qucosa%3A11607.

Full text
Abstract:
Between 1977 and 1980 I collected a nuber of texts on the northern Kenya coast Most were tape recorded by myself fiom oral performances, a few were written down or recorded by others Most of the current collection consists of texts gathered so, plus: the Mwiini material, provided by Chuck Kisseberth, originally provided or recorded in Barawa by M I. Abasheikh, and the Bajuni \"contemporary\" verse, taken form a publicly available cassette-recording by AM. Msallarn in the 1970.
APA, Harvard, Vancouver, ISO, and other styles
5

Bayram, Ulya. "State-of-Mind Classification From Unstructured Texts Using Statistical Features and Lexical Network Features." University of Cincinnati / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1563274174606657.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Mazyad, Ahmad. "Contribution to automatic text classification : metrics and evolutionary algorithms." Thesis, Littoral, 2018. http://www.theses.fr/2018DUNK0487/document.

Full text
Abstract:
Cette thèse porte sur le traitement du langage naturel et l'exploration de texte, à l'intersection de l'apprentissage automatique et de la statistique. Nous nous intéressons plus particulièrement aux schémas de pondération des termes (SPT) dans le contexte de l'apprentissage supervisé et en particulier à la classification de texte. Dans la classification de texte, la tâche de classification multi-étiquettes a suscité beaucoup d'intérêt ces dernières années. La classification multi-étiquettes à partir de données textuelles peut être trouvée dans de nombreuses applications modernes telles que la classification de nouvelles où la tâche est de trouver les catégories auxquelles appartient un article de presse en fonction de son contenu textuel (par exemple, politique, Moyen-Orient, pétrole), la classification du genre musical (par exemple, jazz, pop, oldies, pop traditionnelle) en se basant sur les commentaires des clients, la classification des films (par exemple, action, crime, drame), la classification des produits (par exemple, électronique, ordinateur, accessoires). La plupart des algorithmes d'apprentissage ne conviennent qu'aux problèmes de classification binaire. Par conséquent, les tâches de classification multi-étiquettes sont généralement transformées en plusieurs tâches binaires à label unique. Cependant, cette transformation introduit plusieurs problèmes. Premièrement, les distributions des termes ne sont considérés qu'en matière de la catégorie positive et de la catégorie négative (c'est-à-dire que les informations sur les corrélations entre les termes et les catégories sont perdues). Deuxièmement, il n'envisage aucune dépendance vis-à-vis des étiquettes (c'est-à-dire que les informations sur les corrélations existantes entre les classes sont perdues). Enfin, puisque toutes les catégories sauf une sont regroupées dans une seule catégories (la catégorie négative), les tâches nouvellement créées sont déséquilibrées. Ces informations sont couramment utilisées par les SPT supervisés pour améliorer l'efficacité du système de classification. Ainsi, après avoir présenté le processus de classification de texte multi-étiquettes, et plus particulièrement le SPT, nous effectuons une comparaison empirique de ces méthodes appliquées à la tâche de classification de texte multi-étiquette. Nous constatons que la supériorité des méthodes supervisées sur les méthodes non supervisées n'est toujours pas claire. Nous montrons ensuite que ces méthodes ne sont pas totalement adaptées au problème de la classification multi-étiquettes et qu'elles ignorent beaucoup d'informations statistiques qui pourraient être utilisées pour améliorer les résultats de la classification. Nous proposons donc un nouvel SPT basé sur le gain d'information. Cette nouvelle méthode prend en compte la distribution des termes, non seulement en ce qui concerne la catégorie positive et la catégorie négative, mais également en rapport avec toutes les autres catégories. Enfin, dans le but de trouver des SPT spécialisés qui résolvent également le problème des tâches déséquilibrées, nous avons étudié les avantages de l'utilisation de la programmation génétique pour générer des SPT pour la tâche de classification de texte. Contrairement aux études précédentes, nous générons des formules en combinant des informations statistiques à un niveau microscopique (par exemple, le nombre de documents contenant un terme spécifique) au lieu d'utiliser des SPT complets. De plus, nous utilisons des informations catégoriques telles que (par exemple, le nombre de catégories dans lesquelles un terme apparaît). Des expériences sont effectuées pour mesurer l'impact de ces méthodes sur les performances du modèle. Nous montrons à travers ces expériences que les résultats sont positifs<br>This thesis deals with natural language processing and text mining, at the intersection of machine learning and statistics. We are particularly interested in Term Weighting Schemes (TWS) in the context of supervised learning and specifically the Text Classification (TC) task. In TC, the multi-label classification task has gained a lot of interest in recent years. Multi-label classification from textual data may be found in many modern applications such as news classification where the task is to find the categories that a newswire story belongs to (e.g., politics, middle east, oil), based on its textual content, music genre classification (e.g., jazz, pop, oldies, traditional pop) based on customer reviews, film classification (e.g. action, crime, drama), product classification (e.g. Electronics, Computers, Accessories). Traditional classification algorithms are generally binary classifiers, and they are not suited for the multi-label classification. The multi-label classification task is, therefore, transformed into multiple single-label binary tasks. However, this transformation introduces several issues. First, terms distributions are only considered in relevance to the positive and the negative categories (i.e., information on the correlations between terms and categories is lost). Second, it fails to consider any label dependency (i.e., information on existing correlations between classes is lost). Finally, since all categories but one are grouped into one category (the negative category), the newly created tasks are imbalanced. This information is commonly used by supervised TWS to improve the effectiveness of the classification system. Hence, after presenting the process of multi-label text classification, and more particularly the TWS, we make an empirical comparison of these methods applied to the multi-label text classification task. We find that the superiority of the supervised methods over the unsupervised methods is still not clear. We show then that these methods are not fully adapted to the multi-label classification problem and they ignore much statistical information that coul be used to improve the classification results. Thus, we propose a new TWS based on information gain. This new method takes into consideration the term distribution, not only regarding the positive and the negative categories but also in relevance to all classes. Finally, aiming at finding specialized TWS that also solve the issue of imbalanced tasks, we studied the benefits of using genetic programming for generating TWS for the text classification task. Unlike previous studies, we generate formulas by combining statistical information at a microscopic level (e.g., the number of documents that contain a specific term) instead of using complete TWS. Furthermore, we make use of categorical information such as (e.g., the number of categories where a term occurs). Experiments are made to measure the impact of these methods on the performance of the model. We show through these experiments that the results are positive
APA, Harvard, Vancouver, ISO, and other styles
7

Marsh, Gordon E. "Conceptualizing Musical Texts: Polytropy and the Aesthetics of Recent Music." Bärenreiter Verlag, 1998. https://slub.qucosa.de/id/qucosa%3A37170.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Nurse, Derek. "Historical texts from the Swahili coast (part 2)." Swahili Forum; 2 (1995), S. 41-72, 1995. https://ul.qucosa.de/id/qucosa%3A11618.

Full text
Abstract:
Historical texts from the Swahili coast (Swahili-English): Upper Pokomo Elwana, Mwiini Bajuni Pate Amu, She la Matondoni, Mwani Asili ya Mphokomu Fumo Liongo A story. Proverbs and riddles Mashairi Saidi Haji talking about poetry. Kiteko, a story Verse by MA Abdulkadir, Women`s political songs. An old woman reminisces, Mbaraka Msuri, a hadithi. Ngano A story.
APA, Harvard, Vancouver, ISO, and other styles
9

Boynukalin, Zeynep. "Emotion Analysis Of Turkish Texts By Using Machine Learning Methods." Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12614521/index.pdf.

Full text
Abstract:
Automatically analysing the emotion in texts is in increasing interest in today&rsquo<br>s research fields. The aim is to develop a machine that can detect type of user&rsquo<br>s emotion from his/her text. Emotion classification of English texts is studied by several researchers and promising results are achieved. In this thesis, an emotion classification study on Turkish texts is introduced. To the best of our knowledge, this is the first study on emotion analysis of Turkish texts. In English there exists some well-defined datasets for the purpose of emotion classification, but we could not find datasets in Turkish suitable for this study. Therefore, another important contribution is the generating a new data set in Turkish for emotion analysis. The dataset is generated by combining two types of sources. Several classification algorithms are applied on the dataset and results are compared. Due to the nature of Turkish language, new features are added to the existing methods to improve the success of the proposed method.
APA, Harvard, Vancouver, ISO, and other styles
10

Cooper, John Michael. "Words Without Songs? Of Texts, Titles, and Mendelssohn's «Lieder ohne Worte»." Bärenreiter Verlag, 1998. https://slub.qucosa.de/id/qucosa%3A37118.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

van, der Perre Athena. "From execration texts to quarry inscriptions." Universitätsbibliothek Leipzig, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-201900.

Full text
Abstract:
In the previous years, 3D imaging has found his way into the world of Egyptology. This lecture will present two case studies where 3D technology is used for the documentation of hieratic inscriptions. The inscriptions, painted in (red) ochre or black paint, were applied on different carriers, and required a different methodology. The Egyptian collection of the Royal Museums of Art and History (RMAH Brussels) contains a large number of small decorated and/or inscribed objects. Some of these objects are currently in a bad condition - any operation carried on them can result in considerable material losses -, making it necessary to document them in such a way that it allows future scholars to study them in detail without handling them. The EES Project therefore aims to create multispectral 3D images of these fragile objects with a multispectral ‘minidome’ acquisition system, based on the already existing system of the multi-light Portable Light Dome (PLD). The texture/colour values on the created 2D+ and 3D models are interactive data based on a recording process with infrared, red, green, blue, and ultraviolet light spectra. Software tools and enhancement filters have been developed which can deal with the different wavelengths in real-time. This leads to an easy and cost-effective methodology which combines multispectral imaging with the actual relief characteristics and properties of the physical object. The system is transportable to any collection or excavation in the field. As a case study, the well-known Brussels “Execration Figurines” (Middle Kingdom, c. 1900 BC) were chosen. These figurines are made of unbaked clay and covered with hieratic texts, listing names of foreign countries and rulers. The study of this type of collections is mostly hampered by the poor state of conservation of the objects, but also by the only partial preservation of the ink traces in visible light. The method has also been applied to other decorated objects of the RMAH collection, such as a Fayoum portrait, ostraca and decorated objects made of stone, wood and ceramics. The final goal will be to publish the newly created multispectral 3D images on Carmentis (www.carmentis.be), the online catalogue of the RMAH collection, making them accessible to scholars all over the world. The second case study presents the quarry inscriptions of the New Kingdom limestone quarries at Dayr Abu Hinnis (Middle Egypt). These gallery quarries contain hundreds of hieratic inscriptions, written on the ceiling. The texts are mainly related to the general administration of the quarry area. In documenting the abundance of ceiling inscriptions and other graffiti, we had to decide upon a practice that would allow not only to capture the \"content\", but also to document the location and orientation of each record. Every inscription can be photographed in detail, but this is insufficient to provide the reader access to vital information concerning the spatial distribution of the inscriptions, which may, for instance, relate to the progress of work. After experimenting with a variety of other methods, we adopted a photogrammetric software for 3D modelling photographs of the quarry ceilings, AGISOFT PHOTOSCAN, which uses structure from motion (SFM) algorithms to create three-dimensional images based on a series of overlapping two-dimensional images. The ultimate goal of this whole labour-intensive process in the quarries is not the creation of pure threedimensional models, but rather to generate an orthophoto of the entire ceiling of a quarry. Based on these images, each graffito could be analysed in context.
APA, Harvard, Vancouver, ISO, and other styles
12

Bayar, Mujdat. "Event Boundary Detection Using Web-cating Texts And Audio-visual Features." Master's thesis, METU, 2011. http://etd.lib.metu.edu.tr/upload/12613755/index.pdf.

Full text
Abstract:
We propose a method to detect events and event boundaries in soccer videos by using web-casting texts and audio-visual features. The events and their inaccurate time information given in web-casting texts need to be aligned with the visual content of the video. Most match reports presented by popular organizations such as uefa.com (the official site of Union of European Football Associations) provide the time information in minutes rather than seconds. We propose a robust method which is able to handle uncertainties in the time points of the events. As a result of our experiments, we claim that our method detects event boundaries satisfactorily for uncertain web-casting texts, and that the use of audio-visual features improves the performance of event boundary detection.
APA, Harvard, Vancouver, ISO, and other styles
13

Holmer, Daniel. "Context matters : Classifying Swedish texts using BERT's deep bidirectional word embeddings." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166304.

Full text
Abstract:
When classifying texts using a linear classifier, the texts are commonly represented as feature vectors. Previous methods to represent features as vectors have been unable to capture the context of individual words in the texts, in theory leading to a poor representation of natural language. Bidirectional Encoder Representations from Transformers (BERT), uses a multi-headed self-attention mechanism to create deep bidirectional feature representations, able to model the whole context of all words in a sequence. A BERT model uses a transfer learning approach, where it is pre-trained on a large amount of data and can be further fine-tuned for several down-stream tasks. This thesis uses one multilingual, and two dedicated Swedish BERT models, for the task of classifying Swedish texts as of either easy-to-read or standard complexity in their respective domains. The performance on the text classification task using the different models is then compared both with feature representation methods used in earlier studies, as well as with the other BERT models. The results show that all models performed better on the classification task than the previous methods of feature representation. Furthermore, the dedicated Swedish models show better performance than the multilingual model, with the Swedish model pre-trained on more diverse data outperforming the other.
APA, Harvard, Vancouver, ISO, and other styles
14

Grieshaber, Frank. "GODOT: graph of dated objects and texts: building a chronological gazetteer for antiquity." Epigraphy Edit-a-thon : editing chronological and geographic data in ancient inscriptions ; April 20-22, 2016 / edited by Monica Berti. Leipzig, 2016. Beitrag 6, 2016. https://ul.qucosa.de/id/qucosa%3A15468.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Naether, Franziska. "Magical Texts in Trismegistos: Ammianus Marcellinus on Oracles in Roman Egypt – or: what Impact had Christianity on Pagan Egyptian Divination?" Universität Leipzig, 2008. https://ul.qucosa.de/id/qucosa%3A33979.

Full text
Abstract:
Ammianus Marcellinus Bemerkungen über Orakelpraktiken im Bes-Tempel von Abydos werden als Ausgang für diese Studie gewählt. In RG 19, 12, 3-16 erfahren wir, dass das nicht beschiedene Exemplar eines Ticket-Orakels im Tempelarchiv verblieb, um anschließend auf mögliche kaiser- und „staats“-feindliche Inhalte kontrolliert zu werden. Darunter fällt u.a. die seit 11. n. Chr. verbotene Frage nach dem Todeszeitpunkt des Kaisers. Diese Aussage soll anhand der vorhandenen Orakelfragen aus Ägypten überprüft und das in der Datenbank vorhandene Quellenmaterial zu Religion, Ritualtexten, Magie und Divination / Mantik vorgestellt werden. Besondere Berücksichtigung erfährt hierbei die Frage, ob genuin pagane Rituale in frühchristlicher Zeit statistisch gesehen eine Wandlung erfahren.
APA, Harvard, Vancouver, ISO, and other styles
16

Feder, Frank. "Cataloguing and editing Coptic Biblical texts in an online database system." Universitätsbibliothek Leipzig, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-201570.

Full text
Abstract:
The Göttingen Virtual Manuscript Room (VMR); The Göttingen Virtual Manuscript Room (VMR) offers both an online based digital repository for Coptic Biblical manuscripts (ideally, high resolution images of every manuscript page, all metadata etc.) and a digital edition of their texts, finally even a critical edition of every biblical book of the Coptic Old Testament based on all available manuscripts. All text data will also be transferred into XML and linguistically annotated. In this way the VMR offers a full physical description of each manuscript and, at the same time, a full edition of its text and language data. Of course, the VMR can be used for manuscripts and texts other than Coptic too.
APA, Harvard, Vancouver, ISO, and other styles
17

Omar, Said S. H. "Yahya Ali Omar. 1998. Three prose texts in the Swahili of Mombasa." Universitätsbibliothek Leipzig, 2012. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-95051.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Caldo, Claudia Ozon. "Texto jurídico e procedimentos de reformulação discursiva." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/8/8146/tde-05062013-111316/.

Full text
Abstract:
Este trabalho realizado na Área de Estudos Linguísticos, Literários e Tradutológicos de Francês do Departamento de Letras Modernas da FFLCH/USP se situa na intersecção do Direito e das Ciências da Linguagem, numa perspectiva multi e interdisciplinar, do discurso jurídico. O corpus, formado por textos jurídicos em língua francesa e portuguesa (brasileira), tem como objeto de estudo o discurso jurídico. Nele conceitos de direito foram, ao longo do tempo, modificados e integrados às práticas sociais. O método comparativo foi o escolhido para melhor examinar as transformações. No primeiro texto, escolhido pela sua importância histórica, a Declaração de Direitos do Homem e do Cidadão (França, 1789), surgem os conceitos de homem, igualdade e liberdade. Esses mesmos conceitos reaparecem na Declaração Universal dos Direitos do Homem (ONU, 1948), em que direitos mais explicitados conferem maior proteção ao homem. O terceiro documento, a Constituição da República Federativa do Brasil (Brasil, 1988) detalha no seu artigo 5° todos os direitos conferidos ao homem do século XX, conceitos que ressurgem implícitos na lei sobre a Informatização dos Processos Judiciais (Brasil, lei 11.419/2006). A partir da observação desses documentos levanta-se a hipótese de que os conceitos apresentados de homem, liberdade, igualdade e de justiça não são os mesmos presentes no primeiro documento, o que evidencia uma evolução sócio-histórico-linguístico-discursiva. A reformulação ocasiona, inclusive, um apagamento de conceito. Dessa forma o liame que une os documentos é resultante de um processo. As noções de polifonia e de dialogismo emprestados da Análise do Discurso são desenvolvidos a partir dos estudos bakhtinianos e dos procedimentos argumentativos da retórica renovada por Perelman, bem como das reformulações discursivas implicadas. 8 No primeiro capítulo os documentos são contextualizados historicamente, a fim de que se possa entender seu surgimento. O segundo, teórico, verifica a presença da Teoria da Enunciação e dos conceitos de auditório, ethos e logos, integrantes da Teoria da Argumentação. Do ponto de vista linguístico-discursivo o que se valoriza são as reformulações como ferramentas dessas alterações. Ressalta-se que tais transformações tecem-se a partir das relações entre a Língua e o Direito. Os conceitos de transplant, circulation juridique e empréstimo propostos pelo Direito Comparado consideram as diferenças entre as línguas e as culturas francesa e brasileira. O terceiro capítulo trata do discurso jurídico e de uma proposta de classificação dos diferentes textos em subgêneros a partir de seu objetivo e local de produção. O resultado das análises revela que o decurso do tempo provoca a evolução dos conceitos de homem, liberdade, igualdade e justiça, a ponto de sua existência material no texto escrito permitir seu apagamento, passando de concreta à abstrata, já que implicitamente é o conteúdo que justifica a proposição e a publicação da lei Informatização dos Processos Judiciais.<br>This following thesis, developed in the French Linguistic, Literary and Translation area of the Modern Language Department FFLCH / USP lies in the intersection of Law and Language Sciences. The corpus composed of legal texts in French and Portuguese (Brazilian) has as its object of study of the legal discourse in which the concepts of law, over time, have been modified and integrated with social practices. The comparative method was chosen to better examine these transformations. The first text, chosen for its historical significance, the Declaration of Rights of Man and of the Citizen (1789, France), bring to the light the concepts of \"man\", \"equality\" and \"freedom\". These same concepts reappear in the Universal Declaration of Human Rights (UN, 1948), in which most rights specified give greater protection to the man. The third document, the Constitution of the Federative Republic of Brazil (Brazil, 1988) explains, in its 5th article, all rights granted to man of the 20th century, concepts which reappear implicit in the Law on the Informatized System of the Judicial Process (Brazil, law 11,419/2006). The observation of these documents raises the hypothesis that the concepts presented \"man\", \"freedom\", \"equality\" and \"justice\" are not the same in the first document, which highlights a socio-historical-linguistic evolution-discursive. The reformulation causes, also, an erase of concept. In this way the link that connects the documents is the result of a process. The concepts of polyphony and dialogism borrowed from Discourse Analysis are developed from Bakthins studies and procedures of argumentative renewed Rhetoric by Perelman, as well as discursives involved. In the first chapter documents are historically contextualized in order to permit their understanding and emergence. In the second chapter, theoretical, the Enunciation Theory and the concepts of dialogism and polyphony are noted in texts as well as the concepts of the Auditorium, ethos and logos from the 10 Theory of Argumentation. In the aspect of Linguistically-discursive cases the reformulations are the tools of these changes. It should be noted that such transformations weave from relations between the Languag and the Law. The concepts of \"transplant\", \"circulation juridique\" and \"borrowing\" proposed by the Comparative Law consider the differences between the French and Brazilian languages and cultures. The third chapter treat the legal discourse and propose a classification for texts in different sub-genres from its goal and production site. The result of the analysis prove that the course of time causes the evolution of the concepts of \"man\", \"freedom\", \"equality\" and \"justice\", to the point that it may even result in the erasure of his material existence in the written text, ranging from concret to abstract, since its content appears implicitly in the proposition and the publication of Law on the Informatized System of the Judicial Process.
APA, Harvard, Vancouver, ISO, and other styles
19

Iltache, Samia. "Modélisation ontologique pour la recherche d'information : évaluation de la similarité sémantique de textes et application à la détection de plagiats." Thesis, Toulouse 2, 2018. http://www.theses.fr/2018TOU20121.

Full text
Abstract:
L’expansion du web et le développement des technologies de l’information ont contribué à la prolifération des documents numériques en ligne. Cette disponibilité de l’information présente l’avantage de rendre la connaissance accessible à tous mais soulève de nombreux problèmes quant à l’accès à l’information pertinente, répondant à un besoin utilisateur. Un premier problème est lié à l’extraction de l’information utile parmi celle qui est disponible. Un second problème concerne l’appropriation de ces connaissances qui parfois, se traduit par du plagiat. L’objectif de cette thèse est le développement d’un modèle permettant de mieux caractériser les documents afin d’en faciliter l’accès mais aussi de détecter ceux présentant un risque de plagiat. Ce modèle s’appuie sur des ontologies de domaine pour la classification des documents et pour le calcul de la similarité des documents appartenant à un même domaine. Nous nous intéressons plus spécifiquement aux articles scientifiques, et notamment à leurs résumés, textes courts et relativement structurés. Il s’agit dès lors de déterminer comment évaluer la proximité/similarité sémantique de deux articles à travers l'examen de leurs résumés respectifs. Considérant qu’une ontologie de domaine regroupe les connaissances relatives à un domaine scientifique donné, notre processus est basé sur deux actions :(i) Une classification automatique des documents dans un domaine choisi parmi plusieurs domaines candidats. Cette classification détermine le sens d’un document à partir du contexte global dans lequel s’inscrit son contenu. (ii) Une comparaison des textes réalisée sur la base de la construction de ce que nous appelons le périmètre sémantique de chaque résumé et sur un enrichissement mutuel effectué lors de la comparaison des graphes des résumés. La comparaison sémantique des résumés s’appuie sur une segmentation de leur contenu respectif en zones, unités documentaires, reflétant leur structure logique<br>The expansion of the web and the development of different information technologies have contributed to the proliferation of digital documents online. This availability of information has the advantage of making knowledge accessible to all. However, many problems emerged regarding access to relevant information that meets a user's need. The first problem is related to the extraction of the useful available information. A second problem concerns the use of this knowledge which sometimes results in plagiarism.The aim of this thesis is the development of a model that better characterizes documents to facilitate their access and also to detect those with a risk of plagiarism. This model is based on domain ontologies for the classification of documents and for calculating the similarity of documents belonging to the same domain as well. We are particularly interested in scientific papers, specifically their abstracts, short texts that are relatively well structured. The problem is, therefore, to determine how to assess the semantic proximity/similarity of two papers by examining their respective abstracts. Forasmuch as the domain ontology provides a useful way to represent knowledge relative to a given domain, our process is based on two actions:(i) An automatic classification of documents in a domain selected from several candidate domains. This classification determines the meaning of a document from the global context in which its content is used. (ii) A comparison of the texts performed on the basis of the construction of the semantic perimeter of each abstract and on a mutual enrichment performed when comparing the graphs of the abstracts. The semantic comparison of the abstracts is based on a segmentation of their respective content into zones, documentary units, reflecting their logical structure. It is on the comparison of the conceptual graphs of the zones playing the same role that the calculation of the similarity of the abstracts relies
APA, Harvard, Vancouver, ISO, and other styles
20

Blaschka, Karen, and Monica Berti. "Classical philology goes digital: working on textual phenomena of ancient texts: workshop, Klassische Philologie, Universität Potsdam, Februar 16 - 17, 2017." Universität Potsdam, 2017. https://ul.qucosa.de/id/qucosa%3A20930.

Full text
Abstract:
Digital technologies are constantly changing our daily lives, including the way scholars work. As a result, also Classics is currently subject to constant change. Greek and Latin sources are becoming available in a digital format. The result is that Classical texts are searchable and can be provided with metadata and analyzed to find specific structures. An important keyword in this new scholarly environment is “networking”, because there is a great potential for Classical Philology to collaborate with the Digital Humanities in creating useful tools for textual work. During our workshop scholars who represent several academic disciplines and institutions gathered to talk about their projects. We invited Digital Humanists who have experience with specific issues in Classical Philology and who presented methods and outcomes of their research. In order to enable intensive and efficient work concerning various topics and projects, the workshop was aimed at philologists whose research interests focus on specific phenomena of ancient texts (e.g., similes or quotations). The challenge of extracting and annotating textual data like similes and text reuses poses the same type of practical philological problems to Classicists. Therefore, the workshop provided insight in two main ways: First, in an introductory theoretical section, DH experts presented keynote lectures on specific topics; second, the focus of the workshop was to discuss project ideas with DH experts to explore and explain possibilities for digital implementation, and ideally to offer a platform for potential cooperation. The focus was explicitly on working together to explore ideas and challenges, based also on concrete practical examples. As a result of the workshop, some of the participants agreed on publishing online their abstracts and slides in order to share them with the community of Classicists and Digital Humanists. The publication has been made possible thanks to the generous support of the Open Science Office of the Library of the University of Leipzig.
APA, Harvard, Vancouver, ISO, and other styles
21

Yahaya, Alassan Mahaman Sanoussi. "Amélioration du système de recueils d'information de l'entreprise Semantic Group Company grâce à la constitution de ressources sémantiques." Thesis, Paris 10, 2017. http://www.theses.fr/2017PA100086/document.

Full text
Abstract:
Prendre en compte l'aspect sémantique des données textuelles lors de la tâche de classification s'est imposé comme un réel défi ces dix dernières années. Cette difficulté vient s'ajouter au fait que la plupart des données disponibles sur les réseaux sociaux sont des textes courts, ce qui a notamment pour conséquence de rendre les méthodes basées sur la représentation "bag of words" peu efficientes. L'approche proposée dans ce projet de recherche est différente des approches proposées dans les travaux antérieurs sur l'enrichissement des messages courts et ce pour trois raisons. Tout d'abord, nous n'utilisons pas des bases de connaissances externes comme Wikipedia parce que généralement les messages courts qui sont traités par l'entreprise proveniennent des domaines spécifiques. Deuxièment, les données à traiter ne sont pas utilisées pour la constitution de ressources à cause du fonctionnement de l'outil. Troisièment, à notre connaissance il n'existe pas des travaux d'une part qui exploitent des données structurées comme celles de l'entreprise pour constituer des ressources sémantiques, et d'autre part qui mesurent l'impact de l'enrichissement sur un système interactif de regroupement de flux de textes. Dans cette thèse, nous proposons la création de ressources permettant d'enrichir les messages courts afin d'améliorer la performance de l'outil du regroupement sémantique de l'entreprise Succeed Together. Ce dernier implémente des méthodes de classification supervisée et non supervisée. Pour constituer ces ressources, nous utilisons des techniques de fouille de données séquentielles<br>Taking into account the semantic aspect of the textual data during the classification task has become a real challenge in the last ten years. This difficulty is in addition to the fact that most of the data available on social networks are short texts, which in particular results in making methods based on the "bag of words" representation inefficient. The approach proposed in this research project is different from the approaches proposed in previous work on the enrichment of short messages for three reasons. First, we do not use external knowledge like Wikipedia because typically short messages that are processed by the company come from specific domains. Secondly, the data to be processed are not used for the creation of resources because of the operation of the tool. Thirdly, to our knowledge there is no work on the one hand, which uses structured data such as the company's data to constitute semantic resources, and on the other hand, which measure the impact of enrichment on a system Interactive grouping of text flows. In this thesis, we propose the creation of resources enabling to enrich the short messages in order to improve the performance of the tool of the semantic grouping of the company Succeed Together. The tool implements supervised and unsupervised classification methods. To build these resources, we use sequential data mining techniques
APA, Harvard, Vancouver, ISO, and other styles
22

Lucarelli, Rita. "Images of eternity in 3D." Universitätsbibliothek Leipzig, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-201685.

Full text
Abstract:
By using the technique of photogrammetry for the 3D visualization of ancient Egyptian coffins decorated with magical texts and iconography, this project aims at building up a new digital platform for an in-depth study of the ancient Egyptian funerary culture and its media. It has started in August 2015 through the support of a Mellon Fellowship for the Digital Humanities at UC Berkeley and up until now it has focused on ancient Egyptian coffins kept at the Phoebe A. Hearst Museum of Anthropology of UC Berkeley. The main outcome will be a digital platform that allows to display a coffin in 3D and where users will be able to pan, rotate, and zoom in on the coffin, clicking on areas of text to highlight them and view an annotated translation together with other metadata (transcription of the hieroglyphic text, bibliography, textual variants, museological data, provenance, etc.)
APA, Harvard, Vancouver, ISO, and other styles
23

Myška, Vojtěch. "Rekurentní neuronové sítě pro klasifikaci textů." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2018. http://www.nusl.cz/ntk/nusl-376953.

Full text
Abstract:
Thesis deals with the proposal of the neural networks for classification of positive and negative texts. Development took place in the Python programming language. Design of deep neural network models was performed using the Keras high-level API and the TensorFlow numerical computation library. The computations were performed using GPU with support of the CUDA architecture. The final outcome of the thesis is linguistically independent neural network model for classifying texts at character level reaching up to 93,64% accuracy. Training and testing data were provided by multilingual and Yelp databases. The simulations were performed on 1200000 English, 12000 Czech, German and Spanish texts.
APA, Harvard, Vancouver, ISO, and other styles
24

Berti, Monica. "The Digital Marmor Parium." Epigraphy Edit-a-thon : editing chronological and geographic data in ancient inscriptions ; April 20-22, 2016 / edited by Monica Berti. Leipzig, 2016. Beitrag 4, 2016. https://ul.qucosa.de/id/qucosa%3A14455.

Full text
Abstract:
The Digital Marmor Parium is a project of the Alexander von Humboldt Chair of Digital Humanities at the University of Leipzig (http://www.dh.uni-leipzig.de/wo/dmp). The aim of this work is to produce a new digital edition of the so called Marmor Parium (Parian Marble), which is a Hellenistic chronicle on a marble slab coming from the Greek island of Paros. The importance of the document is due to the fact that it preserves a Greek chronology (1581/80-299/98 BC) with a list of kings and archons accompanied by short references to historical events mainly based on the Athenian history.
APA, Harvard, Vancouver, ISO, and other styles
25

Beyer, Stefan, Biase-Dyson Camilla Di, and Nina Wagenknecht. "Annotating figurative language." Universitätsbibliothek Leipzig, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-201537.

Full text
Abstract:
Whereas past and current digital projects in ancient language studies have been concerned with the annotation of linguistic elements and metadata, there is now an increased interest in the annotation of elements above the linguistic level that are determined by context – like figurative language. Such projects bring their own set of problems (the automatisation of annotation is more difficult, for instance), but also allow us to develop new ways of examining the data. For this reason, we have attempted to take an already annotated database of Ancient Egyptian texts and develop a complementary tagging layer rather than starting from scratch with a new database. In this paper, we present our work in developing a metaphor annotation layer for the Late Egyptian text database of Projet Ramsès (Université de Liège) and in so doing address more general questions: 1) How to ‚tailor-make’ annotation layers to fit other databases? (Workflow) 2) How to make annotations that are flexible enough to be altered in the course of the annotation process? (Project design) 3) What kind of potential do such layers have for integration with existing and future annotations? (Sustainability)
APA, Harvard, Vancouver, ISO, and other styles
26

Maxey, Craig. "Free-text disease classification." Thesis, Monterey, California. Naval Postgraduate School, 2011. http://hdl.handle.net/10945/5554.

Full text
Abstract:
Approved for public release; distribution is unlimited.<br>Modern medicine produces data with every patient interaction. While many data elements are easily captured and analyzed, the fundamental record of the patient/clinican interaction is captured in written, free-text. This thesis provides the foundation for the Military Health System to begin building an auto classifier for ICD9 diagnostic codes based on free-text clinican notes. Support Vector Machine models are fit to approximately 84,000 free-text records providing a means to predict ICD9 codes for other free-text records. While the research conducted in this thesis does not provide a consumate ICD9 classification model, it does provide the foundation required to further more detailed analysis.
APA, Harvard, Vancouver, ISO, and other styles
27

Dzunic, Zoran Ph D. Massachusetts Institute of Technology. "Text structure-aware classification." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/53315.

Full text
Abstract:
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.<br>Cataloged from PDF version of thesis.<br>Includes bibliographical references (p. 73-76).<br>Bag-of-words representations are used in many NLP applications, such as text classification and sentiment analysis. These representations ignore relations across different sentences in a text and disregard the underlying structure of documents. In this work, we present a method for text classification that takes into account document structure and only considers segments that contain information relevant for a classification task. In contrast to the previous work, which assumes that relevance annotation is given, we perform the relevance prediction in an unsupervised fashion. We develop a Conditional Bayesian Network model that incorporates relevance as a hidden variable of a target classifier. Relevance and label predictions are performed jointly, optimizing the relevance component for the best result of the target classifier. Our work demonstrates that incorporating structural information in document analysis yields significant performance gains over bag-of-words approaches on some NLP tasks.<br>by Zoran Dzunic.<br>S.M.
APA, Harvard, Vancouver, ISO, and other styles
28

Barbosa, Alexandre Nunes. "Descoberta de conhecimento aplicado à base de dados textual de saúde." Universidade do Vale do Rio dos Sinos, 2012. http://www.repositorio.jesuita.org.br/handle/UNISINOS/4559.

Full text
Abstract:
Submitted by William Justo Figueiro (williamjf) on 2015-07-18T12:21:33Z No. of bitstreams: 1 42c.pdf: 1016491 bytes, checksum: 407619e0114b592531ee5a68ca0fd0f9 (MD5)<br>Made available in DSpace on 2015-07-18T12:21:33Z (GMT). No. of bitstreams: 1 42c.pdf: 1016491 bytes, checksum: 407619e0114b592531ee5a68ca0fd0f9 (MD5) Previous issue date: 2012<br>UNISINOS - Universidade do Vale do Rio dos Sinos<br>Este trabalho propõe um processo de investigação do conteúdo de uma base de dados, composta por dados descritivos e pré-estruturados do domínio da saúde, mais especificamente da área da Reumatologia. Para a investigação da base de dados, foram compostos 3 conjuntos de interesse. O primeiro composto por uma classe com conteúdo descritivo relativo somente a área da Reumatologia em geral, e outra cujo seu conteúdo pertence a outras áreas da medicina. O segundo e o terceiro conjunto, foram constituídos após análises estatísticas na base de dados. Um formado pelo conteúdo descritivo associado as 5 maiores frequências de códigos CID, e outro formado por conteúdo descritivo associado as 3 maiores frequências de códigos CID relacionados exclusivamente à área da Reumatologia. Estes conjuntos foram pré-processados com técnicas clássicas de Pré-processamento tais como remoção de Stopwords e Stemmer. Com o objetivo de extrair padrões que através de sua interpretação resultem na produção de conhecimento, foram aplicados aos conjuntos de interesse técnicas de classificação e associação, visando à relação entre o conteúdo textual que descreve sintomas de doenças com o conteúdo pré-estruturado, que define o diagnóstico destas doenças. A execução destas técnicas foi realizada através da aplicação do algoritmo de classificação Support Vector Machines e do algoritmo para extração de Regras de Associação Apriori. Para o desenvolvimento deste processo foi pesquisado referencial teórico relativo à mineração de dados, bem como levantamento e estudo de trabalhos científicos produzidos no domínio da mineração textual e relacionados a Prontuário Médico Eletrônico, focando o conteúdo das bases de dados utilizadas, técnicas de pré-processamento e mineração empregados na literatura, bem como os resultados relatados. A técnica de classificação empregada neste trabalho obteve resultados acima de 80% de Acurácia, demonstrando capacidade do algoritmo de rotular dados da saúde relacionados ao domínio de interesse corretamente. Também foram descobertas associações entre conteúdo textual e conteúdo pré-estruturado, que segundo a análise de especialistas, podem conduzir a questionamentos quanto à utilização de determinados CIDs no local de origem dos dados.<br>This study suggests a process of investigation of the content of a database, comprising descriptive and pre-structured data related to the health domain, more particularly in the area of Rheumatology. For the investigation of the database, three sets of interest were composed. The first one formed by a class of descriptive content related only to the area of Rheumatology in general, and another whose content belongs to other areas of medicine. The second and third sets were constituted after statistical analysis in the database. One of them formed by the descriptive content associated to the five highest frequencies of ICD codes, and another formed by descriptive content associated with the three highest frequencies of ICD codes related exclusively to the area of Rheumatology. These sets were pre-processed with classic Pre-processing techniques such as Stopword Removal and Stemming. In order to extract patterns that, through their interpretation, result in knowledge production, association and classification techniques were applied to the sets of interest, aiming at to relate the textual content that describes symptoms of diseases with pre-structured content, which defines the diagnosis of these diseases. The implementation of these techniques was carried out by applying the classification algorithm Support Vector Machines and the Association Rules Apriori Algorithm. For the development of this process, theoretical references concerning data mining were researched, including selection and review of scientific publications produced on text mining and related to Electronic Medical Record, focusing on the content of the databases used, techniques for pre-processing and mining used in the literature, as well as the reported results. The classification technique used in this study reached over 80% accurate results, demonstrating the capacity the algorithm has to correctly label health data related to the field of interest. Associations between text content and pre-structured content were also found, which, according to expert analysis, may be questioned as for the use of certain ICDs in the place of origin of the data.
APA, Harvard, Vancouver, ISO, and other styles
29

Baker, Simon. "Semantic text classification for cancer text mining." Thesis, University of Cambridge, 2018. https://www.repository.cam.ac.uk/handle/1810/275838.

Full text
Abstract:
Cancer researchers and oncologists benefit greatly from text mining major knowledge sources in biomedicine such as PubMed. Fundamentally, text mining depends on accurate text classification. In conventional natural language processing (NLP), this requires experts to annotate scientific text, which is costly and time consuming, resulting in small labelled datasets. This leads to extensive feature engineering and handcrafting in order to fully utilise small labelled datasets, which is again time consuming, and not portable between tasks and domains. In this work, we explore emerging neural network methods to reduce the burden of feature engineering while outperforming the accuracy of conventional pipeline NLP techniques. We focus specifically on the cancer domain in terms of applications, where we introduce two NLP classification tasks and datasets: the first task is that of semantic text classification according to the Hallmarks of Cancer (HoC), which enables text mining of scientific literature assisted by a taxonomy that explains the processes by which cancer starts and spreads in the body. The second task is that of the exposure routes of chemicals into the body that may lead to exposure to carcinogens. We present several novel contributions. We introduce two new semantic classification tasks (the hallmarks, and exposure routes) at both sentence and document levels along with accompanying datasets, and implement and investigate a conventional pipeline NLP classification approach for both tasks, performing both intrinsic and extrinsic evaluation. We propose a new approach to classification using multilevel embeddings and apply this approach to several tasks; we subsequently apply deep learning methods to the task of hallmark classification and evaluate its outcome. Utilising our text classification methods, we develop and two novel text mining tools targeting real-world cancer researchers. The first tool is a cancer hallmark text mining tool that identifies association between a search query and cancer hallmarks; the second tool is a new literature-based discovery (LBD) system designed for the cancer domain. We evaluate both tools with end users (cancer researchers) and find they demonstrate good accuracy and promising potential for cancer research.
APA, Harvard, Vancouver, ISO, and other styles
30

Tisserant, Guillaume. "Généralisation de données textuelles adaptée à la classification automatique." Thesis, Montpellier, 2015. http://www.theses.fr/2015MONTS231/document.

Full text
Abstract:
La classification de documents textuels est une tâche relativement ancienne. Très tôt, de nombreux documents de différentes natures ont été regroupés dans le but de centraliser la connaissance. Des systèmes de classement et d'indexation ont alors été créés. Ils permettent de trouver facilement des documents en fonction des besoins des lecteurs. Avec la multiplication du nombre de documents et l'apparition de l'informatique puis d'internet, la mise en œuvre de systèmes de classement des textes devient un enjeu crucial. Or, les données textuelles, de nature complexe et riche, sont difficiles à traiter de manière automatique. Dans un tel contexte, cette thèse propose une méthodologie originale pour organiser l'information textuelle de façon à faciliter son accès. Nos approches de classification automatique de textes mais aussi d'extraction d'informations sémantiques permettent de retrouver rapidement et avec pertinence une information recherchée.De manière plus précise, ce manuscrit présente de nouvelles formes de représentation des textes facilitant leur traitement pour des tâches de classification automatique. Une méthode de généralisation partielle des données textuelles (approche GenDesc) s'appuyant sur des critères statistiques et morpho-syntaxiques est proposée. Par ailleurs, cette thèse s'intéresse à la construction de syntagmes et à l'utilisation d'informations sémantiques pour améliorer la représentation des documents. Nous démontrerons à travers de nombreuses expérimentations la pertinence et la généricité de nos propositions qui permettent une amélioration des résultats de classification. Enfin, dans le contexte des réseaux sociaux en fort développement, une méthode de génération automatique de HashTags porteurs de sémantique est proposée. Notre approche s'appuie sur des mesures statistiques, des ressources sémantiques et l'utilisation d'informations syntaxiques. Les HashTags proposés peuvent alors être exploités pour des tâches de recherche d'information à partir de gros volumes de données<br>We have work for a long time on the classification of text. Early on, many documents of different types were grouped in order to centralize knowledge. Classification and indexing systems were then created. They make it easy to find documents based on readers' needs. With the increasing number of documents and the appearance of computers and the internet, the implementation of text classification systems becomes a critical issue. However, textual data, complex and rich nature, are difficult to treat automatically. In this context, this thesis proposes an original methodology to organize and facilitate the access to textual information. Our automatic classification approache and our semantic information extraction enable us to find quickly a relevant information.Specifically, this manuscript presents new forms of text representation facilitating their processing for automatic classification. A partial generalization of textual data (GenDesc approach) based on statistical and morphosyntactic criteria is proposed. Moreover, this thesis focuses on the phrases construction and on the use of semantic information to improve the representation of documents. We will demonstrate through numerous experiments the relevance and genericity of our proposals improved they improve classification results.Finally, as social networks are in strong development, a method of automatic generation of semantic Hashtags is proposed. Our approach is based on statistical measures, semantic resources and the use of syntactic information. The generated Hashtags can then be exploited for information retrieval tasks from large volumes of data
APA, Harvard, Vancouver, ISO, and other styles
31

Olin, Per. "Evaluation of text classification techniques for log file classification." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166641.

Full text
Abstract:
System log files are filled with logged events, status codes, and other messages. By analyzing the log files, the systems current state can be determined, and find out if something during its execution went wrong. Log file analysis has been studied for some time now, where recent studies have shown state-of-the-art performance using machine learning techniques. In this thesis, document classification solutions were tested on log files in order to classify regular system runs versus abnormal system runs. To solve this task, supervised and unsupervised learning methods were combined. Doc2Vec was used to extract document features, and Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) based architectures on the classification task. With the use of the machine learning models and preprocessing techniques the tested models yielded an f1-score and accuracy above 95% when classifying log files.
APA, Harvard, Vancouver, ISO, and other styles
32

Danuser, Hermann. "Der Text und die Texte. Über Singularisierung und Pluralisierung einer Kategorie." Bärenreiter Verlag, 1998. https://slub.qucosa.de/id/qucosa%3A36795.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Eriksson, Linus, and Kevin Frejdh. "Swedish biomedical text-miningand classification." Thesis, KTH, Hälsoinformatik och logistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-278067.

Full text
Abstract:
AbstractManual classification of text is both time consuming and expensive. However, it is anecessity within the field of biomedicine, for example to be able to quantify biomedical data.In this study, two different approaches were researched regarding the possibility of usingsmall amounts of training data, in order to create text classification models that are able tounderstand and classify biomedical texts. The study researched whether a specialized modelshould be considered a requirement for this purpose, or if a generic model might suffice. Thetwo models were based on publicly available versions, one specialized to understand Englishbiomedical texts, and the other to understand ordinary Swedish texts. The Swedish modelwas introduced to a new field of texts while the English model had to work on translatedSwedish texts.The results were quite low, but did however indicate that the method with the Swedish modelwas more reliable, performing almost twice as well as the English correspondence. The studyconcluded that there was potential in using general models as a base, and then tuning theminto more specialized fields, even with small amounts of data.KeywordsNLP, text-mining, biomedical texts, classification, labelling, models, BERT, machinelearning, FIC, ICF.<br>Sammanfattning Manuell klassificering av text är tidskonsumerande och kostsamt, däremot är det en nödvändighet inom exempelvis biomedicinska områden för att kunna kvantifierabehandlingen av data. I denna studie undersöktes två alternativa sätt att utan tillgång till stora mängder data, kunna framställa textklassificeringsmodeller som kan förstå och klassificerabiomedicinsk text. Studien undersökte ifall om en specialiserad modell borde anses som ettkrav för detta, eller ifall om en generisk modell kan räcka till. Båda modellerna som användesvar baserade på allmänt tillgängliga versioner, en som var tränad att förstå engelskbiomedicinsk text och en annan som var tränad att förstå vanlig svensk text. Den svenskamodellen introducerades till ett nytt område av text medan den engelska modellen arbetade påöversatta svenska texter. Resultatet visade att den svenska modellen kunde förstå och klassificera texten nästan dubbeltså effektivt som den engelska, däremot med en relativt låg grad av träffsäkerhet. Slutligenkunde slutsatsen dras att den använda metoden visade potential vid träning av modeller, ochvid brist på större datamängder borde generellt tränade modeller kunna nyttjas som bas för attsedan kunna specialiseras till andra områden. Nyckelord NLP, textbrytning, biomedicinska texter, klassificering, märkning, modeller, BERT,maskininlärning, FIC, ICF.
APA, Harvard, Vancouver, ISO, and other styles
34

Prabowo, Rudy. "Ontology-based automatic text classification." Thesis, University of Wolverhampton, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.418665.

Full text
Abstract:
This research investigates to what extent ontologies can be used to achieve an accurate classification performance of an automatic text classifier, called the Automatic Classification Engine (ACE). The task of the classifier is to classify Web pages with respect to the Dewey Decimal Classification (DOC) and Library of Congress Classification (LCC) schemes. In particular, this research focuses on how to 1. build a set of ontologies which can provide a mechanism to enable machine reasoning; 2. define the mappings between the ontologies and the two classification schemes; 3. implement an ontology-based classifier. The design and implementation of the classifier concentrates on developing an ontologybased classification model. Given a Web page, the classifier applies the model to carry out reasoning to determine terms - from within the Web page - which represent significant concepts. The classifier, then, uses the mappings to determine the associated DOC and LCC classes of the significant concepts, and assigns the DOC and LCC classes to the Web page. The research also investigates a number of approaches which can be applied to extend the coverage of the ontologies used in a semi-automatic way, since manually constructing ontologies is time consuming. The investigation leads to the design and implementation of a semi-automatic ontology construction system which can recognise new potential terms. By using an ontology editor, those new terms can be integrated into their associated ontologies. An experiment was conducted to validate the effectiveness of the classification model, in which the classifier classified a set of collections of Web pages. The performance of the classifier was measured, in terms of its coverage and accuracy. The experimental evidence shows that the ontology-based automatic text classification approach achieved a better level of performance over the existing approaches.
APA, Harvard, Vancouver, ISO, and other styles
35

Danielsson, Benjamin. "A Study on Text Classification Methods and Text Features." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-159992.

Full text
Abstract:
When it comes to the task of classification the data used for training is the most crucial part. It follows that how this data is processed and presented for the classifier plays an equally important role. This thesis attempts to investigate the performance of multiple classifiers depending on the features that are used, the type of classes to classify and the optimization of said classifiers. The classifiers of interest are support-vector machines (SMO) and multilayer perceptron (MLP), the features tested are word vector spaces and text complexity measures, along with principal component analysis on the complexity measures. The features are created based on the Stockholm-Umeå-Corpus (SUC) and DigInclude, a dataset containing standard and easy-to-read sentences. For the SUC dataset the classifiers attempted to classify texts into nine different text categories, while for the DigInclude dataset the sentences were classified into either standard or simplified classes. The classification tasks on the DigInclude dataset showed poor performance in all trials. The SUC dataset showed best performance when using SMO in combination with word vector spaces. Comparing the SMO classifier on the text complexity measures when using or not using PCA showed that the performance was largely unchanged between the two, although not using PCA had slightly better performance
APA, Harvard, Vancouver, ISO, and other styles
36

Hirsch, Laurence Benjamin. "An evolutionary approach to text classification." Thesis, Royal Holloway, University of London, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.429566.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Lyra, Risto Matti Juhani. "Topical subcategory structure in text classification." Thesis, University of Sussex, 2019. http://sro.sussex.ac.uk/id/eprint/81340/.

Full text
Abstract:
Data sets with rich topical structure are common in many real world text classification tasks. A single data set often contains a wide variety of topics and, in a typical task, documents belonging to each class are dispersed across many of the topics. Often, a complex relationship exists between the topic a document discusses and the class label: positive or negative sentiment is expressed in documents from many different topics, but knowing the topic does not necessarily help in determining the sentiment label. We know from tasks such as Domain Adaptation that sentiment is expressed in different ways under different topics. Topical context can in some cases even reverse the sentiment polarity of words: to be sharp is a good quality for knives but bad for singers. This property can be found in many different document classification tasks. Standard document classification algorithms do not account for or take advantage of topical diversity; instead, classifiers are usually trained with the tacit assumption that topical diversity does not play a role. This thesis is focused on the interplay between the topical structure of corpora, how the target labels in a classification task distribute over the topics and how the topical structure can be utilised in building ensemble models for text classification. We show empirically that a dataset with rich topical structure can be problematic for single classifiers, and we develop two novel ensemble models to address the issues. We focus on two document classification tasks: document level sentiment analysis of product reviews and hierarchical categorisation of news text. For each task we develop a novel ensemble method that utilises topic models to address the shortcomings of traditional text classification algorithms. Our contribution is in showing empirically that the class association of document features is topic dependent. We show that using the topical context of documents for building ensembles is beneficial for some tasks, and present two new ensemble models for document classification. We also provide a fresh viewpoint for reasoning about the relationship of class labels, topical categories and document features.
APA, Harvard, Vancouver, ISO, and other styles
38

Anusha, Anusha. "Word Segmentation for Classification of Text." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-396969.

Full text
Abstract:
Compounding is a highly productive word-formation process in some languages that is often problematic for natural language processing applications. Word segmentation is the problem of splitting a string of written language into its component words. The purpose of this research is to do a comparative study on different techniques of word segmentation and to identify the best technique that would aid in the extraction of keyword from the text. English was chosen as the language. Dictionary-based and Machine learning approaches were used to split the compound words. This research also aims at evaluating the quality of a word segmentation by comparing it with the segmentation of reference. Results indicated that Dictionary-based word segmentation showed better results in segmenting a compound word compared to the Machine learning segmentation when technical words were involved. Also, to improve the results for the text classification, improving the quality of the text alone is not the key
APA, Harvard, Vancouver, ISO, and other styles
39

Zechner, Niklas. "A novel approach to text classification." Doctoral thesis, Umeå universitet, Institutionen för datavetenskap, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-138917.

Full text
Abstract:
This thesis explores the foundations of text classification, using both empirical and deductive methods, with a focus on author identification and syntactic methods. We strive for a thorough theoretical understanding of what affects the effectiveness of classification in general.  To begin with, we systematically investigate the effects of some parameters on the accuracy of author identification. How is the accuracy affected by the number of candidate authors, and the amount of data per candidate? Are there differences in how methods react to the changes in parameters? Using the same techniques, we see indications that methods previously thought to be topic-independent might not be so, but that syntactic methods may be the best option for avoiding topic dependence. This means that previous studies may have overestimated the power of lexical methods. We also briefly look for ways of spotting which particular features might be the most effective for classification. Apart from author identification, we apply similar methods to identifying properties of the author, including age and gender, and attempt to estimate the number of distinct authors in a text sample. In all cases, the techniques are proven viable if not overwhelmingly accurate, and we see that lexical and syntactic methods give very similar results.  In the final parts, we see some results of automata theory that can be of use for syntactic analysis and classification. First, we generalise a known algorithm for finding a list of the best-ranked strings according to a weighted automaton, to doing the same with trees and a tree automaton. This result can be of use for speeding up parsing, which often runs in several steps, where each step needs several trees from the previous as input. Second, we use a compressed version of deterministic finite automata, known as failure automata, and prove that finding the optimal compression is NP-complete, but that there are efficient algorithms for finding good approximations. Third, we find and prove the derivatives of regular expressions with cuts. Derivatives are an operation on expressions to calculate the remaining expression after reading a given symbol, and cuts are an extension to regular expressions found in many programming languages. Together, these findings may be able to improve on the syntactic analysis which we have seen is a valuable tool for text classification.
APA, Harvard, Vancouver, ISO, and other styles
40

Dahlstedt, Olle. "Automatic Handwritten Text Detection and Classification." Thesis, Uppsala universitet, Avdelningen för visuell information och interaktion, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-453809.

Full text
Abstract:
As more and more organizations digitize their records, the need for automatic document processing software increases. In particular, the rise of ‘digital humanities’ precede a new set of problems on how to digitize historical archival material in an efficient and accurate manner. The transcription of archival material to formats fit for research purposes, such as handwritten spreadsheets, is still expensive and plagued by tedious manual labor. Over the decades, research in handwritten text recognition has focused on text line extraction and recognition. In this thesis, we examine document images that contain complex details, contain more categories of text than handwriting, and handwritten text that is not separated easily to lines. The thesis examines the sub-problem of handwritten text segmentation in detail. We propose a broad definition of text segmentation that requires both text detection and text classification, since this enables us to detect multiple kinds of text within the same image. The aim is to design a system which can detect and identify both handwriting and machine-text within the same image. Working with photographs of spreadsheet documents from the years 1871-1951, a topdown layout-agnostic image processing pipeline is developed. Different kinds of preprocessing are examined, to correct illumination and enhance contrast before binarization, and to detect and clear line contours. To achieve text region detection, we evaluate connected components labeling and MSER as region detectors, extracting textual and non-textual sub-images. On detected sub-images, we perform a Bag-of-Visual-Words quantization of k-means clustered feature descriptor vectors and perform categorical classification by training a Naïve Bayesclassifier on the feature distances to the cluster centroids. Results include a novel two-stage illumination correction and contrast enhancement algorithm that improves document quality as a precursor to binarization, increasing the mean grayscale values of an image while retaining low grayscale variance. Region detectors are evaluated on images with different types of preprocessing and the results show that clearing document outlines influences text region detection. Training on a small sample of sub-images, the categorical classification model proves viable for discrimination between machine-text and handwriting, enabling the use of this model for further recognition purposes.
APA, Harvard, Vancouver, ISO, and other styles
41

Garcia, Constantino Matias. "On the use of text classification methods for text summarisation." Thesis, University of Liverpool, 2013. http://livrepository.liverpool.ac.uk/12957/.

Full text
Abstract:
This thesis describes research work undertaken in the fields of text and questionnaire mining. More specifically, the research work is directed at the use of text classification techniques for the purpose of summarising the free text part of questionnaires. In this thesis text summarisation is conceived of as a form of text classification in that the classes assigned to text documents can be viewed as an indication (summarisation) of the main ideas of the original free text but in a coherent and reduced form. The reason for considering this type of summary is because summarising unstructured free text, such as that found in questionnaires, is not deemed to be effective using conventional text summarisation techniques. Four approaches are described in the context of the classification summarisation of free text from different sources, focused on the free text part of questionnaires. The first approach considers the use of standard classification techniques for text summarisation and was motivated by the desire to establish a benchmark with which the more specialised summarisation classification techniques presented later in this thesis could be compared. The second approach, called Classifier Generation Using Secondary Data (CGUSD), addresses the case when the available data is not considered sufficient for training purposes (or possibly because no data is available at all). The third approach, called Semi-Automated Rule Summarisation Extraction Tool (SARSET), presents a semi-automated classification technique to support document summarisation classification in which there is more involvement by the domain experts in the classifier generation process, the idea was that this might serve to produce more effective summaries. The fourth is a hierarchical summarisation classification approach which assumes that text summarisation can be achieved using a classification approach whereby several class labels can be associated with documents which then constitute the summarisation. For evaluation purposes three types of text were considered: (i) questionnaire free text, (ii) text from medical abstracts and (iii) text from news stories.
APA, Harvard, Vancouver, ISO, and other styles
42

Sinoara, Roberta Akemi. "Aspectos semânticos na representação de textos para classificação automática." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-10102018-143520/.

Full text
Abstract:
Dada a grande quantidade e diversidade de dados textuais sendo criados diariamente, as aplicações do processo de Mineração de Textos são inúmeras e variadas. Nesse processo, a qualidade da solução final depende, em parte, do modelo de representação de textos adotado. Por se tratar de textos em língua natural, relações sintáticas e semânticas influenciam o seu significado. No entanto, modelos tradicionais de representação de textos se limitam às palavras, não sendo possível diferenciar documentos que possuem o mesmo vocabulário, mas que apresentam visões diferentes sobre um mesmo assunto. Nesse contexto, este trabalho foi motivado pela diversidade das aplicações da tarefa de classificação automática de textos, pelo potencial das representações no modelo espaço-vetorial e pela lacuna referente ao tratamento da semântica inerente aos dados em língua natural. O seu desenvolvimento teve o propósito geral de avançar as pesquisas da área de Mineração de Textos em relação à incorporação de aspectos semânticos na representação de coleções de documentos. Um mapeamento sistemático da literatura da área foi realizado e os problemas de classificação foram categorizados em relação à complexidade semântica envolvida. Aspectos semânticos foram abordados com a proposta, bem como o desenvolvimento e a avaliação de sete modelos de representação de textos: (i) gBoED, modelo que incorpora a semântica obtida por meio de conhecimento do domínio; (ii) Uni-based, modelo que incorpora a semântica por meio da desambiguação lexical de sentidos e hiperônimos de conceitos; (iii) SR-based Terms e SR-based Sentences, modelos que incorporam a semântica por meio de anotações de papéis semânticos; (iv) NASARIdocs, Babel2Vec e NASARI+Babel2Vec, modelos que incorporam a semântica por meio de desambiguação lexical de sentidos e embeddings de palavras e conceitos. Representações de coleções de documentos geradas com os modelos propostos e outros da literatura foram analisadas e avaliadas na classificação automática de textos, considerando datasets de diferentes níveis de complexidade semântica. As propostas gBoED, Uni-based, SR-based Terms e SR-based Sentences apresentam atributos mais expressivos e possibilitam uma melhor interpretação da representação dos documentos. Já as propostas NASARIdocs, Babel2Vec e NASARI+Babel2Vec incorporam, de maneira latente, a semântica obtida de embeddings geradas a partir de uma grande quantidade de documentos externos. Essa propriedade tem um impacto positivo na performance de classificação.<br>Text Mining applications are numerous and varied since a huge amount of textual data are created daily. The quality of the final solution of a Text Mining process depends, among other factors, on the adopted text representation model. Despite the fact that syntactic and semantic relations influence natural language meaning, traditional text representation models are limited to words. The use of such models does not allow the differentiation of documents that use the same vocabulary but present different ideas about the same subject. The motivation of this work relies on the diversity of text classification applications, the potential of vector space model representations and the challenge of dealing with text semantics. Having the general purpose of advance the field of semantic representation of documents, we first conducted a systematic mapping study of semantics-concerned Text Mining studies and we categorized classification problems according to their semantic complexity. Then, we approached semantic aspects of texts through the proposal, analysis, and evaluation of seven text representation models: (i) gBoED, which incorporates text semantics by the use of domain expressions; (ii) Uni-based, which takes advantage of word sense disambiguation and hypernym relations; (iii) SR-based Terms and SR-based Sentences, which make use of semantic role labels; (iv) NASARIdocs, Babel2Vec and NASARI+Babel2Vec, which take advantage of word sense disambiguation and embeddings of words and senses.We analyzed the expressiveness and interpretability of the proposed text representation models and evaluated their classification performance against different literature models. While the proposed models gBoED, Uni-based, SR-based Terms and SR-based Sentences have improved expressiveness, the proposals NASARIdocs, Babel2Vec and NASARI+Babel2Vec are latently enriched by the embeddings semantics, obtained from the large training corpus. This property has a positive impact on text classification performance.
APA, Harvard, Vancouver, ISO, and other styles
43

Yi, Kwan 1963. "Text classification using a hidden Markov model." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85214.

Full text
Abstract:
Text categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals are intended. First, a Hidden Markov Model (HAM is proposed as a relatively new method for text categorization. HMM has been applied to a wide range of applications in text processing such as text segmentation and event tracking, information retrieval, and information extraction. Few, however, have applied HMM to TC. Second, the Library of Congress Classification (LCC) is adopted as a classification scheme for the HMM-based TC model for categorizing digital documents. LCC has been used only in a handful of experiments for the purpose of automatic classification. In the proposed framework, a general prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize digitalized documents into LCC. A sample of abstracts from the ProQuest Digital Dissertations database is used for the test-base. Dissertation abstracts, which are pre-classified by professional librarians, form an ideal test-base for evaluating the proposed model of automatic TC. For comparative purposes, a Naive Bayesian model, which has been extensively used in TC applications, is also implemented. Our experimental results show that the performance of our model surpasses that of the Naive Bayesian model as measured by comparing the automatic classification of abstracts to the manual classification performed by professionals.
APA, Harvard, Vancouver, ISO, and other styles
44

Drusiani, Alberto. "Deep Learning Text Classification for Medical Diagnosis." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amslaurea.unibo.it/17281/.

Full text
Abstract:
The ICD coding is the international standard for the classification of diseases and related disorders, drawn up by the World Health Organization. It was introduced to simplify the exchange of medical data, to speed up statistical analyzes and to make insurance reimbursements efficient. The manual classification of ICD-9-CM codes still requires a human effort that implies a considerable waste of resources and for this reason several methods have been presented over the years to automate the process. In this thesis an approach is proposed for the automatic classification of medical diagnoses in ICD-9-CM codes using the Recurrent Neural Networks, in particular the LSTM module, and exploiting the word embedding. The results were satisfactory as we were able to obtain better accuracy than Support Vector Machines, the most used traditional method. Furthermore, we have shown the effectiveness of specific domain embedding models compared to general ones.
APA, Harvard, Vancouver, ISO, and other styles
45

Lanquillon, Carsten. "Enhancing text classification to improve information filtering." [S.l. : s.n.], 2001. http://deposit.ddb.de/cgi-bin/dokserv?idn=963801805.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Ball, Stephen Wayne. "Semantic web service generation for text classification." Thesis, University of Southampton, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.430674.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Li, Ming. "Sequence and text classification : features and classifiers." Thesis, University of East Anglia, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.426966.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Dernoncourt, Franck. "Sequential short-text classification with neural networks." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/111880.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.<br>Cataloged from PDF version of thesis.<br>Includes bibliographical references (pages 69-79).<br>Medical practice too often fails to incorporate recent medical advances. The two main reasons are that over 25 million scholarly medical articles have been published, and medical practitioners do not have the time to perform literature reviews. Systematic reviews aim at summarizing published medical evidence, but writing them requires tremendous human efforts. In this thesis, we propose several natural language processing methods based on artificial neural networks to facilitate the completion of systematic reviews. In particular, we focus on short-text classification, to help authors of systematic reviews locate the desired information. We introduce several algorithms to perform sequential short-text classification, which outperform state-of-the-art algorithms. To facilitate the choice of hyperparameters, we present a method based on Gaussian processes. Lastly, we release PubMed 20k RCT, a new dataset for sequential sentence classification in randomized control trial abstracts.<br>by Franck Dernoncourt.<br>Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
49

Wang, Xutao. "Chinese Text Classification Based On Deep Learning." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-35322.

Full text
Abstract:
Text classification has always been a concern in area of natural language processing, especially nowadays the data are getting massive due to the development of internet. Recurrent neural network (RNN) is one of the most popular method for natural language processing due to its recurrent architecture which give it ability to process serialized information. In the meanwhile, Convolutional neural network (CNN) has shown its ability to extract features from visual imagery. This paper combine the advantages of RNN and CNN and proposed a model called BLSTM-C for Chinese text classification. BLSTM-C begins with a Bidirectional long short-term memory (BLSTM) layer which is an special kind of RNN to get a sequence output based on the past context and the future context. Then it feed this sequence to CNN layer which is utilized to extract features from the previous sequence. We evaluate BLSTM-C model on several tasks such as sentiment classification and category classification and the result shows our model’s remarkable performance on these text tasks.
APA, Harvard, Vancouver, ISO, and other styles
50

Zhihao, Cao. "Sensitive Text Icon Classification for Android Apps." Case Western Reserve University School of Graduate Studies / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=case1513561838121543.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography