Dissertations / Theses: 'Text summarization'

1

Branavan, Satchuthananthavale Rasiah Kuhan. "High compression rate text summarization." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/44368.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.
Includes bibliographical references (p. 95-97).
This thesis focuses on methods for condensing large documents into highly concise summaries, achieving compression rates on par with human writers. While the need for such summaries in the current age of information overload is increasing, the desired compression rate has thus far been beyond the reach of automatic summarization systems. The potency of our summarization methods is due to their in-depth modelling of document content in a probabilistic framework. We explore two types of document representation that capture orthogonal aspects of text content. The first represents the semantic properties mentioned in a document in a hierarchical Bayesian model. This method is used to summarize thousands of consumer reviews by identifying the product properties mentioned by multiple reviewers. The second representation captures discourse properties, modelling the connections between different segments of a document. This discriminatively trained model is employed to generate tables of contents for books and lecture transcripts. The summarization methods presented here have been incorporated into large-scale practical systems that help users effectively access information online.
by Satchuthananthavale Rasiah Kuhan Branavan.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

2

Linhares, Pontes Elvys. "Compressive Cross-Language Text Summarization." Thesis, Avignon, 2018. http://www.theses.fr/2018AVIG0232/document.

Full text

Abstract:

La popularisation des réseaux sociaux et des documents numériques a rapidement accru l'information disponible sur Internet. Cependant, cette quantité massive de données ne peut pas être analysée manuellement. Parmi les applications existantes du Traitement Automatique du Langage Naturel (TALN), nous nous intéressons dans cette thèse au résumé cross-lingue de texte, autrement dit à la production de résumés dans une langue différente de celle des documents sources. Nous analysons également d'autres tâches du TALN (la représentation des mots, la similarité sémantique ou encore la compression de phrases et de groupes de phrases) pour générer des résumés cross-lingues plus stables et informatifs. La plupart des applications du TALN, celle du résumé automatique y compris, utilisent une mesure de similarité pour analyser et comparer le sens des mots, des séquences de mots, des phrases et des textes. L’une des façons d'analyser cette similarité est de générer une représentation de ces phrases tenant compte de leur contenu. Le sens des phrases est défini par plusieurs éléments, tels que le contexte des mots et des expressions, l'ordre des mots et les informations précédentes. Des mesures simples, comme la mesure cosinus et la distance euclidienne, fournissent une mesure de similarité entre deux phrases. Néanmoins, elles n'analysent pas l'ordre des mots ou les séquences de mots. En analysant ces problèmes, nous proposons un modèle de réseau de neurones combinant des réseaux de neurones récurrents et convolutifs pour estimer la similarité sémantique d'une paire de phrases (ou de textes) en fonction des contextes locaux et généraux des mots. Sur le jeu de données analysé, notre modèle a prédit de meilleurs scores de similarité que les systèmes de base en analysant mieux le sens local et général des mots mais aussi des expressions multimots. Afin d'éliminer les redondances et les informations non pertinentes de phrases similaires, nous proposons de plus une nouvelle méthode de compression multiphrase, fusionnant des phrases au contenu similaire en compressions courtes. Pour ce faire, nous modélisons des groupes de phrases semblables par des graphes de mots. Ensuite, nous appliquons un modèle de programmation linéaire en nombres entiers qui guide la compression de ces groupes à partir d'une liste de mots-clés ; nous cherchons ainsi un chemin dans le graphe de mots qui a une bonne cohésion et qui contient le maximum de mots-clés. Notre approche surpasse les systèmes de base en générant des compressions plus informatives et plus correctes pour les langues française, portugaise et espagnole. Enfin, nous combinons les méthodes précédentes pour construire un système de résumé de texte cross-lingue. Notre système génère des résumés cross-lingue de texte en analysant l'information à la fois dans les langues source et cible, afin d’identifier les phrases les plus pertinentes. Inspirés par les méthodes de résumé de texte par compression en analyse monolingue, nous adaptons notre méthode de compression multiphrase pour ce problème afin de ne conserver que l'information principale. Notre système s'avère être performant pour compresser l'information redondante et pour préserver l'information pertinente, en améliorant les scores d'informativité sans perdre la qualité grammaticale des résumés cross-lingues du français vers l'anglais. En analysant les résumés cross-lingues depuis l’anglais, le français, le portugais ou l’espagnol, vers l’anglais ou le français, notre système améliore les systèmes par extraction de l'état de l'art pour toutes ces langues. En outre, une expérience complémentaire menée sur des transcriptions automatiques de vidéo montre que notre approche permet là encore d'obtenir des scores ROUGE meilleurs et plus stables, même pour ces documents qui présentent des erreurs grammaticales et des informations inexactes ou manquantes
The popularization of social networks and digital documents increased quickly the informationavailable on the Internet. However, this huge amount of data cannot be analyzedmanually. Natural Language Processing (NLP) analyzes the interactions betweencomputers and human languages in order to process and to analyze natural languagedata. NLP techniques incorporate a variety of methods, including linguistics, semanticsand statistics to extract entities, relationships and understand a document. Amongseveral NLP applications, we are interested, in this thesis, in the cross-language textsummarization which produces a summary in a language different from the languageof the source documents. We also analyzed other NLP tasks (word encoding representation,semantic similarity, sentence and multi-sentence compression) to generate morestable and informative cross-lingual summaries.Most of NLP applications (including all types of text summarization) use a kind ofsimilarity measure to analyze and to compare the meaning of words, chunks, sentencesand texts in their approaches. A way to analyze this similarity is to generate a representationfor these sentences that contains the meaning of them. The meaning of sentencesis defined by several elements, such as the context of words and expressions, the orderof words and the previous information. Simple metrics, such as cosine metric andEuclidean distance, provide a measure of similarity between two sentences; however,they do not analyze the order of words or multi-words. Analyzing these problems,we propose a neural network model that combines recurrent and convolutional neuralnetworks to estimate the semantic similarity of a pair of sentences (or texts) based onthe local and general contexts of words. Our model predicted better similarity scoresthan baselines by analyzing better the local and the general meanings of words andmulti-word expressions.In order to remove redundancies and non-relevant information of similar sentences,we propose a multi-sentence compression method that compresses similar sentencesby fusing them in correct and short compressions that contain the main information ofthese similar sentences. We model clusters of similar sentences as word graphs. Then,we apply an integer linear programming model that guides the compression of theseclusters based on a list of keywords. We look for a path in the word graph that has goodcohesion and contains the maximum of keywords. Our approach outperformed baselinesby generating more informative and correct compressions for French, Portugueseand Spanish languages. Finally, we combine these previous methods to build a cross-language text summarizationsystem. Our system is an {English, French, Portuguese, Spanish}-to-{English,French} cross-language text summarization framework that analyzes the informationin both languages to identify the most relevant sentences. Inspired by the compressivetext summarization methods in monolingual analysis, we adapt our multi-sentencecompression method for this problem to just keep the main information. Our systemproves to be a good alternative to compress redundant information and to preserve relevantinformation. Our system improves informativeness scores without losing grammaticalquality for French-to-English cross-lingual summaries. Analyzing {English,French, Portuguese, Spanish}-to-{English, French} cross-lingual summaries, our systemsignificantly outperforms extractive baselines in the state of the art for all these languages.In addition, we analyze the cross-language text summarization of transcriptdocuments. Our approach achieved better and more stable scores even for these documentsthat have grammatical errors and missing information

APA, Harvard, Vancouver, ISO, and other styles

3

Ozsoy, Makbule Gulcin. "Text Summarization Using Latent Semantic Analysis." Master's thesis, METU, 2011. http://etd.lib.metu.edu.tr/upload/12612988/index.pdf.

Full text

Abstract:

Text summarization solves the problem of presenting the information needed by a user in a compact form. There are different approaches to create well formed summaries in literature. One of the newest methods in text summarization is the Latent Semantic Analysis (LSA) method. In this thesis, different LSA based summarization algorithms are explained and two new LSA based summarization algorithms are proposed. The algorithms are evaluated on Turkish and English documents, and their performances are compared using their ROUGE scores.

APA, Harvard, Vancouver, ISO, and other styles

4

Mlynarski, Angela, and University of Lethbridge Faculty of Arts and Science. "Automatic text summarization in digital libraries." Thesis, Lethbridge, Alta. : University of Lethbridge, Faculty of Arts and Science, 2006, 2006. http://hdl.handle.net/10133/270.

Full text

Abstract:

A digital library is a collection of services and information objects for storing, accessing, and retrieving digital objects. Automatic text summarization presents salient information in a condensed form suitable for user needs. This thesis amalgamates digital libraries and automatic text summarization by extending the Greenstone Digital Library software suite to include the University of Lethbridge Summarizer. The tool generates summaries, nouns, and non phrases for use as metadata for searching and browsing digital collections. Digital collections of newspapers, PDFs, and eBooks were created with summary metadata. PDF documents were processed the fastest at 1.8 MB/hr, followed by the newspapers at 1.3 MB/hr, with eBooks being the slowest at 0.9 MV/hr. Qualitative analysis on four genres: newspaper, M.Sc. thesis, novel, and poetry, revealed narrative newspapers were most suitable for automatically generated summarization. The other genres suffered from incoherence and information loss. Overall, summaries for digital collections are suitable when used with newspaper documents and unsuitable for other genres.
xiii, 142 leaves ; 28 cm.

APA, Harvard, Vancouver, ISO, and other styles

5

Singi, Reddy Dinesh Reddy. "Comparative text summarization of product reviews." Thesis, Kansas State University, 2010. http://hdl.handle.net/2097/7031.

Full text

Abstract:

Master of Science
Department of Computing and Information Sciences
William H. Hsu
This thesis presents an approach towards summarizing product reviews using comparative sentences by sentiment analysis. Specifically, we consider the problem of extracting and scoring features from natural language text for qualitative reviews in a particular domain. When shopping for a product, customers do not find sufficient time to learn about all products on the market. Similarly, manufacturers do not have proper written sources from which to learn about customer opinions. The only available techniques involve gathering customer opinions, often in text form, from e-commerce and social networking web sites and analyzing them, which is a costly and time-consuming process. In this work I address these issues by applying sentiment analysis, an automated method of finding the opinion stated by an author about some entity in a text document. Here I first gather information about smart phones from many e-commerce web sites. I then present a method to differentiate comparative sentences from normal sentences, form feature sets for each domain, and assign a numerical score to each feature of a product and a weight coefficient obtained by statistical machine learning, to be used as a weight for that feature in ranking various products by linear combinations of their weighted feature scores. In this thesis I also explain what role comparative sentences play in summarizing the product. In order to find the polarity of each feature a statistical algorithm is defined using a small-to-medium sized data set. Then I present my experimental environment and results, and conclude with a review of claims and hypotheses stated at the outset. The approach specified in this thesis is evaluated using manual annotated trained data and also using data from domain experts. I also demonstrate empirically how different algorithms on this summarization can be derived from the technique provided by an annotator. Finally, I review diversified options for customers such as providing alternate products for each feature, top features of a product, and overall rankings for products.

APA, Harvard, Vancouver, ISO, and other styles

6

AGUIAR, C. Z. "Concept Maps Mining for Text Summarization." Universidade Federal do Espírito Santo, 2017. http://repositorio.ufes.br/handle/10/9846.

Full text

Abstract:

Made available in DSpace on 2018-08-02T00:03:48Z (GMT). No. of bitstreams: 1 tese_11160_CamilaZacche_dissertacao_final.pdf: 5437260 bytes, checksum: 0c96c6b2cce9c15ea234627fad78ac9a (MD5) Previous issue date: 2017-03-31
8 Resumo Os mapas conceituais são ferramentas gráficas para a representação e construção do conhecimento. Conceitos e relações formam a base para o aprendizado e, portanto, os mapas conceituais têm sido amplamente utilizados em diferentes situações e para diferentes propósitos na educação, sendo uma delas a represent ação do texto escrito. Mes mo um gramá tico e complexo texto pode ser representado por um mapa conceitual contendo apenas conceitos e relações que represente m o que foi expresso de uma forma mais complicada. No entanto, a construção manual de um mapa conceit ual exige bastante tempo e esforço na identificação e estruturação do conhecimento, especialmente quando o mapa não deve representar os conceitos da estrutura cognitiva do autor. Em vez disso, o mapa deve representar os conceitos expressos em um texto. Ass im, várias abordagens tecnológicas foram propostas para facilitar o processo de construção de mapas conceituais a partir de textos. Portanto, esta dissertação propõe uma nova abordagem para a construção automática de mapas conceituais como sumarização de t extos científicos. A sumarização pretende produzir um mapa conceitual como uma representação resumida do texto, mantendo suas diversas e mais importantes características. A sumarização pode facilitar a compreensão dos textos, uma vez que os alunos estão te ntando lidar com a sobrecarga cognitiva causada pela crescente quantidade de informação textual disponível atualmente. Este crescimento também pode ser prejudicial à construção do conhecimento. Assim, consideramos a hipótese de que a sumarização de um text o representado por um mapa conceitual pode atribuir características importantes para assimilar o conhecimento do texto, bem como diminuir a sua complexidade e o tempo necessário para processá - lo. Neste contexto, realizamos uma revisão da literatura entre o s anos de 1994 e 2016 sobre as abordagens que visam a construção automática de mapas conceituais a partir de textos. A partir disso, construímos uma categorização para melhor identificar e analisar os recursos e as características dessas abordagens tecnoló gicas. Além disso, buscamos identificar as limitações e reunir as melhores características dos trabalhos relacionados para propor nossa abordagem. 9 Ademais, apresentamos um processo Concept Map Mining elaborado seguindo quatro dimensões : Descrição da Fonte de Dados, Definição do Domínio, Identificação de Elementos e Visualização do Mapa. Com o intuito de desenvolver uma arquitetura computacional para construir automaticamente mapas conceituais como sumarização de textos acadêmicos, esta pesquisa resultou na ferramenta pública CMBuilder , uma ferramenta online para a construção automática de mapas conceituais a partir de textos, bem como uma api java chamada ExtroutNLP , que contém bibliotecas para extração de informações e serviços públicos. Para alcançar o objetivo proposto, direcionados esforços para áreas de processamento de linguagem natural e recuperação de informação. Ressaltamos que a principal tarefa para alcançar nosso objetivo é extrair do texto as proposições do tipo ( conceito, rela ção, conceito ). Sob essa premissa, a pesquisa introduz um pipeline que compreende: regras gramaticais e busca em profundidade para a extração de conceitos e relações a partir do texto; mapeamento de preposição, resolução de anáforas e exploração de entidad es nomeadas para a rotulação de conceitos; ranking de conceitos baseado na análise de frequência de elementos e na topologia do mapa; e sumarização de proposição baseada na topologia do grafo. Além disso, a abordagem também propõe o uso de técnicas de apre ndizagem supervisionada de clusterização e classificação associadas ao uso de um tesauro para a definição do domínio do texto e construção de um vocabulário conceitual de domínios. Finalmente, uma análise objetiva para validar a exatidão da biblioteca Extr outNLP é executada e apresenta 0.65 precision sobre o corpus . Além disso, uma análise subjetiva para validar a qualidade do mapa conceitual construído pela ferramenta CMBuilder é realizada , apresentando 0.75/0.45 para precision / recall de conceitos e 0.57/ 0.23 para precision/ recall de relações em idioma inglês e apresenta ndo 0.68/ 0.38 para precision/ recall de conceitos e 0.41/ 0.19 para precision/ recall de relações em idioma português. Ademais , um experimento para verificar se o mapa conceitual sumarizado pe lo CMBuilder tem influência para a compreensão do assunto abordado em um texto é realizado , atingindo 60% de acertos para mapas extraídos de pequenos textos com questões de múltipla escolha e 77% de acertos para m apas extraídos de textos extensos com quest ões discursivas

APA, Harvard, Vancouver, ISO, and other styles

7

Casamayor, Gerard. "Semantically-oriented text planning for automatic summarization." Doctoral thesis, Universitat Pompeu Fabra, 2021. http://hdl.handle.net/10803/671530.

Full text

Abstract:

Text summarization deals with the automatic creation of summaries from one or more documents, either by extracting fragments from the input text or by generating an abstract de novo. Research in recent years has become dominated by a new paradigm where summarization is addressed as a mapping from a sequence of tokens in an input document to a new sequence of tokens summarizing the input. Works following this paradigm apply supervised deep learning methods to learn sequence to sequence models from a large corpus of documents paired with human-crafted summaries. Despite impressive results in automatic quantitative evaluations, this approach to summarization also suffers from a number of drawbacks. One concern is that learned models tend to operate in a black-box fashion that prevents obtaining insights or results from intermediate analysis that could be applied to other tasks -an important consideration in many real-world scenarios where summaries are not the only desired output of a natural language processing system. Another significant drawback is that deep learning methods are largely constrained to languages and types of summary for which abundant corpora containing human authored summaries is available. Albeit researchers are experimenting with transfer learning methods to overcome this problem, it is far from clear how effective these methods are and how to apply them to scenarios where summaries need to adapt to a query or to user preferences. In those cases where it is not practical to learn a sequence to sequence model, it is convenient to fall back to a more traditional formulation of summarization where the input documents are first analyzed, then a summary is planned by selecting and organizing contents, and the final summary is generated either extractively or abstractively --using natural language generation methods in the latter case. By separating linguistic analysis, planning and generation, it becomes possible to apply different approaches to each task. This thesis focuses on the text planning step. Drawing from past research in word sense disambiguation, text summarization and natural language generation, this thesis presents an unsupervised approach to planning the production of summaries. Following the observation that a common strategy for both disambiguation and summarization tasks is to rank candidate items --meanings, text fragments-- we propose a strategy, at the core of our approach, that ranks candidate lexical meanings and individual words in a text. These ranks contribute towards the creation of a graph-based semantic representation from which we select non-redundant contents and organize them for inclusion in the summary. The overall approach is supported by lexicographic databases that provide cross-lingual and cross-domain knowledge, and by textual similarity methods used to compare meanings with each other and with the text. The methods presented in this thesis are tested on two separate tasks, disambiguation of word senses and named entities, and single-document extractive summarization of English texts. The evaluation of the disambiguation task shows that our approach produces useful results for tasks other than summarization, while evaluating in an extractive summarization setting allows us to compare our approach to existing summarization systems. While the results are inconclusive with respect to state-of-the-art in disambiguation and summarization systems, they hint at a large potential for our approach.
El resum automàtic de textos és una tasca dins del camp d'estudi de processament del llenguatge natural que versa sobre la creació automàtica de resums d'un o més documents, ja sigui extraient fragments del text d'entrada or generant un resum des de zero. La recerca recent en aquesta tasca ha estat dominada per un nou paradigma on el resum és abordat com un mapeig d'una seqüència de paraules en el document d'entrada a una nova seqüència de paraules que resumeixen el document. Els treballs que segueixen aquest paradigma apliquen mètodes d'aprenentatge supervisat profund per tal d'aprendre model seqüència a seqüència a partir d'un gran corpus de documents emparellats amb resums escrits a mà. Tot i els resultats impressionants en avaluacions quantitatives automàtiques, aquesta aproximació al resum automàtic també té alguns inconvenients. Un primer problema és que els models entrenats tendeixen a operar com una caixa negra que impedeix obtenir coneixements o resultats de representacions intermèdies i que puguin ser aplicat a altres tasques. Aquest és un problema important en situacions del món real on els resums no son l'única sortida que s'espera d'un sistema de processament de llenguatge natural. Un altre inconvenient significatiu és que els mètodes d'aprenentatge profund estan limitats a idiomes i tipus de resum pels que existeixen grans corpus amb resums escrits per humans. Tot i que els investigadors experimenten amb mètodes de transferència del coneixement per a superar aquest problema, encara ens trobem lluny de saber com d'efectius son aquests mètodes i com aplicar-los a situacions on els resums s'han d'adaptar a consultes o preferències formulades per l'usuari. En aquells casos en que no és pràctic aprendre models de seqüència a seqüència, convé tornar a una formulació més tradicional del resum automàtic on els documents d'entrada s'analitzen en primer lloc, es planifica el resum tot seleccionant i organitzant continguts i el resum final es genera per extracció o abstracció, fent servir mètodes de generació de llenguatge natural en aquest últim cas. Separar l'anàlisi lingüístic, la planificació i la generació permet aplicar estratègies diferents a cada tasca. Aquesta tesi tracta el pas central de planificació del resum. Inspirant-nos en recerca existent en desambiguació de sentits de mots, resum automàtic de textos i generació de llenguatge natural, aquesta tesi presenta una estratègia no supervisada per a la creació de resums. Seguim l'observació de que el rànquing d'ítems (significats o fragments de text) és un mètode comú per a tasques desambiguació i de resum, i proposem un mètode central per a la nostra estratègia que ordena significats lèxics i paraules d'un text. L'ordre resultant contribueix a la creació d'una representació semàntica en forma de graf des de la que seleccionem continguts no redundants i els organitzem per a la seva inclusió en el resum. L'estratègia general es fonamenta en bases de dades lexicogràfiques que proporcionen coneixement creuat entre múltiples idiomes i àrees temàtiques, i per mètodes de càlcul de similitud entre texts que fem servir per comparar significats entre sí i amb el text. Els mètodes que es presenten en aquesta tesi son posats a prova en dues tasques separades, la desambiguació de sentits de paraula i d'entitats amb nom, i el resum extractiu de documents en anglès. L'avaluació de la desambiguació mostra que la nostra estratègia produeix resultats útils per a tasques més enllà del resum automàtic, mentre que l'avaluació del resum extractiu ens permet comparar el nostre enfocament a sistemes existents de resum automàtic. Tot i que els nostres resultats no representen un avenç significatiu respecte a l'estat de la qüestió en desambiguació i resum automàtic, suggereixen que l'estratègia té un gran potencial.

APA, Harvard, Vancouver, ISO, and other styles

8

Reeve, Lawrence H. Han Hyoil. "Semantic annotation and summarization of biomedical text /." Philadelphia, Pa. : Drexel University, 2007. http://hdl.handle.net/1860/1779.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Hassel, Martin. "Resource Lean and Portable Automatic Text Summarization." Doctoral thesis, Stockholm : Numerisk analys och datalogi Numerical Analysis and Computer Science, Kungliga Tekniska högskolan, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4414.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Lehto, Niko, and Mikael Sjödin. "Automatic text summarization of Swedish news articles." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-159972.

Full text

Abstract:

With an increasing amount of textual information available there is also an increased need to make this information more accessible. Our paper describes a modified TextRank model and investigates the different methods available to use automatic text summarization as a means for summary creation of swedish news articles. To evaluate our model we focused on intrinsic evaluation methods, in part through content evaluation in the form of of measuring referential clarity and non-redundancy, and in part by text quality evaluation measures, in the form of keyword retention and ROUGE evaluation. The results acquired indicate that stemming and improved stop word capabilities can have a positive effect on the ROUGE scores. The addition of redundancy checks also seems to have a positive effect on avoiding repetition of information. Keyword retention decreased somewhat, however. Lastly all methods had some trouble with dangling anaphora, showing a need for further work within anaphora resolution.

APA, Harvard, Vancouver, ISO, and other styles

11

Kryściński, Wojciech. "Training Neural Models for Abstractive Text Summarization." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-236973.

Full text

Abstract:

Abstractive text summarization aims to condense long textual documents into a short, human-readable form while preserving the most important information from the source document. A common approach to training summarization models is by using maximum likelihood estimation with the teacher forcing strategy. Despite its popularity, this method has been shown to yield models with suboptimal performance at inference time. This work examines how using alternative, task-specific training signals affects the performance of summarization models. Two novel training signals are proposed and evaluated as part of this work. One, a novelty metric, measuring the overlap between n-grams in the summary and the summarized article. The other, utilizing a discriminator model to distinguish human-written summaries from generated ones on a word-level basis. Empirical results show that using the mentioned metrics as rewards for policy gradient training yields significant performance gains measured by ROUGE scores, novelty scores and human evaluation.
Abstraktiv textsammanfattning syftar på att korta ner långa textdokument till en förkortad, mänskligt läsbar form, samtidigt som den viktigaste informationen i källdokumentet bevaras. Ett vanligt tillvägagångssätt för att träna sammanfattningsmodeller är att använda maximum likelihood-estimering med teacher-forcing-strategin. Trots dess popularitet har denna metod visat sig ge modeller med suboptimal prestanda vid inferens. I det här arbetet undersöks hur användningen av alternativa, uppgiftsspecifika träningssignaler påverkar sammanfattningsmodellens prestanda. Två nya träningssignaler föreslås och utvärderas som en del av detta arbete. Den första, vilket är en ny metrik, mäter överlappningen mellan n-gram i sammanfattningen och den sammanfattade artikeln. Den andra använder en diskrimineringsmodell för att skilja mänskliga skriftliga sammanfattningar från genererade på ordnivå. Empiriska resultat visar att användandet av de nämnda mätvärdena som belöningar för policygradient-träning ger betydande prestationsvinster mätt med ROUGE-score, novelty score och mänsklig utvärdering.

APA, Harvard, Vancouver, ISO, and other styles

12

Le, Thien-Hoa. "Neural Methods for Sentiment Analysis and Text Summarization." Electronic Thesis or Diss., Université de Lorraine, 2020. http://www.theses.fr/2020LORR0037.

Full text

Abstract:

Cette thèse aborde deux questions majeures du traitement automatique du langage naturel liées à l'analyse sémantique des textes : la détection des sentiments, et le résumé automatique. Dans ces deux applications, la nécessité d'analyser le sens du texte de manière précise est primordiale, d'une part pour identifier le sentiment exprimé au travers des mots, et d'autre part pour extraire les informations saillantes d’une phrase complexe et les réécrire de la manière la plus naturelle possible tout en respectant la sémantique du texte d'origine. Nous abordons ces deux questions par des approches d'apprentissage profond, qui permettent d'exploiter au mieux les données, en particulier lorsqu'elles sont disponibles en grande quantité. Analyse des sentiments neuronale. De nombreux réseaux de neurones convolutionnels profonds ont été adaptés du domaine de la vision aux tâches d’analyse des sentiments et de classification des textes. Cependant, ces études ne permettent pas de conclure de manière satisfaisante quant à l'importance de la profondeur du réseau pour obtenir les meilleures performances en classification de textes. Dans cette thèse, nous apportons de nouveaux éléments pour répondre à cette question. Nous proposons une adaptation du réseau convolutionnel profond DenseNet pour la classification de texte et étudions l’importance de la profondeur avec différents niveaux de granularité en entrée (mots ou caractères). Nous montrons que si les modèles profonds offrent de meilleures performances que les réseaux peu profonds lorsque le texte est représenté par une séquence de caractères, ce n'est pas le cas avec des mots. En outre, nous proposons de modéliser conjointement sentiments et actes de dialogue, qui constituent un facteur explicatif influent pour l’analyse du sentiment. Nous avons annoté manuellement les dialogues et les sentiments sur un corpus de micro-blogs, et entraîné un réseau multi-tâches sur ce corpus. Nous montrons que l'apprentissage par transfert peut être efficacement réalisé entre les deux tâches et analysons de plus certaines corrélations spécifiques entre ces deux aspects. Résumé de texte neuronal. L'analyse de sentiments n'apporte qu'une partie de l'information sémantique contenue dans les textes et est insuffisante pour bien comprendre le texte d'origine et prendre des décisions fondées. L'utilisateur d'un tel système a également besoin des raisons sous-jacentes pour vraiment comprendre les documents. Dans cette partie, notre objectif est d'étudier une autre forme d'information sémantique fournie par les modèles de résumé automatique. Nous proposons ainsi un modèle de résumé qui présente de meilleures propriétés d’explicabilité et qui est suffisamment souple pour prendre en charge divers modules d’analyse syntaxique. Plus spécifiquement, nous linéarisons l’arbre syntaxique sous la forme de segments de texte superposés, qui sont ensuite sélectionnés par un apprentissage par renforcement (RL) et re-générés sous une forme compressée. Par conséquent, le modèle proposé est capable de gérer à la fois le résumé par extraction et par abstraction. En outre, les modèles de résumé automatique faisant de plus en plus appel à des approches d'apprentissage par renforcement, nous proposons une étude basée sur l'analyse syntaxique des phrases pour tenter de mieux comprendre quels types d'information sont pris en compte dans ces approches. Nous comparons ainsi de manière détaillée les modèles avec apprentissage par renforcement et les modèles exploitant une connaissance syntaxique supplémentaire des phrases ainsi que leur combinaison, selon plusieurs dimensions liées à la qualité perçue des résumés générés. Nous montrons lorsqu'il existe une contrainte de ressources (calcul et mémoire) qu'il est préférable de n'utiliser que l'apprentissage par renforcement, qui donne des résultats presque aussi satisfaisants que des modèles syntaxiques, avec moins de paramètres et une convergence plus rapide
This thesis focuses on two Natural Language Processing tasks that require to extract semantic information from raw texts: Sentiment Analysis and Text Summarization. This dissertation discusses issues and seeks to improve neural models on both tasks, which have become the dominant paradigm in the past several years. Accordingly, this dissertation is composed of two parts: the first part (Neural Sentiment Analysis) deals with the computational study of people's opinions, sentiments, and the second part (Neural Text Summarization) tries to extract salient information from a complex sentence and rewrites it in a human-readable form. Neural Sentiment Analysis. Similar to computer vision, numerous deep convolutional neural networks have been adapted to sentiment analysis and text classification tasks. However, unlike the image domain, these studies are carried on different input data types and on different datasets, which makes it hard to know if a deep network is truly needed. In this thesis, we seek to find elements to address this question, i.e. whether neural networks must compute deep hierarchies of features for textual data in the same way as they do in vision. We thus propose a new adaptation of the deepest convolutional architecture (DenseNet) for text classification and study the importance of depth in convolutional models with different atom-levels (word or character) of input. We show that deep models indeed give better performances than shallow networks when the text input is represented as a sequence of characters. However, a simple shallow-and-wide network outperforms the deep DenseNet models with word inputs. Besides, to further improve sentiment classifiers and contextualize them, we propose to model them jointly with dialog acts, which are a factor of explanation and correlate with sentiments but are nevertheless often ignored. We have manually annotated both dialogues and sentiments on a Twitter-like social medium, and train a multi-task hierarchical recurrent network on joint sentiment and dialog act recognition. We show that transfer learning may be efficiently achieved between both tasks, and further analyze some specific correlations between sentiments and dialogues on social media. Neural Text Summarization. Detecting sentiments and opinions from large digital documents does not always enable users of such systems to take informed decisions, as other important semantic information is missing. People also need the main arguments and supporting reasons from the source documents to truly understand and interpret the document. To capture such information, we aim at making the neural text summarization models more explainable. We propose a model that has better explainability properties and is flexible enough to support various shallow syntactic parsing modules. More specifically, we linearize the syntactic tree into the form of overlapping text segments, which are then selected with reinforcement learning (RL) and regenerated into a compressed form. Hence, the proposed model is able to handle both extractive and abstractive summarization. Further, we observe that RL-based models are becoming increasingly ubiquitous for many text summarization tasks. We are interested in better understanding what types of information is taken into account by such models, and we propose to study this question from the syntactic perspective. We thus provide a detailed comparison of both RL-based and syntax-aware approaches and of their combination along several dimensions that relate to the perceived quality of the generated summaries such as number of repetitions, sentence length, distribution of part-of-speech tags, relevance and grammaticality. We show that when there is a resource constraint (computation and memory), it is wise to only train models with RL and without any syntactic information, as they provide nearly as good results as syntax-aware models with less parameters and faster training convergence

APA, Harvard, Vancouver, ISO, and other styles

13

Ceylan, Hakan. "Investigating the Extractive Summarization of Literary Novels." Thesis, University of North Texas, 2011. https://digital.library.unt.edu/ark:/67531/metadc103298/.

Full text

Abstract:

Abstract Due to the vast amount of information we are faced with, summarization has become a critical necessity of everyday human life. Given that a large fraction of the electronic documents available online and elsewhere consist of short texts such as Web pages, news articles, scientific reports, and others, the focus of natural language processing techniques to date has been on the automation of methods targeting short documents. We are witnessing however a change: an increasingly larger number of books become available in electronic format. This means that the need for language processing techniques able to handle very large documents such as books is becoming increasingly important. This thesis addresses the problem of summarization of novels, which are long and complex literary narratives. While there is a significant body of research that has been carried out on the task of automatic text summarization, most of this work has been concerned with the summarization of short documents, with a particular focus on news stories. However, novels are different in both length and genre, and consequently different summarization techniques are required. This thesis attempts to close this gap by analyzing a new domain for summarization, and by building unsupervised and supervised systems that effectively take into account the properties of long documents, and outperform the traditional extractive summarization systems typically addressing news genre.

APA, Harvard, Vancouver, ISO, and other styles

14

Wu, Jiewen. "WHISK: Web Hosted Information into Summarized Knowledge." DigitalCommons@CalPoly, 2016. https://digitalcommons.calpoly.edu/theses/1633.

Full text

Abstract:

Today’s online content increases at an alarmingly rate which exceeds users’ ability to consume such content. Modern search techniques allow users to enter keyword queries to find content they wish to see. However, such techniques break down when users freely browse the internet without knowing exactly what they want. Users may have to invest an unnecessarily long time reading content to see if they are interested in it. Automatic text summarization helps relieve this problem by creating synopses that significantly reduce the text while preserving the key points. Steffen Lyngbaek created the SPORK summarization pipeline to solve the content overload in Reddit comment threads. Lyngbaek adapted the Opinosis graph model for extractive summarization and combined it with agglomerative hierarchical clustering and the Smith-Waterman algorithm to perform multi-document summarization on Reddit comments.This thesis presents WHISK as a pipeline for general multi-document text summarization based on SPORK. A generic data model in WHISK allows creating new drivers for different platforms to work with the pipeline. In addition to the existing Opinosis graph model adapted in SPORK, WHISK introduces two simplified graph models for the pipeline. The simplified models removes unnecessary restrictions inherited from Opinosis graph’s abstractive summarization origins. Performance measurements and a study with Digital Democracy compare the two new graph models against the Opinosis graph model. Additionally, the study evaluates WHISK’s ability to generate pull quotes from political discussions as summaries.

APA, Harvard, Vancouver, ISO, and other styles

15

Mastronardo, Claudio. "Integrating Deep Contextualized Word Embeddings into Text Summarization Systems." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/18468/.

Full text

Abstract:

In questa tesi saranno usate tecniche di deep learning per affrontare unodei problemi più difficili dell’elaborazione automatica del linguaggio naturale:la generazione automatica di riassunti. Dato un corpus di testo, l’obiettivoè quello di generare un riassunto che sia in grado di distillare e comprimerel’informazione dall’intero testo di partenza. Con i primi approcci si é provatoa catturare il significato del testo attraverso l’uso di regole scritte dagliumani. Dopo questa era simbolica basata su regole, gli approcchi statistici hanno preso il sopravvento. Negli ultimi anni il deep learning ha impattato positivamente ogni area dell’elaborazione automatica del linguaggionaturale, incluso la generazione automatica dei riassunti. In questo lavoroi modelli pointer-generator [See et al., 2017] sono utilizzati in combinazionea pre-trained deep contextualized word embeddings [Peters et al., 2018]. Sivaluta l’approccio sui due più grossi dataset per la generazione automaticadei riassunti disponibili ora: il dataset CNN/Daily Mail e il dataset Newsroom. Il dataset CNN/Daily Mail è stato generato partendo dal dataset diQuestion Answering pubblicato da DeepMind [Hermann et al., 2015], concatenando le frasi di highlight delle news e formando cosı̀ dei riassunti multifrase. Il dataset Newsroom [Grusky et al., 2018] è, invece, il primo datasetesplicitamente costruito per la generazione automatica di riassunti. Comprende un milione di coppie articolo-riassunto con diversi gradi di estrattività/astrattività a diversi ratio di compressione.L’approccio è valutato sui test-set con l’uso della metrica Recall-Oriented Understudy for Gisting Evaluation (ROUGE). Questo approccio causa un sostanzioso aumento nelle performance per il dataset Newsroom raggiungendo lo stato dell’arte sul valore di ROUGE-1 e valori competitivi per ROUGE-2 e ROUGE-L.

APA, Harvard, Vancouver, ISO, and other styles

16

Kolla, Maheedhar, and University of Lethbridge Faculty of Arts and Science. "Automatic text summarization using lexical chains : algorithms and experiments." Thesis, Lethbridge, Alta. : University of Lethbridge, Faculty of Arts and Science, 2004, 2004. http://hdl.handle.net/10133/226.

Full text

Abstract:

Summarization is a complex task that requires understanding of the document content to determine the importance of the text. Lexical cohesion is a method to identify connected portions of the text based on the relations between the words in the text. Lexical cohesive relations can be represented using lexical chaings. Lexical chains are sequences of semantically related words spread over the entire text. Lexical chains are used in variety of Natural Language Processing (NLP) and Information Retrieval (IR) applications. In current thesis, we propose a lexical chaining method that includes the glossary relations in the chaining process. These relations enable us to identify topically related concepts, for instance dormitory and student, and thereby enhances the identification of cohesive ties in the text. We then present methods that use the lexical chains to generate summaries by extracting sentences from the document(s). Headlines are generated by filtering the portions of the sentences extracted, which do not contribute towards the meaning of the sentence. Headlines generated can be used in real world application to skim through the document collections in a digital library. Multi-document summarization is gaining demand with the explosive growth of online news sources. It requires identification of the several themes present in the collection to attain good compression and avoid redundancy. In this thesis, we propose methods to group the portions of the texts of a document collection into meaningful clusters. clustering enable us to extract the various themes of the document collection. Sentences from clusters can then be extracted to generate a summary for the multi-document collection. Clusters can also be used to generate summaries with respect to a given query. We designed a system to compute lexical chains for the given text and use them to extract the salient portions of the document. Some specific tasks considered are: headline generation, multi-document summarization, and query-based summarization. Our experimental evaluation shows that efficient summaries can be extracted for the above tasks.
viii, 80 leaves : ill. ; 29 cm.

APA, Harvard, Vancouver, ISO, and other styles

17

Biniam, Thomas Indrias, and Adam Morén. "Extractive Text Summarization of Norwegian News Articles Using BERT." Thesis, Linköpings universitet, Medie- och Informationsteknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176598.

Full text

Abstract:

Extractive text summarization has over the years been an important research area in Natural Language Processing. Numerous methods have been proposed for extracting information from text documents. Recent works have shown great success for English summarization tasks by fine-tuning the language model BERT using large summarization datasets. However, less research has been made for low-resource languages. This work contributes by investigating how BERT can be used for Norwegian text summarization. Two models are developed by applying a modified BERT architecture, called BERTSum, on pre-trained Norwegian and Multilingual BERT. The results are models able to predict key sentences from articles to generate bullet-point summaries. These models are evaluated with the automatic metric ROUGE and in this evaluation, the Multilingual BERT model outperforms the Norwegian model. The multilingual model is further evaluated in a human evaluation by journalists, revealing that the generated summaries are not entirely satisfactory in some aspects. With some improvements, the model shows to be a valuable tool for journalists to edit and rewrite generated summaries, saving time and workload.

Examensarbetet är utfört vid Institutionen för teknik och naturvetenskap (ITN) vid Tekniska fakulteten, Linköpings universitet

APA, Harvard, Vancouver, ISO, and other styles

18

Jonsson, Fredrik. "Evaluation of the Transformer Model for Abstractive Text Summarization." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-263325.

Full text

Abstract:

Being able to generate summaries automatically could speed up the spread and retention of information and potentially increase productivity in several fields. Using RNN-based encoder-decoder models with attention have been successful on a variety of language-related tasks such as automatic summarization but also in the field of machine translation. Lately, the Transformer model has been shown to outperform RNN-based models with attention in the relatedfield of machine translation. This study compares the Transformer model to a LSTM-based encoderdecoder model with attention on the task of abstractive summarization. Evaluation is done both automatically, using ROUGE score, as well as using human evaluators to estimate the grammar and readability of the generated summaries. The results show that the Transformer model produces better summaries both in terms of ROUGE score and when evaluated with human evaluators.
Att automatiskt kunna generera sammanfattningar ökar möjligheten att snabbt kunna sprida och ta del av information vilket potentiellt kan leda till produktivitetsökningar inom en mängd fält. RNN-baserade enkoder-dekodermodeller med attention har visat sig vara effektiva inom många språkrelaterade områden såsom automatiskt genererade sammanfattningar men också inom exempelvis automatisk översättning. På senare tid har Transformermodellen överträffat RNN-baserade enkoderdekodermodeller med attention inom det närliggande området automatiska översättningar. Denna uppsats jämför Transformermodellen med en LSTMbaserad enkoder-dekodermodell med attention både genom att använda det automatiska måttet ROUGE, men också genom att jämföra läsbarhet och grammatik i de automatgenererade sammanfattningarna med hjälp av mänskliga utvärderare. Resultaten visar att Transformermodellen genererar bättre sammanfattningar både utvärderat med ROUGE och när de mänskliga utvärderarna används.

APA, Harvard, Vancouver, ISO, and other styles

19

Kipp, Darren. "Shallow semantics for topic-oriented multi-document automatic text summarization." Thesis, University of Ottawa (Canada), 2008. http://hdl.handle.net/10393/27772.

Full text

Abstract:

There are presently a number of NLP tools available which can provide semantic information about a sentence. Connexor Machinese Semantics is one of the most elaborate of such tools in terms of the information it provides. It has been hypothesized that semantic analysis of sentences is required in order to make significant improvements in automatic summarization. Elaborate semantic analysis is still not particularly feasible. In this thesis, I will look at what shallow semantic features are available from an off the shelf semantic analysis tool which might improve the responsiveness of a summary. The aim of this work is to use the information made available as an intermediary approach to improving the responsiveness of summaries. While this approach is not likely to perform as well as full semantic analysis, it is considerably easier to achieve and could provide an important stepping stone in the direction of deeper semantic analysis. As a significant portion of this task we develop mechanisms in various programming languages to view, process, and extract relevant information and features from the data.

APA, Harvard, Vancouver, ISO, and other styles

20

Lyons, Seamus. "Extraction and summarization of units of information from web text." Thesis, University of East Anglia, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.493011.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Mast, Cynda Overton. "The Effects of Cognitive Styles on Summarization of Expository Text." Thesis, University of North Texas, 1988. https://digital.library.unt.edu/ark:/67531/metadc332362/.

Full text

Abstract:

The study investigated the relationship among three cognitive styles and summarization abilities. Both summarization products and processes were examined. Summarizing products were scored and a canonical correlation analysis was performed to determine their relationship with three cognitive styles. Summarizing processes were examined by videotaping students as they provided think aloud protocols. Their processes were recorded on composing style sheets and analyzed qualitatively. Subjects were sixth-grade students in self-contained classes in a suburban school district. Summarizing products were collected over a two week period in the fall. Summarizing processes were collected over an eight week period in the spring of the same school year. The results of the summarizing products analysis suggest that cognitive styles are related to summarization abilities. Two canonical correlations among the two variable sets were statistically significant at the .05 level of significance (.33 and .29). The results further suggest that students who are field independent, reflective, and flexible in their attentional style may be more adept at organizing their ideas and using written mechanics while summarizing. Students who are impulsive and constricted in attentional style may exhibit strength in expressing their ideas while summarizing. Results of the summarizing processes analysis suggest that students of one cognitive style combination may exhibit different behaviors while summarizing than those of other cognitive style combinations. Students who are field independent, reflective, and flexible in their attentional style seem to display more mature, interactive behaviors while summarizing than their peers of other cognitive style combinations.

APA, Harvard, Vancouver, ISO, and other styles

22

Demirtas, Kezban. "Automatic Video Categorization And Summarization." Master's thesis, METU, 2009. http://etd.lib.metu.edu.tr/upload/3/12611113/index.pdf.

Full text

Abstract:

In this thesis, we make automatic video categorization and summarization by using subtitles of videos. We propose two methods for video categorization. The first method makes unsupervised categorization by applying natural language processing techniques on video subtitles and uses the WordNet lexical database and WordNet domains. The method starts with text preprocessing. Then a keyword extraction algorithm and a word sense disambiguation method are applied. The WordNet domains that correspond to the correct senses of keywords are extracted. Video is assigned a category label based on the extracted domains. The second method has the same steps for extracting WordNet domains of video but makes categorization by using a learning module. Experiments with documentary videos give promising results in discovering the correct categories of videos. Video summarization algorithms present condensed versions of a full length video by identifying the most significant parts of the video. We propose a video summarization method using the subtitles of videos and text summarization techniques. We identify significant sentences in the subtitles of a video by using text summarization techniques and then we compose a video summary by finding the video parts corresponding to these summary sentences.

APA, Harvard, Vancouver, ISO, and other styles

23

Lyngbaek, Steffen slyngbae. "SPORK: A SUMMARIZATION PIPELINE FOR ONLINE REPOSITORIES OF KNOWLEDGE." DigitalCommons@CalPoly, 2013. https://digitalcommons.calpoly.edu/theses/1036.

Full text

Abstract:

The web 2.0 era has ushered an unprecedented amount of interactivity on the Internet resulting in a flood of user-generated content. This content is often unstructured and comes in the form of blog posts and comment discussions. Users can no longer keep up with the amount of content available, which causes developers to start relying on natural language techniques to help mitigate the problem. Although many natural language processing techniques have been employed for years, automatic text summarization, in particular, has recently gained traction. This research proposes a graph-based, extractive text summarization system called SPORK (Summarization Pipeline for Online Repositories of Knowledge). The goal of SPORK is to be able to identify important key topics presented in multi-document texts, such as online comment threads. While most other automatic summarization systems simply focus on finding the top sentences represented in the text, SPORK separates the text into clusters, and identifies different topics and opinions presented in the text. SPORK has shown results of managing to identify 72\% of key topics present in any discussion and up to 80\% of key topics in a well-structured discussion.

APA, Harvard, Vancouver, ISO, and other styles

24

LA, QUATRA MORENO. "Deep Learning for Natural Language Understanding and Summarization." Doctoral thesis, Politecnico di Torino, 2022. http://hdl.handle.net/11583/2972201.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Geiss, Johanna. "Latent semantic sentence clustering for multi-document summarization." Thesis, University of Cambridge, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.609761.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Shen, Chao. "Text Analytics of Social Media: Sentiment Analysis, Event Detection and Summarization." FIU Digital Commons, 2014. http://digitalcommons.fiu.edu/etd/1739.

Full text

Abstract:

In the last decade, large numbers of social media services have emerged and been widely used in people's daily life as important information sharing and acquisition tools. With a substantial amount of user-contributed text data on social media, it becomes a necessity to develop methods and tools for text analysis for this emerging data, in order to better utilize it to deliver meaningful information to users. Previous work on text analytics in last several decades is mainly focused on traditional types of text like emails, news and academic literatures, and several critical issues to text data on social media have not been well explored: 1) how to detect sentiment from text on social media; 2) how to make use of social media's real-time nature; 3) how to address information overload for flexible information needs. In this dissertation, we focus on these three problems. First, to detect sentiment of text on social media, we propose a non-negative matrix tri-factorization (tri-NMF) based dual active supervision method to minimize human labeling efforts for the new type of data. Second, to make use of social media's real-time nature, we propose approaches to detect events from text streams on social media. Third, to address information overload for flexible information needs, we propose two summarization framework, dominating set based summarization framework and learning-to-rank based summarization framework. The dominating set based summarization framework can be applied for different types of summarization problems, while the learning-to-rank based summarization framework helps utilize the existing training data to guild the new summarization tasks. In addition, we integrate these techneques in an application study of event summarization for sports games as an example of how to better utilize social media data.

APA, Harvard, Vancouver, ISO, and other styles

27

Rigouste, Lois. "Evolution of a text summarization system in an automatic evaluation framework." Thesis, University of Ottawa (Canada), 2003. http://hdl.handle.net/10393/26535.

Full text

Abstract:

CALLISTO is a text summarizer that searches through a space of possible configurations for the best one. This is different from other systems since it allows CALLISTO (1) to choose adequate components based on results obtained on the training data (and thus, to choose a configuration better adapted to the problem) and (2) to allow different texts to be summarized in different ways. The purpose of this thesis is to find out how the initial space CALLISTO explores can be modified to improve the overall quality of the summaries produced. The thesis reviews and evaluates the first arbitrary design choices made in the system, through a fully automated framework based on a content measure proposed by Lin and Hovy. We tried different modifications to CALLISTO such as replacing the internal evaluation measure, testing other discretization processes, changing the learning algorithm or adding new features to characterize the input text. We found that Naive Bayes outperformed the current learner C5.0, by identifying one configuration working satisfactorily for all texts.

APA, Harvard, Vancouver, ISO, and other styles

28

Kantzola, Evangelia. "Extractive Text Summarization of Greek News Articles Based on Sentence-Clusters." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-420291.

Full text

Abstract:

This thesis introduces an extractive summarization system for Greek news articles based on sentence clustering. The main purpose of the paper is to evaluate the impact of three different types of text representation, Word2Vec embeddings, TF-IDF and LASER embeddings, on the summarization task. By taking these techniques into account, we build three different versions of the initial summarizer. Moreover, we create a new corpus of gold standard summaries to evaluate them against the system summaries. The new collection of reference summaries is merged with a part of the MultiLing Pilot 2011 in order to constitute our main dataset. We perform both automatic and human evaluation. Our automatic ROUGE results suggest that System A which employs Average Word2Vec vectors to create sentence embeddings, outperforms the other two systems by yielding higher ROUGE-L F-scores. Contrary to our initial hypotheses, System C using LASER embeddings fails to surpass even the Word2Vec embeddings method, showing sometimes a weak sentence representation. With regard to the scores obtained by the manual evaluation task, we observe that System A using Average Word2Vec vectors and System C with LASER embeddings tend to produce more coherent and adequate summaries than System B employing TF-IDF. Furthermore, the majority of system summaries are rated very high with respect to non-redundancy. Overall, System A utilizing Average Word2Vec embeddings performs quite successfully according to both evaluations.

APA, Harvard, Vancouver, ISO, and other styles

29

Grant, Harald. "Extractive Multi-document Summarization of News Articles." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-158275.

Full text

Abstract:

Publicly available data grows exponentially through web services and technological advancements. To comprehend large data-streams multi-document summarization (MDS) can be used. In this research, the area of multi-document summarization is investigated. Multiple systems for extractive multi-document summarization are implemented using modern techniques, in the form of the pre-trained BERT language model for word embeddings and sentence classification. This is combined with well proven techniques, in the form of the TextRank ranking algorithm, the Waterfall architecture and anti-redundancy filtering. The systems are evaluated on the DUC-2002, 2006 and 2007 datasets using the ROUGE metric. Where the results show that the BM25 sentence representation implemented in the TextRank model using the Waterfall architecture and an anti-redundancy technique outperforms the other implementations, providing competitive results with other state-of-the-art systems. A cohesive model is derived from the leading system and tried in a user study using a real-world application. The user study is conducted using a real-time news detection application with users from the news-domain. The study shows a clear favour for cohesive summaries in the case of extractive multi-document summarization. Where the cohesive summary is preferred in the majority of cases.

APA, Harvard, Vancouver, ISO, and other styles

30

González, Barba José Ángel. "Attention-based Approaches for Text Analytics in Social Media and Automatic Summarization." Doctoral thesis, Universitat Politècnica de València, 2021. http://hdl.handle.net/10251/172245.

Full text

Abstract:

[ES] Hoy en día, la sociedad tiene acceso y posibilidad de contribuir a grandes cantidades de contenidos presentes en Internet, como redes sociales, periódicos online, foros, blogs o plataformas de contenido multimedia. Todo este tipo de medios han tenido, durante los últimos años, un impacto abrumador en el día a día de individuos y organizaciones, siendo actualmente medios predominantes para compartir, debatir y analizar contenidos online. Por este motivo, resulta de interés trabajar sobre este tipo de plataformas, desde diferentes puntos de vista, bajo el paraguas del Procesamiento del Lenguaje Natural. En esta tesis nos centramos en dos áreas amplias dentro de este campo, aplicadas al análisis de contenido en línea: análisis de texto en redes sociales y resumen automático. En paralelo, las redes neuronales también son un tema central de esta tesis, donde toda la experimentación se ha realizado utilizando enfoques de aprendizaje profundo, principalmente basados en mecanismos de atención. Además, trabajamos mayoritariamente con el idioma español, por ser un idioma poco explorado y de gran interés para los proyectos de investigación en los que participamos. Por un lado, para el análisis de texto en redes sociales, nos enfocamos en tareas de análisis afectivo, incluyendo análisis de sentimientos y detección de emociones, junto con el análisis de la ironía. En este sentido, se presenta un enfoque basado en Transformer Encoders, que consiste en contextualizar \textit{word embeddings} pre-entrenados con tweets en español, para abordar tareas de análisis de sentimiento y detección de ironía. También proponemos el uso de métricas de evaluación como funciones de pérdida, con el fin de entrenar redes neuronales, para reducir el impacto del desequilibrio de clases en tareas \textit{multi-class} y \textit{multi-label} de detección de emociones. Adicionalmente, se presenta una especialización de BERT tanto para el idioma español como para el dominio de Twitter, que tiene en cuenta la coherencia entre tweets en conversaciones de Twitter. El desempeño de todos estos enfoques ha sido probado con diferentes corpus, a partir de varios \textit{benchmarks} de referencia, mostrando resultados muy competitivos en todas las tareas abordadas. Por otro lado, nos centramos en el resumen extractivo de artículos periodísticos y de programas televisivos de debate. Con respecto al resumen de artículos, se presenta un marco teórico para el resumen extractivo, basado en redes jerárquicas siamesas con mecanismos de atención. También presentamos dos instancias de este marco: \textit{Siamese Hierarchical Attention Networks} y \textit{Siamese Hierarchical Transformer Encoders}. Estos sistemas han sido evaluados en los corpora CNN/DailyMail y NewsRoom, obteniendo resultados competitivos en comparación con otros enfoques extractivos coetáneos. Con respecto a los programas de debate, se ha propuesto una tarea que consiste en resumir las intervenciones transcritas de los ponentes, sobre un tema determinado, en el programa "La Noche en 24 Horas". Además, se propone un corpus de artículos periodísticos, recogidos de varios periódicos españoles en línea, con el fin de estudiar la transferibilidad de los enfoques propuestos, entre artículos e intervenciones de los participantes en los debates. Este enfoque muestra mejores resultados que otras técnicas extractivas, junto con una transferibilidad de dominio muy prometedora.
[CA] Avui en dia, la societat té accés i possibilitat de contribuir a grans quantitats de continguts presents a Internet, com xarxes socials, diaris online, fòrums, blocs o plataformes de contingut multimèdia. Tot aquest tipus de mitjans han tingut, durant els darrers anys, un impacte aclaparador en el dia a dia d'individus i organitzacions, sent actualment mitjans predominants per compartir, debatre i analitzar continguts en línia. Per aquest motiu, resulta d'interès treballar sobre aquest tipus de plataformes, des de diferents punts de vista, sota el paraigua de l'Processament de el Llenguatge Natural. En aquesta tesi ens centrem en dues àrees àmplies dins d'aquest camp, aplicades a l'anàlisi de contingut en línia: anàlisi de text en xarxes socials i resum automàtic. En paral·lel, les xarxes neuronals també són un tema central d'aquesta tesi, on tota l'experimentació s'ha realitzat utilitzant enfocaments d'aprenentatge profund, principalment basats en mecanismes d'atenció. A més, treballem majoritàriament amb l'idioma espanyol, per ser un idioma poc explorat i de gran interès per als projectes de recerca en els que participem. D'una banda, per a l'anàlisi de text en xarxes socials, ens enfoquem en tasques d'anàlisi afectiu, incloent anàlisi de sentiments i detecció d'emocions, juntament amb l'anàlisi de la ironia. En aquest sentit, es presenta una aproximació basada en Transformer Encoders, que consisteix en contextualitzar \textit{word embeddings} pre-entrenats amb tweets en espanyol, per abordar tasques d'anàlisi de sentiment i detecció d'ironia. També proposem l'ús de mètriques d'avaluació com a funcions de pèrdua, per tal d'entrenar xarxes neuronals, per reduir l'impacte de l'desequilibri de classes en tasques \textit{multi-class} i \textit{multi-label} de detecció d'emocions. Addicionalment, es presenta una especialització de BERT tant per l'idioma espanyol com per al domini de Twitter, que té en compte la coherència entre tweets en converses de Twitter. El comportament de tots aquests enfocaments s'ha provat amb diferents corpus, a partir de diversos \textit{benchmarks} de referència, mostrant resultats molt competitius en totes les tasques abordades. D'altra banda, ens centrem en el resum extractiu d'articles periodístics i de programes televisius de debat. Pel que fa a l'resum d'articles, es presenta un marc teòric per al resum extractiu, basat en xarxes jeràrquiques siameses amb mecanismes d'atenció. També presentem dues instàncies d'aquest marc: \textit{Siamese Hierarchical Attention Networks} i \textit{Siamese Hierarchical Transformer Encoders}. Aquests sistemes s'han avaluat en els corpora CNN/DailyMail i Newsroom, obtenint resultats competitius en comparació amb altres enfocaments extractius coetanis. Pel que fa als programes de debat, s'ha proposat una tasca que consisteix a resumir les intervencions transcrites dels ponents, sobre un tema determinat, al programa "La Noche en 24 Horas". A més, es proposa un corpus d'articles periodístics, recollits de diversos diaris espanyols en línia, per tal d'estudiar la transferibilitat dels enfocaments proposats, entre articles i intervencions dels participants en els debats. Aquesta aproximació mostra millors resultats que altres tècniques extractives, juntament amb una transferibilitat de domini molt prometedora.
[EN] Nowadays, society has access, and the possibility to contribute, to large amounts of the content present on the internet, such as social networks, online newspapers, forums, blogs, or multimedia content platforms. These platforms have had, during the last years, an overwhelming impact on the daily life of individuals and organizations, becoming the predominant ways for sharing, discussing, and analyzing online content. Therefore, it is very interesting to work with these platforms, from different points of view, under the umbrella of Natural Language Processing. In this thesis, we focus on two broad areas inside this field, applied to analyze online content: text analytics in social media and automatic summarization. Neural networks are also a central topic in this thesis, where all the experimentation has been performed by using deep learning approaches, mainly based on attention mechanisms. Besides, we mostly work with the Spanish language, due to it is an interesting and underexplored language with a great interest in the research projects we participated in. On the one hand, for text analytics in social media, we focused on affective analysis tasks, including sentiment analysis and emotion detection, along with the analysis of the irony. In this regard, an approach based on Transformer Encoders, based on contextualizing pretrained Spanish word embeddings from Twitter, to address sentiment analysis and irony detection tasks, is presented. We also propose the use of evaluation metrics as loss functions, in order to train neural networks for reducing the impact of the class imbalance in multi-class and multi-label emotion detection tasks. Additionally, a specialization of BERT both for the Spanish language and the Twitter domain, that takes into account inter-sentence coherence in Twitter conversation flows, is presented. The performance of all these approaches has been tested with different corpora, from several reference evaluation benchmarks, showing very competitive results in all the tasks addressed. On the other hand, we focused on extractive summarization of news articles and TV talk shows. Regarding the summarization of news articles, a theoretical framework for extractive summarization, based on siamese hierarchical networks with attention mechanisms, is presented. Also, we present two instantiations of this framework: Siamese Hierarchical Attention Networks and Siamese Hierarchical Transformer Encoders. These systems were evaluated on the CNN/DailyMail and the NewsRoom corpora, obtaining competitive results in comparison to other contemporary extractive approaches. Concerning the TV talk shows, we proposed a text summarization task, for summarizing the transcribed interventions of the speakers, about a given topic, in the Spanish TV talk shows of the ``La Noche en 24 Horas" program. In addition, a corpus of news articles, collected from several Spanish online newspapers, is proposed, in order to study the domain transferability of siamese hierarchical approaches, between news articles and interventions of debate participants. This approach shows better results than other extractive techniques, along with a very promising domain transferability.
González Barba, JÁ. (2021). Attention-based Approaches for Text Analytics in Social Media and Automatic Summarization [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/172245
TESIS

APA, Harvard, Vancouver, ISO, and other styles

31

Lindström, Marcus. "The Impact of Scaling Down a Language Model Used for Text Summarization." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-283117.

Full text

Abstract:

Machine learning based language models have achieved state-of-the-art results on a variety of tasks. However, their applicability in commercial uses is becoming increasingly difficult to motivate due to their ever-increasing size. The tradeoff between performance and training/inference times is the usual suspect, often leading to the use of less sophisticated solutions. In this thesis, I investigate how a trained Transformer-based language model specialized in generating sentence embeddings can be scaled down using a technique known as Knowledge Distillation. I evaluate both the original model and its distilled counterpart on the SentEval STS-benchmarks, and also through human evaluation on extractive summaries generated by clustering their embeddings. My results show that a 7.5 times smaller model not only operates at over twice the speed but achieves almost 98% of the original model’s average result on the STS tasks. Furthermore, the human evaluation shows that, when subjectively assessed, summaries generated by the smaller model are significantly better.
Språkmodeller baserade på maskininlärning har nått toppmoderna resultat i en mängd olika uppgifter, dock börjar deras applicerbarhet bli svår att motivera i komerciella användingsområden i och med hur stora de börjar bli. Avvägningen mellan prestanda och tränings/ inferens-tid är det vanligaste problemet, vilket ofta leder till användning av mindre sofistikerade lösningar. Den här avhandlingen undersöker hur en redan tränad Transformer-baserad språkmodell specialicerad på att generera meningsinbäddningar kan bli nerskalad med hjälp av en teknik känd som Kunskapsdestillering. Jag evaluerar både ogrinalmodellen samt dess destillerade motsvarighet på SentEval’s STSbenchmarks, och även genom en mänsklig evaluering av extraktiva summeringar genererade av bägge genom en klustringsmetod av deras inbäddningar. Mina resultat visar att en 7:5 gånger mindre modell arbetar över dubbelt så snabbt, och även når nästan 98% av orginalmodellens medelresultat på STS-uppgifterna. Den mänskliga evalueringen indikerar även en signifikant (subjektiv) skillnad hos modellernas summeringar till den mindres favör.

APA, Harvard, Vancouver, ISO, and other styles

32

Nishino, Masaaki. "Numerical Optimization Methods based on Discrete Structure for Text Summarization and Relational Learning." 京都大学 (Kyoto University), 2014. http://hdl.handle.net/2433/192213.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Falke, Tobias [Verfasser], Iryna [Akademischer Betreuer] Gurevych, and Ido [Akademischer Betreuer] Dagan. "Automatic Structured Text Summarization with Concept Maps / Tobias Falke ; Iryna Gurevych, Ido Dagan." Darmstadt : Universitäts- und Landesbibliothek Darmstadt, 2019. http://d-nb.info/1183911491/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Goyal, Pawan. "Analytic knowledge discovery techniques for ad-hoc information retrieval and automatic text summarization." Thesis, Ulster University, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.543897.

Full text

Abstract:

Information retrieval is broadly concerned with the problem of automated searching for information within some document repository to support various information requests by users. The traditional retrieval frameworks work on the simplistic assumptions of “word independence” and “bag-of-words”, giving rise to problems such as “term mismatch” and “context independent document indexing”. Automatic text summarization systems, which use the same paradigm as that of information retrieval, also suffer from these problems. The concept of “semantic relevance” has also not been formulated in the existing literature. This thesis presents a detailed investigation of the knowledge discovery models and proposes new approaches to address these issues. The traditional retrieval frameworks do not succeed in defining the document content fully because they do not process the concepts in the documents; only the words are processed. To address this issue, a document retrieval model has been proposed using concept hierarchies, learnt automatically from a corpora. A novel approach to give a meaningful representation to the concept nodes in a learnt hierarchy has been proposed using a fuzzy logic based soft least upper bound method. A novel approach of adapting the vector space model with dependency parse relations for information retrieval also has been developed. A user query for information retrieval (IR) applications may not contain the most appropriate terms (words) as actually intended by the user. This is usually referred to as the term mismatch problem and is a crucial research issue in IR. To address this issue, a theoretical framework for Query Representation (QR) has been developed through a comprehensive theoretical analysis of a parametric query vector. A lexical association function has been derived analytically using the relevance criteria. The proposed QR model expands the user query using this association function. A novel term association metric has been derived using the Bernoulli model of randomness. x The derived metric has been used to develop a Bernoulli Query Expansion (BQE) model. The Bernoulli model of randomness has also been extended to the pseudo relevance feedback problem by proposing a Bernoulli Pseudo Relevance (BPR) model. In the traditional retrieval frameworks, the context in which a term occurs is mostly overlooked in assigning its indexing weight. This results in context independent document indexing. To address this issue, a novel Neighborhood Based Document Smoothing (NBDS) model has been proposed, which uses the lexical association between terms to provide a context sensitive indexing weight to the document terms, i.e. the term weights are redistributed based on the lexical association with the context words. To address the “context independent document indexing” for sentence extraction based text summarization task, a lexical association measure derived using the Bernoulli model of randomness has been used. A new approach using the lexical association between terms has been proposed to give a context sensitive weight to the document terms and these weights have been used for the sentence extraction task. Developed analytically, the proposed QR, BQE, BPR and NBDS models provide a proper mathematical framework for query expansion and document smoothing techniques, which have largely been heuristic in the existing literature. Being developed in the generalized retrieval framework, as also proposed in this thesis, these models are applicable to all of the retrieval frameworks. These models have been empirically evaluated over the benchmark TREC datasets and have been shown to provide significantly better performance than the baseline retrieval frameworks to a large degree, without adding significant computational or storage burden. The Bernoulli model applied to the sentence extraction task has also been shown to enhance the performance of the baseline text summarization systems over the benchmark DUC datasets. The theoretical foundations alongwith the empirical results verify that the proposed knowledge discovery models in this thesis advance the state of the art in the field of information retrieval and automatic text summarization.

APA, Harvard, Vancouver, ISO, and other styles

35

Peyrard, Maxime [Verfasser], Iryna [Akademischer Betreuer] Gurevych, Johannes [Akademischer Betreuer] Fürnkranz, and Ani [Akademischer Betreuer] Nenkova. "Principled Approaches to Automatic Text Summarization / Maxime Peyrard ; Iryna Gurevych, Johannes Fürnkranz, Ani Nenkova." Darmstadt : Universitäts- und Landesbibliothek Darmstadt, 2019. http://d-nb.info/1198403241/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Peyrard, Maxime [Verfasser], Iryna Akademischer Betreuer] Gurevych, Johannes [Akademischer Betreuer] Fürnkranz, and Ani [Akademischer Betreuer] [Nenkova. "Principled Approaches to Automatic Text Summarization / Maxime Peyrard ; Iryna Gurevych, Johannes Fürnkranz, Ani Nenkova." Darmstadt : Universitäts- und Landesbibliothek Darmstadt, 2019. http://d-nb.info/1198403241/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Hobson, Stacy F. "Text summarization evaluation correlating human performance on an extrinsic task with automatic intrinsic metrics /." College Park, Md.: University of Maryland, 2007. http://hdl.handle.net/1903/7623.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2007.
Thesis research directed by: Dept. of Computer Science. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

38

Duan, Yijun. "History-related Knowledge Extraction from Temporal Text Collections." Kyoto University, 2020. http://hdl.handle.net/2433/253410.

Full text

Abstract:

Kyoto University (京都大学)
0048
新制・課程博士
博士(情報学)
甲第22574号
情博第711号
新制||情||122(附属図書館)
京都大学大学院情報学研究科社会情報学専攻
(主査)教授吉川正俊, 教授鹿島久嗣, 教授田島敬史, 特定准教授 JATOWT Adam Wladyslaw
学位規則第4条第1項該当

APA, Harvard, Vancouver, ISO, and other styles

39

Fang, Yimai. "Proposition-based summarization with a coherence-driven incremental model." Thesis, University of Cambridge, 2019. https://www.repository.cam.ac.uk/handle/1810/287468.

Full text

Abstract:

Summarization models which operate on meaning representations of documents have been neglected in the past, although they are a very promising and interesting class of methods for summarization and text understanding. In this thesis, I present one such summarizer, which uses the proposition as its meaning representation. My summarizer is an implementation of Kintsch and van Dijk's model of comprehension, which uses a tree of propositions to represent the working memory. The input document is processed incrementally in iterations. In each iteration, new propositions are connected to the tree under the principle of local coherence, and then a forgetting mechanism is applied so that only a few important propositions are retained in the tree for the next iteration. A summary can be generated using the propositions which are frequently retained. Originally, this model was only played through by hand by its inventors using human-created propositions. In this work, I turned it into a fully automatic model using current NLP technologies. First, I create propositions by obtaining and then transforming a syntactic parse. Second, I have devised algorithms to numerically evaluate alternative ways of adding a new proposition, as well as to predict necessary changes in the tree. Third, I compared different methods of modelling local coherence, including coreference resolution, distributional similarity, and lexical chains. In the first group of experiments, my summarizer realizes summary propositions by sentence extraction. These experiments show that my summarizer outperforms several state-of-the-art summarizers. The second group of experiments concerns abstractive generation from propositions, which is a collaborative project. I have investigated the option of compressing extracted sentences, but generation from propositions has been shown to provide better information packaging.

APA, Harvard, Vancouver, ISO, and other styles

40

Meechan-Maddon, Ailsa. "The effect of noise in the training of convolutional neural networks for text summarisation." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-384607.

Full text

Abstract:

In this thesis, we work towards bridging the gap between two distinct areas: noisy text handling and text summarisation. The overall goal of the paper is to examine the effects of noise in the training of convolutional neural networks for text summarisation, with a view to understanding how to effectively create a noise-robust text-summarisation system. We look specifically at the problem of abstractive text summarisation of noisy data in the context of summarising error-containing documents from automatic speech recognition (ASR) output. We experiment with adding varying levels of noise (errors) to the 4 million-article Gigaword corpus and training an encoder-decoder CNN on it with the aim of producing a noise-robust text summarisation system. A total of six text summarisation models are trained, each with a different level of noise. We discover that the models with a high level of noise are indeed able to aptly summarise noisy data into clean summaries, despite a tendency for all models to overfit to the level of noise on which they were trained. Directions are given for future steps in order to create an even more noise-robust and flexible text summarisation system.

APA, Harvard, Vancouver, ISO, and other styles

41

Kan'an, Tarek Ghaze. "Arabic News Text Classification and Summarization: A Case of the Electronic Library Institute SeerQ (ELISQ)." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/74272.

Full text

Abstract:

Arabic news articles in heterogeneous electronic collections are difficult for users to work with. Two problems are: that they are not categorized in a way that would aid browsing, and that there are no summaries or detailed metadata records that could be easier to work with than full articles. To address the first problem, schema mapping techniques were adapted to construct a simple taxonomy for Arabic news stories that is compatible with the subject codes of the International Press Telecommunications Council. So that each article would be labeled with the proper taxonomy category, automatic classification methods were researched, to identify the most appropriate. Experiments showed that the best features to use in classification resulted from a new tailored stemming approach (i.e., a new Arabic light stemmer called P-Stemmer). When coupled with binary classification using SVM, the newly developed approach proved to be superior to state-of-the-art techniques. To address the second problem, i.e., summarization, preliminary work was done with English corpora. This was in the context of a new Problem Based Learning (PBL) course wherein students produced template summaries of big text collections. The techniques used in the course were extended to work with Arabic news. Due to the lack of high quality tools for Named Entity Recognition (NER) and topic identification for Arabic, two new tools were constructed: RenA for Arabic NER, and ALDA for Arabic topic extraction tool (using the Latent Dirichlet Algorithm). Controlled experiments with each of RenA and ALDA, involving Arabic speakers and a randomly selected corpus of 1000 Qatari news articles, showed the tools produced very good results (i.e., names, organizations, locations, and topics). Then the categorization, NER, topic identification, and additional information extraction techniques were combined to produce approximately 120,000 summaries for Qatari news articles, which are searchable, along with the articles, using LucidWorks Fusion, which builds upon Solr software. Evaluation of the summaries showed high ratings based on the 1000-article test corpus. Contributions of this research with Arabic news articles thus include a new: test corpus, taxonomy, light stemmer, classification approach, NER tool, topic identification tool, and template-based summarizer – all shown through experimentation to be highly effective.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

42

El, Aouad Sara. "Personalized, Aspect-based Summarization of Movie Reviews." Electronic Thesis or Diss., Sorbonne université, 2019. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2019SORUS019.pdf.

Full text

Abstract:

Les sites web de critiques en ligne aident les utilisateurs à décider quoi acheter ou quels hôtels choisir. Ces plateformes permettent aux utilisateurs d’exprimer leurs opinions à l’aide d’évaluations numériques et de commentaires textuels. Les notes numériques donnent une idée approximative du service. D'autre part, les commentaires textuels donnent des détails complets, ce qui est fastidieux à lire. Dans cette thèse, nous développons de nouvelles méthodes et algorithmes pour générer des résumés personnalisés de critiques de films, basés sur les aspects, pour un utilisateur donné. Le premier problème que nous abordons consiste à extraire un ensemble de mots liés à un aspect des critiques de films. Notre évaluation montre que notre méthode est capable d'extraire même des termes impopulaires qui représentent un aspect, tels que des termes composés ou des abréviations. Nous étudions ensuite le problème de l'annotation des phrases avec des aspects et proposons une nouvelle méthode qui annote les phrases en se basant sur une similitude entre la signature d'aspect et les termes de la phrase. Le troisième problème que nous abordons est la génération de résumés personnalisés, basés sur les aspects. Nous proposons un algorithme d'optimisation pour maximiser la couverture des aspects qui intéressent l'utilisateur et la représentativité des phrases dans le résumé sous réserve de contraintes de longueur et de similarité. Enfin, nous réalisons trois études d’utilisateur qui montrent que l’approche que nous proposons est plus performante que la méthode de pointe en matière de génération de résumés
Online reviewing websites help users decide what to buy or places to go. These platforms allow users to express their opinions using numerical ratings as well as textual comments. The numerical ratings give a coarse idea of the service. On the other hand, textual comments give full details which is tedious for users to read. In this dissertation, we develop novel methods and algorithms to generate personalized, aspect-based summaries of movie reviews for a given user. The first problem we tackle is extracting a set of related words to an aspect from movie reviews. Our evaluation shows that our method is able to extract even unpopular terms that represent an aspect, such as compound terms or abbreviations, as opposed to the methods from the related work. We then study the problem of annotating sentences with aspects, and propose a new method that annotates sentences based on a similarity between the aspect signature and the terms in the sentence. The third problem we tackle is the generation of personalized, aspect-based summaries. We propose an optimization algorithm to maximize the coverage of the aspects the user is interested in and the representativeness of sentences in the summary subject to a length and similarity constraints. Finally, we perform three user studies that show that the approach we propose outperforms the state of art method for generating summaries

APA, Harvard, Vancouver, ISO, and other styles

43

Škurla, Ján. "Sumarizace dokumentů na webu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236504.

Full text

Abstract:

Topic of this master's thesis is a summarization of the documents on the web. First, it deals with the issues of acquiring text from the web using wrapper. An overview of wrappers used as an inspiration for the future implementation is stated. This paper also includes various methods for creating summary (Luhn`s, Edmundson`s and KPC) from the text data. Application design for the text data extraction and summarization is also part of this paper. Application is based on Java platform and Swing graphic library.

APA, Harvard, Vancouver, ISO, and other styles

44

Rennes, Evelina. "Keeping an Eye on the Context : An Eye Tracking Study of Cohesion Errors in Automatic Text Summarization." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-95527.

Full text

Abstract:

Automatic text summarization is a growing field due to the modern world’s Internet based society, but to automatically create perfect summaries is not easy, and cohesion errors are common. By the usage of an eye tracking camera, this thesis studies the nature of four different types of cohesion errors occurring in summaries. A total of 23 participants read and rated four different texts and marked the most difficult areas of each text. Statistical analysis of the data revealed that absent cohesion or context and broken anaphoric reference (pronouns) caused some disturbance in reading, but that the impact is restricted to the effort to read rather than the comprehension of the text. Erroneous anaphoric reference (pronouns) was not detected by the participants which poses a problem for automatic text summarizers, and other potential disturbing factors were detected. Finally, the question of the meaningfulness of keeping absent cohesion or context as a separate error type was raised.

APA, Harvard, Vancouver, ISO, and other styles

45

Monsen, Julius. "Building high-quality datasets for abstractive text summarization : A filtering‐based method applied on Swedish news articles." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176352.

Full text

Abstract:

With an increasing amount of information on the internet, automatic text summarization could potentially make content more readily available for a larger variety of people. Training and evaluating text summarization models require datasets of sufficient size and quality. Today, most such datasets are in English, and for minor languages such as Swedish, it is not easy to obtain corresponding datasets with handwritten summaries. This thesis proposes methods for compiling high-quality datasets suitable for abstractive summarization from a large amount of noisy data through characterization and filtering. The data used consists of Swedish news articles and their preambles which are here used as summaries. Different filtering techniques are applied, yielding five different datasets. Furthermore, summarization models are implemented by warm-starting an encoder-decoder model with BERT checkpoints and fine-tuning it on the different datasets. The fine-tuned models are evaluated with ROUGE metrics and BERTScore. All models achieve significantly better results when evaluated on filtered test data than when evaluated on unfiltered test data. Moreover, models trained on the most filtered dataset with the smallest size achieves the best results on the filtered test data. The trade-off between dataset size and quality and other methodological implications of the data characterization, the filtering and the model implementation are discussed, leading to suggestions for future research.

APA, Harvard, Vancouver, ISO, and other styles

46

Folin, Veronika. "Abstractive Long Document Summarization: Studio e Sperimentazione di Modelli Generativi Retrieval-Augmented." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/24283/.

Full text

Abstract:

In questa tesi si trattano lo studio e la sperimentazione di un modello generativo retrieval-augmented, basato su Transformers, per il task di Abstractive Summarization su lunghe sentenze legali. La sintesi automatica del testo (Automatic Text Summarization) è diventata un task di Natural Language Processing (NLP) molto importante oggigiorno, visto il grandissimo numero di dati provenienti dal web e banche dati. Inoltre, essa permette di automatizzare un processo molto oneroso per gli esperti, specialmente nel settore legale, in cui i documenti sono lunghi e complicati, per cui difficili e dispendiosi da riassumere. I modelli allo stato dell’arte dell’Automatic Text Summarization sono basati su soluzioni di Deep Learning, in particolare sui Transformers, che rappresentano l’architettura più consolidata per task di NLP. Il modello proposto in questa tesi rappresenta una soluzione per la Long Document Summarization, ossia per generare riassunti di lunghe sequenze testuali. In particolare, l’architettura si basa sul modello RAG (Retrieval-Augmented Generation), recentemente introdotto dal team di ricerca Facebook AI per il task di Question Answering. L’obiettivo consiste nel modificare l’architettura RAG al fine di renderla adatta al task di Abstractive Long Document Summarization. In dettaglio, si vuole sfruttare e testare la memoria non parametrica del modello, con lo scopo di arricchire la rappresentazione del testo di input da riassumere. A tal fine, sono state sperimentate diverse configurazioni del modello su diverse tipologie di esperimenti e sono stati valutati i riassunti generati con diverse metriche automatiche.

APA, Harvard, Vancouver, ISO, and other styles

47

Keneshloo, Yaser. "Addressing Challenges of Modern News Agencies via Predictive Modeling, Deep Learning, and Transfer Learning." Diss., Virginia Tech, 2019. http://hdl.handle.net/10919/91910.

Full text

Abstract:

Today's news agencies are moving from traditional journalism, where publishing just a few news articles per day was sufficient, to modern content generation mechanisms, which create more than thousands of news pieces every day. With the growth of these modern news agencies comes the arduous task of properly handling this massive amount of data that is generated for each news article. Therefore, news agencies are constantly seeking solutions to facilitate and automate some of the tasks that have been previously done by humans. In this dissertation, we focus on some of these problems and provide solutions for two broad problems which help a news agency to not only have a wider view of the behaviour of readers around the article but also to provide an automated tools to ease the job of editors in summarizing news articles. These two disjoint problems are aiming at improving the users' reading experience by helping the content generator to monitor and focus on poorly performing content while allow them to promote the good-performing ones. We first focus on the task of popularity prediction of news articles via a combination of regression, classification, and clustering models. We next focus on the problem of generating automated text summaries for a long news article using deep learning models. The first problem aims at helping the content developer in understanding of how a news article is performing over the long run while the second problem provides automated tools for the content developers to generate summaries for each news article.
Doctor of Philosophy
Nowadays, each person is exposed to an immense amount of information from social media, blog posts, and online news portals. Among these sources, news agencies are one of the main content providers for each person around the world. Contemporary news agencies are moving from traditional journalism to modern techniques from different angles. This is achieved either by building smart tools to track the behaviour of readers’ reaction around a specific news article or providing automated tools to facilitate the editor’s job in providing higher quality content to readers. These systems should not only be able to scale well with the growth of readers but also they have to be able to process ad-hoc requests, precisely since most of the policies and decisions in these agencies are taken around the result of these analytical tools. As part of this new movement towards adapting new technologies for smart journalism, we have worked on various problems with The Washington Post news agency on building tools for predicting the popularity of a news article and automated text summarization model. We develop a model that monitors each news article after its publication and provide prediction over the number of views that this article will receive within the next 24 hours. This model will help the content creator to not only promote potential viral article in the main page of the web portal or social media, but also provide intuition for editors on potential poorly performing articles so that they can edit the content of those articles for better exposure. On the other hand, current news agencies are generating more than a thousands news articles per day and generating three to four summary sentences for each of these news pieces not only become infeasible in the near future but also very expensive and time-consuming. Therefore, we also develop a separate model for automated text summarization which generates summary sentences for a news article. Our model will generate summaries by selecting the most salient sentence in the news article and paraphrase them to shorter sentences that could represent as a summary sentence for the entire document.

APA, Harvard, Vancouver, ISO, and other styles

48

Venkatachalam, Ramiya. "Surfacing Personas from Enterprise Social Media to Enhance Engagement Visibility." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1370882249.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Kumar, Trun. "Automaic Text Summarization." Thesis, 2014. http://ethesis.nitrkl.ac.in/5619/1/110CS0127.pdf.

Full text

Abstract:

Automatic summarization is the procedure of decreasing the content of a document with a machine (computer) program so as to make a summary that holds the most critical sentences of the text file (document). Extracting summary from the document is a difficult task for human beings. Therefore to generate summary automatically has to facilitate several challenges; as the system automates it can only extract the required information from the original document. As the issue of information overload has grown - trouble has been initiated, and as the measure of data has extended, so has eagerness to customize it. It is uncommonly troublesome for individuals to physically condense broad reports of substance. Automatic Summarization systems may be classified into extractive and abstractive summary. An extractive summary method involves selecting indispensable sentences from the record and interfacing them into shorter structure. The vitality of sentences chosen is focused around factual and semantic characteristics of sentences. Extractive method work by selecting a subset of existing words, or sentences in the text file (content document) to produce the summary of input text file. The looking of important data from a huge content document is exceptionally difficult occupation for the user consequently to programmed concentrate the imperative information or summary of the content record. This summary helps the users to reduce time instead Of reading the whole text document and it provide quick knowledge from the large text file. The extractive summarization are commonly focused around techniques for sentence extraction to blanket the set of sentences that are most important for the general understanding of a given text file. In frequency based technique, obtained summary makes more meaning. But in k-means clustering due to out of order extraction, summary might not make sense

APA, Harvard, Vancouver, ISO, and other styles

50

Kumar, T. "Automatic text summarization." Thesis, 2014. http://ethesis.nitrkl.ac.in/5617/1/E-65.pdf.

Full text

Abstract:

Automatic summarization is the procedure of decreasing the content of a document with a machine (computer) program so as to make a summary that holds the most critical sentences of the text file (document). Extracting summary from the document is a difficult task for human beings. Therefore to generate summary automatically has to facilitate several challenges; as the system automates it can only extract the required information from the original document. As the issue of information overload has grown - trouble has been initiated, and as the measure of data has extended, so has eagerness to customize it. It is uncommonly troublesome for individuals to physically condense broad reports of substance. Automatic Summarization systems may be classified into extractive and abstractive summary. An extractive summary method involves selecting indispensable sentences from the record and interfacing them into shorter structure. The vitality of sentences chosen is focused around factual and semantic characteristics of sentences. Extractive method work by selecting a subset of existing words, or sentences in the text file (content document) to produce the summary of input text file. The looking of important data from a huge content document is exceptionally difficult occupation for the user consequently to programmed concentrate the imperative information or summary of the content record. This summary helps the users to reduce time instead Of reading the whole text document and it provide quick knowledge from the large text file. The extractive summarization are commonly focused around techniques for sentence extraction to blanket the set of sentences that are most important for the general understanding of a given text file. In frequency based technique, obtained summary makes more meaning. But in k-means clustering due to out of order extraction, summary might not make sense.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Text summarization'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles