Log in

Relevant bibliographies by topics / Unstructured data mining / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Unstructured data mining.

Dissertations / Theses on the topic 'Unstructured data mining'

Author: Grafiati

Published: 3 June 2025

Last updated: 20 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 22 dissertations / theses for your research on the topic 'Unstructured data mining.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Bala, Saimir. "Mining Projects from Structured and Unstructured Data." Jens Gulden, Selmin Nurcan, Iris Reinhartz-Berger, Widet Guédria, Palash Bera, Sérgio Guerreiro, Michael Fellman, Matthias Weidlich, 2017. http://epub.wu.ac.at/7205/1/ProjecMining%2DCamera%2DReady.pdf.

Full text

Abstract:

Companies working on safety-critical projects must adhere to strict rules imposed by the domain, especially when human safety is involved. These projects need to be compliant to standard norms and regulations. Thus, all the process steps must be clearly documented in order to be verifiable for compliance in a later stage by an auditor. Nevertheless, documentation often comes in the form of manually written textual documents in different formats. Moreover, the project members use diverse proprietary tools. This makes it difficult for auditors to understand how the actual project was conducted. My research addresses the project mining problem by exploiting logs from project-generated artifacts, which come from software repositories used by the project team.

APA, Harvard, Vancouver, ISO, and other styles

2

Bojduj, Brett N. "Extraction of Causal-Association Networks from Unstructured Text Data." DigitalCommons@CalPoly, 2009. https://digitalcommons.calpoly.edu/theses/138.

Full text

Abstract:

Causality is an expression of the interactions between variables in a system. Humans often explicitly express causal relations through natural language, so extracting these relations can provide insight into how a system functions. This thesis presents a system that uses a grammar parser to extract causes and effects from unstructured text through a simple, pre-deﬁned grammar pattern. By ﬁltering out non-causal sentences before the extraction process begins, the presented methodology is able to achieve a precision of 85.91% and a recall of 73.99%. The polarity of the extracted relations is then classiﬁed using a Fisher classiﬁer. The result is a set of directed relations of causes and effects, with polarity as either increasing or decreasing. These relations can then be used to create networks of causes and effects. This “Causal-Association Network” (CAN) can be used to aid decision-making in complex domains such as economics or medicine, that rely upon dynamic interactions between many variables.

APA, Harvard, Vancouver, ISO, and other styles

3

King, Michael Allen. "Ensemble Learning Techniques for Structured and Unstructured Data." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/51667.

Full text

Abstract:

This research provides an integrated approach of applying innovative ensemble learning techniques that has the potential to increase the overall accuracy of classification models. Actual structured and unstructured data sets from industry are utilized during the research process, analysis and subsequent model evaluations. The first research section addresses the consumer demand forecasting and daily capacity management requirements of a nationally recognized alpine ski resort in the state of Utah, in the United States of America. A basic econometric model is developed and three classic predictive models evaluated the effectiveness. These predictive models were subsequently used as input for four ensemble modeling techniques. Ensemble learning techniques are shown to be effective. The second research section discusses the opportunities and challenges faced by a leading firm providing sponsored search marketing services. The goal for sponsored search marketing campaigns is to create advertising campaigns that better attract and motivate a target market to purchase. This research develops a method for classifying profitable campaigns and maximizing overall campaign portfolio profits. Four traditional classifiers are utilized, along with four ensemble learning techniques, to build classifier models to identify profitable pay-per-click campaigns. A MetaCost ensemble configuration, having the ability to integrate unequal classification cost, produced the highest campaign portfolio profit. The third research section addresses the management challenges of online consumer reviews encountered by service industries and addresses how these textual reviews can be used for service improvements. A service improvement framework is introduced that integrates traditional text mining techniques and second order feature derivation with ensemble learning techniques. The concept of GLOW and SMOKE words is introduced and is shown to be an objective text analytic source of service defects or service accolades.<br>Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

4

Al-Azzam, Omar Ghazi. "Mining for Significant Information from Unstructured and Structured Biological Data and Its Applications." Diss., North Dakota State University, 2012. https://hdl.handle.net/10365/26509.

Full text

Abstract:

Massive amounts of biological data are being accumulated in science. Searching for significant meaningful information and patterns from different types of data is necessary towards gaining knowledge from these large amounts of data available to users. However, data mining techniques do not normally deal with significance. Integrating data mining techniques with standard statistical procedures provides a way for mining statistically signi- ficant, interesting information from both structured and unstructured data. In this dissertation, different algorithms for mining significant biological information from both unstructured and structured data are proposed. A weighted-density-based approach is presented for mining item data from unstructured textual representations. Different algorithms in the area of radiation hybrid mapping are developed for mining significant information from structured binary data. The proposed algorithms have different applications in the ordering problem in radiation hybrid mapping including: identifying unreliable markers, and building solid framework maps. Effectiveness of the proposed algorithms towards improving map stability is demonstrated. Map stability is determined based on resampling analysis. The proposed algorithms deal effectively and efficiently with multidimensional data and also reduce computational cost dramatically. Evaluation shows that the proposed algorithms outperform comparative methods in terms of both accuracy and computation cost.

APA, Harvard, Vancouver, ISO, and other styles

5

Yaakub, Mohd Ridzwan. "Integration of Opinion Mining into customer analysis model." Thesis, Queensland University of Technology, 2015. https://eprints.qut.edu.au/85084/1/Mohd%20Ridzwan_Yaakub_Thesis.pdf.

Full text

Abstract:

This research proposes a multi-dimensional model for Opinion Mining, which integrates customers' characteristics and their opinions about products (or services). Customer opinions are valuable for companies to deliver right products or services to their customers. This research presents a comprehensive framework to evaluate opinions' orientation based on products' hierarchy attributes. It also provides an alternative way to obtain opinion summaries for different groups of customers and different categories of produces.

APA, Harvard, Vancouver, ISO, and other styles

6

葉立志 and Chi-lap Yip. "Discovering patterns in databases: the cases for language, music, and unstructured data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2000. http://hub.hku.hk/bib/B31242649.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Yip, Chi-lap. "Discovering patterns in databases the cases for language, music, and unstructured data /." Hong Kong : University of Hong Kong, 2000. http://sunzi.lib.hku.hk/hkuto/record.jsp?B2240112X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Popescu, Ana-Maria. "Information extraction from unstructured web text /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/6935.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Vikholm, Oskar. "Dealing with unstructured data : A study about information quality and measurement." Thesis, Uppsala universitet, Institutionen för informatik och media, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-255214.

Full text

Abstract:

Many organizations have realized that the growing amount of unstructured text may contain information that can be used for different purposes, such as making decisions. Organizations can by using so-called text mining tools, extract information from text documents. For example within military and intelligence activities it is important to go through reports and look for entities such as names of people, events, and the relationships in-between them when criminal or other interesting activities are being investigated and mapped. This study explores how information quality can be measured and what challenges it involves. It is done on the basis of Wang and Strong (1996) theory about how information quality can be measured. The theory is tested and discussed from empirical material that contains interviews from two case organizations. The study observed two important aspects to take into consideration when measuring information quality: context dependency and source criticism. Context dependency means that the context in which information quality should be measured in must be defined based on the consumer’s needs. Source criticism implies that it is important to take the original source into consideration, and how reliable it is. Further, data quality and information quality is often used interchangeably, which means that organizations needs to decide what they really want to measure. One of the major challenges in developing software for entity extraction is that the system needs to understand the structure of natural language, which is very complicated.<br>Många organisationer har insett att den växande mängden ostrukturerad text kan innehålla information som kan användas till flera ändamål såsom beslutsfattande. Genom att använda så kallade text-mining verktyg kan organisationer extrahera information från textdokument. Inom till exempel militär verksamhet och underrättelsetjänst är det viktigt att kunna gå igenom rapporter och leta efter exempelvis namn på personer, händelser och relationerna mellan dessa när brottslig eller annan intressant verksamhet undersöks och kartläggs. I studien undersöks hur informationskvalitet kan mätas och vilka utmaningar det medför. Det görs med utgångspunkt i Wang och Strongs (1996) teori om hur informationskvalité kan mätas. Teorin testas och diskuteras utifrån ett empiriskt material som består av intervjuer från två fall-organisationer. Studien uppmärksammar två viktiga aspekter att ta hänsyn till för att mäta informationskvalitét; kontextberoende och källkritik. Kontextberoendet innebär att det sammanhang inom vilket informationskvalitét mäts måste definieras utifrån konsumentens behov. Källkritik innebär att det är viktigt att ta hänsyn informationens ursprungliga källa och hur trovärdig den är. Vidare är det viktigt att organisationer bestämmer om det är data eller informationskvalitét som ska mätas eftersom dessa två begrepp ofta blandas ihop. En av de stora utmaningarna med att utveckla mjukvaror för entitetsextrahering är att systemen ska förstå uppbyggnaden av det naturliga språket, vilket är väldigt komplicerat.

APA, Harvard, Vancouver, ISO, and other styles

10

Sequeira, José Francisco Rodrigues. "Automatic knowledge base construction from unstructured text." Master's thesis, Universidade de Aveiro, 2016. http://hdl.handle.net/10773/17910.

Full text

Abstract:

Mestrado em Engenharia de Computadores e Telemática<br>Taking into account the overwhelming number of biomedical publications being produced, the effort required for a user to efficiently explore those publications in order to establish relationships between a wide range of concepts is staggering. This dissertation presents GRACE, a web-based platform that provides an advanced graphical exploration interface that allows users to traverse the biomedical domain in order to find explicit and latent associations between annotated biomedical concepts belonging to a variety of semantic types (e.g., Genes, Proteins, Disorders, Procedures and Anatomy). The knowledge base utilized is a collection of MEDLINE articles with English abstracts. These annotations are then stored in an efficient data storage that allows for complex queries and high-performance data delivery. Concept relationship are inferred through statistical analysis, applying association measures to annotated terms. These processes grant the graphical interface the ability to create, in real-time, a data visualization in the form of a graph for the exploration of these biomedical concept relationships.<br>Tendo em conta o crescimento do número de publicações biomédicas a serem produzidas todos os anos, o esforço exigido para que um utilizador consiga, de uma forma eficiente, explorar estas publicações para conseguir estabelecer associações entre um conjunto alargado de conceitos torna esta tarefa exaustiva. Nesta disertação apresentamos uma plataforma web chamada GRACE, que providencia uma interface gráfica de exploração que permite aos utilizadores navegar pelo domínio biomédico em busca de associações explícitas ou latentes entre conceitos biomédicos pertencentes a uma variedade de domínios semânticos (i.e., Genes, Proteínas, Doenças, Procedimentos e Anatomia). A base de conhecimento usada é uma coleção de artigos MEDLINE com resumos escritos na língua inglesa. Estas anotações são armazenadas numa base de dados que permite pesquisas complexas e obtenção de dados com alta performance. As relações entre conceitos são inferidas a partir de análise estatística, aplicando medidas de associações entre os conceitos anotados. Estes processos permitem à interface gráfica criar, em tempo real, uma visualização de dados, na forma de um grafo, para a exploração destas relações entre conceitos do domínio biomédico.

APA, Harvard, Vancouver, ISO, and other styles

11

Xiong, Hui. "Combining Subject Expert Experimental Data with Standard Data in Bayesian Mixture Modeling." The Ohio State University, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=osu1312214048.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Coetsee, Dirko. "Conditional random fields for noisy text normalisation." Thesis, Stellenbosch : Stellenbosch University, 2014. http://hdl.handle.net/10019.1/96064.

Full text

Abstract:

Thesis (MScEng) -- Stellenbosch University, 2014.<br>ENGLISH ABSTRACT: The increasing popularity of microblogging services such as Twitter means that more and more unstructured data is available for analysis. The informal language usage in these media presents a problem for traditional text mining and natural language processing tools. We develop a pre-processor to normalise this noisy text so that useful information can be extracted with standard tools. A system consisting of a tokeniser, out-of-vocabulary token identifier, correct candidate generator, and N-gram language model is proposed. We compare the performance of generative and discriminative probabilistic models for these different modules. The effect of normalising the training and testing data on the performance of a tweet sentiment classifier is investigated. A linear-chain conditional random field, which is a discriminative model, is found to work better than its generative counterpart for the tokenisation module, achieving a 0.76% character error rate compared to 1.41% for the finite state automaton. For the candidate generation module, however, the generative weighted finite state transducer works better, getting the correct clean version of a word right 36% of the time on the first guess, while the discriminatively trained hidden alignment conditional random field only achieves 6%. The use of a normaliser as a pre-processing step does not significantly affect the performance of the sentiment classifier.<br>AFRIKAANSE OPSOMMING: Mikro-webjoernale soos Twitter word al hoe meer gewild, en die hoeveelheid ongestruktureerde data wat beskikbaar is vir analise groei daarom soos nooit tevore nie. Die informele taalgebruik in hierdie media maak dit egter moeilik om tradisionele tegnieke en bestaande dataverwerkingsgereedskap toe te pas. ’n Stelsel wat hierdie ruiserige teks normaliseer word ontwikkel sodat bestaande pakkette gebruik kan word om die teks verder te verwerk. Die stelsel bestaan uit ’n module wat die teks in woordeenhede opdeel, ’n module wat woorde identifiseer wat gekorrigeer moet word, ’n module wat dan kandidaat korreksies voorstel, en ’n module wat ’n taalmodel toepas om die mees waarskynlike skoon teks te vind. Die verrigting van diskriminatiewe en generatiewe modelle vir ’n paar van hierdie modules word vergelyk en die invloed wat so ’n normaliseerder op die akkuraatheid van ’n sentimentklassifiseerder het word ondersoek. Ons bevind dat ’n lineêre-ketting voorwaardelike toevalsveld—’n diskriminatiewe model — beter werk as sy generatiewe eweknie vir tekssegmentering. Die voorwaardelike toevalsveld-model behaal ’n karakterfoutkoers van 0.76%, terwyl die toestandsmasjien-model 1.41% behaal. Die toestantsmasjien-model werk weer beter om kandidaat woorde te genereer as die verskuilde belyningsmodel wat ons geïmplementeer het. Die toestandsmasjien kry 36% van die tyd die regte weergawe van ’n woord met die eerste raaiskoot, terwyl die diskriminatiewe model dit slegs 6% van die tyd kan doen. Laastens het ons bevind dat die vooraf normalisering van Twitter boodskappe nie ’n beduidende effek op die akkuraatheid van ’n sentiment klassifiseerder het nie.

APA, Harvard, Vancouver, ISO, and other styles

13

Andrade, Junior Valter Lacerda de. "Utilização de técnicas de dados não estruturados para desenvolvimento de modelos aplicados ao ciclo de crédito." Pontifícia Universidade Católica de São Paulo, 2014. https://tede2.pucsp.br/handle/handle/18150.

Full text

Abstract:

Made available in DSpace on 2016-04-29T14:23:30Z (GMT). No. of bitstreams: 1 Valter Lacerda de Andrade Junior.pdf: 673552 bytes, checksum: 68480511c98995570354a0166d2bb577 (MD5) Previous issue date: 2014-08-13<br>The need for expert assessment of Data Mining in textual data fields and other unstructured information is increasingly present in the public and private sector. Through probabilistic models and analytical studies, it is possible to broaden the understanding of a particular information source. In recent years, technology progress caused exponential growth of the information produced and accessed in the virtual media (web and private). It is estimated that by 2003 humanity had historically generated a total of 5 exabytes of content; today that asset volume can be produced in a few days. With the increasing demand, this project aims to work with probabilistic models related to the financial market in order to check whether the textual data fields, or unstructured information, contained within the business environment, can predict certain customers behaviors. It is assumed that in the corporate environment and on the web, there is great valuable information that, due to the complexity and lack of structure, they are barely considered in probabilistic studies. This material may represent competitive and strategic advantage for business, so analyzing unstructured information one can acquire important data on behaviors and mode of user interaction in the environment in which it operates, providing data as to obtain psychographic profile and satisfaction degree. The corpus of this study consists of the results of experiments made in negotiating environment of a financial company in São Paulo. On the foregoing analysis, it was applied statistical bias semiotic concepts. Among the findings of this study, it is possible to get a critical review and thorough understanding of the processes of textual data assessment<br>A necessidade de análise especializada de Mineração de Dados (Data Mining) em campos textuais e em outras informações não estruturadas estão, cada vez mais, presente nas instituições dos setores públicos e privados. Por meio de modelos probabilísticos e estudos analíticos, torna-se possível ampliar o entendimento sobre determinada fonte de informação. Nos últimos anos, devido ao avanço tecnológico, observa-se um crescimento exponencial na quantidade de informação produzida e acessada nas mídias virtuais (web e privada). Até 2003, a humanidade havia gerado, historicamente, um total de 5 exabytes de conteúdo; hoje estima-se que esse volume possa ser produzido em poucos dias. Assim, a partir desta crescente demanda identificada, este projeto visa trabalhar com modelos probabilísticos relacionados ao mercado financeiro com o intuito de analisar se os campos textuais e ilustrativos, ou informações não estruturadas, contidas dentro do ambiente de negócio, podem prever certos comportamentos de clientes. Parte-se do pressuposto que, no ambiente corporativo e na web, existem informações de grande valor e que, devido à complexidade e falta de estrutura, não são consideradas em estudos probabilísticos. Isso pode representar vantagem competitiva e estratégica para o negócio, pois, por meio da análise da informação não estruturada, podem-se conhecer comportamentos e modos de interação do usuário nestes ambientes, proporcionando obter dados como perfil psicográfico e grau de satisfação. O corpus deste estudo constitui-se de resultados de experimentos efetuados no ambiente negocial de uma empresa do setor financeiro. Para as análises, foram aplicados conceitos estatísticos com viés semiótico. Entre as informações obtidas por esta pesquisa, verifica-se a compreensão crítica e aprofundada dos processos de análise textual

APA, Harvard, Vancouver, ISO, and other styles

14

Carvalho, André Silva de. "Analytics como uma ferramenta para Consumer Insights." Escola Superior de Propaganda e Marketing, 2017. http://tede2.espm.br/handle/tede/267.

Full text

Abstract:

Submitted by Adriana Alves Rodrigues (aalves@espm.br) on 2017-11-22T15:02:28Z No. of bitstreams: 1 ANDRE SILVA DE CARVALHO.pdf: 3017440 bytes, checksum: 72f0dd79324eb16e16c0fca2fea756db (MD5)<br>Approved for entry into archive by Adriana Alves Rodrigues (aalves@espm.br) on 2017-11-22T15:02:51Z (GMT) No. of bitstreams: 1 ANDRE SILVA DE CARVALHO.pdf: 3017440 bytes, checksum: 72f0dd79324eb16e16c0fca2fea756db (MD5)<br>Approved for entry into archive by Ana Cristina Ropero (ana@espm.br) on 2017-11-23T10:56:03Z (GMT) No. of bitstreams: 1 ANDRE SILVA DE CARVALHO.pdf: 3017440 bytes, checksum: 72f0dd79324eb16e16c0fca2fea756db (MD5)<br>Made available in DSpace on 2017-11-23T10:56:31Z (GMT). No. of bitstreams: 1 ANDRE SILVA DE CARVALHO.pdf: 3017440 bytes, checksum: 72f0dd79324eb16e16c0fca2fea756db (MD5) Previous issue date: 2017-03-24<br>Being innovative in a more and more competitive market can be anything but trivial. There is a complex variables system to be taken into account throughout an innovation process, and hardly ever will there be enough data to support a research or decision. It is always possible to turn to human inference, or cognitive bias, when enough data is not available, or when time for decision-making is scarce. Consumer Insight technique has been used for this research purpose and aimed at lowering cognitive bias, seeking to find out what are consumers' wishes and needs so that decision-making or innovation could be supported. This paper proposes to mitigate the influence of cognitive bias, by means of data analysis techniques, in search for patterns which can identify opportunities to give both decision-making and search for innovation some support. In order to achive this purpose, unstructured data from 26.514 telephone talks had in a big financial market company between 01.12.2016 e 31.12.2016 have been used. Analysis has been carried out with the transcript from voice into text concomitantly with Text Mining and Social Network analysis. The results have led us to identify main client demands from a sales perspective, cancellation resquest, as well as the reason for inefficiency in offering new products from elements of higher centrality identified in the word association networks. It is implied that the combined use of analytical techniques applied to unstructured data may give rise to findings in which cognitive bias is lower.<br>Em um mercado cada vez mais competitivo, ser inovador pode ser um diferencial, porém não é uma atividade trivial. Existe um sistema de variáveis complexas que deve ser considerado ao longo de um processo de inovação e nem sempre há dados suficientes que suportem uma pesquisa ou decisão. A inferência humana, ou viés cognitivo, pode ser uma alternativa quando não existem dados suficientes ou quando o tempo para a tomada de decisão é menor que o necessário. A técnica de Consumer Insight foi utilizada nesta pesquisa com o objetivo de diminuir o viés cognitivo, buscando descobrir os anseios e necessidades do consumidor, para suportar o processo de tomada de decisão ou inovação. Este estudo apresenta uma proposta para mitigar a influência do viés cognitivo, a partir de técnicas de análise de dados, em busca de padrões que possam identificar as oportunidades para suportar o processo decisório ou a busca pela inovação. Neste trabalho foram utilizados dados não estruturados de 26.514 conversas telefônicas realizadas no período de 01/12/2016 a 31/12/2016, provenientes de uma empresa do mercado financeiro. A metodologia analítica consistiu na transcrição de voz para texto e no uso associado de técnicas de Text Mining e Análise de Redes Sociais. Os resultados obtidos permitiram identificar as principais demandas dos clientes na perspectiva de vendas, pedido de cancelamento e a razão da ineficiência das ofertas de novos produtos, a partir dos elementos de maior centralidade identificados nas redes de associação de palavras. Implica-se que o uso combinado de técnicas analíticas em dados não estruturados pode permitir a obtenção de achados com menor influência do viés cognitivo.

APA, Harvard, Vancouver, ISO, and other styles

15

Silva, Gabriel Lucas Cantanhede da. "Aplicação de algoritmos genéticos em mineração de processos não estruturados." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/100/100131/tde-02072018-215453/.

Full text

Abstract:

Mineração de processos é um novo campo de pesquisa que liga mineração de dados e gestão de processos de negócio. A mineração de processos segue a premissa de que existe um processo desconhecido em um determinado contexto, e que ao analisar os traços do seu comportamento, com o auxílio da mineração de dados, é possível descobrir o modelo do processo. No entanto, processos de negócio realistas são difíceis de minerar por causa do excesso de comportamento registrado nos logs. Esses processos não estruturados, apesar de complexos, possuem um potencial grande para melhoria, sendo que as abordagens atuais de mineração de processos para esse contexto ainda provém pouco suporte à gestão. Este trabalho de pesquisa de mestrado visou aplicar técnicas computacionais evolutivas na mineração de modelos de processo, usando algoritmos genéticos para descobrir automaticamente modelos de processos não estruturados visando dar suporte à gestão organizacional de processos. Uma revisão da literatura foi realizada para auxiliar a proposição de uma nova abordagem focada na descoberta de modelos de processos não estruturados. A abordagem proposta introduz novas fórmulas de cálculo das métricas de completude e precisão baseadas nas informações de transições entre atividades, reorganizadas por meio de uma estrutura de matriz criada neste trabalho. A abordagem introduz também o uso de operadores genéticos e estratégias de fluxo evolutivo ainda não implementados na literatura relativa a algoritmos genéticos na descoberta de processos. Análises da parametrização da abordagem proposta, bem como os modelos de processos resultantes, indicam que a abordagem é eficaz na mineração de modelos de processos melhores a partir de amostras de um log não estruturado<br>Process mining is a new field of research that links data mining and business process management. Process mining follows the premise that there is an unknown process in a given context, and by analyzing the traces of its behavior, with the help of data mining, the process model can be discovered. However, realistic business processes are difficult to mine because of excessive behavior recorded in the logs. These unstructured processes, despite being complex, hold great potential for improvement, and the current process mining approaches for that context yet provide little support for management. This masters research project aims to apply evolutionary computational techniques in process mining, using genetic algorithms to automatically discover unstructured process models in order to support process management in organizations. A literature review was carried out to support the proposition of a new approach focused on the discovery of unstructured process models. The proposed approach introduces new formulas for calculating completeness and precision metrics, based on the information of transitions between activities that are reorganized through a matrix structure created in this work. The approach also introduces the use of genetic operators and evolutionary flow strategies not yet implemented in the literature regarding genetic algorithms in process discovery. Analyzes of the parameterization of the proposed approach, as well as the resulting process models, indicate that the approach is effective in mining better process models from samples of a unstructured log

APA, Harvard, Vancouver, ISO, and other styles

16

Valentin, Sarah. "Extraction et combinaison d’informations épidémiologiques à partir de sources informelles pour la veille des maladies infectieuses animales." Thesis, Montpellier, 2020. http://www.theses.fr/2020MONTS067.

Full text

Abstract:

L’intelligence épidémiologique a pour but de détecter, d’analyser et de surveiller au cours du temps les potentielles menaces sanitaires. Ce processus de surveillance repose sur des sources dites formelles, tels que les organismes de santé officiels, et des sources dites informelles, comme les médias. La veille des sources informelles est réalisée au travers de la surveillance basée sur les événements (event-based surveillance en anglais). Ce type de veille requiert le développement d’outils dédiés à la collecte et au traitement de données textuelles non structurées publiées sur le Web. Cette thèse se concentre sur l’extraction et la combinaison d’informations épidémiologiques extraites d’articles de presse en ligne, dans le cadre de la veille des maladies infectieuses animales. Le premier objectif de cette thèse est de proposer et de comparer des approches pour améliorer l’identification et l’extraction d’informations épidémiologiques pertinentes à partir du contenu d’articles. Le second objectif est d’étudier l’utilisation de descripteurs épidémiologiques (i.e. maladies, hôtes, localisations et dates) dans le contexte de l’extraction d’événements et de la mise en relation d’articles similaires au regard de leur contenu épidémiologique. Dans ce manuscrit, nous proposons de nouvelles représentations textuelles fondées sur la sélection, l’expansion et la combinaison de descripteurs épidémiologiques. Nous montrons que l’adaptation et l’extension de méthodes de fouille de texte et de classification permet d’améliorer l’utilisation des articles en ligne tant que source de données sanitaires. Nous mettons en évidence le rôle de l’expertise quant à la pertinence et l’interprétabilité de certaines des approches proposées. Bien que nos travaux soient menés dans le contexte de la surveillance de maladies en santé animale, nous discutons des aspects génériques des méthodes proposées, vis-à-vis de de maladies inconnues et dans un contexte One Health (« une seule santé »)<br>Epidemic intelligence aims to detect, investigate and monitor potential health threats while relying on formal (e.g. official health authorities) and informal (e.g. media) information sources. Monitoring of unofficial sources, or so-called event-based surveillance (EBS), requires the development of systems designed to retrieve and process unstructured textual data published online. This manuscript focuses on the extraction and combination of epidemiological information from informal sources (i.e. online news), in the context of the international surveillance of animal infectious diseases. The first objective of this thesis is to propose and compare approaches to enhance the identification and extraction of relevant epidemiological information from the content of online news. The second objective is to study the use of epidemiological entities extracted from the news articles (i.e. diseases, hosts, locations and dates) in the context of event extraction and retrieval of related online news.This manuscript proposes new textual representation approaches by selecting, expanding, and combining relevant epidemiological features. We show that adapting and extending text mining and classification methods improves the added value of online news sources for event-based surveillance. We stress the role of domain expert knowledge regarding the relevance and the interpretability of methods proposed in this thesis. While our researches are conducted in the context of animal disease surveillance, we discuss the generic aspects of our approaches regarding unknown threats and One Health surveillance

APA, Harvard, Vancouver, ISO, and other styles

17

Thomas, STEPHEN. "MINING UNSTRUCTURED SOFTWARE REPOSITORIES USING IR MODELS." Thesis, 2012. http://hdl.handle.net/1974/7688.

Full text

Abstract:

Mining Software Repositories, which is the process of analyzing the data related to software development practices, is an emerging field which aims to aid development teams in their day to day tasks. However, data in many software repositories is currently unused because the data is unstructured, and therefore difficult to mine and analyze. Information Retrieval (IR) techniques, which were developed specifically to handle unstructured data, have recently been used by researchers to mine and analyze the unstructured data in software repositories, with some success. The main contribution of this thesis is the idea that the research and practice of using IR models to mine unstructured software repositories can be improved by going beyond the current state of affairs. First, we propose new applications of IR models to existing software engineering tasks. Specifically, we present a technique to prioritize test cases based on their IR similarity, giving highest priority to those test cases that are most dissimilar. In another new application of IR models, we empirically recover how developers use their mailing list while developing software. Next, we show how the use of advanced IR techniques can improve results. Using a framework for combining disparate IR models, we find that bug localization performance can be improved by 14–56% on average, compared to the best individual IR model. In addition, by using topic evolution models on the history of source code, we can uncover the evolution of source code concepts with an accuracy of 87–89%. Finally, we show the risks of current research, which uses IR models as black boxes without fully understanding their assumptions and parameters. We show that data duplication in source code has undesirable effects for IR models, and that by eliminating the duplication, the accuracy of IR models improves. Additionally, we find that in the bug localization task, an unwise choice of parameter values results in an accuracy of only 1%, where optimal parameters can achieve an accuracy of 55%. Through empirical case studies on real-world systems, we show that all of our proposed techniques and methodologies significantly improve the state-of-the-art.<br>Thesis (Ph.D, Computing) -- Queen's University, 2012-12-12 12:34:59.854

APA, Harvard, Vancouver, ISO, and other styles

18

Goeva, Aleksandrina. "Complexity penalized methods for structured and unstructured data." Thesis, 2017. https://hdl.handle.net/2144/27072.

Full text

Abstract:

A fundamental goal of statisticians is to make inferences from the sample about characteristics of the underlying population. This is an inverse problem, since we are trying to recover a feature of the input with the availability of observations on an output. Towards this end, we consider complexity penalized methods, because they balance goodness of fit and generalizability of the solution. The data from the underlying population may come in diverse formats - structured or unstructured - such as probability distributions, text tokens, or graph characteristics. Depending on the defining features of the problem we can chose the appropriate complexity penalized approach, and assess the quality of the estimate produced by it. Favorable characteristics are strong theoretical guarantees of closeness to the true value and interpretability. Our work fits within this framework and spans the areas of simulation optimization, text mining and network inference. The first problem we consider is model calibration under the assumption that given a hypothesized input model, we can use stochastic simulation to obtain its corresponding output observations. We formulate it as a stochastic program by maximizing the entropy of the input distribution subject to moment matching. We then propose an iterative scheme via simulation to approximately solve it. We prove convergence of the proposed algorithm under appropriate conditions and demonstrate the performance via numerical studies. The second problem we consider is summarizing text documents through an inferred set of topics. We propose a frequentist reformulation of a Bayesian regularization scheme. Through our complexity-penalized perspective we lend further insight into the nature of the loss function and the regularization achieved through the priors in the Bayesian formulation. The third problem is concerned with the impact of sampling on the degree distribution of a network. Under many sampling designs, we have a linear inverse problem characterized by an ill-conditioned matrix. We investigate the theoretical properties of an approximate solution for the degree distribution found by regularizing the solution of the ill-conditioned least squares objective. Particularly, we study the rate at which the penalized solution tends to the true value as a function of network size and sampling rate.

APA, Harvard, Vancouver, ISO, and other styles

19

Felgueiras, Marco Filipe Madeira. "Multilabel classification of unstructured data using Crunchbase." Master's thesis, 2020. http://hdl.handle.net/10071/22188.

Full text

Abstract:

Our work compares different methods and models for multilabel text classification using information collected from Crunchbase, a large database that holds information of more than 600000 companies. Each company is labeled with one more categories, from a subset of 46 possible, and the proposed models predict the categories based solely on the company textual description. A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and Part-of-Speech Tagging. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. The first experiment, is a Multiclass classification problem that tries to find the most probable category using only one model for all categories, with an overall score of 67% using SVM, Naive Bayes and Fuzzy Fingerprints. The second experiment uses makes use of multiple classifiers, one for each category, and tries to predict the complete set of categories for each company, with an overal score of 73% precision and 47% recall. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc.<br>Este trabalho compara diferentes métodos e modelos para classificação de texto utilizando informação proveniente do Crunchbase, uma grande base de dados que contém dados sobre mais de 600000 empresas. Cada empresa está associada a uma ou mais categorias, de 46 possiveis, e os modelos propostos utilizam apenas a descrição de cada empresa para prever a sua categoria. Foram aplicadas várias técnicas de processamento de linguagem natural para extração de informação incluindo "stemming", lematização e "Part-of-Speech Tagging". Este "dataset" é altamente desiquilibrado, a frequência de cada categoria vai desde 0.7% a 28%. A primeira experiência, é um problema multiclasse que tenta encontrar qual a categoria mais provável para uma empresa utilizando apenas um modelo para todas as categorias, obtendo um resultado global de 67% de "accuracy" utilizando SVM, Naive Bayes e Fuzzy Fingerprints. A segunda experiência utiliza vários classificadores, um por cada categoria, para atribuir todas as categorias de uma determinada empresa obtendo resultados de 73% de precisão e 47% de "recall". Os modelos resultantes do nosso trabalho podem ser um ativo importante para a classificação automática de texto, não só para descrições de empresas mas também para outros textos, como páginas de Internet, blogs, notícias, entre outros.

APA, Harvard, Vancouver, ISO, and other styles

20

Eickhoff, Matthias. "The Information Value of Unstructured Analyst Opinions." Doctoral thesis, 2017. http://hdl.handle.net/11858/00-1735-0000-0023-3EA0-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Wolowiec, Martin. "Using text-mining-assisted analysis to examine the applicability of unstructured data in the context of customer complaint management." Master's thesis, 2015. http://hdl.handle.net/10362/17534.

Full text

Abstract:

Double Degree<br>In quest of gaining a more holistic picture of customer experiences, many companies are starting to consider textual data due to the richer insights on customer experience touch points it can provide. Meanwhile, recent trends point towards an emerging integration of customer relationship management and customer experience management and thereby availability of additional sources of textual data. Using text-mining-assisted analysis, this study demonstrates the practicality of the arising opportunity with means of perceived justice theory in the context of customer complaint management. The study shows that customers value interpersonal aspects most as part of the overall complaint handling process. The results link the individual factors in a sequence of ‘courtesy → interactional justice → satisfaction with complaint handling’, followed by behavioural outcomes. Academic and managerial implications are discussed.

APA, Harvard, Vancouver, ISO, and other styles

22

Alic, Irina. "Decision Support Systems for Financial Market Surveillance." Doctoral thesis, 2016. http://hdl.handle.net/11858/00-1735-0000-002B-7D04-4.

Full text

Abstract:

Entscheidungsunterstützungssysteme in der Finanzwirtschaft sind nicht nur für die Wis-senschaft, sondern auch für die Praxis von großem Interesse. Um die Finanzmarktüber-wachung zu gewährleisten, sehen sich die Finanzaufsichtsbehörden auf der einen Seite, mit der steigenden Anzahl von onlineverfügbaren Informationen, wie z.B. den Finanz-Blogs und -Nachrichten konfrontiert. Auf der anderen Seite stellen schnell aufkommen-de Trends, wie z.B. die stetig wachsende Menge an online verfügbaren Daten sowie die Entwicklung von Data-Mining-Methoden, Herausforderungen für die Wissenschaft dar. Entscheidungsunterstützungssysteme in der Finanzwirtschaft bieten die Möglichkeit rechtzeitig relevante Informationen für Finanzaufsichtsbehörden und Compliance-Beauftragte von Finanzinstituten zur Verfügung zu stellen. In dieser Arbeit werden IT-Artefakte vorgestellt, welche die Entscheidungsfindung der Finanzmarktüberwachung unterstützen. Darüber hinaus wird eine erklärende Designtheorie vorgestellt, welche die Anforderungen der Regulierungsbehörden und der Compliance-Beauftragten in Finan-zinstituten aufgreift.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!