Dissertations / Theses: 'Data Mining Techniques'

1

Tong, Suk-man Ivy. "Techniques in data stream mining." Click to view the E-thesis via HKUTO, 2005. http://sunzi.lib.hku.hk/hkuto/record/B34737376.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Tong, Suk-man Ivy, and 湯淑敏. "Techniques in data stream mining." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2005. http://hub.hku.hk/bib/B34737376.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Burgess, Martin. "Transformation techniques in data mining." Thesis, University of East Anglia, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.410093.

Full text

Abstract:

Transforming data is essential within data mining as a precursor to many applications such as rule induction and Multivariate Adaptive Regression Splines. The problems arising from the use of categorical valued data in rule induction are reduced confidence (accuracy), support and coverage. We introduce a technique called arcsin transformation where categorical valued data is replaced with numeric values. This technique has been used on a number of databases and has shown to be highly effective. Multivariate Adaptive Regression Splines, MARS, is a regression tool which attempts to approximate complex relationships by a series of linear regressions on different intervals of the explanatory variable ranges. Like regression methods in general, we need to know what assumptions are made and how the violation of these may disrupt performance. The two key assumptions with most regression models including MARS are additivity of effects and homoscedasticity. If any of these assumptions are not satisfied in terms of the original observations, y;, a non-linear transformation may improve matters. We use the Box-Cox transformation in which the continuous dependent variable (with non-negative responses) in a linear regression setting, might induce the regression assumptions given previously. The assumptions stated are discussed in detail using a variety of tests. The results show that on seven databases examined, an improvement has been made on six, where the models produced were

APA, Harvard, Vancouver, ISO, and other styles

4

Al-Hashemi, Idrees Yousef. "Applying data mining techniques over big data." Thesis, Boston University, 2013. https://hdl.handle.net/2144/21119.

Full text

Abstract:

Thesis (M.S.C.S.) PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at open-help@bu.edu. Thank you.
The rapid development of information technology in recent decades means that data appear in a wide variety of formats — sensor data, tweets, photographs, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today’s internet has about 0.1 Zettabytes of data (ZB is about 1021 bytes), and this number will reach 35 ZB by 2020. With such an overwhelming flood of information, present data management systems are not able to scale to this huge amount of raw, unstructured data—in today’s parlance, Big Data. In the present study, we show the basic concepts and design of Big Data tools, algorithms, and techniques. We compare the classical data mining algorithms to the Big Data algorithms by using Hadoop/MapReduce as a core implementation of Big Data for scalable algorithms. We implemented the K-means algorithm and A-priori algorithm with Hadoop/MapReduce on a 5 nodes Hadoop cluster. We explore NoSQL databases for semi-structured, massively large-scaling of data by using MongoDB as an example. Finally, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data storage for these two algorithms.

APA, Harvard, Vancouver, ISO, and other styles

5

Zhou, Wubai. "Data Mining Techniques to Understand Textual Data." FIU Digital Commons, 2017. https://digitalcommons.fiu.edu/etd/3493.

Full text

Abstract:

More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks derived from the two domains, i.e., disaster management and IT service management that mainly utilizing textual data as an information carrier. Improving situation awareness in disaster management and alleviating human efforts involved in IT service management dictates more intelligent and efficient solutions to understand the textual data acting as the main information carrier in the two domains. From the perspective of data mining, four directions are identified: (1) Intelligently generate a storyline summarizing the evolution of a hurricane from relevant online corpus; (2) Automatically recommending resolutions according to the textual symptom description in a ticket; (3) Gradually adapting the resolution recommendation system for time correlated features derived from text; (4) Efficiently learning distributed representation for short and lousy ticket symptom descriptions and resolutions. Provided with different types of textual data, data mining techniques proposed in those four research directions successfully address our tasks to understand and extract valuable knowledge from those textual data. My dissertation will address the research topics outlined above. Concretely, I will focus on designing and developing data mining methodologies to better understand textual information, including (1) a storyline generation method for efficient summarization of natural hurricanes based on crawled online corpus; (2) a recommendation framework for automated ticket resolution in IT service management; (3) an adaptive recommendation system on time-varying temporal correlated features derived from text; (4) a deep neural ranking model not only successfully recommending resolutions but also efficiently outputting distributed representation for ticket descriptions and resolutions.

APA, Harvard, Vancouver, ISO, and other styles

6

XIAO, XIN. "Data Mining Techniques for Complex User-Generated Data." Doctoral thesis, Politecnico di Torino, 2016. http://hdl.handle.net/11583/2644046.

Full text

Abstract:

Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains.

APA, Harvard, Vancouver, ISO, and other styles

7

Yu, Congcong. "Parallelizing ensemble techniques for data mining." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/MQ59416.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

CID, DANTE JOSE ALEXANDRE. "DATA MINING WITH ROUGH SETS TECHNIQUES." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2000. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=7244@1.

Full text

Abstract:

COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
Esta dissertação investiga a utilização de Rough Sets no processo de descoberta de conhecimento em Bancos de Dados (KDD - Knowledge Discovery in Databases). O objetivo do trabalho foi avaliar o desempenho da técnica de Rough Sets na tarefa de Classificação de Dados. A Classificação é a tarefa da fase de Mineração de Dados que consiste na descoberta de regras de decisão, ou regras de inferência, que melhor representem um grupo de registros do banco de dados. O trabalho consistiu de cinco etapas principais: estudo sobre o processo de KDD; estudo sobre as técnicas de Rough Sets aplicadas à mineração de dados; análise de ferramentas de mineração de dados do mercado; evolução do projeto Bramining; e a realização de alguns estudos de caso para avaliar o Bramining. O estudo sobre o caso KDD abrangeu todas as suas fases: transformação, limpeza, seleção, mineração de dados e pós-processamento. O resultado obtido serviu de base para o aprimoramento do projeto Bramining. O estudo sobre as técnicas de Rough Sets envolveu a pesquisa de seus conceitos e sua aplicabilidade no contexto de KDD. A teoria de Rough Sets foi apresentada por Zdzislaw Pawlak no início dos anos 80 como uma abordagem matemática para a análise de dados vagos e imprecisos. Este estudo permitiu sua aplicação na ferramenta de mineração de dados desenvolvida. A análise de ferramentas de mineração de dados do mercado abrangeu o estudo e testes de aplicativos baseados em diferentes técnicas, enriquecimento a base de comparação utilizada na avaliação da pesquisa. A evolução do projeto Bramining consistiu no aprimoramento do ambiente KDD desenvolvido em estudos anteriores, passando a incluir técnica de Rough Sets em seu escopo. Os estudos de caso foram conduzidos paralelamente com o uso de Bramining e de outras ferramentas existentes, para efeito de comparação. Os índices apresentados pelo Bramining nos estudos de caso foram considerados, de forma geral, equivalentes aos do software comercial, tendo ambos obtidos regras de boa qualidade na maioria dos casos. O Bramining, entretanto, mostrou-se mais completo para o processo de KDD, graças às diversas opções nele disponíveis para preparação dos dados antes da fase de mineração. Os resultados obtidos comprovaram, através da aplicação desenvolvida, a adequação dos conceitos de Rough Sets à tarefa de classificação de dados. Alguns pontos frágeis da técnica foram identificados, como a necessidade de um mecanismo de apoio para a redução de atributos e a dificuldade em trabalhar com atributos de domínio contínuo. Porém, ao se inserir a técnica em um ambiente mais completo de KDD, como o Bramining, estas deficiências foram sanadas. As opções de preparação da base que o Bramining disponibiliza ao usuário para executar, em particular, a redução e a codificação de atributos permitem deixar os dados em estado adequado à aplicação de Rough Sets. A mineração de dados é uma questão bastante relevante nos dias atuais, e muitos métodos têm sido propostos para as diversas tarefas que dizem respeito a esta questão. A teoria de Rough Sets não mostrou significativas vantagens ou desvantagens em relação a outras técnicas já consagradas, mas foi de grande valia comprovar que há caminhos alternativos para o processo de descoberta de conhecimento.
This dissertation investigates the application of Rough Sets to the process of KDD - Knowledge Discovery in Databases. The main goal of the work was to evaluate the performance of Rough Sets techniques in solving the classification problem. Classification is a task of the Data Mining step in KDD Process that performs the discovery of decision rules that best represent a group of registers in a database. The work had five major steps: study of the KDD process; study of Rough Sets techniques applied to data mining; evaluation of existing data mining tools; development of Bramining project; and execution of some case studies to evaluate Bramining. The study of KDD process included all its steps: transformation, cleaning, selection, data mining and post- processing. The results obtained served as a basis to the enhamcement of Bramining. The study of Rough Sets techniques included the research of theory´s concepts and its applicability at KDD context. The Rough Sets tehory has been introduced by Zdzislaw Pawlak in the early 80´s as a mathematical approach to the analysis of vague and uncertain data. This research made possible the implementation of the technique under the environment of the developed tool. The analysis of existing data mining tools included studying and testing of software based on different techniques, enriching the background used in the evaluation of the research. The evolution of Bramining Project consisted in the enhancement of the KDD environment developed in previous works, including the addition of Rough Sets techniques. The case studies were performed simultaneously with Bramining and a commercial minig tool, for comparison reasons. The quality of the knowledge generated by Bramining was considered equivalent to the results of commercial tool, both providing good decision rules for most of the cases. Nevertheless, Bramining proved to be more adapted to the complete KDD process, thanks to the many available features to prepare data to data mining step. The results achieved through the developed application proved the suitability of Rough Sets concepts to the data classification task. Some weaknesses of the technique were identified, like the need of a previous attribute reduction and the inability to deal with continuous domain data. But as the technique has been inserted in a more complete KDD environment like the Bramining Project, those weaknesses ceased to exist. The features of data preparation available in Bramining environment, particularly the reduction and attribute codification options, enable the user to have the database fairly adapted to the use of Rough Sets algorithms. Data mining is a very relevant issue in present days and many methods have been proposed to the different tasks involved in it. Compared to other techniques, Rough Sets Theory did not bring significant advantages or disadvantages to the process, but it has been of great value to show there are alternate ways to knowledge discovery.

APA, Harvard, Vancouver, ISO, and other styles

9

Al-Bataineh, Hussien Suleiman. "Islanding Detection Using Data Mining Techniques." Thesis, North Dakota State University, 2015. https://hdl.handle.net/10365/27634.

Full text

Abstract:

Connection of the distributed generators (DGs), poses new challenges for operation and management of the distribution system. An important issue is that of islanding, where a part of the system gets disconnected from the DG. This thesis explores the use of several data-mining, and machine learning techniques to detect islanding. Several cases of islanding and non- islanding are simulated with a standard test-case: the IEEE 13 bus test distribution system. Different types of DGs are connected to the system and disturbances are introduced. Several classifiers are tested for their effectiveness in identifying islanded conditions under different scenarios. The simulation results show that the random forest classifier consistently outperforms the other methods for a diverse set of operating conditions, within an acceptable time after the onset of islanding. These results strengthen the case for machine-driven based tools for quick and accurate detection of islanding in microgrids.

APA, Harvard, Vancouver, ISO, and other styles

10

JANAKIRAMAN, KRISHNAMOORTHY. "ENTITY IDENTIFICATION USING DATA MINING TECHNIQUES." University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin989852516.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Aliotta, Marco Antonio. "Data mining techniques on volcano monitoring." Doctoral thesis, Università di Catania, 2013. http://hdl.handle.net/10761/1364.

Full text

Abstract:

The aim of this thesis is the study of data mining process able to discover implicit information from huge amount of data. In particular, indexing of datasets is studied to speed the efficiency of search algorithm. All of the presented techniques are applied in geophysical research field where the huge amount of data hide implicit information related to volcanic processes and their evolution over time. Data mining techniques, reported in details in the next chapters, are implemented with the aim of recurrent patterns analysis from heterogeneous data. This thesis is organized as follows. Chapter 1 introduces the problem of searching in a metric space, showing the key applications (from text retrieval to computational biology and so on) and the basic concepts (e.g. metric distance function). The current solutions, together with a model for standardization, are presented in Chapter 2. A novel indexing structure, the K-Pole Tree, that uses a dynamic number of pivots to partition a metric space, is presented in Chapter 3, after a taxonomy of the state-of-the-art indexing algorithm. Experimental effectiveness of K-Pole Tree is compared to other efficient algorithms in Chapter 4, where proximity queries results are showed. In Chapter 5 a basic review of pattern recognition techniques is reported. In particular, DBSCAN Algorithm and SVM (Support Vector Machines) are discussed. Finally, Chapter 6 shows some geophysical applications where data mining techniques are applied for volcano data analysis and surveillance purpose. In particular, an application for clustering infrasound signals and another to index an thermal image database are presented.

APA, Harvard, Vancouver, ISO, and other styles

12

Di, Silvestro Lorenzo Paolo. "Data Mining and Visual Analytics Techniques." Doctoral thesis, Università di Catania, 2014. http://hdl.handle.net/10761/1559.

Full text

Abstract:

With the beginning of the Information Age and the following spread of the information overload phenomenon, it has been mandatory to develop a means to simply explore, analyze and summarize large quantity of data. To achieve this purposes a data mining techniques and information visualization methods are used since decades. In the last years a new research field is gaining importance: Visual Analytics, an outgrowth of the fields of scientific and information visualization but includes technologies from many other fields, including knowledge management, statistical analysis, cognitive science and decision science. In this dissertation the combined effort of the mentioned research fields will be analyzed, pointing out different way to combine them following the best practice according to several application cases.

APA, Harvard, Vancouver, ISO, and other styles

13

Schubert, Matthias. "Advanced Data Mining Techniques for Compound Objects." Diss., lmu, 2004. http://nbn-resolving.de/urn:nbn:de:bvb:19-27981.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Dutta, Ila. "Data Mining Techniques to Identify Financial Restatements." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/37342.

Full text

Abstract:

Data mining is a multi-disciplinary field of science and technology widely used in developing predictive models and data visualization in various domains. Although there are numerous data mining algorithms and techniques across multiple fields, it appears that there is no consensus on the suitability of a particular model, or the ways to address data preprocessing issues. Moreover, the effectiveness of data mining techniques depends on the evolving nature of data. In this study, we focus on the suitability and robustness of various data mining models for analyzing real financial data to identify financial restatements. From data mining perspective, it is quite interesting to study financial restatements for the following reasons: (i) the restatement data is highly imbalanced that requires adequate attention in model building, (ii) there are many financial and non-financial attributes that may affect financial restatement predictive models. This requires careful implementation of data mining techniques to develop parsimonious models, and (iii) the class imbalance issue becomes more complex in a dataset that includes both intentional and unintentional restatement instances. Most of the previous studies focus on fraudulent (or intentional) restatements and the literature has largely ignored unintentional restatements. Intentional (i.e. fraudulent) restatements instances are rare and likely to have more distinct features compared to non-restatement cases. However, unintentional cases are comparatively more prevalent and likely to have fewer distinct features that separate them from non-restatement cases. A dataset containing unintentional restatement cases is likely to have more class overlapping issues that may impact the effectiveness of predictive models. In this study, we developed predictive models based on all restatement cases (both intentional and unintentional restatements) using a real, comprehensive and novel dataset which includes 116 attributes and approximately 1,000 restatement and 19,517 non-restatement instances over a period of 2009 to 2014. To the best of our knowledge, no other study has developed predictive models for financial restatements using post-financial crisis events. In order to avoid redundant attributes, we use three feature selection techniques: Correlation based feature subset selection (CfsSubsetEval), Information gain attribute evaluation (InfoGainEval), Stepwise forward selection (FwSelect) and generate three datasets with reduced attributes. Our restatement dataset is highly skewed and highly biased towards non-restatement (majority) class. We applied various algorithms (e.g. random undersampling (RUS), Cluster based undersampling (CUS) (Sobhani et al., 2014), random oversampling (ROS), Synthetic minority oversampling technique (SMOTE) (Chawla et al., 2002), Adaptive synthetic sampling (ADASYN) (He et al., 2008), and Tomek links with SMOTE) to address class imbalance in the financial restatement dataset. We perform classification employing six different choices of classifiers, Decision three (DT), Artificial neural network (ANN), Naïve Bayes (NB), Random forest (RF), Bayesian belief network (BBN) and Support vector machine (SVM) using 10-fold cross validation and test the efficiency of various predictive models using minority class recall value, minority class F-measure and G-mean. We also experiment different ensemble methods (bagging and boosting) with the base classifiers and employ other meta-learning algorithms (stacking and cost-sensitive learning) to improve model performance. While applying cluster-based undersampling technique, we find that various classifiers (e.g. SVM, BBN) show a high success rate in terms of minority class recall value. For example, SVM classifier shows a minority recall value of 96% which is quite encouraging. However, the ability of these classifiers to detect majority class instances is dismal. We find that some variations of synthetic oversampling such as ‘Tomek Link + SMOTE’ and ‘ADASYN’ show promising results in terms of both minority recall value and G-mean. Using InfoGainEval feature selection method, RF classifier shows minority recall values of 92.6% for ‘Tomek Link + SMOTE’ and 88.9% for ‘ADASYN’ techniques, respectively. The corresponding G-mean values are 95.2% and 94.2% for these two oversampling techniques, which show that RF classifier is quite effective in predicting both minority and majority classes. We find further improvement in results for RF classifier with cost-sensitive learning algorithm using ‘Tomek Link + SMOTE’ oversampling technique. Subsequently, we develop some decision rules to detect restatement firms based on a subset of important attributes. To the best of our knowledge, only Kim et al. (2016) perform a data mining study using only pre-financial crisis restatement data. Kim et al. (2016) employed a matching sample based undersampling technique and used logistic regression, SVM and BBN classifiers to develop financial restatement predictive models. The study’s highest reported G-mean is 70%. Our results with clustering based undersampling are similar to the performance measures reported by Kim et al. (2016). However, our synthetic oversampling based results show a better predictive ability. The RF classifier shows a very high degree of predictive capability for minority class instances (97.4%) and a very high G-mean value (95.3%) with cost-sensitive learning. Yet, we recognize that Kim et al. (2016) use a different restatement dataset (with pre-crisis restatement cases) and hence a direct comparison of results may not be fully justified. Our study makes contributions to the data mining literature by (i) presenting predictive models for financial restatements with a comprehensive dataset, (ii) focussing on various datamining techniques and presenting a comparative analysis, and (iii) addressing class imbalance issue by identifying most effective technique. To the best of our knowledge, we used the most comprehensive dataset to develop our predictive models for identifying financial restatement.

APA, Harvard, Vancouver, ISO, and other styles

15

Hamby, Stephen Edward. "Data mining techniques for protein sequence analysis." Thesis, University of Nottingham, 2010. http://eprints.nottingham.ac.uk/11498/.

Full text

Abstract:

This thesis concerns two areas of bioinformatics related by their role in protein structure and function: protein structure prediction and post translational modification of proteins. The dihedral angles Ψ and Φ are predicted using support vector regression. For the prediction of Ψ dihedral angles the addition of structural information is examined and the normalisation of Ψ and Φ dihedral angles is examined. An application of the dihedral angles is investigated. The relationship between dihedral angles and three bond J couplings determined from NMR experiments is described by the Karplus equation. We investigate the determination of the correct solution of the Karplus equation using predicted Φ dihedral angles. Glycosylation is an important post translational modification of proteins involved in many different facets of biology. The work here investigates the prediction of N-linked and O-linked glycosylation sites using the random forest machine learning algorithm and pairwise patterns in the data. This methodology produces more accurate results when compared to state of the art prediction methods. The black box nature of random forest is addressed by using the trepan algorithm to generate a decision tree with comprehensible rules that represents the decision making process of random forest. The prediction of our program GPP does not distinguish between glycans at a given glycosylation site. We use farthest first clustering, with the idea of classifying each glycosylation site by the sugar linking the glycan to protein. This thesis demonstrates the prediction of protein backbone torsion angles and improves the current state of the art for the prediction of glycosylation sites. It also investigates potential applications and the interpretation of these methods.

APA, Harvard, Vancouver, ISO, and other styles

16

MAHOTO, NAEEM AHMED. "Data mining techniques for complex application domains." Doctoral thesis, Politecnico di Torino, 2013. http://hdl.handle.net/11583/2506368.

Full text

Abstract:

The emergence of advanced communication techniques has increased availability of large collection of data in electronic form in a number of application domains including healthcare, e- business, and e-learning. Everyday a large amount of records are stored electronically. However, finding useful information from such a large data collection is a challenging issue. Data mining technology aims automatically extracting hidden knowledge from large data repositories exploiting sophisticated algorithms. The hidden knowledge in the electronic data may be potentially utilized to facilitate the procedures, productivity, and reliability of several application domains. The PhD activity has been focused on novel and effective data mining approaches to tackle the complex data coming from two main application domains: Healthcare data analysis and Textual data analysis. The research activity, in the context of healthcare data, addressed the application of different data mining techniques to discover valuable knowledge from real exam-log data of patients. In particular, efforts have been devoted to the extraction of medical pathways, which can be exploited to analyze the actual treatments followed by patients. The derived knowledge not only provides useful information to deal with the treatment procedures but may also play an important role in future predictions of potential patient risks associated with medical treatments. The research effort in textual data analysis is twofold. On the one hand, a novel approach to discovery of succinct summaries of large document collections has been proposed. On the other hand, the suitability of an established descriptive data mining to support domain experts in making decisions has been investigated. Both research activities are focused on adopting widely exploratory data mining techniques to textual data analysis, which require overcoming intrinsic limitations for traditional algorithms for handling textual documents efficiently and effectively.

APA, Harvard, Vancouver, ISO, and other styles

17

ZANONI, MARCO. "Data mining techniques for design pattern detection." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2012. http://hdl.handle.net/10281/31515.

Full text

Abstract:

The main objective of design pattern detection is to gain better comprehension of a software system, and of the kind of problems addressed during the development of the system itself. Design patterns have informal specifications, leading to many implementation variants caused by the subjective interpretation of the pattern by developers. This thesis applies a supervised classification approach to make the detection more subjective, bringing to developers the patterns they want to find, ranked by a confidence value.

APA, Harvard, Vancouver, ISO, and other styles

18

Okafor, Anthony. "Entropy based techniques with applications in data mining." [Gainesville, Fla.] : University of Florida, 2005. http://purl.fcla.edu/fcla/etd/UFE0013113.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Tekieh, Mohammad Hossein. "Analysis of Healthcare Coverage Using Data Mining Techniques." Thèse, Université d'Ottawa / University of Ottawa, 2012. http://hdl.handle.net/10393/20547.

Full text

Abstract:

This study explores healthcare coverage disparity using a quantitative analysis on a large dataset from the United States. One of the objectives is to build supervised models including decision tree and neural network to study the efficient factors in healthcare coverage. We also discover groups of people with health coverage problems and inconsistencies by employing unsupervised modeling including K-Means clustering algorithm. Our modeling is based on the dataset retrieved from Medical Expenditure Panel Survey with 98,175 records in the original dataset. After pre-processing the data, including binning, cleaning, dealing with missing values, and balancing, it contains 26,932 records and 23 variables. We build 50 classification models in IBM SPSS Modeler employing decision tree and neural networks. The accuracy of the models varies between 76% and 81%. The models can predict the healthcare coverage for a new sample based on its significant attributes. We demonstrate that the decision tree models provide higher accuracy that the models based on neural networks. Also, having extensively analyzed the results, we discover the most efficient factors in healthcare coverage to be: access to care, age, poverty level of family, and race/ethnicity.

APA, Harvard, Vancouver, ISO, and other styles

20

Almutairi, Abdulrazaq Z. "Improving intrusion detection systems using data mining techniques." Thesis, Loughborough University, 2016. https://dspace.lboro.ac.uk/2134/21313.

Full text

Abstract:

Recent surveys and studies have shown that cyber-attacks have caused a lot of damage to organisations, governments, and individuals around the world. Although developments are constantly occurring in the computer security field, cyber-attacks still cause damage as they are developed and evolved by hackers. This research looked at some industrial challenges in the intrusion detection area. The research identified two main challenges; the first one is that signature-based intrusion detection systems such as SNORT lack the capability of detecting attacks with new signatures without human intervention. The other challenge is related to multi-stage attack detection, it has been found that signature-based is not efficient in this area. The novelty in this research is presented through developing methodologies tackling the mentioned challenges. The first challenge was handled by developing a multi-layer classification methodology. The first layer is based on decision tree, while the second layer is a hybrid module that uses two data mining techniques; neural network, and fuzzy logic. The second layer will try to detect new attacks in case the first one fails to detect. This system detects attacks with new signatures, and then updates the SNORT signature holder automatically, without any human intervention. The obtained results have shown that a high detection rate has been obtained with attacks having new signatures. However, it has been found that the false positive rate needs to be lowered. The second challenge was approached by evaluating IP information using fuzzy logic. This approach looks at the identity of participants in the traffic, rather than the sequence and contents of the traffic. The results have shown that this approach can help in predicting attacks at very early stages in some scenarios. However, it has been found that combining this approach with a different approach that looks at the sequence and contents of the traffic, such as event- correlation, will achieve a better performance than each approach individually.

APA, Harvard, Vancouver, ISO, and other styles

21

Sobolewska, Katarzyna-Ewa. "Web links utility assessment using data mining techniques." Thesis, Blekinge Tekniska Högskola, Avdelningen för programvarusystem, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2936.

Full text

Abstract:

This paper is focusing on the data mining solutions for the WWW, specifically how it can be used for the hyperlinks evaluation. We are focusing on the hyperlinks used in the web sites systems and on the problem which consider evaluation of its utility. Since hyperlinks reflect relation to other webpage one can expect that there exist way to verify if users follow desired navigation paths. The Challenge is to use available techniques to discover usage behavior patterns and interpret them. We have evaluated hyperlinks of the selected pages from www.bth.se web site. By using web expert’s help the usefulness of the data mining as the assessment basis was validated. The outcome of the research shows that data mining gives decision support for the changes in the web site navigational structure.
akasha.kate@gmail.com

APA, Harvard, Vancouver, ISO, and other styles

22

Kanellopoulos, Yiannis. "Supporting software systems maintenance using data mining techniques." Thesis, University of Manchester, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.496254.

Full text

Abstract:

Data mining and its ability to handle large amounts of data and uncover hidden patterns has the potential to facilitate the comprehension and maintainability evaluation of a software system. Source code artefacts and measurement values can be used as input to data mining algorithms in order to provide insights into a system's structure or to create groups of artefacts with similar software measurements. This thesis investigates the applicability and suitability of data mining techniques to facilitate a the comprehension and maintainability evaluation of a software system's source code.

APA, Harvard, Vancouver, ISO, and other styles

23

Espy, John. "Data mining techniques for constructing jury selection models." Thesis, California State University, Long Beach, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1527548.

Full text

Abstract:

Jury selection can determine a case before it even begins. The goal is to predict whether a juror rules for the plaintiff or the defense in the medical malpractice trials that are conducted, and which variables are significant in predicting this. The data for the analysis were obtained from mock trials that simulated actual trials, with possible arguments from the defense and the plaintiff with ample discussion time. These mock trials were supplemented by surveys that attempted to capture the characteristics and attitudes of the mock juror and the case at hand. The data were modeled using the logistic regression as well as decision trees and neural networks techniques.

APA, Harvard, Vancouver, ISO, and other styles

24

Siddiqui, Muazzam Ahmed. "HIGH PERFORMANCE DATA MINING TECHNIQUES FOR INTRUSION DETECTION." Master's thesis, University of Central Florida, 2004. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4435.

Full text

Abstract:

The rapid growth of computers transformed the way in which information and data was stored. With this new paradigm of data access, comes the threat of this information being exposed to unauthorized and unintended users. Many systems have been developed which scrutinize the data for a deviation from the normal behavior of a user or system, or search for a known signature within the data. These systems are termed as Intrusion Detection Systems (IDS). These systems employ different techniques varying from statistical methods to machine learning algorithms. Intrusion detection systems use audit data generated by operating systems, application softwares or network devices. These sources produce huge amount of datasets with tens of millions of records in them. To analyze this data, data mining is used which is a process to dig useful patterns from a large bulk of information. A major obstacle in the process is that the traditional data mining and learning algorithms are overwhelmed by the bulk volume and complexity of available data. This makes these algorithms impractical for time critical tasks like intrusion detection because of the large execution time. Our approach towards this issue makes use of high performance data mining techniques to expedite the process by exploiting the parallelism in the existing data mining algorithms and the underlying hardware. We will show that how high performance and parallel computing can be used to scale the data mining algorithms to handle large datasets, allowing the data mining component to search a much larger set of patterns and models than traditional computational platforms and algorithms would allow. We develop parallel data mining algorithms by parallelizing existing machine learning techniques using cluster computing. These algorithms include parallel backpropagation and parallel fuzzy ARTMAP neural networks. We evaluate the performances of the developed models in terms of speedup over traditional algorithms, prediction rate and false alarm rate. Our results showed that the traditional backpropagation and fuzzy ARTMAP algorithms can benefit from high performance computing techniques which make them well suited for time critical tasks like intrusion detection.
M.S.
School of Computer Science
Engineering and Computer Science
Computer Science

APA, Harvard, Vancouver, ISO, and other styles

25

floyd, stuart. "Data Mining Techniques for Prognosis in Pancreatic Cancer." Digital WPI, 2007. https://digitalcommons.wpi.edu/etd-theses/671.

Full text

Abstract:

This thesis focuses on the use of data mining techniques to investigate the expected survival time of patients with pancreatic cancer. Clinical patient data have been useful in showing overall population trends in patient treatment and outcomes. Models built on patient level data also have the potential to yield insights into the best course of treatment and the long-term outlook for individual patients. Within the medical community, logistic regression has traditionally been chosen for building predictive models in terms of explanatory variables or features. Our research demonstrates that the use of machine learning algorithms for both feature selection and prediction can significantly increase the accuracy of models of patient survival. We have evaluated the use of Artificial Neural Networks, Bayesian Networks, and Support Vector Machines. We have demonstrated (p<0.05) that data mining techniques are capable of improved prognostic predictions of pancreatic cancer patient survival as compared with logistic regression alone.

APA, Harvard, Vancouver, ISO, and other styles

26

Kim, Hyunki. "Developing semantic digital libraries using data mining techniques." [Gainesville, Fla.] : University of Florida, 2005. http://purl.fcla.edu/fcla/etd/UFE0010105.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Wang, Qing. "Intelligent Data Mining Techniques for Automatic Service Management." FIU Digital Commons, 2018. https://digitalcommons.fiu.edu/etd/3883.

Full text

Abstract:

Today, as more and more industries are involved in the artificial intelligence era, all business enterprises constantly explore innovative ways to expand their outreach and fulfill the high requirements from customers, with the purpose of gaining a competitive advantage in the marketplace. However, the success of a business highly relies on its IT service. Value-creating activities of a business cannot be accomplished without solid and continuous delivery of IT services especially in the increasingly intricate and specialized world. Driven by both the growing complexity of IT environments and rapidly changing business needs, service providers are urgently seeking intelligent data mining and machine learning techniques to build a cognitive ``brain" in IT service management, capable of automatically understanding, reasoning and learning from operational data collected from human engineers and virtual engineers during the IT service maintenance. The ultimate goal of IT service management optimization is to maximize the automation of IT routine procedures such as problem detection, determination, and resolution. However, to fully automate the entire IT routine procedure is still a challenging task without any human intervention. In the real IT system, both the step-wise resolution descriptions and scripted resolutions are often logged with their corresponding problematic incidents, which typically contain abundant valuable human domain knowledge. Hence, modeling, gathering and utilizing the domain knowledge from IT system maintenance logs act as an extremely crucial role in IT service management optimization. To optimize the IT service management from the perspective of intelligent data mining techniques, three research directions are identified and considered to be greatly helpful for automatic service management: (1) efficiently extract and organize the domain knowledge from IT system maintenance logs; (2) online collect and update the existing domain knowledge by interactively recommending the possible resolutions; (3) automatically discover the latent relation among scripted resolutions and intelligently suggest proper scripted resolutions for IT problems. My dissertation addresses these challenges mentioned above by designing and implementing a set of intelligent data-driven solutions including (1) constructing the domain knowledge base for problem resolution inference; (2) online recommending resolution in light of the explicit hierarchical resolution categories provided by domain experts; and (3) interactively recommending resolution with the latent resolution relations learned through a collaborative filtering model.

APA, Harvard, Vancouver, ISO, and other styles

28

JABEEN, SAIMA. "Document analysis by means of data mining techniques." Doctoral thesis, Politecnico di Torino, 2014. http://hdl.handle.net/11583/2537297.

Full text

Abstract:

The huge amount of textual data produced everyday by scientists, journalists and Web users, allows investigating many different aspects of information stored in the published documents. Data mining and information retrieval techniques are exploited to manage and extract information from huge amount of unstructured textual data. Text mining also known as text data mining is the processing of extracting high quality information (focusing relevance, novelty and interestingness) from text by identifying patterns etc. Text mining typically involves the process of structuring input text by means of parsing and other linguistic features or sometimes by removing extra data and then finding patterns from structured data. Patterns are then evaluated at last and interpretation of output is performed to accomplish the desired task. Recently, text mining has got attention in several fields such as in security (involves analysis of Internet news), for commercial (for search and indexing purposes) and in academic departments (such as answering query). Beyond searching the documents consisting the words given in a user query, text mining may provide direct answer to user by semantic web for content based (content meaning and its context). It can also act as intelligence analyst and can also be used in some email spam filters for filtering out unwanted material. Text mining usually includes tasks such as clustering, categorization, sentiment analysis, entity recognition, entity relation modeling and document summarization. In particular, summarization approaches are suitable for identifying relevant sentences that describe the main concepts presented in a document dataset. Furthermore, the knowledge existed in the most informative sentences can be employed to improve the understanding of user and/or community interests. Different approaches have been proposed to extract summaries from unstructured text documents. Some of them are based on the statistical analysis of linguistic features by means of supervised machine learning or data mining methods, such as Hidden Markov models, neural networks and Naive Bayes methods. An appealing research field is the extraction of summaries tailored to the major user interests. In this context, the problem of extracting useful information according to domain knowledge related to the user interests is a challenging task. The main topics have been to study and design of novel data representations and data mining algorithms useful for managing and extracting knowledge from unstructured documents. This thesis describes an effort to investigate the application of data mining approaches, firmly established in the subject of transactional data (e.g., frequent itemset mining), to textual documents. Frequent itemset mining is a widely exploratory technique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, the usage of frequent itemsets in textual document summarization has never been investigated so far. A work is carried on exploiting frequent itemsets for the purpose of multi-document summarization so a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer) is presented, that is based on an itemset-based model, i.e., a framework comprise of frequent itemsets, taken out from the document collection. Highly representative and not redundant sentences are selected for generating summary by considering both sentence coverage, with respect to a sentence relevance score, based on tf-idf statistics, and a concise and highly informative itemset-based model. To evaluate the ItemSum performance a suite of experiments on a collection of news articles has been performed. Obtained results show that ItemSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. We also validated our approach against a large number of approaches on the DUC’04 document collection. Performance comparisons, in terms of precision, recall, and F-measure, have been performed by means of the ROUGE toolkit. In most cases, ItemSum significantly outperforms the considered competitors. Furthermore, the impact of both the main algorithm parameters and the adopted model coverage strategy on the summarization performance are investigated as well. In some cases, the soundness and readability of the generated summaries are unsatisfactory, because the summaries do not cover in an effective way all the semantically relevant data facets. A step beyond towards the generation of more accurate summaries has been made by semantics-based summarizers. Such approaches combine the use of general-purpose summarization strategies with ad-hoc linguistic analysis. The key idea is to also consider the semantics behind the document content to overcome the limitations of general-purpose strategies in differentiating between sentences based on their actual meaning and context. Most of the previously proposed approaches perform the semantics-based analysis as a preprocessing step that precedes the main summarization process. Therefore, the generated summaries could not entirely reflect the actual meaning and context of the key document sentences. In contrast, we aim at tightly integrating the ontology-based document analysis into the summarization process in order to take the semantic meaning of the document content into account during the sentence evaluation and selection processes. With this in mind, we propose a new multi-document summarizer, namely Yago-based Summarizer, that integrates an established ontology-based entity recognition and disambiguation step. Named Entity Recognition from Yago ontology is being used for the task of text summarization. The Named Entity Recognition (NER) task is concerned with marking occurrences of a specific object being mentioned. These mentions are then classified into a set of predefined categories. Standard categories include “person”, “location”, “geo-political organization”, “facility”, “organization”, and “time”. The use of NER in text summarization improved the summarization process by increasing the rank of informative sentences. To demonstrate the effectiveness of the proposed approach, we compared its performance on the DUC’04 benchmark document collections with that of a large number of state-of-the-art summarizers. Furthermore, we also performed a qualitative evaluation of the soundness and readability of the generated summaries and a comparison with the results that were produced by the most effective summarizers. A parallel effort has been devoted to integrating semantics-based models and the knowledge acquired from social networks into a document summarization model named as SociONewSum. The effort addresses the sentence-based generic multi-document summarization problem, which can be formulated as follows: given a collection of news articles ranging over the same topic, the goal is to extract a concise yet informative summary, which consists of most salient document sentences. An established ontological model has been used to improve summarization performance by integrating a textual entity recognition and disambiguation step. Furthermore, the analysis of the user-generated content coming from Twitter has been exploited to discover current social trends and improve the appealing of the generated summaries. An experimental evaluation of the SociONewSum performance was conducted on real English-written news article collections and Twitter posts. The achieved results demonstrate the effectiveness of the proposed summarizer, in terms of different ROUGE scores, compared to state-of-the-art open source summarizers as well as to a baseline version of the SociONewSum summarizer that does not perform any UGC analysis. Furthermore, the readability of the generated summaries has also been analyzed.

APA, Harvard, Vancouver, ISO, and other styles

29

Better, Marco L. "Data mining techniques for prediction and classification in discrete data applications." Connect to online resource, 2007. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3273688.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Storer, Jeremy J. "Computational Intelligence and Data Mining Techniques Using the Fire Data Set." Bowling Green State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1460129796.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Torre, Fabrizio. "3D data visualization techniques and applications for visual multidimensional data mining." Doctoral thesis, Universita degli studi di Salerno, 2014. http://hdl.handle.net/10556/1561.

Full text

Abstract:

2012 - 2013
Despite modern technology provide new tools to measure the world around us, we are quickly generating massive amounts of high-dimensional, spatialtemporal data. In this work, I deal with two types of datasets: one in which the spatial characteristics are relatively dynamic and the data are sampled at different periods of time, and the other where many dimensions prevail, although the spatial characteristics are relatively static. The first dataset refers to a peculiar aspect of uncertainty arising from contractual relationships that regulate a project execution: the dispute management. In recent years there has been a growth in size and complexity of the projects managed by public or private organizations. This leads to increased probability of project failures, frequently due to the difficulty and the ability to achieve the objectives such as on-time delivery, cost containment, expected quality achievement. In particular, one of the most common causes of project failure is the very high degree of uncertainty that affects the expected performance of the project, especially when different stakeholders with divergent aims and goals are involved in the project...[edited by author]
XII n.s.

APA, Harvard, Vancouver, ISO, and other styles

32

Li, Qiao. "Data mining and statistical techniques applied to genetic epidemiology." Thesis, University of East Anglia, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.533716.

Full text

Abstract:

Genetic epidemiology is the study of the joint action of genes and environmental factors in determining the phenotypes of diseases. The twin study is a classic and important epidemiological tool, which can help to separate the underlying effects of genes and environment on phenotypes. Twin data have been widely examined using traditional methods to genetic epidemiological research. However, they provide a rich sources information related to many complex phenotypes that has the potential to be further explored and exploited. This thesis focuses on two major genetic epidemiological approaches: familial aggregation analysis and linkage analysis, using twin data from TwinsUK Registry. Structural equation modelling (SEM) is a conventional method used in familial aggregation analysis, and is applied in this research to discover the underlying genetic and environmental influences on two complex phenotypes: coping strategies and osteoarthritis. However, SEM is a confirmatory method and relies on prior biomedical hypotheses. A new exploratory method, named MDS-C, combining multidimensional scaling and clustering method is developed in this thesis. It does not rely on using prior hypothetical models and is applied to uncover underlying genetic determinants of bone mineral density (BMD). The results suggest that the genetic influence on BMD is site-specific. Haseman-Elston (H-E) regression is a conventional linkage analysis approach using the identity by descent (IBD) information between twins to detect quantitative trait loci (QTLs) which regulate the quantitative phenotype. However, it only considers the genetic effect from individual loci. Two new approaches including a pair-wise H-E regression (PWH-E) and a feature screening approach (FSA) are proposed in this research to detect QTLs allowing gene-gene interaction. Simulation studies demonstrate that PWH-E and FSA have greater power to detect QTLs with interactions. Application to real-world BMD data results in identifying a set of potential QTLs, including 7 chromosomal loci consistent with previous genome-wide studies.

APA, Harvard, Vancouver, ISO, and other styles

33

Palmer, Nathan Patrick. "Data mining techniques for large-scale gene expression analysis." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/68493.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 238-256).
Modern computational biology is awash in large-scale data mining problems. Several high-throughput technologies have been developed that enable us, with relative ease and little expense, to evaluate the coordinated expression levels of tens of thousands of genes, evaluate hundreds of thousands of single-nucleotide polymorphisms, and sequence individual genomes. The data produced by these assays has provided the research and commercial communities with the opportunity to derive improved clinical prognostic indicators, as well as develop an understanding, at the molecular level, of the systemic underpinnings of a variety of diseases. Aside from the statistical methods used to evaluate these assays, another, more subtle challenge is emerging. Despite the explosive growth in the amount of data being generated and submitted to the various publicly available data repositories, very little attention has been paid to managing the phenotypic characterization of their samples (i.e., managing class labels in a controlled fashion). If sense is to be made of the underlying assay data, the samples' descriptive metadata must first be standardized in a machine-readable format. In this thesis, we explore these issues, specifically within the context of curating and analyzing a large DNA microarray database. We address three main challenges. First, we acquire a large subset of a publicly available microarray repository and develop a principled method for extracting phenotype information from freetext sample labels, then use that information to generate an index of the sample's medically-relevant annotation. The indexing method we develop, Concordia, incorporates pre-existing expert knowledge relating to the hierarchical relationships between medical terms, allowing queries of arbitrary specificity to be efficiently answered. Second, we describe a highly flexible approach to answering the question: "Given a previously unseen gene expression sample, how can we compute its similarity to all of the labeled samples in our database, and how can we utilize those similarity scores to predict the phenotype of the new sample?" Third, we describe a method for identifying phenotype-specific transcriptional profiles within the context of this database, and explore a method for measuring the relative strength of those signatures across the rest of the database, allowing us to identify molecular signatures that are shared across various tissues ad diseases. These shared fingerprints may form a quantitative basis for optimal therapy selection and drug repositioning for a variety of diseases.
by Nathan Patrick Palmer.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

34

Han, J., M. Kamber, and J. Pei. "Data mining: concepts and techniques." 2012. http://hdl.handle.net/10454/9053.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Veltman, Lisa M. "Incident Data Analysis Using Data Mining Techniques." 2008. http://hdl.handle.net/1969.1/ETD-TAMU-2008-08-35.

Full text

Abstract:

There are several databases collecting information on various types of incidents, and most analyses performed on these databases usually do not expand past basic trend analysis or counting occurrences. This research uses the more robust methods of data mining and text mining to analyze the Hazardous Substances Emergency Events Surveillance (HSEES) system data by identifying relationships among variables, predicting the occurrence of injuries, and assessing the value added by the text data. The benefits of performing a thorough analysis of past incidents include better understanding of safety performance, better understanding of how to focus efforts to reduce incidents, and a better understanding of how people are affected by these incidents. The results of this research showed that visually exploring the data via bar graphs did not yield any noticeable patterns. Clustering the data identified groupings of categories across the variable inputs such as manufacturing events resulting from intentional acts like system startup and shutdown, performing maintenance, and improper dumping. Text mining the data allowed for clustering the events and further description of the data, however, these events were not noticeably distinct and drawing conclusions based on these clusters was limited. Inclusion of the text comments to the overall analysis of HSEES data greatly improved the predictive power of the models. Interpretation of the textual data?s contribution was limited, however, the qualitative conclusions drawn were similar to the model without textual data input. Although HSEES data is collected to describe the effects hazardous substance releases/threatened releases have on people, a fairly good predictive model was still obtained from the few variables identified as cause related.

APA, Harvard, Vancouver, ISO, and other styles

36

Mantovani, Matteo. "Approximate Data Mining Techniques on Clinical Data." Doctoral thesis, 2020. http://hdl.handle.net/11562/1018039.

Full text

Abstract:

The past two decades have witnessed an explosion in the number of medical and healthcare datasets available to researchers and healthcare professionals. Data collection efforts are highly required, and this prompts the development of appropriate data mining techniques and tools that can automatically extract relevant information from data. Consequently, they provide insights into various clinical behaviors or processes captured by the data. Since these tools should support decision-making activities of medical experts, all the extracted information must be represented in a human-friendly way, that is, in a concise and easy-to-understand form. To this purpose, here we propose a new framework that collects different new mining techniques and tools proposed. These techniques mainly focus on two aspects: the temporal one and the predictive one. All of these techniques were then applied to clinical data and, in particular, ICU data from MIMIC III database. It showed the flexibility of the framework, which is able to retrieve different outcomes from the overall dataset. The first two techniques rely on the concept of Approximate Temporal Functional Dependencies (ATFDs). ATFDs have been proposed, with their suitable treatment of temporal information, as a methodological tool for mining clinical data. An example of the knowledge derivable through dependencies may be "within 15 days, patients with the same diagnosis and the same therapy usually receive the same daily amount of drug". However, current ATFD models are not analyzing the temporal evolution of the data, such as "For most patients with the same diagnosis, the same drug is prescribed after the same symptom". To this extent, we propose a new kind of ATFD called Approximate Pure Temporally Evolving Functional Dependencies (APEFDs). Another limitation of such kind of dependencies is that they cannot deal with quantitative data when some tolerance can be allowed for numerical values. In particular, this limitation arises in clinical data warehouses, where analysis and mining have to consider one or more measures related to quantitative data (such as lab test results and vital signs), concerning multiple dimensional (alphanumeric) attributes (such as patient, hospital, physician, diagnosis) and some time dimensions (such as the day since hospitalization and the calendar date). According to this scenario, we introduce a new kind of ATFD, named Multi-Approximate Temporal Functional Dependency (MATFD), which considers dependencies between dimensions and quantitative measures from temporal clinical data. These new dependencies may provide new knowledge as "within 15 days, patients with the same diagnosis and the same therapy receive a daily amount of drug within a fixed range". The other techniques are based on pattern mining, which has also been proposed as a methodological tool for mining clinical data. However, many methods proposed so far focus on mining of temporal rules which describe relationships between data sequences or instantaneous events, without considering the presence of more complex temporal patterns into the dataset. These patterns, such as trends of a particular vital sign, are often very relevant for clinicians. Moreover, it is really interesting to discover if some sort of event, such as a drug administration, is capable of changing these trends and how. To this extent, we propose a new kind of temporal patterns, called Trend-Event Patterns (TEPs), that focuses on events and their influence on trends that can be retrieved from some measures, such as vital signs. With TEPs we can express concepts such as "The administration of paracetamol on a patient with an increasing temperature leads to a decreasing trend in temperature after such administration occurs". We also decided to analyze another interesting pattern mining technique that includes prediction. This technique discovers a compact set of patterns that aim to describe the condition (or class) of interest. Our framework relies on a classification model that considers and combines various predictive pattern candidates and selects only those that are important to improve the overall class prediction performance. We show that our classification approach achieves a significant reduction in the number of extracted patterns, compared to the state-of-the-art methods based on minimum predictive pattern mining approach, while preserving the overall classification accuracy of the model. For each technique described above, we developed a tool to retrieve its kind of rule. All the results are obtained by pre-processing and mining clinical data and, as mentioned before, in particular ICU data from MIMIC III database.

APA, Harvard, Vancouver, ISO, and other styles

37

Huang, Jen-Chieh, and 黃仁傑. "Using Data Mining Techniques to Support Data Retrieval." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/41468540528588892817.

Full text

Abstract:

碩士
中原大學
資訊管理研究所
89
The explosive growth of the Internet, and particular the World Wide Web, in recent years has put huge amounts of information at the disposal of anyone with access to the Internet. A problem facing information retrieval on the web is how to help user read easily. Clustering method is to group the data with similar features in clusters without needing predefined cluster labels. And Document Clustering or Document Classfy can help user read easily. In the past, clustering or classfy must extract keyword to descript this document. And keyword extract from document is a content-based clustering method. Data mining is a new technique that can discover something unknowledge from a mount of data. Recently some research is using data mining technique to find knowledge in the web named web mining. In our research, a new method for clustering Web Page is proposed. This method is using web mining technique to produce user-based clustering. A major advantage of this approach is that the relavency information is objectively reflected by the usage logs; frequent simultaneous visits to two unrelated documents should indicate that they are in fact closely related. By analysis sogi web’s log file, web page clustering is proposed, and the experiment result show that cluster having high percisioin.

APA, Harvard, Vancouver, ISO, and other styles

38

Narvaez, Vilema Miryan Estela, Felice Crupi, and Fabrizio Angiulli. "Data mining techniques for large and complex data." Thesis, 2017. http://hdl.handle.net/10955/1875.

Full text

Abstract:

Dottorato di Ricerca in Information and Communication Engineering For Pervasive Intelligent Environments, Ciclo XXIX
During these three years of research I dedicated myself to the study and design of data mining techniques for large quantities of data. Particular attention was devoted to training set condensing techniques for the nearest-neighbor classification rule and to techniques for node anomaly detection in networks. The first part of this thesis was focused on the design of strategies to reduce the size of the subset extracted from condensing techniques and to their experimentation. The training set condensing techniques aim to determine a subset of the original training set having the property of allowing to correctly classify all the training set examples. The subset extracted from these techniques also known as consistent subset. The result of the research was the development of various strategies of subset selection, designed to determine during the training phase the most promising subset based on different methods of estimating test accuracy. Among them, the PACOPT strategy is based on Pessimistic Error Estimate (PEE) to estimate generalization as a trade-off between training set accuracy and model complexity. The experimental phase has had for reference the FCNN technique of condensation. Among the methods of condensation based on the nearest neighbor decision rule (NN rule), FCNN (for Fast Condensed NN) it is one of the most advantageous technique, particularly in terms of time performance. We showed that the designed selection strategies guarantee to preserve the accuracy of a consistent subset. We also demonstrated that the proposed selection strategies guarantee to significantly reduce the size of the model. Comparison with notable training-set reduction techniques for the NN rule witness for state-of-the-art performances of the here introduced strategies. The second part of the thesis is directed towards the design of analysis tools for network structured data. Anomaly detection is an area that has received much attention in recent years. It has a wide variety of applications, including fraud detection and network intrusion detection. The techniques focused on anomaly detection in static graphs assume that the networks do not change and are capable of representing only a single snapshot of data. As real-world networks are constantly changing, there has been a shift in focus to dynamic graphs, which evolve over time. We present a technique for node anomaly detection in networks where arcs are annotated with time of creation. The technique aims at singling out anomalies by taking simultaneously into account information concerning both the structure of the network and the order in which connections have been established. The latter information is obtained by timestamps associated with arcs. A set of temporal structures is induced by checking certain conditions on the order of arc appearance denoting different kinds of user behaviors. The distribution of these structures is computed for each node and used to detect anomalies. We point out that the approach here investigated is substantially different from techniques dealing with dynamic networks. Indeed, our aim is not to determine the points in time in which a certain portion of the networks (typically a community or a subgraph) exhibited a significant change, as usually done by dynamic-graph anomaly detection techniques. Rather, our primary aim is to analyze each single node by taking simultaneously into account its temporal footprint.
Università della Calabria

APA, Harvard, Vancouver, ISO, and other styles

39

Guarascio, Massimo, Domenico Saccà, Giuseppe Manco, and Luigi Palopoli. "Data mining techniques for fraud detection." Thesis, 2014. http://hdl.handle.net/10955/419.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Mallick, Jnanaranjan, and Manmohan Tudu. "Study of various data mining techniques." Thesis, 2007. http://ethesis.nitrkl.ac.in/4254/1/%E2%80%9CStudy_of_various_data_mining_techniques.pdf.

Full text

Abstract:

The advent of computing technology has significantly influenced our lives and two major impacts of this effect are Business Data Processing and Scientific Computing. During the initial years of the development of computer techniques for business, computer professionals were concerned with designing files to store the data so that information could be efficiently retrieved. There were restrictions on storage size for storing data and on speed of accessing the data. Needless to say, the activity was restricted to a very few, highly qualified professionals. Then came the era when the task was simplified by a DBMS [1]. The responsibilities of intricate tasks, such as declarative aspects of the program were passed on to the database administrator and the user could pose his query in simpler languages such as query languages.

APA, Harvard, Vancouver, ISO, and other styles

41

Chuang, Tse-sheng, and 莊澤生. "Discovering Issue Networks Using Data Mining Techniques." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/27215766204365017293.

Full text

Abstract:

碩士
國立中山大學
資訊管理學系研究所
90
By means of data mining techniques development these days, the knowledge discovered by virtue of data mining has ranging from business application to fraud detection. However, too often, we see only the profit-making justification for investing in data mining while losing sight of the fact that they can help resolve issues of global or national importance. In this research, we propose the architecture for issue oriented information construction and knowledge discovery that related to political or public policy issues. In this architecture, we adopt issue networks as the description model and data mining as the core technique. This study is also performed and verified with prototype system constructing and case data analyzing. There are three main topics in our research. The issue networks information construction starts with text files information retrieving of specified issue from news reports. Keywords retrieved from news reports are converted into structuralized network nodes and presented in the form of issue networks. The second topic is the clustering of network actors. We adopt an issue-association clustering method to provide views of clustering of issue participators based on relations of issues. In third topic, we use specified link analysis method to compute the importance of actors and sub-issues. Our study concludes with performance evaluation via domain experts. We conduct recall, precision evaluation for first topic above and certainty, novelty, utility evaluation for others.

APA, Harvard, Vancouver, ISO, and other styles

42

HUANG, JYUN-HAO, and 黃俊豪. "Design optimization process with data mining techniques." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/59498094123929998366.

Full text

Abstract:

碩士
國立中興大學
機械工程學系所
97
Abstract 　　This thesis incorporates data mining into optimization process. The useful information embedded in the data can be dug out to enhance the computational efficiency and produce better solutions. For unconstrained optimization problems, the data mining is used to find the design space that might contain the global solution. For constrained optimization problems, it is used to find the possible feasible regions. The Sequential quadratic programming (SQP) is then used to find the optimum solution in the identified areas to see whether the chance to find the global solution is increased. Based on test results, it shows that for high-noised multi-modal problems, data mining and SQP may not be able to find the global solution. Therefore, the evolutionary algorithm will be incorporated into the optimization process in the second part of this thesis. 　　The evolutionary algorithm searches the design space using multiple points simultaneously. If the design space can be reduced, then the computational time spent will be reduced and the chance to find the global solution will be increased. In order to save the computational time for structural optimization problems, the artificial neural network is employed to get the approximate results of structural analyses. 　　This thesis uses evolution strategy to search for the optimum solution in design space found by data mining. For structural optimization problems, neural networks are used to replace exact finite element analyses. The SQP is used in the last step to search for the exact optimum solution. Several test problems show that the proposed approach not only can find better solutions but also spends less computational time.

APA, Harvard, Vancouver, ISO, and other styles

43

Shiau, Shu-Min, and 蕭書民. "Applying Data Mining Techniques to Products Test." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/58276969863380870641.

Full text

Abstract:

碩士
世新大學
資訊管理學研究所(含碩專班)
99
With the longer product development time, the cost usually will relative higher. And if the issues are found at the later of development process, then the cost will even higher just to pay for the correction.Therefore, product's functional testing in the development process is a necessary stage and through testing hope the problem can be detected at the earliest time.Because of effective testing, the issues can be found and correct right away before the product shipment.To let the customers have the highest quality and stability product is our goal.So, the product testing is extremely important task during the product manufacturing. Users require a long period of use and maintain stability for network products.In order to meet customers demand, the testing is very important during product development's period. This study is based on the network product testing data to do the survey and model building, and through the use to analysis its information association with rules and validation.The results show that the network product testing data can be used effectively on determine the analytes in the test plan.Information through different conditions and factors classification, you can see the relationship between product and test plan more clearly.If just focus on one data, then you will find that different type of products have its own relationship with the test plan.Network product testing can also be using network data survey techniques to obtain the accuracy and save more labor requirement.By using this way, the labor resource can be used more efficient and also let the data survey perform its best ability to create more profit to corporate, also by using tihs way to enhance the product satisfication on customers.

APA, Harvard, Vancouver, ISO, and other styles

44

Chen, Po-Jung, and 陳柏融. "Selecting Test Items by Data Mining Techniques." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/31407843272135626158.

Full text

Abstract:

碩士
世新大學
資訊管理學研究所(含碩專班)
97
One of the most important goals of tests designing is to pick items with most discrimination. In the past, most work assumed no dependent relations among test items so that the test papers are made by picking items with highest individual discriminations. But in reality, test items may relate to other items, the overall discrimination of a test paper can not be simply added-up. Hence, this study proposes a two-step method to design test papers by picking discriminative items combinations from the item bank. We first analyze the archival tests to discover substitute items as well as recognize discriminative test itemsets by using data mining technology. Then, the test items are recommended to complete the discriminative test paper. Finally, a real life case is used to testify the proposed method. These test data are provided by the Chinese Enterprise Planning Association (CERP) in Taiwan. The experimental results show the two-step method can complete the test design task efficiently. In addition, the newly composed test paper presents high discriminative since it is very close to the maximum discrimination under the assumption of items independence which are ideally generated by the Item Response Theory.

APA, Harvard, Vancouver, ISO, and other styles

45

Cernaut, Oana-Maria. "Customer targeting models using data mining techniques." Master's thesis, 2019. http://hdl.handle.net/10773/30010.

Full text

Abstract:

In recent years, the segmentation process has undergone numerous changes, once with the advances in data mining. Knowledge discovery can automatize and provide better insights into customer trends and dynamics. The objective of the paper is to improve the quality of the marketing segmentation for company T. More specifically, the research question it plans to answer is whether data mining techniques deliver a better segmentation model than intuitive approaches. The segmentation steps comprise the identification of the necessary variables, the selection of the relevant ones to conduct the segmentation and the usage of artificial neural networks to predict future outcomes. To this end, the work makes use of web scraping (based on Google searches), K-means clustering and artificial neural networks.
Mestrado em Marketing

APA, Harvard, Vancouver, ISO, and other styles

46

Lin, Yen-Tim, and 林彥廷. "Analysis of Spam Using Data Mining Techniques." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/02200975323671982267.

Full text

Abstract:

碩士
華梵大學
資訊管理學系碩士班
96
Abstract For the development of novel technology, the application of the information technology becomes one of the main habits in daily life, for example, the use of e-mail. However, these followed by the intrusion of the spam mails which persecute the users and waste the network resources. So, the setting up of the e-mail filter system is an important topic of network security. The previous studies of the e-mail filter are mostly depended on the keywords or building up of black and white lists to identify the spam, but they are limited to the problem of wrongly classification the e-mails that may cause the missing of some important mails. This study is based on the characteristics of the spam. The spam mails are divided into 16 attributes with the technology of data mining, and use the voting method to classify the mails. The data are mainly collected from the mail box of the Huafan University. First, they are changed into the data format with 16 characteristics. Thereafter, the decision tree, back-propagation network (BPN) and support vector machine (SVM) are conducted to proceed the learning vote which can produce the required classified results. According to the simulation results, this application of the e-mail filter mechanism in this study can provide a satisfied outcome.

APA, Harvard, Vancouver, ISO, and other styles

47

Chu, Yu-Hou, and 朱宇侯. "Using Data Mining Techniques for Vehicle Warranty." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/64698339906754441668.

Full text

Abstract:

碩士
中原大學
資訊管理研究所
93
The innovation of information technology has increased the development of information production and gathering. Lots of companies have setup information systems to collect the daily operation data and stored these data in the database. In order to enforce the competitive advantage for company. Therefore, how to smartly and automatically transfer data into useful information and knowledge becomes a very pioneering goal of data application. As a result, data mining gradually becomes important. The researches of data mining techniques have been developed very well in many fields. The application of data mining techniques in maintenance field being in paid much attention in these years. The project adapts Rough Set and Association Rules of data miming to develop a model based on RSFP ( Rough set and Frequent pattern list). The model sifts all attributes out to leave only t he most important ones. Then it comes to data miming which is used to get the last association rules. In the project, we take the example of reported data from the service station. We sift the minor attributes out by the theory of rough sets to gather the most important attributes. This project is an important theory to sift out attributes by means of rough sets. Furthermore, I use Frequent pattern list to mine association rules. We also apply the combination of both rough sets and Frequent pattern list to vehicle warranty fees and data. The result heightens the efficiency of sifting and miming of attributes and also helps the service station find out what rules they are interested in.

APA, Harvard, Vancouver, ISO, and other styles

48

Kao, Ming-Jui, and 高明瑞. "Importers Value Discovery Using Data Mining Techniques." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/85174480441730344807.

Full text

Abstract:

碩士
元智大學
資訊管理學系
98
As the market of shipping by sea is flourishing, liner container carriers are raising their volume of freight and enhancing service quality in order to meet customers'' demands. However, the customers with higher contribution margin are very important to be acquired from a large number of customers in the global market. To determine those customers with higher contribution margin, this research combined RFM Model for measuring the value of customers and K-means algorithm for clustering analysis. Since Trans-Pacific trade of US import from the Far East is more indicative, we adopted US market data for this research. The results showed that the proposed clustering model has the essence of efficiency. The customers were divided into five categories, they are (A)Best , (B)Spender, (C)Frequent, (D)Uncertain and (E)Loss/negative. Among those five categories, the Best and Spender account for 75 % of overall Trans-Pacific trades. The corresponding marketing strategies for the above two groups are proposed in this research as well.

APA, Harvard, Vancouver, ISO, and other styles

49

謝祖仁. "Using Data Mining Techniques in Patient Diagnosis." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/93434524520345344958.

Full text

Abstract:

碩士
國立彰化師範大學
資訊管理學系所
100
The medical industry is a service industry, except offering to the acute patient the service of seeking medical advice , The target of serving more is that often one fixes the chronic patient who goes to a doctor in the hospital, Age strata are mostly age patients on the middle and senior level, except that main disease is sought while seeking medical advice, will often be with other disease symptoms, such as the diabetes patient person often with high blood pressure, some disease symptoms of the eye, Must check to track too that treat to other medical training clinic while going to a doctor, if can offer a patient to seek medical advice medical care intact and convenient, to the patient seeking medical advice to the hospital, can save time, can offer a patient to combine as to hospital and intact medical treatment is looked after, further analyze the relevant information of disease that a patient seeks medical advice, the expectation can be found potential and having rule of relation nature, can even use to get the result of prevention from suffering from the diseases. Find result of study, at occupy all by certain rate no matter in number of times, women suffer from number and going to a doctor rate to suffer from the number of people and clinic and go to a doctoring etc. blood pressure disease, reveal that looks after the importance on at chronic disease. Women suffer from the number of people at multiple chronic diseases and is generally higher than men. Analyze, reveal with age level, have blood pressure these sector suffers from number to obviously increase in 40-64 year old. The Association rule analysis result of the chronic disease diagnosis: high blood pressure and Confidence that spends above and other 50% chronic disease such as vascular disease of the brain, diabetes, heart disease, other metabolism lack proper care and the intersection of immunity and unregulated illness and the intersection of gullet and the intersection of and stomach and disease of duodenum, etc. have high relation.

APA, Harvard, Vancouver, ISO, and other styles

50

Renda, Alessandro. "Algorithms and techniques for data stream mining." Doctoral thesis, 2021. http://hdl.handle.net/2158/1235915.

Full text

Abstract:

The abstraction of data streams encompasses a vast range of diverse applications that continuously generate data and therefore require dedicated algorithms and approaches for exploitation and mining. In this framework both unsupervised and supervised approaches are generally employed, depending on the task and on the availability of annotated data. This thesis proposes novel algorithms and techniques specifically tailored for the streaming setting and for knowledge discovery from Social Networks. In the first part of this work we propose a novel clustering algorithm for data streams. Our investigation stems from the discussion of general challenges posed by cluster analysis and of those purely related to the streaming setting. First, we propose SF-DBSCAN (streaming fuzzy DBSCAN) a preliminary solution conceived as an extension of the popular DBSCAN algorithm. SF-DBSCAN handles the arrival of new objects and continuously updates the clustering result by taking advantage of concepts from fuzzy set theory. However, it gives equal importance to every collected object and therefore is not suitable to manage unbounded data streams and to adapt to evolving settings. Then, we introduce TSF-DBSCAN, a novel "temporal" adaptation of streaming fuzzy DBSCAN: it overcomes the limits of the previous proposal and proves to be effective in handling evolving and potentially unbounded data streams, discovering clusters with fuzzy overlapping borders. In the second part of the thesis we explore a supervised learning application: the goal of our analysis is to discover the public opinion towards the vaccination topic in Italy, by exploiting the popular Twitter platform as data source. First, we discuss the design and development of a system for stance detection from text. The deployment of the classification model for the online monitoring of the public opinion, however, cannot ignore that tweets can be seen as a particular form of a temporal data stream. Then, we discuss the importance of leveraging user-related information, which enables the design of a set of techniques aimed at deepening and enhancing the analysis. Finally, we compare different learning schemes for addressing concept-drift, i.e. a change in the underlying data distribution, in a dynamic environment affected by the occurrence of real world context-related events. In this case study and throughout the thesis, the proposal of algorithms and techniques is supported by in-depth experimental analysis.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data Mining Techniques'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles