To see the other types of publications on this topic, follow the link: Data Dimensionality Reduction.

Dissertations / Theses on the topic 'Data Dimensionality Reduction'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Data Dimensionality Reduction.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Vamulapalli, Harika Rao. "On Dimensionality Reduction of Data." ScholarWorks@UNO, 2010. http://scholarworks.uno.edu/td/1211.

Full text
Abstract:
Random projection method is one of the important tools for the dimensionality reduction of data which can be made efficient with strong error guarantees. In this thesis, we focus on linear transforms of high dimensional data to the low dimensional space satisfying the Johnson-Lindenstrauss lemma. In addition, we also prove some theoretical results relating to the projections that are of interest when applying them in practical applications. We show how the technique can be applied to synthetic data with probabilistic guarantee on the pairwise distance. The connection between dimensionality reduction and compressed sensing is also discussed.
APA, Harvard, Vancouver, ISO, and other styles
2

Widemann, David P. "Dimensionality reduction for hyperspectral data." College Park, Md.: University of Maryland, 2008. http://hdl.handle.net/1903/8448.

Full text
Abstract:
Thesis (Ph. D.) -- University of Maryland, College Park, 2008.
Thesis research directed by: Dept. of Mathematics. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.
APA, Harvard, Vancouver, ISO, and other styles
3

Baldiwala, Aliakbar. "Dimensionality Reduction for Commercial Vehicle Fleet Monitoring." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/38330.

Full text
Abstract:
A variety of new features have been added in the present-day vehicles like a pre-crash warning, the vehicle to vehicle communication, semi-autonomous driving systems, telematics, drive by wire. They demand very high bandwidth from in-vehicle networks. Various electronic control units present inside the automotive transmit useful information via automotive multiplexing. Automotive multiplexing allows sharing information among various intelligent modules inside an automotive electronic system. Optimum functionality is achieved by transmitting this data in real time. The high bandwidth and high-speed requirement can be achieved either by using multiple buses or by implementing higher bandwidth. But, by doing so the cost of the network and the complexity of the wiring in the vehicle increases. Another option is to implement higher layer protocol which can reduce the amount of data transferred by using data reduction (DR) techniques, thus reducing the bandwidth usage. The implementation cost is minimal as only the changes are required in the software and not in hardware. In our work, we present a new data reduction algorithm termed as “Comprehensive Data Reduction (CDR)” algorithm. The proposed algorithm is used for minimization of the bus utilization of CAN bus for a future vehicle. The reduction in the busload was efficiently made by compressing the parameters; thus, more number of messages and lower priority messages can be efficiently sent on the CAN bus. The proposed work also presents a performance analysis of proposed algorithm with the boundary of fifteen compression algorithm, and Compression area selection algorithms (Existing Data Reduction Algorithm). The results of the analysis show that proposed CDR algorithm provides better data reduction compared to earlier proposed algorithms. The promising results were obtained in terms of reduction in bus utilization, compression efficiency, and percent peak load of CAN bus. This Reduction in the bus utilization permits to utilize a larger number of network nodes (ECU’s) in the existing system without increasing the overall cost of the system. The proposed algorithm has been developed for automotive environment, but it can also be utilized in any applications where extensive information transmission among various control units is carried out via a multiplexing bus.
APA, Harvard, Vancouver, ISO, and other styles
4

DWIVEDI, SAURABH. "DIMENSIONALITY REDUCTION FOR DATA DRIVEN PROCESS MODELING." University of Cincinnati / OhioLINK, 2003. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1069770129.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

XU, NUO. "AGGRESSIVE DIMENSIONALITY REDUCTION FOR DATA-DRIVEN MODELING." University of Cincinnati / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1178640357.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Law, Hiu Chung. "Clustering, dimensionality reduction, and side information." Diss., Connect to online resource - MSU authorized users, 2006.

Find full text
Abstract:
Thesis (Ph. D.)--Michigan State University. Dept. of Computer Science & Engineering, 2006.
Title from PDF t.p. (viewed on June 19, 2009) Includes bibliographical references (p. 296-317). Also issued in print.
APA, Harvard, Vancouver, ISO, and other styles
7

Ross, Ian. "Nonlinear dimensionality reduction methods in climate data analysis." Thesis, University of Bristol, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492479.

Full text
Abstract:
Linear dimensionality reduction techniques, notably principal component analysis, are widely used in climate data analysis as a means to aid in the interpretation of datasets of high dimensionality. These hnear methods may not be appropriate for the analysis of data arising from nonlinear processes occurring in the climate system. Numerous techniques for nonlinear dimensionality reduction have been developed recently that may provide a potentially useful tool for the identification of low-dimensional manifolds in climate data sets arising from nonlinear dynamics. In this thesis I apply three such techniques to the study of El Niño/Southern Oscillation variability in tropical Pacific sea surface temperatures and thermocline depth, comparing observational data with simulations from coupled atmosphere-ocean general circulation models from the CMIP3 multi-model ensemble.
APA, Harvard, Vancouver, ISO, and other styles
8

Ray, Sujan. "Dimensionality Reduction in Healthcare Data Analysis on Cloud Platform." University of Cincinnati / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin161375080072697.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Ha, Sook Shin. "Dimensionality Reduction, Feature Selection and Visualization of Biological Data." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/77169.

Full text
Abstract:
Due to the high dimensionality of most biological data, it is a difficult task to directly analyze, model and visualize the data to gain biological insight. Thus, dimensionality reduction becomes an imperative pre-processing step in analyzing and visualizing high-dimensional biological data. Two major approaches to dimensionality reduction in genomic analysis and biomarker identification studies are: Feature extraction, creating new features by combining existing ones based on a mapping technique; and feature selection, choosing an optimal subset of all features based on an objective function. In this dissertation, we show how our innovative reduction schemes effectively reduce the dimensionality of DNA gene expression data to extract biologically interpretable and relevant features which result in enhancing the biomarker identification process. To construct biologically interpretable features and facilitate Muscular Dystrophy (MD) subtypes classification, we extract molecular features from MD microarray data by constructing sub-networks using a novel integrative scheme which utilizes protein-protein interaction (PPI) network, functional gene sets information and mRNA profiling data. The workflow includes three major steps: First, by combining PPI network structure and gene-gene co-expression relationship into a new distance metric, we apply affinity propagation clustering (APC) to build gene sub-networks; secondly, we further incorporate functional gene sets knowledge to complement the physical interaction information; finally, based on the constructed sub-network and gene set features, we apply multi-class support vector machine (MSVM) for MD sub-type classification and highlight the biomarkers contributing to the sub-type prediction. The experimental results show that our scheme could construct sub-networks that are more relevant to MD than those constructed by the conventional approach. Furthermore, our integrative strategy substantially improved the prediction accuracy, especially for those ‘hard-to-classify' sub-types. Conventionally, pathway-based analysis assumes that genes in a pathway equally contribute to a biological function, thus assigning uniform weight to genes. However, this assumption has been proven incorrect and applying uniform weight in the pathway analysis may not be an adequate approach for tasks like molecular classification of diseases, as genes in a functional group may have different differential power. Hence, we propose to use different weights for the pathway analysis which resulted in the development of four weighting schemes. We applied them in two existing pathway analysis methods using both real and simulated gene expression data for pathways. Weighting changes pathway scoring and brings up some new significant pathways, leading to the detection of disease-related genes that are missed under uniform weight. To help us understand our MD expression data better and derive scientific insight from it, we have explored a suite of visualization tools. Particularly, for selected top performing MD sub-networks, we displayed the network view using Cytoscape; functional annotations using IPA and DAVID functional analysis tools; expression pattern using heat-map and parallel coordinates plot; and MD associated pathways using KEGG pathway diagrams. We also performed weighted MD pathway analysis, and identified overlapping sub-networks across different weight schemes and different MD subtypes using Venn Diagrams, which resulted in the identification of a new sub-network significantly associated with MD. All those graphically displayed data and information helped us understand our MD data and the MD subtypes better, resulting in the identification of several potentially MD associated biomarker pathways and genes.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
10

Di, Ciaccio Lucio. "Feature selection and dimensionality reduction for supervised data analysis." Thesis, Massachusetts Institute of Technology, 2016. https://hdl.handle.net/1721.1/122827.

Full text
Abstract:
Thesis: S.M., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2016
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 103-106).
by Lucio Di Ciaccio.
S.M.
S.M. Massachusetts Institute of Technology, Department of Aeronautics and Astronautics
APA, Harvard, Vancouver, ISO, and other styles
11

Ghodsi, Boushehri Ali. "Nonlinear Dimensionality Reduction with Side Information." Thesis, University of Waterloo, 2006. http://hdl.handle.net/10012/1020.

Full text
Abstract:
In this thesis, I look at three problems with important applications in data processing. Incorporating side information, provided by the user or derived from data, is a main theme of each of these problems.

This thesis makes a number of contributions. The first is a technique for combining different embedding objectives, which is then exploited to incorporate side information expressed in terms of transformation invariants known to hold in the data. It also introduces two different ways of incorporating transformation invariants in order to make new similarity measures. Two algorithms are proposed which learn metrics based on different types of side information. These learned metrics can then be used in subsequent embedding methods. Finally, it introduces a manifold learning algorithm that is useful when applied to sequential decision problems. In this case we are given action labels in addition to data points. Actions in the manifold learned by this algorithm have meaningful representations in that they are represented as simple transformations.
APA, Harvard, Vancouver, ISO, and other styles
12

Gámez, López Antonio Juan. "Application of nonlinear dimensionality reduction to climate data for prediction." [S.l.] : [s.n.], 2006. http://opus.kobv.de/ubp/volltexte/2006/1095.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Kharal, Rosina. "Semidefinite Embedding for the Dimensionality Reduction of DNA Microarray Data." Thesis, University of Waterloo, 2006. http://hdl.handle.net/10012/2945.

Full text
Abstract:
Harnessing the power of DNA microarray technology requires the existence of analysis methods that accurately interpret microarray data. Current literature abounds with algorithms meant for the investigation of microarray data. However, there is need for an efficient approach that combines different techniques of microarray data analysis and provides a viable solution to dimensionality reduction of microarray data. Reducing the high dimensionality of microarray data is one approach in striving to better understand the information contained within the data. We propose a novel approach for dimensionality reduction of microarray data that effectively combines different techniques in the study of DNA microarrays. Our method, KAS (kernel alignment with semidefinite embedding), aids the visualization of microarray data in two dimensions and shows improvement over existing dimensionality reduction methods such as PCA, LLE and Isomap.
APA, Harvard, Vancouver, ISO, and other styles
14

Hira, Zena Maria. "Dimensionality reduction methods for microarray cancer data using prior knowledge." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/33812.

Full text
Abstract:
Microarray studies are currently a very popular source of biological information. They allow the simultaneous measurement of hundreds of thousands of genes, drastically increasing the amount of data that can be gathered in a small amount of time and also decreasing the cost of producing such results. Large numbers of high dimensional data sets are currently being generated and there is an ongoing need to find ways to analyse them to obtain meaningful interpretations. Many microarray experiments are concerned with answering specific biological or medical questions regarding diseases and treatments. Cancer is one of the most popular research areas and there is a plethora of data available requiring in depth analysis. Although the analysis of microarray data has been thoroughly researched over the past ten years, new approaches still appear regularly, and may lead to a better understanding of the available information. The size of the modern data sets presents considerable difficulties to traditional methodologies based on hypothesis testing, and there is a new move towards the use of machine learning in microarray data analysis. Two new methods of using prior genetic knowledge in machine learning algorithms have been developed and their results are compared with existing methods. The prior knowledge consists of biological pathway data that can be found in on-line databases, and gene ontology terms. The first method, called ''a priori manifold learning'' uses the prior knowledge when constructing a manifold for non-linear feature extraction. It was found to perform better than both linear principal components analysis (PCA) and the non-linear Isomap algorithm (without prior knowledge) in both classification accuracy and quality of the clusters. Both pathway and GO terms were used as prior knowledge, and results showed that using GO terms can make the models over-fit the data. In the cases where the use of GO terms does not over-fit, the results are better than PCA, Isomap and a priori manifold learning using pathways. The second method, called ''the feature selection over pathway segmentation algorithm'', uses the pathway information to split a big dataset into smaller ones. Then, using AdaBoost, decision trees are constructed for each of the smaller sets and the sets that achieve higher classification accuracy are identified. The individual genes in these subsets are assessed to determine their role in the classification process. Using data sets concerning chronic myeloid leukaemia (CML) two subsets based on pathways were found to be strongly associated with the response to treatment. Using a different data set from measurements on lower grade glioma (LGG) tumours, four informative gene sets were discovered. Further analysis based on the Gini importance measure identified a set of genes for each cancer type (CML, LGG) that could predict the response to treatment very accurately (> 90%). Moreover a single gene that can predict the response to CML treatment accurately was identified.
APA, Harvard, Vancouver, ISO, and other styles
15

Gámez, López Antonio Juan. "Application of nonlinear dimensionality reduction to climate data for prediction." Phd thesis, Universität Potsdam, 2006. http://opus.kobv.de/ubp/volltexte/2006/1095/.

Full text
Abstract:
This Thesis was devoted to the study of the coupled system composed by El Niño/Southern Oscillation and the Annual Cycle. More precisely, the work was focused on two main problems: 1. How to separate both oscillations into an affordable model for understanding the behaviour of the whole system. 2. How to model the system in order to achieve a better understanding of the interaction, as well as to predict future states of the system. We focused our efforts in the Sea Surface Temperature equations, considering that atmospheric effects were secondary to the ocean dynamics. The results found may be summarised as follows: 1. Linear methods are not suitable for characterising the dimensionality of the sea surface temperature in the tropical Pacific Ocean. Therefore they do not help to separate the oscillations by themselves. Instead, nonlinear methods of dimensionality reduction are proven to be better in defining a lower limit for the dimensionality of the system as well as in explaining the statistical results in a more physical way [1]. In particular, Isomap, a nonlinear modification of Multidimensional Scaling methods, provides a physically appealing method of decomposing the data, as it substitutes the euclidean distances in the manifold by an approximation of the geodesic distances. We expect that this method could be successfully applied to other oscillatory extended systems and, in particular, to meteorological systems. 2. A three dimensional dynamical system could be modeled, using a backfitting algorithm, for describing the dynamics of the sea surface temperature in the tropical Pacific Ocean. We observed that, although there were few data points available, we could predict future behaviours of the coupled ENSO-Annual Cycle system with an accuracy of less than six months, although the constructed system presented several drawbacks: few data points to input in the backfitting algorithm, untrained model, lack of forcing with external data and simplification using a close system. Anyway, ensemble prediction techniques showed that the prediction skills of the three dimensional time series were as good as those found in much more complex models. This suggests that the climatological system in the tropics is mainly explained by ocean dynamics, while the atmosphere plays a secondary role in the physics of the process. Relevant predictions for short lead times can be made using a low dimensional system, despite its simplicity. The analysis of the SST data suggests that nonlinear interaction between the oscillations is small, and that noise plays a secondary role in the fundamental dynamics of the oscillations [2]. A global view of the work shows a general procedure to face modeling of climatological systems. First, we should find a suitable method of either linear or nonlinear dimensionality reduction. Then, low dimensional time series could be extracted out of the method applied. Finally, a low dimensional model could be found using a backfitting algorithm in order to predict future states of the system.
Das Ziel dieser Arbeit ist es das Verhalten der Temperatur des Meers im tropischen Pazifischen Ozean vorherzusagen. In diesem Gebiet der Welt finden zwei wichtige Phänomene gleichzeitig statt: der jährliche Zyklus und El Niño. Der jährliche Zyklus kann als Oszillation physikalischer Variablen (z.B. Temperatur, Windgeschwindigkeit, Höhe des Meeresspiegels), welche eine Periode von einem Jahr zeigen, definiert werden. Das bedeutet, dass das Verhalten des Meers und der Atmosphäre alle zwölf Monate ähnlich sind (alle Sommer sind ähnlicher jedes Jahr als Sommer und Winter des selben Jahres). El Niño ist eine irreguläre Oszillation weil sie abwechselnd hohe und tiefe Werte erreicht, aber nicht zu einer festen Zeit, wie der jährliche Zyklus. Stattdessen, kann el Niño in einem Jahr hohe Werte erreichen und dann vier, fünf oder gar sieben Jahre benötigen, um wieder aufzutreten. Es ist dabei zu beachten, dass zwei Phänomene, die im selben Raum stattfinden, sich gegenseitig beeinflussen. Dennoch weiß man sehr wenig darüber, wie genau el Niño den jährlichen Zyklus beeinflusst, und umgekehrt. Das Ziel dieser Arbeit ist es, erstens, sich auf die Temperatur des Meers zu fokussieren, um das gesamte System zu analysieren; zweitens, alle Temperaturzeitreihen im tropischen Pazifischen Ozean auf die geringst mögliche Anzahl zu reduzieren, um das System einerseits zu vereinfachen, ohne aber andererseits wesentliche Information zu verlieren. Dieses Vorgehen ähnelt der Analyse einer langen schwingenden Feder, die sich leicht um die Ruhelage bewegt. Obwohl die Feder lang ist, können wir näherungsweise die ganze Feder zeichnen wenn wir die höchsten Punkte zur einen bestimmten Zeitpunkt kennen. Daher, brauchen wir nur einige Punkte der Feder um ihren Zustand zu charakterisieren. Das Hauptproblem in unserem Fall ist die Mindestanzahl von Punkten zu finden, die ausreicht, um beide Phänomene zu beschreiben. Man hat gefunden, dass diese Anzahl drei ist. Nach diesem Teil, war das Ziel vorherzusagen, wie die Temperaturen sich in der Zeit entwickeln werden, wenn man die aktuellen und vergangenen Temperaturen kennt. Man hat beobachtet, dass eine genaue Vorhersage bis zu sechs oder weniger Monate gemacht werden kann, und dass die Temperatur für ein Jahr nicht vorhersagbar ist. Ein wichtiges Resultat ist, dass die Vorhersagen auf kurzen Zeitskalen genauso gut sind, wie die Vorhersagen, welche andere Autoren mit deutlich komplizierteren Methoden erhalten haben. Deswegen ist meine Aussage, dass das gesamte System von jährlichem Zyklus und El Niño mittels einfacherer Methoden als der heute angewandten vorhergesagt werden kann.
APA, Harvard, Vancouver, ISO, and other styles
16

Carreira-Perpinan, Miguel Angel. "Continuous latent variable models for dimensionality reduction and sequential data reconstruction." Thesis, University of Sheffield, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.369991.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Sulecki, Nathan. "Characterizing Dimensionality Reduction Algorithm Performance in terms of Data Set Aspects." Ohio University Honors Tutorial College / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ouhonors1493397823307462.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Ingram, Stephen. "Practical considerations for Dimensionality Reduction : user guidance, costly distances, and document data." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/45175.

Full text
Abstract:
In this thesis, we explore ways to make practical extensions to Dimensionality Reduction, or DR algorithms with the goal of addressing challenging, real-world cases. The first case we consider is that of how to provide guidance to those users employing DR methods in their data analysis. We specifically target users who are not experts in the mathematical concepts behind DR algorithms. We first identify two levels of guidance: global and local. Global user guidance helps non-experts select and arrange a sequence of analysis algorithms. Local user guidance helps users select appropriate algorithm parameter choices and interpret algorithm output. We then present a software system, DimStiller, that incorporates both types of guidance, validating it on several use-cases. The second case we consider is that of using DR to analyze datasets consisting of documents. In order to modify DR algorithms to handle document datasets effectively, we first analyze the geometric structure of document datasets. Our analysis describes the ways document datasets differ from other kinds of datasets. We then leverage these geometric properties for speed and quality by incorporating ideas from text querying into DR and other algorithms for data analysis. We then present the Overview prototype, a proof-of-concept document analysis system. Overview synthesizes both the goals of designing systems for data analysts who are DR novices, and performing DR on document data. The third case we consider is that of costly distance functions, or when the method used to derive the true proximity between two data points is computationally expensive. Using standard approaches to DR in this important use-case can result in either unnecessarily protracted runtimes or long periods of user monitoring. To address the case of costly distances, we develop an algorithm framework, Glint, which efficiently manages the number of distance function calculations for the Multidimensional Scaling class of DR algorithms. We then show that Glint implementations of Multidimensional Scaling algorithms achieve substantial speed improvements or remove the need for human monitoring.
APA, Harvard, Vancouver, ISO, and other styles
19

Barrett, Philip James. "Exploratory database visualisation : the application and assessment of data and dimensionality reduction." Thesis, Aston University, 1995. http://publications.aston.ac.uk/10634/.

Full text
Abstract:
This thesis describes the development of a complete data visualisation system for large tabular databases, such as those commonly found in a business environment. A state-of-the-art 'cyberspace cell' data visualisation technique was investigated and a powerful visualisation system using it was implemented. Although allowing databases to be explored and conclusions drawn, it had several drawbacks, the majority of which were due to the three-dimensional nature of the visualisation. A novel two-dimensional generic visualisation system, known as MADEN, was then developed and implemented, based upon a 2-D matrix of 'density plots'. MADEN allows an entire high-dimensional database to be visualised in one window, while permitting close analysis in 'enlargement' windows. Selections of records can be made and examined, and dependencies between fields can be investigated in detail. MADEN was used as a tool for investigating and assessing many data processing algorithms, firstly data-reducing (clustering) methods, then dimensionality-reducing techniques. These included a new 'directed' form of principal components analysis, several novel applications of artificial neural networks, and discriminant analysis techniques which illustrated how groups within a database can be separated. To illustrate the power of the system, MADEN was used to explore customer databases from two financial institutions, resulting in a number of discoveries which would be of interest to a marketing manager. Finally, the database of results from the 1992 UK Research Assessment Exercise was analysed. Using MADEN allowed both universities and disciplines to be graphically compared, and supplied some startling revelations, including empirical evidence of the 'Oxbridge factor'.
APA, Harvard, Vancouver, ISO, and other styles
20

Landgraf, Andrew J. "Generalized Principal Component Analysis: Dimensionality Reduction through the Projection of Natural Parameters." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1437610558.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Lu, Tien-hsin. "SqueezeFit Linear Program: Fast and Robust Label-aware Dimensionality Reduction." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587156777565173.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Hu, Renjie. "Random neural networks for dimensionality reduction and regularized supervised learning." Diss., University of Iowa, 2019. https://ir.uiowa.edu/etd/6960.

Full text
Abstract:
This dissertation explores Random Neural Networks (RNNs) in several aspects and their applications. First, Novel RNNs have been proposed for dimensionality reduction and visualization. Based on Extreme Learning Machines (ELMs) and Self-Organizing Maps (SOMs) a new method is created to identify the important variables and visualize the data. This technique reduces the curse of dimensionality and improves furthermore the interpretability of the visualization and is tested on real nursing survey datasets. ELM-SOM+ is an autoencoder created to preserves the intrinsic quality of SOM and also brings continuity to the projection using two ELMs. This new methodology shows considerable improvement over SOM on real datasets. Second, as a Supervised Learning method, ELMs has been applied to the hierarchical multiscale method to bridge the the molecular dynamics to continua. The method is tested on simulation data and proven to be efficient for passing the information from one scale to another. Lastly, the regularization of ELMs has been studied and a new regularization algorithm for ELMs is created using a modified Lanczos Algorithm. The Lanczos ELM on average divide computational time by 20 and reduce the Normalized MSE by 14% comparing with regular ELMs.
APA, Harvard, Vancouver, ISO, and other styles
23

Nsang, Augustine S. "An Empirical Study of Novel Approaches to Dimensionality Reduction and Applications." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1312294067.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Cheriyadat, Anil Meerasa. "Limitations of principal component analysis for dimensionality-reduction for classification of hyperspectral data." Master's thesis, Mississippi State : Mississippi State University, 2003. http://library.msstate.edu/etd/show.asp?etd=etd-11072003-133109.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

González, Valenzuela Ricardo Eugenio 1984. "Linear dimensionality reduction applied to SIFT and SURF feature descriptors." [s.n.], 2014. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275499.

Full text
Abstract:
Orientadores: Hélio Pedrini, William Robson Schwartz
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-24T12:45:45Z (GMT). No. of bitstreams: 1 GonzalezValenzuela_RicardoEugenio_M.pdf: 22940228 bytes, checksum: 972bc5a0fac686d7eda4da043bbd61ab (MD5) Previous issue date: 2014
Resumo: Descritores locais robustos normalmente compõem-se de vetores de características de alta dimensionalidade para descrever atributos discriminativos em imagens. A alta dimensionalidade de um vetor de características implica custos consideráveis em termos de tempo computacional e requisitos de armazenamento afetando o desempenho de várias tarefas que utilizam descritores de características, tais como correspondência, recuperação e classificação de imagens. Para resolver esses problemas, pode-se aplicar algumas técnicas de redução de dimensionalidade, escencialmente, construindo uma matrix de projeção que explique adequadamente a importancia dos dados em outras bases. Esta dissertação visa aplicar técnicas de redução linear de dimensionalidade aos descritores SIFT e SURF. Seu principal objetivo é demonstrar que, mesmo com o risco de diminuir a precisão dos vetores de caraterísticas, a redução de dimensionalidade pode resultar em um equilíbrio adequado entre tempo computacional e recursos de armazenamento. A redução linear de dimensionalidade é realizada por meio de técnicas como projeções aleatórias (RP), análise de componentes principais (PCA), análise linear discriminante (LDA) e mínimos quadrados parciais (PLS), a fim de criar vetores de características de menor dimensão. Este trabalho avalia os vetores de características reduzidos em aplicações de correspondência e de recuperação de imagens. O tempo computacional e o uso de memória são medidos por comparações entre os vetores de características originais e reduzidos
Abstract: Robust local descriptors usually consist of high dimensional feature vectors to describe distinctive characteristics of images. The high dimensionality of a feature vector incurs into considerable costs in terms of computational time and storage requirements, which affects the performance of several tasks that employ feature vectors, such as matching, image retrieval and classification. To address these problems, it is possible to apply some dimensionality reduction techniques, by building a projection matrix which explains adequately the importance of the data in other basis. This dissertation aims at applying linear dimensionality reduction to SIFT and SURF descriptors. Its main objective is to demonstrate that, even risking to decrease the accuracy of the feature vectors, the dimensionality reduction can result in a satisfactory trade-off between computational time and storage. We perform the linear dimensionality reduction through Random Projections (RP), Independent Component Analysis (ICA), Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) and Partial Least Squares (PLS) in order to create lower dimensional feature vectors. This work evaluates such reduced feature vectors in a matching application, as well as their distinctiveness in an image retrieval application. The computational time and memory usage are then measured by comparing the original and the reduced feature vectors. OBSERVAÇÃONa segunda folha, do arquivo em anexo, o meu nome tem dois pequenos erros
Mestrado
Ciência da Computação
Mestre em Ciência da Computação
APA, Harvard, Vancouver, ISO, and other styles
26

Venkataraman, Shilpa. "Exploiting Remotely Sensed Hyperspectral Data Via Spectral Band Grouping For Dimensionality Reduction And Multiclassifiers." MSSTATE, 2005. http://sun.library.msstate.edu/ETD-db/theses/available/etd-07052005-155324/.

Full text
Abstract:
To overcome the dimensionality curse of hyperspectral data, an investigation has been done on the use of grouping spectral bands, followed by feature level fusion and classifier decision fusion, to develop an automated target recognition (ATR) system for data reduction and enhanced classification. The entire span of spectral bands in the hyperspectral data is subdivided into groups based on performance metrics. Feature extraction is done using supervised methods as well as unsupervised methods. The effects of classification of the lower dimension data by parametric, as well as non-parametric, classifiers are studied. Further, multiclassifiers and decision level fusion using Qualified Majority Voting is applied to the features extracted from each group. The effectiveness of the ATR system is tested using the hyperspectral signatures of a target class, Cogongrass (Imperata Cylindrica), and a non-target class, Johnsongrass (Sorghum halepense). A comparison of target detection accuracies by before and after decision fusion illustrates the effect of the influence of each group on the final decision and the benefits of using decision fusion with multiclassifiers. Hence, the ATR system designed can be used to detect a target class while significantly reducing the dimensionality of the data.
APA, Harvard, Vancouver, ISO, and other styles
27

Varikuti, Deepthi [Verfasser]. "Evaluation and optimization of biologically meaningful dimensionality reduction approaches for MRI data / Deepthi Varikuti." Düsseldorf : Universitäts- und Landesbibliothek der Heinrich-Heine-Universität Düsseldorf, 2018. http://d-nb.info/1159767017/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Mrázek, Michal. "Data mining." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2019. http://www.nusl.cz/ntk/nusl-400441.

Full text
Abstract:
The aim of this master’s thesis is analysis of the multidimensional data. Three dimensionality reduction algorithms are introduced. It is shown how to manipulate with text documents using basic methods of natural language processing. The goal of the practical part of the thesis is to process real-world data from the internet forum. Posted messages are transformed to the numerical representation, then to two-dimensional space and visualized. Later on, topics of the messages are discovered. In the last part, a few selected algorithms are compared.
APA, Harvard, Vancouver, ISO, and other styles
29

Kliegr, Tomáš. "Clickstream Analysis." Master's thesis, Vysoká škola ekonomická v Praze, 2007. http://www.nusl.cz/ntk/nusl-2065.

Full text
Abstract:
Thesis introduces current research trends in clickstream analysis and proposes a new heuristic that could be used for dimensionality reduction of semantically enriched data in Web Usage Mining (WUM). Click-fraud and conversion fraud are identified as key prospective application areas for WUM. Thesis documents a conversion fraud vulnerability of Google Analytics and proposes defense - a new clickstream acquisition software, which collects data in sufficient granularity and structure to allow for data mining approaches to fraud detection. Three variants of K-means clustering algorithms and three association rule data mining systems are evaluated and compared on real-world web usage data.
APA, Harvard, Vancouver, ISO, and other styles
30

Niskanen, M. (Matti). "A visual training based approach to surface inspection." Doctoral thesis, University of Oulu, 2003. http://urn.fi/urn:isbn:9514270673.

Full text
Abstract:
Abstract Training a visual inspection device is not straightforward but suffers from the high variation in material to be inspected. This variation causes major difficulties for a human, and this is directly reflected in classifier training. Many inspection devices utilize rule-based classifiers the building and training of which rely mainly on human expertise. While designing such a classifier, a human tries to find the questions that would provide proper categorization. In training, an operator tunes the classifier parameters, aiming to achieve as good classification accuracy as possible. Such classifiers require lot of time and expertise before they can be fully utilized. Supervised classifiers form another common category. These learn automatically from training material, but rely on labels that a human has set for it. However, these labels tend to be inconsistent and thus reduce the classification accuracy achieved. Furthermore, as class boundaries are learnt from training samples, they cannot in practise be later adjusted if needed. In this thesis, a visual based training method is presented. It avoids the problems related to traditional training methods by combining a classifier and a user interface. The method relies on unsupervised projection and provides an intuitive way to directly set and tune the class boundaries of high-dimensional data. As the method groups the data only by the similarities of its features, it is not affected by erroneous and inconsistent labelling made for training samples. Furthermore, it does not require knowledge of the internal structure of the classifier or iterative parameter tuning, where a combination of parameter values leading to the desired class boundaries are sought. On the contrary, the class boundaries can be set directly, changing the classification parameters. The time need to take such a classifier into use is small and tuning the class boundaries can happen even on-line, if needed. The proposed method is tested with various experiments in this thesis. Different projection methods are evaluated from the point of view of visual based training. The method is further evaluated using a self-organizing map (SOM) as the projection method and wood as the test material. Parameters such as accuracy, map size, and speed are measured and discussed, and overall the method is found to be an advantageous training and classification scheme.
APA, Harvard, Vancouver, ISO, and other styles
31

Villa, Alberto. "Advanced spectral unmixing and classification methods for hyperspectral remote sensing data." Phd thesis, Université de Grenoble, 2011. http://tel.archives-ouvertes.fr/tel-00767250.

Full text
Abstract:
La thèse propose des nouvelles techniques pour la classification et le démelange spectraldes images obtenus par télédétection iperspectrale. Les problèmes liées au données (notammenttrès grande dimensionalité, présence de mélanges des pixels) ont été considerés et destechniques innovantes pour résoudre ces problèmes. Nouvelles méthodes de classi_cationavancées basées sur l'utilisation des méthodes traditionnel de réduction des dimension etl'integration de l'information spatiale ont été développés. De plus, les méthodes de démelangespectral ont été utilisés conjointement pour ameliorer la classification obtenu avec lesméthodes traditionnel, donnant la possibilité d'obtenir aussi une amélioration de la résolutionspatial des maps de classification grace à l'utilisation de l'information à niveau sous-pixel.Les travaux ont suivi une progression logique, avec les étapes suivantes:1. Constat de base: pour améliorer la classification d'imagerie hyperspectrale, il fautconsidérer les problèmes liées au données : très grande dimensionalité, presence demélanges des pixels.2. Peut-on développer méthodes de classi_cation avancées basées sur l'utilisation des méthodestraditionnel de réduction des dimension (ICA ou autre)?3. Comment utiliser les differents types d'information contextuel typique des imagés satellitaires?4. Peut-on utiliser l'information données par les méthodes de démelange spectral pourproposer nouvelles chaines de réduction des dimension?5. Est-ce qu'on peut utiliser conjointement les méthodes de démelange spectral pour ameliorerla classification obtenu avec les méthodes traditionnel?6. Peut-on obtenir une amélioration de la résolution spatial des maps de classi_cationgrace à l'utilisation de l'information à niveau sous-pixel?Les différents méthodes proposées ont été testées sur plusieurs jeux de données réelles, montrantresultats comparable ou meilleurs de la plus part des methodes presentés dans la litterature.
APA, Harvard, Vancouver, ISO, and other styles
32

Todorov, Hristo [Verfasser]. "Pattern analysis, dimensionality reduction and hypothesis testing in high-dimensional data from animal studies with small sample sizes / Hristo Todorov." Mainz : Universitätsbibliothek der Johannes Gutenberg-Universität Mainz, 2020. http://d-nb.info/1224895347/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Chao, Roger. "Data analysis for Systematic Literature Reviews." Thesis, Linnéuniversitetet, Institutionen för informatik (IK), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-105122.

Full text
Abstract:
Systematic Literature Reviews (SLR) are a powerful research tool to identify and select literature to answer a certain question. However, an approach to extract inherent analytical data in Systematic Literature Reviews’ multi-dimensional datasets was lacking. Previous Systematic Literature Review tools do not incorporate the capability of providing said analytical insight. Therefore, this thesis aims to provide a useful approach comprehending various algorithms and data treatment techniques to provide the user with analytical insight on their data that is not evident in the bare execution of a Systematic Literature Review. For this goal, a literature review has been conducted to find the most relevant techniques to extract data from multi-dimensional data sets and the aforementioned approach has been tested on a survey regarding Self-Adaptive Systems (SAS) using a web-application. As a result, we find out what are the most adequate techniques to incorporate into the approach this thesis will provide.
APA, Harvard, Vancouver, ISO, and other styles
34

Curti, Nico. "Implementazione e benchmarking dell'algoritmo QDANet PRO per l'analisi di big data genomici." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/12018/.

Full text
Abstract:
Dato il recente avvento delle tecnologie NGS, in grado di sequenziare interi genomi umani in tempi e costi ridotti, la capacità di estrarre informazioni dai dati ha un ruolo fondamentale per lo sviluppo della ricerca. Attualmente i problemi computazionali connessi a tali analisi rientrano nel topic dei Big Data, con databases contenenti svariati tipi di dati sperimentali di dimensione sempre più ampia. Questo lavoro di tesi si occupa dell'implementazione e del benchmarking dell'algoritmo QDANet PRO, sviluppato dal gruppo di Biofisica dell'Università di Bologna: il metodo consente l'elaborazione di dati ad alta dimensionalità per l'estrazione di una Signature a bassa dimensionalità di features con un'elevata performance di classificazione, mediante una pipeline d'analisi che comprende algoritmi di dimensionality reduction. Il metodo è generalizzabile anche all'analisi di dati non biologici, ma caratterizzati comunque da un elevato volume e complessità, fattori tipici dei Big Data. L'algoritmo QDANet PRO, valutando la performance di tutte le possibili coppie di features, ne stima il potere discriminante utilizzando un Naive Bayes Quadratic Classifier per poi determinarne il ranking. Una volta selezionata una soglia di performance, viene costruito un network delle features, da cui vengono determinate le componenti connesse. Ogni sottografo viene analizzato separatamente e ridotto mediante metodi basati sulla teoria dei networks fino all'estrapolazione della Signature finale. Il metodo, già precedentemente testato su alcuni datasets disponibili al gruppo di ricerca con riscontri positivi, è stato messo a confronto con i risultati ottenuti su databases omici disponibili in letteratura, i quali costituiscono un riferimento nel settore, e con algoritmi già esistenti che svolgono simili compiti. Per la riduzione dei tempi computazionali l'algoritmo è stato implementato in linguaggio C++ su HPC, con la parallelizzazione mediante librerie OpenMP delle parti più critiche.
APA, Harvard, Vancouver, ISO, and other styles
35

Cesarini, Ettore. "Stima streaming di sottospazi principali." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/17897/.

Full text
Abstract:
Nell'attuale stato tecnologico in cui la dimensionalità dei dati vede un incremento esponenziale, si rende necessario l'utilizzo di algoritmi che permettano un'efficiente signal processing anche ai dispositivi con potenza di calcolo e disponibilità di memoria limitati. In questo lavoro è stato preso in considerazione un setting reale, per il quale si sono cercati due metodi algoritmici in grado di garantirne un'efficiente stima dell'energia del segnale, in modo streaming e riducendo la dimensionalità dei dati. Traendo ispirazione dallo stato dell'arte, sono stati decisi i due metodi di stima streaming di sottospazi principali HPCA e SEPS. In sede di simulazione si è fatto uso di dataset sintetici appositamente generati per ottenere un opportuno tuning dei parametri per ciascun algoritmo. In seguito si è proceduto a valutare l'efficacia delle scelte fatte sui parametri applicando i singoli algoritmi al setting reale. Nel medesimo setting, al passo successivo si è proceduto al confronto diretto dei due metodi valutandone le performance in termini di dinamica e valore di convergenza, costi computazionali e robustezza in seguito a fenomeni di carattere non stazionario. I risultati hanno favorito l'algoritmo HPCA in termini di prestazioni e robustezza, a scapito di una maggiore incidenza sulla complessità computazionale e memory footprint rispetto al secondo algoritmo. SEPS ha reso la fase di tuning un passo molto più delicato, dovuta all'alta sensibilità del suo learning rate; tuttavia a seguito dell'opportuna paramettrizzazione, ha primeggiato relativamente ai costi computazionali e al memory footprint, dando prova di alta flessibilità, semplicità implementativa ma minore velocità di convergenza. Ai fini di questa elaborazione, sia HPCA che SEPS si sono rivelati una scelta corretta e soddisfacente nell'ottica della stima streaming di sottospazi principali, facendo prediligere il secondo per le performance ottenute nello specifico setting qui analizzato.
APA, Harvard, Vancouver, ISO, and other styles
36

Gritsenko, Andrey. "Bringing interpretability and visualization with artificial neural networks." Diss., University of Iowa, 2017. https://ir.uiowa.edu/etd/5764.

Full text
Abstract:
Extreme Learning Machine (ELM) is a training algorithm for Single-Layer Feed-forward Neural Network (SLFN). The difference in theory of ELM from other training algorithms is in the existence of explicitly-given solution due to the immutability of initialed weights. In practice, ELMs achieve performance similar to that of other state-of-the-art training techniques, while taking much less time to train a model. Experiments show that the speedup of training ELM is up to the 5 orders of magnitude comparing to standard Error Back-propagation algorithm. ELM is a recently discovered technique that has proved its efficiency in classic regression and classification tasks, including multi-class cases. In this thesis, extensions of ELMs for non-typical for Artificial Neural Networks (ANNs) problems are presented. The first extension, described in the third chapter, allows to use ELMs to get probabilistic outputs for multi-class classification problems. The standard way of solving this type of problems is based 'majority vote' of classifier's raw outputs. This approach can rise issues if the penalty for misclassification is different for different classes. In this case, having probability outputs would be more useful. In the scope of this extension, two methods are proposed. Additionally, an alternative way of interpreting probabilistic outputs is proposed. ELM method prove useful for non-linear dimensionality reduction and visualization, based on repetitive re-training and re-evaluation of model. The forth chapter introduces adaptations of ELM-based visualization for classification and regression tasks. A set of experiments has been conducted to prove that these adaptations provide better visualization results that can then be used for perform classification or regression on previously unseen samples. Shape registration of 3D models with non-isometric distortion is an open problem in 3D Computer Graphics and Computational Geometry. The fifth chapter discusses a novel approach for solving this problem by introducing a similarity metric for spectral descriptors. Practically, this approach has been implemented in two methods. The first one utilizes Siamese Neural Network to embed original spectral descriptors into a lower dimensional metric space, for which the Euclidean distance provides a good measure of similarity. The second method uses Extreme Learning Machines to learn similarity metric directly for original spectral descriptors. Over a set of experiments, the consistency of the proposed approach for solving deformable registration problem has been proven.
APA, Harvard, Vancouver, ISO, and other styles
37

Gheyas, Iffat A. "Novel computationally intelligent machine learning algorithms for data mining and knowledge discovery." Thesis, University of Stirling, 2009. http://hdl.handle.net/1893/2152.

Full text
Abstract:
This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model.
APA, Harvard, Vancouver, ISO, and other styles
38

Hanna, Peter, and Erik Swartling. "Anomaly Detection in Time Series Data using Unsupervised Machine Learning Methods: A Clustering-Based Approach." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-273630.

Full text
Abstract:
For many companies in the manufacturing industry, attempts to find damages in their products is a vital process, especially during the production phase. Since applying different machine learning techniques can further aid the process of damage identification, it becomes a popular choice among companies to make use of these methods to enhance the production process even further. For some industries, damage identification can be heavily linked with anomaly detection of different measurements. In this thesis, the aim is to construct unsupervised machine learning models to identify anomalies on unlabeled measurements of pumps using high frequency sampled current and voltage time series data. The measurement can be split up into five different phases, namely the startup phase, three duty point phases and lastly the shutdown phase. The approach is based on clustering methods, where the main algorithms of use are the density-based algorithms DBSCAN and LOF. Dimensionality reduction techniques, such as feature extraction and feature selection, are applied to the data and after constructing the five models of each phase, it can be seen that the models identifies anomalies in the data set given.​
För flera företag i tillverkningsindustrin är felsökningar av produkter en fundamental uppgift i produktionsprocessen. Då användningen av olika maskininlärningsmetoder visar sig innehålla användbara tekniker för att hitta fel i produkter är dessa metoder ett populärt val bland företag som ytterligare vill förbättra produktionprocessen. För vissa industrier är feldetektering starkt kopplat till anomalidetektering av olika mätningar. I detta examensarbete är syftet att konstruera oövervakad maskininlärningsmodeller för att identifiera anomalier i tidsseriedata. Mer specifikt består datan av högfrekvent mätdata av pumpar via ström och spänningsmätningar. Mätningarna består av fem olika faser, nämligen uppstartsfasen, tre last-faser och fasen för avstängning. Maskinilärningsmetoderna är baserade på olika klustertekniker, och de metoderna som användes är DBSCAN och LOF algoritmerna. Dessutom tillämpades olika dimensionsreduktionstekniker och efter att ha konstruerat 5 olika modeller, alltså en för varje fas, kan det konstateras att modellerna lyckats identifiera anomalier i det givna datasetet.
APA, Harvard, Vancouver, ISO, and other styles
39

Chen, Beichen, and Amy Jinxin Chen. "PCA based dimensionality reduction of MRI images for training support vector machine to aid diagnosis of bipolar disorder." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-259621.

Full text
Abstract:
This study aims to investigate how dimensionality reduction of neuroimaging data prior to training support vector machines (SVMs) affects the classification accuracy of bipolar disorder. This study uses principal component analysis (PCA) for dimensionality reduction. An open source data set of 19 bipolar and 31 control structural magnetic resonance imaging (sMRI) samples was used, part of the UCLA Consortium for Neuropsychiatric Phenomics LA5c Study funded by the NIH Roadmap Initiative aiming to foster breakthroughs in the development of novel treatments for neuropsychiatric disorders. The images underwent smoothing, feature extraction and PCA before they were used as input to train SVMs. 3-fold cross-validation was used to tune a number of hyperparameters for linear, radial, and polynomial kernels. Experiments were done to investigate the performance of SVM models trained using 1 to 29 principal components (PCs). Several PC sets reached 100% accuracy in the final evaluation, with the minimal set being the first two principal components. Accumulated variance explained by the PCs used did not have a correlation with the performance of the model. The choice of kernel and hyperparameters is of utmost importance as the performance obtained can vary greatly. The results support previous studies that SVM can be useful in aiding the diagnosis of bipolar disorder, and that the use of PCA as a dimensionality reduction method in combination with SVM may be appropriate for the classification of neuroimaging data for illnesses not limited to bipolar disorder. Due to the limitation of a small sample size, the results call for future research using larger collaborative data sets to validate the accuracies obtained.
Syftet med denna studie är att undersöka hur dimensionalitetsreduktion av neuroradiologisk data före träning av stödvektormaskiner (SVMs) påverkar klassificeringsnoggrannhet av bipolär sjukdom. Studien använder principalkomponentanalys (PCA) för dimensionalitetsreduktion. En datauppsättning av 19 bipolära och 31 friska magnetisk resonanstomografi(MRT) bilder användes, vilka tillhör den öppna datakällan från studien UCLA Consortium for Neuropsychiatric Phenomics LA5c som finansierades av NIH Roadmap Initiative i syfte att främja genombrott i utvecklingen av nya behandlingar för neuropsykiatriska funktionsnedsättningar. Bilderna genomgick oskärpa, särdragsextrahering och PCA innan de användes som indata för att träna SVMs. Med 3-delad korsvalidering inställdes ett antal parametrar för linjära, radiala och polynomiska kärnor. Experiment gjordes för att utforska prestationen av SVM-modeller tränade med 1 till 29 principalkomponenter (PCs). Flera PC uppsättningar uppnådde 100% noggrannhet i den slutliga utvärderingen, där den minsta uppsättningen var de två första PCs. Den ackumulativa variansen över antalet PCs som användes hade inte någon korrelation med prestationen på modellen. Valet av kärna och hyperparametrar är betydande eftersom prestationen kan variera mycket. Resultatet stödjer tidigare studier att SVM kan vara användbar som stöd för diagnostisering av bipolär sjukdom och användningen av PCA som en dimensionalitetsreduktionsmetod i kombination med SVM kan vara lämplig för klassificering av neuroradiologisk data för bipolär och andra sjukdomar. På grund av begränsningen med få dataprover, kräver resultaten framtida forskning med en större datauppsättning för att validera de erhållna noggrannheten.
APA, Harvard, Vancouver, ISO, and other styles
40

Bahri, Maroua. "Improving IoT data stream analytics using summarization techniques." Electronic Thesis or Diss., Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAT017.

Full text
Abstract:
Face à cette évolution technologique vertigineuse, l’utilisation des dispositifs de l'Internet des Objets (IdO), les capteurs, et les réseaux sociaux, d'énormes flux de données IdO sont générées quotidiennement de différentes applications pourront être transformées en connaissances à travers l’apprentissage automatique. En pratique, de multiples problèmes se posent afin d’extraire des connaissances utiles de ces flux qui doivent être gérés et traités efficacement. Dans ce contexte, cette thèse vise à améliorer les performances (en termes de mémoire et de temps) des algorithmes de l'apprentissage supervisé, principalement la classification à partir de flux de données en évolution. En plus de leur nature infinie, la dimensionnalité élevée et croissante de ces flux données dans certains domaines rendent la tâche de classification plus difficile. La première partie de la thèse étudie l’état de l’art des techniques de classification et de réduction de dimension pour les flux de données, tout en présentant les travaux les plus récents dans ce cadre.La deuxième partie de la thèse détaille nos contributions en classification pour les flux de données. Il s’agit de nouvelles approches basées sur les techniques de réduction de données visant à réduire les ressources de calcul des classificateurs actuels, presque sans perte en précision. Pour traiter les flux de données de haute dimension efficacement, nous incorporons une étape de prétraitement qui consiste à réduire la dimension de chaque donnée (dès son arrivée) de manière incrémentale avant de passer à l’apprentissage. Dans ce contexte, nous présentons plusieurs approches basées sur: Bayesien naïf amélioré par les résumés minimalistes et hashing trick, k-NN qui utilise compressed sensing et UMAP, et l’utilisation d’ensembles d’apprentissage également
With the evolution of technology, the use of smart Internet-of-Things (IoT) devices, sensors, and social networks result in an overwhelming volume of IoT data streams, generated daily from several applications, that can be transformed into valuable information through machine learning tasks. In practice, multiple critical issues arise in order to extract useful knowledge from these evolving data streams, mainly that the stream needs to be efficiently handled and processed. In this context, this thesis aims to improve the performance (in terms of memory and time) of existing data mining algorithms on streams. We focus on the classification task in the streaming framework. The task is challenging on streams, principally due to the high -- and increasing -- data dimensionality, in addition to the potentially infinite amount of data. The two aspects make the classification task harder.The first part of the thesis surveys the current state-of-the-art of the classification and dimensionality reduction techniques as applied to the stream setting, by providing an updated view of the most recent works in this vibrant area.In the second part, we detail our contributions to the field of classification in streams, by developing novel approaches based on summarization techniques aiming to reduce the computational resource of existing classifiers with no -- or minor -- loss of classification accuracy. To address high-dimensional data streams and make classifiers efficient, we incorporate an internal preprocessing step that consists in reducing the dimensionality of input data incrementally before feeding them to the learning stage. We present several approaches applied to several classifications tasks: Naive Bayes which is enhanced with sketches and hashing trick, k-NN by using compressed sensing and UMAP, and also integrate them in ensemble methods
APA, Harvard, Vancouver, ISO, and other styles
41

Henriksson, William. "High dimensional data clustering; A comparative study on gene expressions : Experiment on clustering algorithms on RNA-sequence from tumors with evaluation on internal validation." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-17492.

Full text
Abstract:
In cancer research, class discovery is the first process for investigating a new dataset for which hidden groups there are by similar attributes. However datasets from gene expressions, RNA microarray or RNA-sequence, are high-dimensional. Which makes it hard to perform clusteranalysis and to get clusters that are well separated. Well separated clusters are wanted because that tells that objects are most likely not placed in wrong clusters. This report investigate in an experiment whether using K-Means and hierarchical are suitable for clustering gene expressions in RNA-sequence data from various tumors. Dimensionality reduction methods are also applied to see whether that helps create well-separated clusters. The results tell that well separated clusters are only achieved by using PCA as dimensionality reduction and K-Means on correlation. The main contribution of this paper is determining that using K-Means or hierarchical clustering on the full natural dimensionality of RNA-sequence data returns unwanted silhouette average width, under 0,4.
APA, Harvard, Vancouver, ISO, and other styles
42

Paiva, José Gustavo de Souza. "Técnicas computacionais de apoio à classificação visual de imagens e outros dados." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-02042013-084718/.

Full text
Abstract:
O processo automático de classificação de dados em geral, e em particular de classificação de imagens, é uma tarefa computacionalmente intensiva e variável em termos de precisão, sendo consideravelmente dependente da configuração do classificador e da representação dos dados utilizada. Muitos dos fatores que afetam uma adequada aplicação dos métodos de classificação ou categorização para imagens apontam para a necessidade de uma maior interferência do usuário no processo. Para isso são necessárias mais ferramentas de apoio às várias etapas do processo de classificação, tais como, mas não limitadas, a extração de características, a parametrização dos algoritmos de classificação e a escolha de instâncias de treinamento adequadas. Este doutorado apresenta uma metodologia para Classificação Visual de Imagens, baseada na inserção do usuário no processo de classificação automática através do uso de técnicas de visualização. A ideia é permitir que o usuário participe de todos os passos da classificação de determinada coleção, realizando ajustes e consequentemente melhorando os resultados de acordo com suas necessidades. Um estudo de diversas técnicas de visualização candidatas para a tarefa é apresentado, com destaque para as árvores de similaridade, sendo apresentadas melhorias do algoritmo de construção em termos de escalabilidade visual e de tempo de processamento. Adicionalmente, uma metodologia de redução de dimensionalidade visual semi-supervisionada é apresentada para apoiar, pela utilização de ferramentas visuais, a criação de espaços reduzidos que melhorem as características de segregação do conjunto original de características. A principal contribuição do trabalho é um sistema de classificação visual incremental que incorpora todos os passos da metodologia proposta, oferecendo ferramentas interativas e visuais que permitem a interferência do usuário na classificação de coleções incrementais com configuração de classes variável. Isso possibilita a utilização do conhecimento do ser humano na construção de classificadores que se adequem a diferentes necessidades dos usuários em diferentes cenários, produzindo resultados satisfatórios para coleções de dados diversas. O foco desta tese é em categorização de coleções de imagens, com exemplos também para conjuntos de dados textuais
Automatic data classification in general, and image classification in particular, are computationally intensive tasks with variable results concerning precision, being considerably dependent on the classifier´s configuration and data representation. Many of the factors that affect an adequate application of classification or categorization methods for images point to the need for more user interference in the process. To accomplish that, it is necessary to develop a larger set of supporting tools for the various stages of the classification set up, such as, but not limited to, feature extraction, parametrization of the classification algorithm and selection of adequate training instances. This doctoral Thesis presents a Visual Image Classification methodology based on the user´s insertion in the classification process through the use of visualization techniques. The idea is to allow the user to participate in all classification steps, adjusting several stages and consequently improving the results according to his or her needs. A study on several candidate visualization techniques is presented, with emphasis on similarity trees, and improvements of the tree construction algorithm, both in visual and time scalability, are shown. Additionally, a visual semi-supervised dimensionality reduction methodology was developed to support, through the use of visual tools, the creation of reduced spaces that improve segregation of the original feature space. The main contribution of this work is an incremental visual classification system incorporating all the steps of the proposed methodology, and providing interactive and visual tools that permit user controlled classification of an incremental collection with evolving class configuration. It allows the use of the human knowledge on the construction of classifiers that adapt to different user needs in different scenarios, producing satisfactory results for several data collections. The focus of this Thesis is image data sets, with examples also in classification of textual collections
APA, Harvard, Vancouver, ISO, and other styles
43

Marion, Damien. "Multidimensionality of the models and the data in the side-channel domain." Thesis, Paris, ENST, 2018. http://www.theses.fr/2018ENST0056/document.

Full text
Abstract:
Depuis la publication en 1999 du papier fondateur de Paul C. Kocher, Joshua Jaffe et Benjamin Jun, intitulé "Differential Power Analysis", les attaques par canaux auxiliaires se sont révélées être un moyen d’attaque performant contre les algorithmes cryptographiques. En effet, il s’est avéré que l’utilisation d’information extraite de canaux auxiliaires comme le temps d’exécution, la consommation de courant ou les émanations électromagnétiques, pouvait être utilisée pour retrouver des clés secrètes. C’est dans ce contexte que cette thèse propose, dans un premier temps, de traiter le problème de la réduction de dimension. En effet, en vingt ans, la complexité ainsi que la taille des données extraites des canaux auxiliaires n’a cessé de croître. C’est pourquoi la réduction de dimension de ces données permet de réduire le temps et d’augmenter l’efficacité des attaques. Les méthodes de réduction de dimension proposées le sont pour des modèles de fuites complexe et de dimension quelconques. Dans un second temps, une méthode d’évaluation d’algorithmes logiciels est proposée. Celle-ci repose sur l’analyse de l’ensemble des données manipulées lors de l’exécution du logiciel évalué. La méthode proposée est composée de plusieurs fonctionnalités permettant d’accélérer et d’augmenter l’efficacité de l’analyse, notamment dans le contexte d’évaluation d’implémentation de cryptographie en boîte blanche
Since the publication in 1999 of the seminal paper of Paul C. Kocher, Joshua Jaffe and Benjamin Jun, entitled "Differential Power Analysis", the side-channel attacks have been proved to be efficient ways to attack cryptographic algorithms. Indeed, it has been revealed that the usage of information extracted from the side-channels such as the execution time, the power consumption or the electromagnetic emanations could be used to recover secret keys. In this context, we propose first, to treat the problem of dimensionality reduction. Indeed, since twenty years, the complexity and the size of the data extracted from the side-channels do not stop to grow. That is why the reduction of these data decreases the time and increases the efficiency of these attacks. The dimension reduction is proposed for complex leakage models and any dimension. Second, a software leakage assessment methodology is proposed ; it is based on the analysis of all the manipulated data during the execution of the software. The proposed methodology provides features that speed-up and increase the efficiency of the analysis, especially in the case of white box cryptography
APA, Harvard, Vancouver, ISO, and other styles
44

Brunet, Anne-Claire. "Développement d'outils statistiques pour l'analyse de données transcriptomiques par les réseaux de co-expression de gènes." Thesis, Toulouse 3, 2016. http://www.theses.fr/2016TOU30373/document.

Full text
Abstract:
Les nouvelles biotechnologies offrent aujourd'hui la possibilité de récolter une très grande variété et quantité de données biologiques (génomique, protéomique, métagénomique...), ouvrant ainsi de nouvelles perspectives de recherche pour la compréhension des processus biologiques. Dans cette thèse, nous nous sommes plus spécifiquement intéressés aux données transcriptomiques, celles-ci caractérisant l'activité ou le niveau d'expression de plusieurs dizaines de milliers de gènes dans une cellule donnée. L'objectif était alors de proposer des outils statistiques adaptés pour analyser ce type de données qui pose des problèmes de "grande dimension" (n<
Today's, new biotechnologies offer the opportunity to collect a large variety and volume of biological data (genomic, proteomic, metagenomic...), thus opening up new avenues for research into biological processes. In this thesis, what we are specifically interested is the transcriptomic data indicative of the activity or expression level of several thousands of genes in a given cell. The aim of this thesis was to propose proper statistical tools to analyse these high dimensional data (n<
APA, Harvard, Vancouver, ISO, and other styles
45

Li, Lei. "Fast Algorithms for Mining Co-evolving Time Series." Research Showcase @ CMU, 2011. http://repository.cmu.edu/dissertations/112.

Full text
Abstract:
Time series data arise in many applications, from motion capture, environmental monitoring, temperatures in data centers, to physiological signals in health care. In the thesis, I will focus on the theme of learning and mining large collections of co-evolving sequences, with the goal of developing fast algorithms for finding patterns, summarization, and anomalies. In particular, this thesis will answer the following recurring challenges for time series: 1. Forecasting and imputation: How to do forecasting and to recover missing values in time series data? 2. Pattern discovery and summarization: How to identify the patterns in the time sequences that would facilitate further mining tasks such as compression, segmentation and anomaly detection? 3. Similarity and feature extraction: How to extract compact and meaningful features from multiple co-evolving sequences that will enable better clustering and similarity queries of time series? 4. Scale up: How to handle large data sets on modern computing hardware? We develop models to mine time series with missing values, to extract compact representation from time sequences, to segment the sequences, and to do forecasting. For large scale data, we propose algorithms for learning time series models, in particular, including Linear Dynamical Systems (LDS) and Hidden Markov Models (HMM). We also develop a distributed algorithm for finding patterns in large web-click streams. Our thesis will present special models and algorithms that incorporate domain knowledge. For motion capture, we will describe the natural motion stitching and occlusion filling for human motion. In particular, we provide a metric for evaluating the naturalness of motion stitching, based which we choose the best stitching. Thanks to domain knowledge (body structure and bone lengths), our algorithm is capable of recovering occlusions in mocap sequences, better in accuracy and longer in missing period. We also develop an algorithm for forecasting thermal conditions in a warehouse-sized data center. The forecast will help us control and manage the data center in a energy-efficient way, which can save a significant percentage of electric power consumption in data centers.
APA, Harvard, Vancouver, ISO, and other styles
46

Lindgren, Mona, and Anders Sivertsson. "Visualizing the Body Language of a Musical Conductor using Gaussian Process Latent Variable Models : Creating a visualization tool for GP-LVM modelling of motion capture data and investigating an angle based model for dimensionality reduction." Thesis, KTH, Skolan för teknikvetenskap (SCI), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-195692.

Full text
Abstract:
In this bachelors’ thesis we investigate and visualize a Gaussian process latent variable model (GP-LVM), used to model high dimensional motion capture data of a musical conductor in a lower dimensional space. This work expands upon the degree project of K. Karipidou, ”Modelling the body language of a musical conductor using Gaussian Process Latent Variable Models”, in which GP-LVMs are used to perform dimensionality reduction of motion capture data of a conductor conducting a string quartet, expressing four different underlying emotional interpretations (tender, angry, passionate and neutral). In Karipidou’s work, a GP-LVM coupled with K-means and an HMM are used for classification of unseen conduction motions into the aforementioned emotional interpretations. We develop a graphical user interface (GUI) for visualizing the resulting lower dimensional mapping performed by a GP-LVM side by side with the motion capture data. The GUI and the GP-LVM mapping is done within Matlab, while the open source 3D creation suite Blender is used to visualize the motion capture data in greater detail, which is then imported into the GUI. Furthermore, we develop a new GP-LVM in the same manner as Karipidou, but based on the angles between the motion capture nodes, and compare its accuracy in classifying emotion to that of Karipidou’s location based model. The evaluation of the GUI concludes that it is a very useful tool when a GP-LVM is to be examined and evaluated. However, our angle-based model does not improve the classification result compared to Karipidou’s position-based. Thus, using Euler angles are deemed inappropriate for this application. Keywords: Gaussian process latent variable model, motion capture, visualization, body language, musical conductor, euler angles.
APA, Harvard, Vancouver, ISO, and other styles
47

Hindawi, Mohammed. "Sélection de variables pour l’analyse des données semi-supervisées dans les systèmes d’Information décisionnels." Thesis, Lyon, INSA, 2013. http://www.theses.fr/2013ISAL0015/document.

Full text
Abstract:
La sélection de variables est une tâche primordiale en fouille de données et apprentissage automatique. Il s’agit d’une problématique très bien connue par les deux communautés dans les contextes, supervisé et non-supervisé. Le contexte semi-supervisé est relativement récent et les travaux sont embryonnaires. Récemment, l’apprentissage automatique a bien été développé à partir des données partiellement labélisées. La sélection de variables est donc devenue plus importante dans le contexte semi-supervisé et plus adaptée aux applications réelles, où l’étiquetage des données est devenu plus couteux et difficile à obtenir. Dans cette thèse, nous présentons une étude centrée sur l’état de l’art du domaine de la sélection de variable en s’appuyant sur les méthodes qui opèrent en mode semi-supervisé par rapport à celles des deux contextes, supervisé et non-supervisé. Il s’agit de montrer le bon compromis entre la structure géométrique de la partie non labélisée des données et l’information supervisée de leur partie labélisée. Nous nous sommes particulièrement intéressés au «small labeled-sample problem» où l’écart est très important entre les deux parties qui constituent les données. Pour la sélection de variables dans ce contexte semi-supervisé, nous proposons deux familles d’approches en deux grandes parties. La première famille est de type «Filtre» avec une série d’algorithmes qui évaluent la pertinence d’une variable par une fonction de score. Dans notre cas, cette fonction est basée sur la théorie spectrale de graphe et l’intégration de contraintes qui peuvent être extraites à partir des données en question. La deuxième famille d’approches est de type «Embedded» où la sélection de variable est intrinsèquement liée à un modèle d’apprentissage. Pour ce faire, nous proposons des algorithmes à base de pondération de variables dans un paradigme de classification automatique sous contraintes. Deux visions sont développées à cet effet, (1) une vision globale en se basant sur la satisfaction relaxée des contraintes intégrées directement dans la fonction objective du modèle proposé ; et (2) une deuxième vision, qui est locale et basée sur le contrôle stricte de violation de ces dites contraintes. Les deux approches évaluent la pertinence des variables par des poids appris en cours de la construction du modèle de classification. En outre de cette tâche principale de sélection de variables, nous nous intéressons au traitement de la redondance. Pour traiter ce problème, nous proposons une méthode originale combinant l’information mutuelle et un algorithme de recherche d’arbre couvrant construit à partir de variables pertinentes en vue de l’optimisation de leur nombre au final. Finalement, toutes les approches développées dans le cadre de cette thèse sont étudiées en termes de leur complexité algorithmique d’une part et sont validés sur des données de très grande dimension face et des méthodes connues dans la littérature d’autre part
Feature selection is an important task in data mining and machine learning processes. This task is well known in both supervised and unsupervised contexts. The semi-supervised feature selection is still under development and far from being mature. In general, machine learning has been well developed in order to deal with partially-labeled data. Thus, feature selection has obtained special importance in the semi-supervised context. It became more adapted with the real world applications where labeling process is costly to obtain. In this thesis, we present a literature review on semi-supervised feature selection, with regard to supervised and unsupervised contexts. The goal is to show the importance of compromising between the structure from unlabeled part of data, and the background information from their labeled part. In particular, we are interested in the so-called «small labeled-sample problem» where the difference between both data parts is very important. In order to deal with the problem of semi-supervised feature selection, we propose two groups of approaches. The first group is of «Filter» type, in which, we propose some algorithms which evaluate the relevance of features by a scoring function. In our case, this function is based on spectral-graph theory and the integration of pairwise constraints which can be extracted from the data in hand. The second group of methods is of «Embedded» type, where feature selection becomes an internal function integrated in the learning process. In order to realize embedded feature selection, we propose algorithms based on feature weighting. The proposed methods rely on constrained clustering. In this sense, we propose two visions, (1) a global vision, based on relaxed satisfaction of pairwise constraints. This is done by integrating the constraints in the objective function of the proposed clustering model; and (2) a second vision, which is local and based on strict control of constraint violation. Both approaches evaluate the relevance of features by weights which are learned during the construction of the clustering model. In addition to the main task which is feature selection, we are interested in redundancy elimination. In order to tackle this problem, we propose a novel algorithm based on combining the mutual information with maximum spanning tree-based algorithm. We construct this tree from the relevant features in order to optimize the number of these selected features at the end. Finally, all proposed methods in this thesis are analyzed and their complexities are studied. Furthermore, they are validated on high-dimensional data versus other representative methods in the literature
APA, Harvard, Vancouver, ISO, and other styles
48

Morvan, Anne. "Contributions to unsupervised learning from massive high-dimensional data streams : structuring, hashing and clustering." Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLED033/document.

Full text
Abstract:
Cette thèse étudie deux tâches fondamentales d'apprentissage non supervisé: la recherche des plus proches voisins et le clustering de données massives en grande dimension pour respecter d'importantes contraintes de temps et d'espace.Tout d'abord, un nouveau cadre théorique permet de réduire le coût spatial et d'augmenter le débit de traitement du Cross-polytope LSH pour la recherche du plus proche voisin presque sans aucune perte de précision.Ensuite, une méthode est conçue pour apprendre en une seule passe sur des données en grande dimension des codes compacts binaires. En plus de garanties théoriques, la qualité des sketches obtenus est mesurée dans le cadre de la recherche approximative des plus proches voisins. Puis, un algorithme de clustering sans paramètre et efficace en terme de coût de stockage est développé en s'appuyant sur l'extraction d'un arbre couvrant minimum approché du graphe de dissimilarité compressé auquel des coupes bien choisies sont effectuées
This thesis focuses on how to perform efficiently unsupervised machine learning such as the fundamentally linked nearest neighbor search and clustering task, under time and space constraints for high-dimensional datasets. First, a new theoretical framework reduces the space cost and increases the rate of flow of data-independent Cross-polytope LSH for the approximative nearest neighbor search with almost no loss of accuracy.Second, a novel streaming data-dependent method is designed to learn compact binary codes from high-dimensional data points in only one pass. Besides some theoretical guarantees, the quality of the obtained embeddings are accessed on the approximate nearest neighbors search task.Finally, a space-efficient parameter-free clustering algorithm is conceived, based on the recovery of an approximate Minimum Spanning Tree of the sketched data dissimilarity graph on which suitable cuts are performed
APA, Harvard, Vancouver, ISO, and other styles
49

Duan, Haoyang. "Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease." Thèse, Université d'Ottawa / University of Ottawa, 2014. http://hdl.handle.net/10393/31113.

Full text
Abstract:
From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset.
APA, Harvard, Vancouver, ISO, and other styles
50

Jr, Juscelino Izidoro de Oliveira. "SELEÇÃO DE VARIÁVEIS NA MINERAÇÃO DE DADOS AGRÍCOLAS:Uma abordagem baseada em análise de componentes principais." UNIVERSIDADE ESTADUAL DE PONTA GROSSA, 2012. http://tede2.uepg.br/jspui/handle/prefix/152.

Full text
Abstract:
Made available in DSpace on 2017-07-21T14:19:33Z (GMT). No. of bitstreams: 1 Juscelino Izidoro Oliveira.pdf: 622255 bytes, checksum: 54447b380bca4ea8e2360060669d5cff (MD5) Previous issue date: 2012-07-30
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Multivariate data analysis allows the researcher to verify the interaction among a lot of attributes that can influence the behavior of a response variable. That analysis uses models that can be induced from experimental data set. An important issue in the induction of multivariate regressors and classifers is the sample size, because this determines the reliability of the model for tasks of regression or classification of the response variable. This work approachs the sample size issue through the Theory of Probably Approximately Correct Learning, that comes from problems about machine learning for induction of models. Given the importance of agricultural modelling, this work shows two procedures to select variables. Variable Selection by Principal Component Analysis is an unsupervised procedure and allows the researcher to select the most relevant variables from the agricultural data by considering the variation in the data. Variable Selection by Supervised Principal Component Analysis is a supervised procedure and allows the researcher to perform the same process as in the previous procedure, but concentrating the focus of the selection over the variables with more influence in the behavior of the response variable. Both procedures allow the sample complexity informations to be explored in variable selection process. Those procedures were tested in five experiments, showing that the supervised procedure has allowed to induce models that produced better scores, by mean, than that models induced over variables selected by unsupervised procedure. Those experiments also allowed to verify that the variables selected by the unsupervised and supervised procedure showed reduced indices of multicolinearity.
A análise multivariada de dados permite verificar a interação de vários atributos que podem influenciar o comportamento de uma variável de resposta. Tal análise utiliza modelos que podem ser induzidos de conjuntos de dados experimentais. Um fator importante na indução de regressores e classificadores multivariados é o tamanho da amostra, pois, esta determina a contabilidade do modelo quando há a necessidade de se regredir ou classificar a variável de resposta. Este trabalho aborda a questão do tamanho da amostra por meio da Teoria do Aprendizado Provavelmente Aproximadamente Correto, oriundo de problemas sobre o aprendizado de máquina para a indução de modelos. Dada a importância da modelagem agrícola, este trabalho apresenta dois procedimentos para a seleção de variáveis. O procedimento de Seleção de Variáveis por Análise de Componentes Principais, que não é supervisionado e permite ao pesquisador de agricultura selecionar as variáveis mais relevantes de um conjunto de dados agrícolas considerando a variação contida nos dados. O procedimento de Seleção de Variáveis por Análise de Componentes Principais Supervisionado, que é supervisionado e permite realizar o mesmo processo do primeiro procedimento, mas concentrando-se apenas nas variáveis que possuem maior infuência no comportamento da variável de resposta. Ambos permitem que informações a respeito da complexidade da amostra sejam exploradas na seleção de variáveis. Os dois procedimentos foram avaliados em cinco experimentos, mostrando que o procedimento supervisionado permitiu, em média, induzir modelos que produziram melhores pontuações do que aqueles modelos gerados sobre as variáveis selecionadas pelo procedimento não supervisionado. Os experimentos também permitiram verificar que as variáveis selecionadas por ambos os procedimentos apresentavam índices reduzidos de multicolinaridade..
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography