Dissertations / Theses on the topic 'Data Dimensionality Reduction'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Data Dimensionality Reduction.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Vamulapalli, Harika Rao. "On Dimensionality Reduction of Data." ScholarWorks@UNO, 2010. http://scholarworks.uno.edu/td/1211.
Full textWidemann, David P. "Dimensionality reduction for hyperspectral data." College Park, Md.: University of Maryland, 2008. http://hdl.handle.net/1903/8448.
Full textThesis research directed by: Dept. of Mathematics. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.
Baldiwala, Aliakbar. "Dimensionality Reduction for Commercial Vehicle Fleet Monitoring." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/38330.
Full textDWIVEDI, SAURABH. "DIMENSIONALITY REDUCTION FOR DATA DRIVEN PROCESS MODELING." University of Cincinnati / OhioLINK, 2003. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1069770129.
Full textXU, NUO. "AGGRESSIVE DIMENSIONALITY REDUCTION FOR DATA-DRIVEN MODELING." University of Cincinnati / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1178640357.
Full textLaw, Hiu Chung. "Clustering, dimensionality reduction, and side information." Diss., Connect to online resource - MSU authorized users, 2006.
Find full textTitle from PDF t.p. (viewed on June 19, 2009) Includes bibliographical references (p. 296-317). Also issued in print.
Ross, Ian. "Nonlinear dimensionality reduction methods in climate data analysis." Thesis, University of Bristol, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492479.
Full textRay, Sujan. "Dimensionality Reduction in Healthcare Data Analysis on Cloud Platform." University of Cincinnati / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin161375080072697.
Full textHa, Sook Shin. "Dimensionality Reduction, Feature Selection and Visualization of Biological Data." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/77169.
Full textPh. D.
Di, Ciaccio Lucio. "Feature selection and dimensionality reduction for supervised data analysis." Thesis, Massachusetts Institute of Technology, 2016. https://hdl.handle.net/1721.1/122827.
Full textCataloged from PDF version of thesis.
Includes bibliographical references (pages 103-106).
by Lucio Di Ciaccio.
S.M.
S.M. Massachusetts Institute of Technology, Department of Aeronautics and Astronautics
Ghodsi, Boushehri Ali. "Nonlinear Dimensionality Reduction with Side Information." Thesis, University of Waterloo, 2006. http://hdl.handle.net/10012/1020.
Full textThis thesis makes a number of contributions. The first is a technique for combining different embedding objectives, which is then exploited to incorporate side information expressed in terms of transformation invariants known to hold in the data. It also introduces two different ways of incorporating transformation invariants in order to make new similarity measures. Two algorithms are proposed which learn metrics based on different types of side information. These learned metrics can then be used in subsequent embedding methods. Finally, it introduces a manifold learning algorithm that is useful when applied to sequential decision problems. In this case we are given action labels in addition to data points. Actions in the manifold learned by this algorithm have meaningful representations in that they are represented as simple transformations.
Gámez, López Antonio Juan. "Application of nonlinear dimensionality reduction to climate data for prediction." [S.l.] : [s.n.], 2006. http://opus.kobv.de/ubp/volltexte/2006/1095.
Full textKharal, Rosina. "Semidefinite Embedding for the Dimensionality Reduction of DNA Microarray Data." Thesis, University of Waterloo, 2006. http://hdl.handle.net/10012/2945.
Full textHira, Zena Maria. "Dimensionality reduction methods for microarray cancer data using prior knowledge." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/33812.
Full textGámez, López Antonio Juan. "Application of nonlinear dimensionality reduction to climate data for prediction." Phd thesis, Universität Potsdam, 2006. http://opus.kobv.de/ubp/volltexte/2006/1095/.
Full textDas Ziel dieser Arbeit ist es das Verhalten der Temperatur des Meers im tropischen Pazifischen Ozean vorherzusagen. In diesem Gebiet der Welt finden zwei wichtige Phänomene gleichzeitig statt: der jährliche Zyklus und El Niño. Der jährliche Zyklus kann als Oszillation physikalischer Variablen (z.B. Temperatur, Windgeschwindigkeit, Höhe des Meeresspiegels), welche eine Periode von einem Jahr zeigen, definiert werden. Das bedeutet, dass das Verhalten des Meers und der Atmosphäre alle zwölf Monate ähnlich sind (alle Sommer sind ähnlicher jedes Jahr als Sommer und Winter des selben Jahres). El Niño ist eine irreguläre Oszillation weil sie abwechselnd hohe und tiefe Werte erreicht, aber nicht zu einer festen Zeit, wie der jährliche Zyklus. Stattdessen, kann el Niño in einem Jahr hohe Werte erreichen und dann vier, fünf oder gar sieben Jahre benötigen, um wieder aufzutreten. Es ist dabei zu beachten, dass zwei Phänomene, die im selben Raum stattfinden, sich gegenseitig beeinflussen. Dennoch weiß man sehr wenig darüber, wie genau el Niño den jährlichen Zyklus beeinflusst, und umgekehrt. Das Ziel dieser Arbeit ist es, erstens, sich auf die Temperatur des Meers zu fokussieren, um das gesamte System zu analysieren; zweitens, alle Temperaturzeitreihen im tropischen Pazifischen Ozean auf die geringst mögliche Anzahl zu reduzieren, um das System einerseits zu vereinfachen, ohne aber andererseits wesentliche Information zu verlieren. Dieses Vorgehen ähnelt der Analyse einer langen schwingenden Feder, die sich leicht um die Ruhelage bewegt. Obwohl die Feder lang ist, können wir näherungsweise die ganze Feder zeichnen wenn wir die höchsten Punkte zur einen bestimmten Zeitpunkt kennen. Daher, brauchen wir nur einige Punkte der Feder um ihren Zustand zu charakterisieren. Das Hauptproblem in unserem Fall ist die Mindestanzahl von Punkten zu finden, die ausreicht, um beide Phänomene zu beschreiben. Man hat gefunden, dass diese Anzahl drei ist. Nach diesem Teil, war das Ziel vorherzusagen, wie die Temperaturen sich in der Zeit entwickeln werden, wenn man die aktuellen und vergangenen Temperaturen kennt. Man hat beobachtet, dass eine genaue Vorhersage bis zu sechs oder weniger Monate gemacht werden kann, und dass die Temperatur für ein Jahr nicht vorhersagbar ist. Ein wichtiges Resultat ist, dass die Vorhersagen auf kurzen Zeitskalen genauso gut sind, wie die Vorhersagen, welche andere Autoren mit deutlich komplizierteren Methoden erhalten haben. Deswegen ist meine Aussage, dass das gesamte System von jährlichem Zyklus und El Niño mittels einfacherer Methoden als der heute angewandten vorhergesagt werden kann.
Carreira-Perpinan, Miguel Angel. "Continuous latent variable models for dimensionality reduction and sequential data reconstruction." Thesis, University of Sheffield, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.369991.
Full textSulecki, Nathan. "Characterizing Dimensionality Reduction Algorithm Performance in terms of Data Set Aspects." Ohio University Honors Tutorial College / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ouhonors1493397823307462.
Full textIngram, Stephen. "Practical considerations for Dimensionality Reduction : user guidance, costly distances, and document data." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/45175.
Full textBarrett, Philip James. "Exploratory database visualisation : the application and assessment of data and dimensionality reduction." Thesis, Aston University, 1995. http://publications.aston.ac.uk/10634/.
Full textLandgraf, Andrew J. "Generalized Principal Component Analysis: Dimensionality Reduction through the Projection of Natural Parameters." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1437610558.
Full textLu, Tien-hsin. "SqueezeFit Linear Program: Fast and Robust Label-aware Dimensionality Reduction." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587156777565173.
Full textHu, Renjie. "Random neural networks for dimensionality reduction and regularized supervised learning." Diss., University of Iowa, 2019. https://ir.uiowa.edu/etd/6960.
Full textNsang, Augustine S. "An Empirical Study of Novel Approaches to Dimensionality Reduction and Applications." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1312294067.
Full textCheriyadat, Anil Meerasa. "Limitations of principal component analysis for dimensionality-reduction for classification of hyperspectral data." Master's thesis, Mississippi State : Mississippi State University, 2003. http://library.msstate.edu/etd/show.asp?etd=etd-11072003-133109.
Full textGonzález, Valenzuela Ricardo Eugenio 1984. "Linear dimensionality reduction applied to SIFT and SURF feature descriptors." [s.n.], 2014. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275499.
Full textDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-24T12:45:45Z (GMT). No. of bitstreams: 1 GonzalezValenzuela_RicardoEugenio_M.pdf: 22940228 bytes, checksum: 972bc5a0fac686d7eda4da043bbd61ab (MD5) Previous issue date: 2014
Resumo: Descritores locais robustos normalmente compõem-se de vetores de características de alta dimensionalidade para descrever atributos discriminativos em imagens. A alta dimensionalidade de um vetor de características implica custos consideráveis em termos de tempo computacional e requisitos de armazenamento afetando o desempenho de várias tarefas que utilizam descritores de características, tais como correspondência, recuperação e classificação de imagens. Para resolver esses problemas, pode-se aplicar algumas técnicas de redução de dimensionalidade, escencialmente, construindo uma matrix de projeção que explique adequadamente a importancia dos dados em outras bases. Esta dissertação visa aplicar técnicas de redução linear de dimensionalidade aos descritores SIFT e SURF. Seu principal objetivo é demonstrar que, mesmo com o risco de diminuir a precisão dos vetores de caraterísticas, a redução de dimensionalidade pode resultar em um equilíbrio adequado entre tempo computacional e recursos de armazenamento. A redução linear de dimensionalidade é realizada por meio de técnicas como projeções aleatórias (RP), análise de componentes principais (PCA), análise linear discriminante (LDA) e mínimos quadrados parciais (PLS), a fim de criar vetores de características de menor dimensão. Este trabalho avalia os vetores de características reduzidos em aplicações de correspondência e de recuperação de imagens. O tempo computacional e o uso de memória são medidos por comparações entre os vetores de características originais e reduzidos
Abstract: Robust local descriptors usually consist of high dimensional feature vectors to describe distinctive characteristics of images. The high dimensionality of a feature vector incurs into considerable costs in terms of computational time and storage requirements, which affects the performance of several tasks that employ feature vectors, such as matching, image retrieval and classification. To address these problems, it is possible to apply some dimensionality reduction techniques, by building a projection matrix which explains adequately the importance of the data in other basis. This dissertation aims at applying linear dimensionality reduction to SIFT and SURF descriptors. Its main objective is to demonstrate that, even risking to decrease the accuracy of the feature vectors, the dimensionality reduction can result in a satisfactory trade-off between computational time and storage. We perform the linear dimensionality reduction through Random Projections (RP), Independent Component Analysis (ICA), Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) and Partial Least Squares (PLS) in order to create lower dimensional feature vectors. This work evaluates such reduced feature vectors in a matching application, as well as their distinctiveness in an image retrieval application. The computational time and memory usage are then measured by comparing the original and the reduced feature vectors. OBSERVAÇÃONa segunda folha, do arquivo em anexo, o meu nome tem dois pequenos erros
Mestrado
Ciência da Computação
Mestre em Ciência da Computação
Venkataraman, Shilpa. "Exploiting Remotely Sensed Hyperspectral Data Via Spectral Band Grouping For Dimensionality Reduction And Multiclassifiers." MSSTATE, 2005. http://sun.library.msstate.edu/ETD-db/theses/available/etd-07052005-155324/.
Full textVarikuti, Deepthi [Verfasser]. "Evaluation and optimization of biologically meaningful dimensionality reduction approaches for MRI data / Deepthi Varikuti." Düsseldorf : Universitäts- und Landesbibliothek der Heinrich-Heine-Universität Düsseldorf, 2018. http://d-nb.info/1159767017/34.
Full textMrázek, Michal. "Data mining." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2019. http://www.nusl.cz/ntk/nusl-400441.
Full textKliegr, Tomáš. "Clickstream Analysis." Master's thesis, Vysoká škola ekonomická v Praze, 2007. http://www.nusl.cz/ntk/nusl-2065.
Full textNiskanen, M. (Matti). "A visual training based approach to surface inspection." Doctoral thesis, University of Oulu, 2003. http://urn.fi/urn:isbn:9514270673.
Full textVilla, Alberto. "Advanced spectral unmixing and classification methods for hyperspectral remote sensing data." Phd thesis, Université de Grenoble, 2011. http://tel.archives-ouvertes.fr/tel-00767250.
Full textTodorov, Hristo [Verfasser]. "Pattern analysis, dimensionality reduction and hypothesis testing in high-dimensional data from animal studies with small sample sizes / Hristo Todorov." Mainz : Universitätsbibliothek der Johannes Gutenberg-Universität Mainz, 2020. http://d-nb.info/1224895347/34.
Full textChao, Roger. "Data analysis for Systematic Literature Reviews." Thesis, Linnéuniversitetet, Institutionen för informatik (IK), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-105122.
Full textCurti, Nico. "Implementazione e benchmarking dell'algoritmo QDANet PRO per l'analisi di big data genomici." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/12018/.
Full textCesarini, Ettore. "Stima streaming di sottospazi principali." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/17897/.
Full textGritsenko, Andrey. "Bringing interpretability and visualization with artificial neural networks." Diss., University of Iowa, 2017. https://ir.uiowa.edu/etd/5764.
Full textGheyas, Iffat A. "Novel computationally intelligent machine learning algorithms for data mining and knowledge discovery." Thesis, University of Stirling, 2009. http://hdl.handle.net/1893/2152.
Full textHanna, Peter, and Erik Swartling. "Anomaly Detection in Time Series Data using Unsupervised Machine Learning Methods: A Clustering-Based Approach." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-273630.
Full textFör flera företag i tillverkningsindustrin är felsökningar av produkter en fundamental uppgift i produktionsprocessen. Då användningen av olika maskininlärningsmetoder visar sig innehålla användbara tekniker för att hitta fel i produkter är dessa metoder ett populärt val bland företag som ytterligare vill förbättra produktionprocessen. För vissa industrier är feldetektering starkt kopplat till anomalidetektering av olika mätningar. I detta examensarbete är syftet att konstruera oövervakad maskininlärningsmodeller för att identifiera anomalier i tidsseriedata. Mer specifikt består datan av högfrekvent mätdata av pumpar via ström och spänningsmätningar. Mätningarna består av fem olika faser, nämligen uppstartsfasen, tre last-faser och fasen för avstängning. Maskinilärningsmetoderna är baserade på olika klustertekniker, och de metoderna som användes är DBSCAN och LOF algoritmerna. Dessutom tillämpades olika dimensionsreduktionstekniker och efter att ha konstruerat 5 olika modeller, alltså en för varje fas, kan det konstateras att modellerna lyckats identifiera anomalier i det givna datasetet.
Chen, Beichen, and Amy Jinxin Chen. "PCA based dimensionality reduction of MRI images for training support vector machine to aid diagnosis of bipolar disorder." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-259621.
Full textSyftet med denna studie är att undersöka hur dimensionalitetsreduktion av neuroradiologisk data före träning av stödvektormaskiner (SVMs) påverkar klassificeringsnoggrannhet av bipolär sjukdom. Studien använder principalkomponentanalys (PCA) för dimensionalitetsreduktion. En datauppsättning av 19 bipolära och 31 friska magnetisk resonanstomografi(MRT) bilder användes, vilka tillhör den öppna datakällan från studien UCLA Consortium for Neuropsychiatric Phenomics LA5c som finansierades av NIH Roadmap Initiative i syfte att främja genombrott i utvecklingen av nya behandlingar för neuropsykiatriska funktionsnedsättningar. Bilderna genomgick oskärpa, särdragsextrahering och PCA innan de användes som indata för att träna SVMs. Med 3-delad korsvalidering inställdes ett antal parametrar för linjära, radiala och polynomiska kärnor. Experiment gjordes för att utforska prestationen av SVM-modeller tränade med 1 till 29 principalkomponenter (PCs). Flera PC uppsättningar uppnådde 100% noggrannhet i den slutliga utvärderingen, där den minsta uppsättningen var de två första PCs. Den ackumulativa variansen över antalet PCs som användes hade inte någon korrelation med prestationen på modellen. Valet av kärna och hyperparametrar är betydande eftersom prestationen kan variera mycket. Resultatet stödjer tidigare studier att SVM kan vara användbar som stöd för diagnostisering av bipolär sjukdom och användningen av PCA som en dimensionalitetsreduktionsmetod i kombination med SVM kan vara lämplig för klassificering av neuroradiologisk data för bipolär och andra sjukdomar. På grund av begränsningen med få dataprover, kräver resultaten framtida forskning med en större datauppsättning för att validera de erhållna noggrannheten.
Bahri, Maroua. "Improving IoT data stream analytics using summarization techniques." Electronic Thesis or Diss., Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAT017.
Full textWith the evolution of technology, the use of smart Internet-of-Things (IoT) devices, sensors, and social networks result in an overwhelming volume of IoT data streams, generated daily from several applications, that can be transformed into valuable information through machine learning tasks. In practice, multiple critical issues arise in order to extract useful knowledge from these evolving data streams, mainly that the stream needs to be efficiently handled and processed. In this context, this thesis aims to improve the performance (in terms of memory and time) of existing data mining algorithms on streams. We focus on the classification task in the streaming framework. The task is challenging on streams, principally due to the high -- and increasing -- data dimensionality, in addition to the potentially infinite amount of data. The two aspects make the classification task harder.The first part of the thesis surveys the current state-of-the-art of the classification and dimensionality reduction techniques as applied to the stream setting, by providing an updated view of the most recent works in this vibrant area.In the second part, we detail our contributions to the field of classification in streams, by developing novel approaches based on summarization techniques aiming to reduce the computational resource of existing classifiers with no -- or minor -- loss of classification accuracy. To address high-dimensional data streams and make classifiers efficient, we incorporate an internal preprocessing step that consists in reducing the dimensionality of input data incrementally before feeding them to the learning stage. We present several approaches applied to several classifications tasks: Naive Bayes which is enhanced with sketches and hashing trick, k-NN by using compressed sensing and UMAP, and also integrate them in ensemble methods
Henriksson, William. "High dimensional data clustering; A comparative study on gene expressions : Experiment on clustering algorithms on RNA-sequence from tumors with evaluation on internal validation." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-17492.
Full textPaiva, José Gustavo de Souza. "Técnicas computacionais de apoio à classificação visual de imagens e outros dados." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-02042013-084718/.
Full textAutomatic data classification in general, and image classification in particular, are computationally intensive tasks with variable results concerning precision, being considerably dependent on the classifier´s configuration and data representation. Many of the factors that affect an adequate application of classification or categorization methods for images point to the need for more user interference in the process. To accomplish that, it is necessary to develop a larger set of supporting tools for the various stages of the classification set up, such as, but not limited to, feature extraction, parametrization of the classification algorithm and selection of adequate training instances. This doctoral Thesis presents a Visual Image Classification methodology based on the user´s insertion in the classification process through the use of visualization techniques. The idea is to allow the user to participate in all classification steps, adjusting several stages and consequently improving the results according to his or her needs. A study on several candidate visualization techniques is presented, with emphasis on similarity trees, and improvements of the tree construction algorithm, both in visual and time scalability, are shown. Additionally, a visual semi-supervised dimensionality reduction methodology was developed to support, through the use of visual tools, the creation of reduced spaces that improve segregation of the original feature space. The main contribution of this work is an incremental visual classification system incorporating all the steps of the proposed methodology, and providing interactive and visual tools that permit user controlled classification of an incremental collection with evolving class configuration. It allows the use of the human knowledge on the construction of classifiers that adapt to different user needs in different scenarios, producing satisfactory results for several data collections. The focus of this Thesis is image data sets, with examples also in classification of textual collections
Marion, Damien. "Multidimensionality of the models and the data in the side-channel domain." Thesis, Paris, ENST, 2018. http://www.theses.fr/2018ENST0056/document.
Full textSince the publication in 1999 of the seminal paper of Paul C. Kocher, Joshua Jaffe and Benjamin Jun, entitled "Differential Power Analysis", the side-channel attacks have been proved to be efficient ways to attack cryptographic algorithms. Indeed, it has been revealed that the usage of information extracted from the side-channels such as the execution time, the power consumption or the electromagnetic emanations could be used to recover secret keys. In this context, we propose first, to treat the problem of dimensionality reduction. Indeed, since twenty years, the complexity and the size of the data extracted from the side-channels do not stop to grow. That is why the reduction of these data decreases the time and increases the efficiency of these attacks. The dimension reduction is proposed for complex leakage models and any dimension. Second, a software leakage assessment methodology is proposed ; it is based on the analysis of all the manipulated data during the execution of the software. The proposed methodology provides features that speed-up and increase the efficiency of the analysis, especially in the case of white box cryptography
Brunet, Anne-Claire. "Développement d'outils statistiques pour l'analyse de données transcriptomiques par les réseaux de co-expression de gènes." Thesis, Toulouse 3, 2016. http://www.theses.fr/2016TOU30373/document.
Full textToday's, new biotechnologies offer the opportunity to collect a large variety and volume of biological data (genomic, proteomic, metagenomic...), thus opening up new avenues for research into biological processes. In this thesis, what we are specifically interested is the transcriptomic data indicative of the activity or expression level of several thousands of genes in a given cell. The aim of this thesis was to propose proper statistical tools to analyse these high dimensional data (n<
Li, Lei. "Fast Algorithms for Mining Co-evolving Time Series." Research Showcase @ CMU, 2011. http://repository.cmu.edu/dissertations/112.
Full textLindgren, Mona, and Anders Sivertsson. "Visualizing the Body Language of a Musical Conductor using Gaussian Process Latent Variable Models : Creating a visualization tool for GP-LVM modelling of motion capture data and investigating an angle based model for dimensionality reduction." Thesis, KTH, Skolan för teknikvetenskap (SCI), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-195692.
Full textHindawi, Mohammed. "Sélection de variables pour l’analyse des données semi-supervisées dans les systèmes d’Information décisionnels." Thesis, Lyon, INSA, 2013. http://www.theses.fr/2013ISAL0015/document.
Full textFeature selection is an important task in data mining and machine learning processes. This task is well known in both supervised and unsupervised contexts. The semi-supervised feature selection is still under development and far from being mature. In general, machine learning has been well developed in order to deal with partially-labeled data. Thus, feature selection has obtained special importance in the semi-supervised context. It became more adapted with the real world applications where labeling process is costly to obtain. In this thesis, we present a literature review on semi-supervised feature selection, with regard to supervised and unsupervised contexts. The goal is to show the importance of compromising between the structure from unlabeled part of data, and the background information from their labeled part. In particular, we are interested in the so-called «small labeled-sample problem» where the difference between both data parts is very important. In order to deal with the problem of semi-supervised feature selection, we propose two groups of approaches. The first group is of «Filter» type, in which, we propose some algorithms which evaluate the relevance of features by a scoring function. In our case, this function is based on spectral-graph theory and the integration of pairwise constraints which can be extracted from the data in hand. The second group of methods is of «Embedded» type, where feature selection becomes an internal function integrated in the learning process. In order to realize embedded feature selection, we propose algorithms based on feature weighting. The proposed methods rely on constrained clustering. In this sense, we propose two visions, (1) a global vision, based on relaxed satisfaction of pairwise constraints. This is done by integrating the constraints in the objective function of the proposed clustering model; and (2) a second vision, which is local and based on strict control of constraint violation. Both approaches evaluate the relevance of features by weights which are learned during the construction of the clustering model. In addition to the main task which is feature selection, we are interested in redundancy elimination. In order to tackle this problem, we propose a novel algorithm based on combining the mutual information with maximum spanning tree-based algorithm. We construct this tree from the relevant features in order to optimize the number of these selected features at the end. Finally, all proposed methods in this thesis are analyzed and their complexities are studied. Furthermore, they are validated on high-dimensional data versus other representative methods in the literature
Morvan, Anne. "Contributions to unsupervised learning from massive high-dimensional data streams : structuring, hashing and clustering." Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLED033/document.
Full textThis thesis focuses on how to perform efficiently unsupervised machine learning such as the fundamentally linked nearest neighbor search and clustering task, under time and space constraints for high-dimensional datasets. First, a new theoretical framework reduces the space cost and increases the rate of flow of data-independent Cross-polytope LSH for the approximative nearest neighbor search with almost no loss of accuracy.Second, a novel streaming data-dependent method is designed to learn compact binary codes from high-dimensional data points in only one pass. Besides some theoretical guarantees, the quality of the obtained embeddings are accessed on the approximate nearest neighbors search task.Finally, a space-efficient parameter-free clustering algorithm is conceived, based on the recovery of an approximate Minimum Spanning Tree of the sketched data dissimilarity graph on which suitable cuts are performed
Duan, Haoyang. "Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease." Thèse, Université d'Ottawa / University of Ottawa, 2014. http://hdl.handle.net/10393/31113.
Full textJr, Juscelino Izidoro de Oliveira. "SELEÇÃO DE VARIÁVEIS NA MINERAÇÃO DE DADOS AGRÍCOLAS:Uma abordagem baseada em análise de componentes principais." UNIVERSIDADE ESTADUAL DE PONTA GROSSA, 2012. http://tede2.uepg.br/jspui/handle/prefix/152.
Full textCoordenação de Aperfeiçoamento de Pessoal de Nível Superior
Multivariate data analysis allows the researcher to verify the interaction among a lot of attributes that can influence the behavior of a response variable. That analysis uses models that can be induced from experimental data set. An important issue in the induction of multivariate regressors and classifers is the sample size, because this determines the reliability of the model for tasks of regression or classification of the response variable. This work approachs the sample size issue through the Theory of Probably Approximately Correct Learning, that comes from problems about machine learning for induction of models. Given the importance of agricultural modelling, this work shows two procedures to select variables. Variable Selection by Principal Component Analysis is an unsupervised procedure and allows the researcher to select the most relevant variables from the agricultural data by considering the variation in the data. Variable Selection by Supervised Principal Component Analysis is a supervised procedure and allows the researcher to perform the same process as in the previous procedure, but concentrating the focus of the selection over the variables with more influence in the behavior of the response variable. Both procedures allow the sample complexity informations to be explored in variable selection process. Those procedures were tested in five experiments, showing that the supervised procedure has allowed to induce models that produced better scores, by mean, than that models induced over variables selected by unsupervised procedure. Those experiments also allowed to verify that the variables selected by the unsupervised and supervised procedure showed reduced indices of multicolinearity.
A análise multivariada de dados permite verificar a interação de vários atributos que podem influenciar o comportamento de uma variável de resposta. Tal análise utiliza modelos que podem ser induzidos de conjuntos de dados experimentais. Um fator importante na indução de regressores e classificadores multivariados é o tamanho da amostra, pois, esta determina a contabilidade do modelo quando há a necessidade de se regredir ou classificar a variável de resposta. Este trabalho aborda a questão do tamanho da amostra por meio da Teoria do Aprendizado Provavelmente Aproximadamente Correto, oriundo de problemas sobre o aprendizado de máquina para a indução de modelos. Dada a importância da modelagem agrícola, este trabalho apresenta dois procedimentos para a seleção de variáveis. O procedimento de Seleção de Variáveis por Análise de Componentes Principais, que não é supervisionado e permite ao pesquisador de agricultura selecionar as variáveis mais relevantes de um conjunto de dados agrícolas considerando a variação contida nos dados. O procedimento de Seleção de Variáveis por Análise de Componentes Principais Supervisionado, que é supervisionado e permite realizar o mesmo processo do primeiro procedimento, mas concentrando-se apenas nas variáveis que possuem maior infuência no comportamento da variável de resposta. Ambos permitem que informações a respeito da complexidade da amostra sejam exploradas na seleção de variáveis. Os dois procedimentos foram avaliados em cinco experimentos, mostrando que o procedimento supervisionado permitiu, em média, induzir modelos que produziram melhores pontuações do que aqueles modelos gerados sobre as variáveis selecionadas pelo procedimento não supervisionado. Os experimentos também permitiram verificar que as variáveis selecionadas por ambos os procedimentos apresentavam índices reduzidos de multicolinaridade..