Log in

Relevant bibliographies by topics / Ensemble learning methods / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Ensemble learning methods.

Dissertations / Theses on the topic 'Ensemble learning methods'

Author: Grafiati

Published: 4 June 2021

Last updated: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Ensemble learning methods.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Abbasian, Houman. "Inner Ensembles: Using Ensemble Methods in Learning Step." Thèse, Université d'Ottawa / University of Ottawa, 2014. http://hdl.handle.net/10393/31127.

Full text

Abstract:

A pivotal moment in machine learning research was the creation of an important new research area, known as Ensemble Learning. In this work, we argue that ensembles are a very general concept, and though they have been widely used, they can be applied in more situations than they have been to date. Rather than using them only to combine the output of an algorithm, we can apply them to decisions made inside the algorithm itself, during the learning step. We call this approach Inner Ensembles. The motivation to develop Inner Ensembles was the opportunity to produce models with the similar advantages as regular ensembles, accuracy and stability for example, plus additional advantages such as comprehensibility, simplicity, rapid classification and small memory footprint. The main contribution of this work is to demonstrate how broadly this idea can be applied, and highlight its potential impact on all types of algorithms. To support our claim, we first provide a general guideline for applying Inner Ensembles to different algorithms. Then, using this framework, we apply them to two categories of learning methods: supervised and un-supervised. For the former we chose Bayesian network, and for the latter K-Means clustering. Our results show that 1) the overall performance of Inner Ensembles is significantly better than the original methods, and 2) Inner Ensembles provide similar performance improvements as regular ensembles.

APA, Harvard, Vancouver, ISO, and other styles

2

Velka, Elina. "Loss Given Default Estimation with Machine Learning Ensemble Methods." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279846.

Full text

Abstract:

This thesis evaluates the performance of three machine learning methods in prediction of the Loss Given Default (LGD). LGD can be seen as the opposite of the recovery rate, i.e. the ratio of an outstanding loan that the loan issuer would not be able to recover in case the customer would default. The methods investigated are decision trees, random forest and boosted methods. All of the methods investigated performed well in predicting the cases were the loan is not recovered, LGD = 1 (100%), or the loan is totally recovered, LGD = 0 (0% ). When the performance of the models was evaluated on a dataset where the observations with LGD = 1 were removed, a significant decrease in performance was observed. The random forest model built on an unbalanced training dataset showed better performance on the test dataset that included values LGD = 1 and the random forest model built on a balanced training dataset performed better on the test set where the observations of LGD = 1 were removed. Boosted models evaluated in this study showed less accurate predictions than other methods used. Overall, the performance of random forest models showed slightly better results than the performance of decision tree models, although the computational time (the cost) was considerably longer when running the random forest models. Therefore decision tree models would be suggested for prediction of the Loss Given Default.
Denna uppsats undersöker och jämför tre maskininlärningsmetoder som estimerar förlust vid fallissemang (Loss Given Default, LGD). LGD kan ses som motsatsen till återhämtningsgrad, dvs. andelen av det utstående lånet som långivaren inte skulle återfå ifall kunden skulle fallera. Maskininlärningsmetoder som undersöks i detta arbete är decision trees, random forest och boosted metoder. Alla metoder fungerade väl vid estimering av lån som antingen inte återbetalas, dvs. LGD = 1 (100%), eller av lån som betalas i sin helhet, LGD = 0 (0%). En tydlig minskning i modellernas träffsäkerhet påvisades när modellerna kördes med ett dataset där observationer med LGD = 1 var borttagna. Random forest modeller byggda på ett obalanserat träningsdataset presterade bättre än de övriga modellerna på testset som inkluderade observationer där LGD = 1. Då observationer med LGD = 1 var borttagna visade det sig att random forest modeller byggda på ett balanserat träningsdataset presterade bättre än de övriga modellerna. Boosted modeller visade den svagaste träffsäkerheten av de tre metoderna som blev undersökta i denna studie. Totalt sett visade studien att random forest modeller byggda på ett obalanserat träningsdataset presterade en aning bättre än decision tree modeller, men beräkningstiden (kostnaden) var betydligt längre när random forest modeller kördes. Därför skulle decision tree modeller föredras vid estimering av förlust vid fallissemang.

APA, Harvard, Vancouver, ISO, and other styles

3

Conesa, Gago Agustin. "Methods to combine predictions from ensemble learning in multivariate forecasting." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-103600.

Full text

Abstract:

Making predictions nowadays is of high importance for any company, whether small or large, as thanks to the possibility to analyze the data available, new market opportunities can be found, risks and costs can be reduced, among others. Machine learning algorithms for time series can be used for predicting future values of interest. However, choosing the appropriate algorithm and tuning its metaparameters require a great level of expertise. This creates an adoption barrier for small and medium enterprises which could not afford hiring a machine learning expert to their IT team. For these reasons, this project studies different possibilities to make good predictions based on machine learning algorithms, but without requiring great theoretical knowledge from the users. Moreover, a software package that implements the prediction process has been developed. The software is an ensemble method that first predicts a value taking into account different algorithms at the same time, and then it combines their results considering also the previous performance of each algorithm to obtain a final prediction of the value. Moreover, the solution proposed and implemented in this project can also predict according to a concrete objective (e.g., optimize the prediction, or do not exceed the real value) because not every prediction problem is subject to the same constraints. We have experimented and validated the implementation with three different cases. In all of them, a better performance has been obtained in comparison with each of the algorithms involved, reaching improvements of 45 to 95%.

APA, Harvard, Vancouver, ISO, and other styles

4

Kanneganti, Alekhya. "Using Ensemble Machine Learning Methods in Estimating Software Development Effort." Thesis, Blekinge Tekniska Högskola, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-20691.

Full text

Abstract:

Background: Software Development Effort Estimation is a process that focuses on estimating the required effort to develop a software project with a minimal budget. Estimating effort includes interpretation of required manpower, resources, time and schedule. Project managers are responsible for estimating the required effort. A model that can predict software development effort efficiently comes in hand and acts as a decision support system for the project managers to enhance the precision in estimating effort. Therefore, the context of this study is to increase the efficiency in estimating software development effort. Objective: The main objective of this thesis is to identify an effective ensemble method to build and implement it, in estimating software development effort. Apart from this, parameter tuning is also implemented to improve the performance of the model. Finally, we compare the results of the developed model with the existing models. Method: In this thesis, we have adopted two research methods. Initially, a Literature Review was conducted to gain knowledge on the existing studies, machine learning techniques, datasets, ensemble methods that were previously used in estimating Software Development Effort. Then a controlled Experiment was conducted in order to build an ensemble model and to evaluate the performance of the ensemble model for determining if the developed model has a better performance when compared to the existing models. Results: After conducting literature review and collecting evidence, we have decided to build and implement stacked generalization ensemble method in this thesis, with the help of individual machine learning techniques like Support vector regressor (SVR), K-Nearest Neighbors regressor (KNN), Decision Tree Regressor (DTR), Linear Regressor (LR), Multi-Layer Perceptron Regressor (MLP) Random Forest Regressor (RFR), Gradient Boosting Regressor (GBR), AdaBoost Regressor (ABR), XGBoost Regressor (XGB). Likewise, we have decided to implement Randomized Parameter Optimization and SelectKbest function to implement feature section. Datasets like COCOMO81, MAXWELL, ALBERCHT, DESHARNAIS were used. Results of the experiment show that the developed ensemble model performs at its best, for three out of four datasets. Conclusion: After evaluating and analyzing the results obtained, we can conclude that the developed model works well with the datasets that have continuous, numeric type of values. We can also conclude that the developed ensemble model outperforms other existing models when implemented with COCOMO81, MAXWELL, ALBERCHT datasets.

APA, Harvard, Vancouver, ISO, and other styles

5

Bustos, Ricardo Gacitua. "OntoLancs : An evaluation framework for ontology learning by ensemble methods." Thesis, Lancaster University, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.533089.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Elahi, Haroon. "A Boosted-Window Ensemble." Thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-5658.

Full text

Abstract:

Context. The problem of obtaining predictions from stream data involves training on the labeled instances and suggesting the class values for the unseen stream instances. The nature of the data-stream environments makes this task complicated. The large number of instances, the possibility of changes in the data distribution, presence of noise and drifting concepts are just some of the factors that add complexity to the problem. Various supervised-learning algorithms have been designed by putting together efficient data-sampling, ensemble-learning, and incremental-learning methods. The performance of the algorithm is dependent on the chosen methods. This leaves an opportunity to design new supervised-learning algorithms by using different combinations of constructing methods. Objectives. This thesis work proposes a fast and accurate supervised-learning algorithm for performing predictions on the data-streams. This algorithm is called as Boosted-Window Ensemble (BWE), which is invented using the mixture-of-experts technique. BWE uses Sliding Window, Online Boosting and incremental-learning for data-sampling, ensemble-learning, and maintaining a consistent state with the current stream data, respectively. In this regard, a sliding window method is introduced. This method uses partial-updates for sliding the window on the data-stream and is called Partially-Updating Sliding Window (PUSW). The investigation is carried out to compare two variants of sliding window and three different ensemble-learning methods for choosing the superior methods. Methods. The thesis uses experimentation approach for evaluating the Boosted-Window Ensemble (BWE). CPU-time and the Prediction accuracy are used as performance indicators, where CPU-time is the execution time in seconds. The benchmark algorithms include: Accuracy-Updated Ensemble1 (AUE1), Accuracy-Updated Ensemble2 (AUE2), and Accuracy-Weighted Ensemble (AWE). The experiments use nine synthetic and five real-world datasets for generating performance estimates. The Asymptotic Friedman test and the Wilcoxon Signed-Rank test are used for hypothesis testing. The Wilcoxon-Nemenyi-McDonald-Thompson test is used for performing post-hoc analysis. Results. The hypothesis testing suggests that: 1) both for the synthetic and real-wrold datasets, the Boosted Window Ensemble (BWE) has significantly lower CPU-time values than two benchmark algorithms (Accuracy-updated Ensemble1 (AUE1) and Accuracy-weighted Ensemble (AWE). 2) BWE returns similar prediction accuracy as AUE1 and AWE for synthetic datasets. 3) BWE returns similar prediction accuracy as the three benchmark algorithms for the real-world datasets. Conclusions. Experimental results demonstrate that the proposed algorithm can be as accurate as the state-of-the-art benchmark algorithms, while obtaining predictions from the stream data. The results further show that the use of Partially-Updating Sliding Window has resulted in lower CPU-time for BWE as compared with the chunk-based sliding window method used in AUE1, AUE2, and AWE.

APA, Harvard, Vancouver, ISO, and other styles

7

King, Michael Allen. "Ensemble Learning Techniques for Structured and Unstructured Data." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/51667.

Full text

Abstract:

This research provides an integrated approach of applying innovative ensemble learning techniques that has the potential to increase the overall accuracy of classification models. Actual structured and unstructured data sets from industry are utilized during the research process, analysis and subsequent model evaluations. The first research section addresses the consumer demand forecasting and daily capacity management requirements of a nationally recognized alpine ski resort in the state of Utah, in the United States of America. A basic econometric model is developed and three classic predictive models evaluated the effectiveness. These predictive models were subsequently used as input for four ensemble modeling techniques. Ensemble learning techniques are shown to be effective. The second research section discusses the opportunities and challenges faced by a leading firm providing sponsored search marketing services. The goal for sponsored search marketing campaigns is to create advertising campaigns that better attract and motivate a target market to purchase. This research develops a method for classifying profitable campaigns and maximizing overall campaign portfolio profits. Four traditional classifiers are utilized, along with four ensemble learning techniques, to build classifier models to identify profitable pay-per-click campaigns. A MetaCost ensemble configuration, having the ability to integrate unequal classification cost, produced the highest campaign portfolio profit. The third research section addresses the management challenges of online consumer reviews encountered by service industries and addresses how these textual reviews can be used for service improvements. A service improvement framework is introduced that integrates traditional text mining techniques and second order feature derivation with ensemble learning techniques. The concept of GLOW and SMOKE words is introduced and is shown to be an objective text analytic source of service defects or service accolades.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

8

Nguyen, Thanh Tien. "Ensemble Learning Techniques and Applications in Pattern Classification." Thesis, Griffith University, 2017. http://hdl.handle.net/10072/366342.

Full text

Abstract:

It is widely known that the best classifier for a given problem is often problem dependent and there is no one classification algorithm that is the best for all classification tasks. A natural question that arise is: can we combine multiple classification algorithms to achieve higher classification accuracy than a single one? That is the idea behind a class of methods called ensemble method. Ensemble method is defined as the combination of several classifiers with the aim of achieving lower classification error rate than using a single classifier. Ensemble methods have been applying to various applications ranging from computer aided medical diagnosis, computer vision, software engineering, to information retrieval. In this study, we focus on heterogeneous ensemble methods in which a fixed set of diverse learning algorithms are learned on the same training set to generate the different classifiers and the class prediction is then made based on the output of these classifiers (called Level1 data or meta-data). The research on heterogeneous ensemble methods is mainly focused on two aspects: (i) to propose efficient classifiers combining methods on meta-data to achieve high accuracy, and (ii) to optimize the ensemble by performing feature and classifier selection. Although various approaches related to heterogeneous ensemble methods have been proposed, some research gaps still exist First, in ensemble learning, the meta-data of an observation reflects the agreement and disagreement between the different base classifiers.
Thesis (PhD Doctorate)
Doctor of Philosophy (PhD)
School of Information and Communication Technology
Science, Environment, Engineering and Technology
Full Text

APA, Harvard, Vancouver, ISO, and other styles

9

Shi, Zhe. "Semi-supervised Ensemble Learning Methods for Enhanced Prognostics and Health Management." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1522420632837268.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Slawek, Janusz. "Inferring Gene Regulatory Networks from Expression Data using Ensemble Methods." VCU Scholars Compass, 2014. http://scholarscompass.vcu.edu/etd/3396.

Full text

Abstract:

High-throughput technologies for measuring gene expression made inferring of the genome-wide Gene Regulatory Networks an active field of research. Reverse-engineering of systems of transcriptional regulations became an important challenge in molecular and computational biology. Because such systems model dependencies between genes, they are important in understanding of cell behavior, and can potentially turn observed expression data into the new biological knowledge and practical applications. In this dissertation we introduce a set of algorithms, which infer networks of transcriptional regulations from variety of expression profiles with superior accuracy compared to the state-of-the-art techniques. The proposed methods make use of ensembles of trees, which became popular in many scientific fields, including genetics and bioinformatics. However, originally they were motivated from the perspective of classification, regression, and feature selection theory. In this study we exploit their relative variable importance measure as an indication of the presence or absence of a regulatory interaction between genes. We further analyze their predictions on a set of the universally recognized benchmark expression data sets, and achieve favorable results in compare with the state-of-the-art algorithms.

APA, Harvard, Vancouver, ISO, and other styles

11

De, Giorgi Marcello. "Tree ensemble methods for Predictive Maintenance: a case study." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/22282/.

Full text

Abstract:

Nel lavoro descritto in questa tesi sono stati creati modelli per la manutenzione predittiva di macchine utensili in ambito industriale; in particolare, i modelli realizzati sono stati addestrati sfruttando degli ensemble tree methods con le finalità di: predire il verificarsi di un guasto in macchina con un anticipo tale da permettere l'organizzazione delle squadre di manutenzione; predire la necessità della sostituzione anticipata dell'utensile utilizzato dalla macchina, per mantenere alti gli standard di qualità. Dopo aver dato uno sfondo al contesto industriale in esame, la tesi illustra i processi seguiti per la creazione e l'aggregazione di un dataset, e l'introduzione di informazioni relative agli eventi in macchina. Analizzato il comportamento di alcune variabili durante la lavorazione ed effettuata una distinzione tra cicli di lavorazione validi e non validi, si procede introducendo gli ensemble tree methods e il motivo della scelta di questa classe di algoritmi. Nel dettaglio, vengono presentati due possibili candidati al problema trattato: Random Forest ed XGBoost; dopo averne descritto il funzionamento, vengono presentati i risultati ottenuti dai modelli proponendo, per stimarne l'efficacia, un funzione di costo atteso come alternativa all'accuracy score. I risultati dei modelli allenati con i due algoritmi proposti vengono infine confrontati.

APA, Harvard, Vancouver, ISO, and other styles

12

Lund, William B. "Ensemble Methods for Historical Machine-Printed Document Recognition." BYU ScholarsArchive, 2014. https://scholarsarchive.byu.edu/etd/4024.

Full text

Abstract:

The usefulness of digitized documents is directly related to the quality of the extracted text. Optical Character Recognition (OCR) has reached a point where well-formatted and clean machine- printed documents are easily recognizable by current commercial OCR products; however, older or degraded machine-printed documents present problems to OCR engines resulting in word error rates (WER) that severely limit either automated or manual use of the extracted text. Major archives of historical machine-printed documents are being assembled around the globe, requiring an accurate transcription of the text for the automated creation of descriptive metadata, full-text searching, and information extraction. Given document images to be transcribed, ensemble recognition methods with multiple sources of evidence from the original document image and information sources external to the document have been shown in this and related work to improve output. This research introduces new methods of evidence extraction, feature engineering, and evidence combination to correct errors from state-of-the-art OCR engines. This work also investigates the success and failure of ensemble methods in the OCR error correction task, as well as the conditions under which these ensemble recognition methods reduce the Word Error Rate (WER), improving the quality of the OCR transcription, showing that the average document word error rate can be reduced below the WER of a state-of-the-art commercial OCR system by between 7.4% and 28.6% depending on the test corpus and methods. This research on OCR error correction contributes within the larger field of ensemble methods as follows. Four unique corpora for OCR error correction are introduced: The Eisenhower Communiqués, a collection of typewritten documents from 1944 to 1945; The Nineteenth Century Mormon Articles Newspaper Index from 1831 to 1900; and two synthetic corpora based on the Enron (2001) and the Reuters (1997) datasets. The Reverse Dijkstra Heuristic is introduced as a novel admissible heuristic for the A* exact alignment algorithm. The impact of the heuristic is a dramatic reduction in the number of nodes processed during text alignment as compared to the baseline method. From the aligned text, the method developed here creates a lattice of competing hypotheses for word tokens. In contrast to much of the work in this field, the word token lattice is created from a character alignment, preserving split and merged tokens within the hypothesis columns of the lattice. This alignment method more explicitly identifies competing word hypotheses which may otherwise have been split apart by a word alignment. Lastly, this research explores, in order of increasing contribution to word error rate reduction: voting among hypotheses, decision lists based on an in-domain training set, ensemble recognition methods with novel feature sets, multiple binarizations of the same document image, and training on synthetic document images.

APA, Harvard, Vancouver, ISO, and other styles

13

Darwiche, Aiman A. "Machine Learning Methods for Septic Shock Prediction." Diss., NSUWorks, 2018. https://nsuworks.nova.edu/gscis_etd/1051.

Full text

Abstract:

Sepsis is an organ dysfunction life-threatening disease that is caused by a dysregulated body response to infection. Sepsis is difficult to detect at an early stage, and when not detected early, is difficult to treat and results in high mortality rates. Developing improved methods for identifying patients in high risk of suffering septic shock has been the focus of much research in recent years. Building on this body of literature, this dissertation develops an improved method for septic shock prediction. Using the data from the MMIC-III database, an ensemble classifier is trained to identify high-risk patients. A robust prediction model is built by obtaining a risk score from fitting the Cox Hazard model on multiple input features. The score is added to the list of features and the Random Forest ensemble classifier is trained to produce the model. The Cox Enhanced Random Forest (CERF) proposed method is evaluated by comparing its predictive accuracy to those of extant methods.

APA, Harvard, Vancouver, ISO, and other styles

14

Frery, Jordan. "Ensemble Learning for Extremely Imbalced Data Flows." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSES034.

Full text

Abstract:

L'apprentissage machine est l'étude de la conception d'algorithmes qui apprennent à partir des données d'apprentissage pour réaliser une tâche spécifique. Le modèle résultant est ensuite utilisé pour prédire de nouveaux points de données (invisibles) sans aucune aide extérieure. Ces données peuvent prendre de nombreuses formes telles que des images (matrice de pixels), des signaux (sons,...), des transactions (âge, montant, commerçant,...), des journaux (temps, alertes, ...). Les ensembles de données peuvent être définis pour traiter une tâche spécifique telle que la reconnaissance d'objets, l'identification vocale, la détection d'anomalies, etc. Dans ces tâches, la connaissance des résultats escomptés encourage une approche d'apprentissage supervisé où chaque donnée observée est assignée à une étiquette qui définit ce que devraient être les prédictions du modèle. Par exemple, dans la reconnaissance d'objets, une image pourrait être associée à l'étiquette "voiture" qui suggère que l'algorithme d'apprentissage doit apprendre qu'une voiture est contenue dans cette image, quelque part. Cela contraste avec l'apprentissage non supervisé où la tâche à accomplir n'a pas d'étiquettes explicites. Par exemple, un sujet populaire dans l'apprentissage non supervisé est de découvrir les structures sous-jacentes contenues dans les données visuelles (images) telles que les formes géométriques des objets, les lignes, la profondeur, avant d'apprendre une tâche spécifique. Ce type d'apprentissage est évidemment beaucoup plus difficile car il peut y avoir un nombre infini de concepts à saisir dans les données. Dans cette thèse, nous nous concentrons sur un scénario spécifique du cadre d'apprentissage supervisé : 1) l'étiquette d'intérêt est sous-représentée (p. ex. anomalies) et 2) l'ensemble de données augmente avec le temps à mesure que nous recevons des données d'événements réels (p. ex. transactions par carte de crédit). En fait, ces deux problèmes sont très fréquents dans le domaine industriel dans lequel cette thèse se déroule
Machine learning is the study of designing algorithms that learn from trainingdata to achieve a specific task. The resulting model is then used to predict overnew (unseen) data points without any outside help. This data can be of manyforms such as images (matrix of pixels), signals (sounds,...), transactions (age,amount, merchant,...), logs (time, alerts, ...). Datasets may be defined to addressa specific task such as object recognition, voice identification, anomaly detection,etc. In these tasks, the knowledge of the expected outputs encourages a supervisedlearning approach where every single observed data is assigned to a label thatdefines what the model predictions should be. For example, in object recognition,an image could be associated with the label "car" which suggests that the learningalgorithm has to learn that a car is contained in this picture, somewhere. This is incontrast with unsupervised learning where the task at hand does not have explicitlabels. For example, one popular topic in unsupervised learning is to discoverunderlying structures contained in visual data (images) such as geometric formsof objects, lines, depth, before learning a specific task. This kind of learning isobviously much harder as there might be potentially an infinite number of conceptsto grasp in the data. In this thesis, we focus on a specific scenario of thesupervised learning setting: 1) the label of interest is under represented (e.g.anomalies) and 2) the dataset increases with time as we receive data from real-lifeevents (e.g. credit card transactions). In fact, these settings are very common inthe industrial domain in which this thesis takes place

APA, Harvard, Vancouver, ISO, and other styles

15

Vandoni, Jennifer. "Ensemble Methods for Pedestrian Detection in Dense Crowds." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS116/document.

Full text

Abstract:

Cette thèse s’intéresse à la détection des piétons dans des foules très denses depuis un système mono-camera, avec comme but d’obtenir des détections localisées de toutes les personnes. Ces détections peuvent être utilisées soit pour obtenir une estimation robuste de la densité, soit pour initialiser un algorithme de suivi. Les méthodologies classiques utilisées pour la détection de piétons s’adaptent mal au cas où seulement les têtes sont visibles, de part l’absence d’arrière-plan, l’homogénéité visuelle de la foule, la petite taille des objets et la présence d’occultations très fortes. En présence de problèmes difficiles tels que notre application, les approches à base d’apprentissage supervisé sont bien adaptées. Nous considérons un système à plusieurs classifieurs (Multiple Classifier System, MCS), composé de deux ensembles différents, le premier basé sur les classifieurs SVM (SVM- ensemble) et le deuxième basé sur les CNN (CNN-ensemble), combinés dans le cadre de la Théorie des Fonctions de Croyance (TFC). L’ensemble SVM est composé de plusieurs SVM exploitant les données issues d’un descripteur différent. La TFC nous permet de prendre en compte une valeur d’imprécision supposée correspondre soit à une imprécision dans la procédure de calibration, soit à une imprécision spatiale. Cependant, le manque de données labellisées pour le cas des foules très denses nuit à la génération d’ensembles de données d’entrainement et de validation robustes. Nous avons proposé un algorithme d’apprentissage actif de type Query-by- Committee (QBC) qui permet de sélectionner automatiquement de nouveaux échantillons d’apprentissage. Cet algorithme s’appuie sur des mesures évidentielles déduites des fonctions de croyance. Pour le second ensemble, pour exploiter les avancées de l’apprentissage profond, nous avons reformulé notre problème comme une tâche de segmentation en soft labels. Une architecture entièrement convolutionelle a été conçue pour détecter les petits objets grâce à des convolutions dilatées. Nous nous sommes appuyés sur la technique du dropout pour obtenir un ensemble CNN capable d’évaluer la fiabilité sur les prédictions du réseau lors de l’inférence. Les réalisations de cet ensemble sont ensuite combinées dans le cadre de la TFC. Pour conclure, nous montrons que la sortie du MCS peut être utile aussi pour le comptage de personnes. Nous avons proposé une méthodologie d’évaluation multi-échelle, très utile pour la communauté de modélisation car elle lie incertitude (probabilité d’erreur) et imprécision sur les valeurs de densité estimées
This study deals with pedestrian detection in high- density crowds from a mono-camera system. The detections can be then used both to obtain robust density estimation, and to initialize a tracking algorithm. One of the most difficult challenges is that usual pedestrian detection methodologies do not scale well to high-density crowds, for reasons such as absence of background, high visual homogeneity, small size of the objects, and heavy occlusions. We cast the detection problem as a Multiple Classifier System (MCS), composed by two different ensembles of classifiers, the first one based on SVM (SVM-ensemble) and the second one based on CNN (CNN-ensemble), combined relying on the Belief Function Theory (BFT) to exploit their strengths for pixel-wise classification. SVM-ensemble is composed by several SVM detectors based on different gradient, texture and orientation descriptors, able to tackle the problem from different perspectives. BFT allows us to take into account the imprecision in addition to the uncertainty value provided by each classifier, which we consider coming from possible errors in the calibration procedure and from pixel neighbor's heterogeneity in the image space. However, scarcity of labeled data for specific dense crowd contexts reflects in the impossibility to obtain robust training and validation sets. By exploiting belief functions directly derived from the classifiers' combination, we propose an evidential Query-by-Committee (QBC) active learning algorithm to automatically select the most informative training samples. On the other side, we explore deep learning techniques by casting the problem as a segmentation task with soft labels, with a fully convolutional network designed to recover small objects thanks to a tailored use of dilated convolutions. In order to obtain a pixel-wise measure of reliability about the network's predictions, we create a CNN- ensemble by means of dropout at inference time, and we combine the different obtained realizations in the context of BFT. Finally, we show that the output map given by the MCS can be employed to perform people counting. We propose an evaluation method that can be applied at every scale, providing also uncertainty bounds on the estimated density

APA, Harvard, Vancouver, ISO, and other styles

16

Michelen, Strofer Carlos Alejandro. "Machine Learning and Field Inversion approaches to Data-Driven Turbulence Modeling." Diss., Virginia Tech, 2021. http://hdl.handle.net/10919/103155.

Full text

Abstract:

There still is a practical need for improved closure models for the Reynolds-averaged Navier-Stokes (RANS) equations. This dissertation explores two different approaches for using experimental data to provide improved closure for the Reynolds stress tensor field. The first approach uses machine learning to learn a general closure model from data. A novel framework is developed to train deep neural networks using experimental velocity and pressure measurements. The sensitivity of the RANS equations to the Reynolds stress, required for gradient-based training, is obtained by means of both variational and ensemble methods. The second approach is to infer the Reynolds stress field for a flow of interest from limited velocity or pressure measurements of the same flow. Here, this field inversion is done using a Monte Carlo Bayesian procedure and the focus is on improving the inference by enforcing known physical constraints on the inferred Reynolds stress field. To this end, a method for enforcing boundary conditions on the inferred field is presented. The two data-driven approaches explored and improved upon here demonstrate the potential for improved practical RANS predictions.
Doctor of Philosophy
The Reynolds-averaged Navier-Stokes (RANS) equations are widely used to simulate fluid flows in engineering applications despite their known inaccuracy in many flows of practical interest. The uncertainty in the RANS equations is known to stem from the Reynolds stress tensor for which no universally applicable turbulence model exists. The computational cost of more accurate methods for fluid flow simulation, however, means RANS simulations will likely continue to be a major tool in engineering applications and there is still a need for improved RANS turbulence modeling. This dissertation explores two different approaches to use available experimental data to improve RANS predictions by improving the uncertain Reynolds stress tensor field. The first approach is using machine learning to learn a data-driven turbulence model from a set of training data. This model can then be applied to predict new flows in place of traditional turbulence models. To this end, this dissertation presents a novel framework for training deep neural networks using experimental measurements of velocity and pressure. When using velocity and pressure data, gradient-based training of the neural network requires the sensitivity of the RANS equations to the learned Reynolds stress. Two different methods, the continuous adjoint and ensemble approximation, are used to obtain the required sensitivity. The second approach explored in this dissertation is field inversion, whereby available data for a flow of interest is used to infer a Reynolds stress field that leads to improved RANS solutions for that same flow. Here, the field inversion is done via the ensemble Kalman inversion (EKI), a Monte Carlo Bayesian procedure, and the focus is on improving the inference by enforcing known physical constraints on the inferred Reynolds stress field. To this end, a method for enforcing boundary conditions on the inferred field is presented. While further development is needed, the two data-driven approaches explored and improved upon here demonstrate the potential for improved practical RANS predictions.

APA, Harvard, Vancouver, ISO, and other styles

17

Sirin, Volkan. "Machine Learning Methods For Opponent Modeling In Games Of Imperfect Information." Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12614630/index.pdf.

Full text

Abstract:

This thesis presents a machine learning approach to the problem of opponent modeling in games of imperfect information. The efficiency of various artificial intelligence techniques are investigated in this domain. A sequential game is called imperfect information game if players do not have all the information about the current state of the game. A very popular example is the Texas Holdem Poker, which is used for realization of the suggested methods in this thesis. Opponent modeling is the system that enables a player to predict the behaviour of its opponent. In this study, opponent modeling problem is approached as a classification problem. An architecture with different classifiers for each phase of the game is suggested. Neural Networks, K-Nearest Neighbors (KNN) and Support Vector Machines are used as classifier. For modeling a particular player, KNN is found to be most successful amongst all, with a prediction accuracy of 88%. An ensemble learning system is proposed for modeling different playing styles and unknown ones. Computational complexity and parallelization of some calculations are also provided.

APA, Harvard, Vancouver, ISO, and other styles

18

Dutra, Calainho Felipe. "Evaluation of Calibration Methods to Adjust for Infrequent Values in Data for Machine Learning." Thesis, Högskolan Dalarna, Mikrodataanalys, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:du-28134.

Full text

Abstract:

The performance of supervised machine learning algorithms is highly dependent on the distribution of the target variable. Infrequent values are more di_cult to predict, as there are fewer examples for the algorithm to learn patterns that contain those values. These infrequent values are a common problem with real data, being the object of interest in many _elds such as medical research, _nance and economics, just to mention a few. Problems regarding classi_cation have been comprehensively studied. For regression, on the other hand, few contributions are available. In this work, two ensemble methods from classi_cation are adapted to the regression case. Additionally, existing oversampling techniques, namely SmoteR, are tested. Therefore, the aim of this research is to examine the inuence of oversampling and ensemble techniques over the accuracy of regression models when predicting infrequent values. To assess the performance of the proposed techniques, two data sets are used: one concerning house prices, while the other regards patients with Parkinson's Disease. The _ndings corroborate the usefulness of the techniques for reducing the prediction error of infrequent observations. In the best case, the proposed Random Distribution Sample Ensemble reduced the overall RMSE by 8.09% and the RMSE for infrequent values by 6.44% when compared with the best performing benchmark for the housing data set.

APA, Harvard, Vancouver, ISO, and other styles

19

NOTARO, MARCO. "HIERARCHICAL ENSEMBLE METHODS FOR ONTOLOGY-BASED PREDICTIONS IN COMPUTATIONAL BIOLOGY." Doctoral thesis, Università degli Studi di Milano, 2019. http://hdl.handle.net/2434/606185.

Full text

Abstract:

L'annotazione standardizzata di entità biologiche, quali geni e proteine, ha fortemente promosso l'organizzazione dei concetti biologici in vocabolari controllati, cioè ontologie che consentono di indicizzare in modo coerente le relazioni tra le diverse classi funzionali organizzate secondo una gerarchia predefinita. Esempi di ontologie biologiche in cui i termini funzionali sono strutturati secondo un grafo diretto aciclico (DAG) sono la Gene Ontology (GO) e la Human Phenotype Ontology (HPO). Tali tassonomie gerarchiche vengono utilizzate dalla comunità scientifica rispettivamente per sistematizzare le funzioni proteiche di tutti gli organismi viventi dagli Archea ai Metazoa e per categorizzare le anomalie fenotipiche associate a malattie umane. Tali bio-ontologie, offrendo uno spazio di classificazione ben definito, hanno favorito lo sviluppo di metodi di apprendimento per la predizione automatizzata della funzione delle proteine e delle associazioni gene-fenotipo patologico nell'uomo. L'obiettivo di tali metodologie consiste nell'“indirizzare” la ricerca “in-vitro” per favorire una riduzione delle spese ed un uso più efficace dei fondi destinati alla ricerca. Dal punto di vista dell'apprendimento automatico il problema della predizione della funzione delle proteine o delle associazioni gene-fenotipo patologico nell'uomo può essere modellato come un problema di classificazione multi-etichetta strutturato, in cui le predizioni associate ad ogni esempio (i.e., gene o proteina) sono sotto-grafi organizzati secondo una determinata struttura (albero o DAG). A causa della complessità del problema di classificazione, ad oggi l'approccio di predizione più comunemente utilizzato è quello “flat”, che consiste nell'addestrare un classificatore separatamente per ogni termine dell'ontologia senza considerare le relazioni gerarchiche esistenti tra le classi funzionali. L'utilizzo di questo approccio è giustificato non soltanto dal fatto di ridurre la complessità computazionale del problema di apprendimento, ma anche dalla natura “instabile” dei termini che compongono l'ontologia stessa. Infatti tali termini vengono aggiornati mensilmente mediante un processo curato da esperti che si basa sia sulla letteratura scientifica biomedica che su dati sperimentali ottenuti da esperimenti eseguiti “in-vitro” o “in-silico”. In questo contesto, in letteratura sono stati proposti due classi generali di classificatori. Da una parte, si collocano i metodi di apprendimento automatico che predicono le classi funzionali in modo “flat”, ossia senza esplorare la struttura intrinseca dello spazio delle annotazioni. Dall'altra parte, gli approcci gerarchici che, considerando esplicitamente le relazioni gerarchiche fra i termini funzionali dell'ontologia, garantiscono che le annotazioni predette rispettino la “true-path-rule”, la regola biologica che governa le ontologie. Nell'ambito dei metodi gerarchici, in letteratura sono stati proposti due diverse categorie di approcci. La prima si basa su metodi kernelizzati per predizioni con output strutturato, mentre la seconda su metodi di ensemble gerarchici. Entrambi questi metodi presentano alcuni svantaggi. I primi sono computazionalmente pesanti e non scalano bene se applicati ad ontologie biologiche. I secondi sono stati per la maggior parte concepiti per tassonomie strutturate ad albero, e quei pochi approcci specificatamente progettati per ontologie strutturate secondo un DAG, sono nella maggioranza dei casi incapaci di migliorare le performance di predizione dei metodi “flat”. Per superare queste limitazioni, nel presente lavoro di tesi si sono proposti dei nuovi metodi di ensemble gerarchici capaci di fornire predizioni consistenti con la struttura gerarchica dell'ontologia. Tali approcci, da un lato estendono precedenti metodi originariamente sviluppati per ontologie strutturate ad albero ad ontologie organizzate secondo un DAG e dall'altro migliorano significativamente le predizioni rispetto all'approccio “flat” indipendentemente dalla scelta del tipo di classificatore utilizzato. Nella loro forma più generale, gli approcci di ensemble gerarchici sono altamente modulari, nel senso che adottano una strategia di apprendimento a due passi. Nel primo passo, le classi funzionali dell'ontologia vengono apprese in modo indipendente l'una dall'altra, mentre nel secondo passo le predizioni “flat” vengono combinate opportunamente tenendo conto delle gerarchia fra le classi ontologiche. I principali contributi introdotti nella presente tesi sono sia metodologici che sperimentali. Da un punto di vista metodologico, sono stati proposti i seguenti nuovi metodi di ensemble gerarchici: a) HTD-DAG (Hierarchical Top-Down per tassonomie DAG strutturate); b) TPR-DAG (True-Path-Rule per DAG) con diverse varianti algoritmiche; c) ISO-TPR (True-Path-Rule con Regressione Isotonica), un nuovo algoritmo gerarchico che combina la True-Path-Rule con metodi di regressione isotonica. Per tutti i metodi di ensemble gerarchici è stato dimostrato in modo formale la coerenza delle predizioni, cioè è stato provato come gli approcci proposti sono in grado di fornire predizioni che rispettano le relazioni gerarchiche fra le classi. Da un punto di vista sperimentale, risultati a livello dell'intero genoma di organismi modello e dell'uomo ed a livello della totalità delle classi incluse nelle ontologie biologiche mostrano che gli approcci metodologici proposti: a) sono competitivi con gli algoritmi di predizione output strutturata allo stato dell'arte; b) sono in grado di migliorare i classificatori “flat”, a patto che le predizioni fornite dal classificatore non siano casuali; c) sono in grado di predire nuove associazioni tra geni umani e fenotipi patologici, un passo cruciale per la scoperta di nuovi geni associati a malattie genetiche umane e al cancro; d) scalano bene su dataset costituiti da decina di migliaia di esempi (i.e., proteine o geni) e su tassonomie costituite da migliaia di classi funzionali. Infine, i metodi proposti in questa tesi sono stati implementati in una libreria software scritta in linguaggio R, HEMDAG (Hierarchical Ensemble Methods per DAG), che è pubblica, liberamente scaricabile e disponibile per i sistemi operativi Linux, Windows e Macintosh.
The standardized annotation of biomedical related objects, often organized in dedicated catalogues, strongly promoted the organization of biological concepts into controlled vocabularies, i.e. ontologies by which related terms of the underlying biological domain are structured according to a predefined hierarchy. Indeed large ontologies have been developed by the scientific community to structure and organize the gene and protein taxonomy of all the living organisms from Archea to Metazoa, i.e. the Gene Ontology, or human specific ontologies, such as the Human Phenotype Ontology, that provides a structured taxonomy of the abnormal human phenotypes associated with diseases. These ontologies, offering a coded and well-defined classification space for biological entities such as genes and proteins, favor the development of machine learning methods able to predict features of biological objects like the association between a human gene and a disease, with the aim to drive wet lab research allowing a reduction of the costs and a more effective usage of the available research funds. Despite the soundness of the aforementioned objectives, the resulting multi-label classification problems raise so complex machine learning issues that until recently the far common approach was the “flat” prediction, i.e. simply training a classifier for each term in the controlled vocabulary and ignoring the relationships between terms. This approach was not only justified by the need to reduce the computational complexity of the learning task, but also by the somewhat “unstable” nature of the terms composing the controlled vocabularies, because they were (and are) updated on a monthly basis in a process performed by expert curators and based on biomedical literature, and wet and in-silico experiments. In this context, two main general classes of classifiers have been proposed in literature. On the one hand, “hierarchy-unaware” learning methods predict labels in a “flat” way without exploiting the inherent structure of the annotation space. On the other hand, “hierarchy-aware” learning methods can improve the accuracy and the precision of the predictions by considering the hierarchical relationships between ontology terms. Moreover these methods can guarantee the consistency of the predicted labels according to the “true path rule”, that is the biological and logical rule that governs the internal coherence of biological ontologies. To properly handle the hierarchical relationships linking the ontology terms, two main classes of structured output methods have been proposed in literature: the first one is based on kernelized methods for structured output spaces, the second on hierarchical ensemble methods for ontology-based predictions. However both these approaches suffer of significant drawbacks. The kernel-based methods for structured output space are computationally intensive and do not scale well when applied to complex multi-label bio-ontologies. Most hierarchical ensemble methods have been conceived for tree-structured taxonomies and the few ones specifically developed for the prediction in DAG-structured output spaces are, in most cases, unable to improve prediction performances over flat methods. To overcome these limitations, in this thesis novel “ontology-aware” ensemble methods have been developed, able to handle DAG-structured ontologies, leveraging previous results obtained with “true-path-rule”-based hierarchical learning algorithms. These methods are highly modular in the sense that they adopt a “two-step” learning strategy: in the first step they learn separately each term of the ontology using flat methods, and in the second they properly combine the flat predictions according to the hierarchy of the classes. The main contributions of this thesis are both methodological and experimental. From a methodological standpoint, novel hierarchical ensemble methods are proposed, including: a) HTD (Hierarchical Top-Down algorithm for DAG structured ontologies); b) TPR-DAG (True Path Rule ensemble for DAG) with several variants; c) ISO-TPR, a novel ensemble method that combines the True Path Rule approach with Isotonic Regression. For all these methods a formal proof of their consistency, i.e. the guarantee of providing predictions that “respect” the hierarchical relationships between classes, is provided. From an experimental standpoint, extensive genome and ontology-wide results show that the proposed methods: a) are competitive with state-of-the-art prediction algorithms; b) are able to improve flat machine learning classifiers, if the base learners can provide non random predictions; c) are able to predict new associations between genes and human abnormal phenotypes, a crucial step to discover novel genes associated with human diseases ranging from genetic disorders to cancer; d) scale nicely with large datasets and bio-ontologies. Finally HEMDAG, a novel R library implementing the proposed hierarchical ensemble methods has been developed and publicly delivered.

APA, Harvard, Vancouver, ISO, and other styles

20

Banfield, Robert E. "Learning on complex simulations." [Tampa, Fla.] : University of South Florida, 2007. http://purl.fcla.edu/usf/dc/et/SFE0002112.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Kankanala, Padmavathy. "Machine learning methods for the estimation of weather and animal-related power outages on overhead distribution feeders." Diss., Kansas State University, 2013. http://hdl.handle.net/2097/16914.

Full text

Abstract:

Doctor of Philosophy
Department of Electrical and Computer Engineering
Sanjoy Das and Anil Pahwa
Because a majority of day-to-day activities rely on electricity, it plays an important role in daily life. In this digital world, most of the people’s life depends on electricity. Without electricity, the flip of a switch would no longer produce instant light, television or refrigerators would be nonexistent, and hundreds of conveniences often taken for granted would be impossible. Electricity has become a basic necessity, and so any interruption in service due to disturbances in power lines causes a great inconvenience to customers. Customers and utility commissions expect a high level of reliability. Power distribution systems are geographically dispersed and exposure to environment makes them highly vulnerable part of power systems with respect to failures and interruption of service to customers. Following the restructuring and increased competition in the electric utility industry, distribution system reliability has acquired larger significance. Better understanding of causes and consequences of distribution interruptions is helpful in maintaining distribution systems, designing reliable systems, installing protection devices, and environmental issues. Various events, such as equipment failure, animal activity, tree fall, wind, and lightning, can negatively affect power distribution systems. Weather is one of the primary causes affecting distribution system reliability. Unfortunately, as weather-related outages are highly random, predicting their occurrence is an arduous task. To study the impact of weather on overhead distribution system several models, such as linear and exponential regression models, neural network model, and ensemble methods are presented in this dissertation. The models were extended to study the impact of animal activity on outages in overhead distribution system. Outage, lightning, and weather data for four different cities in Kansas of various sizes from 2005 to 2011 were provided by Westar Energy, Topeka, and state climate office at Kansas State University weather services. Models developed are applied to estimate daily outages. Performance tests shows that regression and neural network models are able to estimate outages well but failed to estimate well in lower and upper range of observed values. The introduction of committee machines inspired by the ‘divide & conquer” principle overcomes this problem. Simulation results shows that mixture of experts model is more effective followed by AdaBoost model in estimating daily outages. Similar results on performance of these models were found for animal-caused outages.

APA, Harvard, Vancouver, ISO, and other styles

22

Memari, Majid. "Predicting the Stock Market Using News Sentiment Analysis." OpenSIUC, 2018. https://opensiuc.lib.siu.edu/theses/2442.

Full text

Abstract:

ABSTRACT MAJID MEMARI, for the Masters of Science degree in Computer Science, presented on November 3rd, 2017 at Southern Illinois University, Carbondale, IL. Title: PREDICTING THE STOCK MARKET USING NEWS SENTIMENT ANALYSIS Major Professor: Dr. Norman Carver Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. GDELT is the largest, most comprehensive, and highest resolution open database ever created. It is a platform that monitors the world's news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages, every moment of every day that stretches all the way back to January 1st, 1979, and updates daily [1]. Stock market prediction is the act of trying to determine the future value of a company stock or other financial instrument traded on an exchange. The successful prediction of a stock's future price could yield significant profit. The efficient-market hypothesis suggests that stock prices reflect all currently available information and any price changes that are not based on newly revealed information thus are inherently unpredictable [2]. On the other hand, other studies show that it is predictable. The stock market prediction has been a long-time attractive topic and is extensively studied by researchers in different fields with numerous studies of the correlation between stock market fluctuations and different data sources derived from the historical data of world major stock indices or external information from social media and news [6]. The main objective of this research is to investigate the accuracy of predicting the unseen prices of the Dow Jones Industrial Average using information derived from GDELT database. Dow Jones Industrial Average (DJIA) is a stock market index, and one of several indices created by Wall Street Journal editor and Dow Jones & Company co-founder Charles Dow. This research is based on data sets of events from GDELT database and daily prices of the DJI from Yahoo Finance, all from March 2015 to October 2017. First, multiple different classification machine learning models are applied to the generated datasets and then also applied to multiple different Ensemble methods. In statistics and machine learning, Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Afterwards, performances are evaluated for each model using the optimized parameters. Finally, experimental results show that using Ensemble methods has a significant (positive) impact on improving the prediction accuracy. Keywords: Big Data, GDELT, Stock Market, Prediction, Dow Jones Index, Machine Learning, Ensemble Methods

APA, Harvard, Vancouver, ISO, and other styles

23

Jaber, Ghazal. "An approach for online learning in the presence of concept changes." Phd thesis, Université Paris Sud - Paris XI, 2013. http://tel.archives-ouvertes.fr/tel-00907486.

Full text

Abstract:

Learning from data streams is emerging as an important application area. When the environment changes, it is necessary to rely on on-line learning with the capability to adapt to changing conditions a.k.a. concept drifts. Adapting to concept drifts entails forgetting some or all of the old acquired knowledge when the concept changes while accumulating knowledge regarding the supposedly stationary underlying concept. This tradeoff is called the stability-plasticity dilemma. Ensemble methods have been among the most successful approaches. However, the management of the ensemble which ultimately controls how past data is forgotten has not been thoroughly investigated so far. Our work shows the importance of the forgetting strategy by comparing several approaches. The results thus obtained lead us to propose a new ensemble method with an enhanced forgetting strategy to adapt to concept drifts. Experimental comparisons show that our method compares favorably with the well-known state-of-the-art systems. The majority of previous works focused only on means to detect changes and to adapt to them. In our work, we go one step further by introducing a meta-learning mechanism that is able to detect relevant states of the environment, to recognize recurring contexts and to anticipate likely concepts changes. Hence, the method we suggest, deals with both the challenge of optimizing the stability-plasticity dilemma and with the anticipation and recognition of incoming concepts. This is accomplished through an ensemble method that controls a ensemble of incremental learners. The management of the ensemble of learners enables one to naturally adapt to the dynamics of the concept changes with very few parameters to set, while a learning mechanism managing the changes in the ensemble provides means for the anticipation of, and the quick adaptation to, the underlying modification of the context.

APA, Harvard, Vancouver, ISO, and other styles

24

Li, Yichao. "Algorithmic Methods for Multi-Omics Biomarker Discovery." Ohio University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1541609328071533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Lundberg, Jacob. "Resource Efficient Representation of Machine Learning Models : investigating optimization options for decision trees in embedded systems." Thesis, Linköpings universitet, Statistik och maskininlärning, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-162013.

Full text

Abstract:

Combining embedded systems and machine learning models is an exciting prospect. However, to fully target any embedded system, with the most stringent resource requirements, the models have to be designed with care not to overwhelm it. Decision tree ensembles are targeted in this thesis. A benchmark model is created with LightGBM, a popular framework for gradient boosted decision trees. This model is first transformed and regularized with RuleFit, a LASSO regression framework. Then it is further optimized with quantization and weight sharing, techniques used when compressing neural networks. The entire process is combined into a novel framework, called ESRule. The data used comes from the domain of frequency measurements in cellular networks. There is a clear use-case where embedded systems can use the produced resource optimized models. Compared with LightGBM, ESRule uses 72ˆ less internal memory on average, simultaneously increasing predictive performance. The models use 4 kilobytes on average. The serialized variant of ESRule uses 104ˆ less hard disk space than LightGBM. ESRule is also clearly faster at predicting a single sample.

APA, Harvard, Vancouver, ISO, and other styles

26

Börthas, Lovisa, and Sjölander Jessica Krange. "Machine Learning Based Prediction and Classification for Uplift Modeling." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-266379.

Full text

Abstract:

The desire to model the true gain from targeting an individual in marketing purposes has lead to the common use of uplift modeling. Uplift modeling requires the existence of a treatment group as well as a control group and the objective hence becomes estimating the difference between the success probabilities in the two groups. Efficient methods for estimating the probabilities in uplift models are statistical machine learning methods. In this project the different uplift modeling approaches Subtraction of Two Models, Modeling Uplift Directly and the Class Variable Transformation are investigated. The statistical machine learning methods applied are Random Forests and Neural Networks along with the standard method Logistic Regression. The data is collected from a well established retail company and the purpose of the project is thus to investigate which uplift modeling approach and statistical machine learning method that yields in the best performance given the data used in this project. The variable selection step was shown to be a crucial component in the modeling processes as so was the amount of control data in each data set. For the uplift to be successful, the method of choice should be either the Modeling Uplift Directly using Random Forests, or the Class Variable Transformation using Logistic Regression. Neural network - based approaches are sensitive to uneven class distributions and is hence not able to obtain stable models given the data used in this project. Furthermore, the Subtraction of Two Models did not perform well due to the fact that each model tended to focus too much on modeling the class in both data sets separately instead of modeling the difference between the class probabilities. The conclusion is hence to use an approach that models the uplift directly, and also to use a great amount of control data in each data set.
Behovet av att kunna modellera den verkliga vinsten av riktad marknadsföring har lett till den idag vanligt förekommande metoden inkrementell responsanalys. För att kunna utföra denna typ av metod krävs förekomsten av en existerande testgrupp samt kontrollgrupp och målet är således att beräkna differensen mellan de positiva utfallen i de två grupperna. Sannolikheten för de positiva utfallen för de två grupperna kan effektivt estimeras med statistiska maskininlärningsmetoder. De inkrementella responsanalysmetoderna som undersöks i detta projekt är subtraktion av två modeller, att modellera den inkrementella responsen direkt samt en klassvariabeltransformation. De statistiska maskininlärningsmetoderna som tillämpas är random forests och neurala nätverk samt standardmetoden logistisk regression. Datan är samlad från ett väletablerat detaljhandelsföretag och målet är därmed att undersöka vilken inkrementell responsanalysmetod och maskininlärningsmetod som presterar bäst givet datan i detta projekt. De mest avgörande aspekterna för att få ett bra resultat visade sig vara variabelselektionen och mängden kontrolldata i varje dataset. För att få ett lyckat resultat bör valet av maskininlärningsmetod vara random forests vilken används för att modellera den inkrementella responsen direkt, eller logistisk regression tillsammans med en klassvariabeltransformation. Neurala nätverksmetoder är känsliga för ojämna klassfördelningar och klarar därmed inte av att erhålla stabila modeller med den givna datan. Vidare presterade subtraktion av två modeller dåligt på grund av att var modell tenderade att fokusera för mycket på att modellera klassen i båda dataseten separat, istället för att modellera differensen mellan dem. Slutsatsen är således att en metod som modellerar den inkrementella responsen direkt samt en relativt stor kontrollgrupp är att föredra för att få ett stabilt resultat.

APA, Harvard, Vancouver, ISO, and other styles

27

Fiterau, Madalina. "Discovering Compact and Informative Structures through Data Partitioning." Research Showcase @ CMU, 2015. http://repository.cmu.edu/dissertations/792.

Full text

Abstract:

In many practical scenarios, prediction for high-dimensional observations can be accurately performed using only a fraction of the existing features. However, the set of relevant predictive features, known as the sparsity pattern, varies across data. For instance, features that are informative for a subset of observations might be useless for the rest. In fact, in such cases, the dataset can be seen as an aggregation of samples belonging to several low-dimensional sub-models, potentially due to different generative processes. My thesis introduces several techniques for identifying sparse predictive structures and the areas of the feature space where these structures are effective. This information allows the training of models which perform better than those obtained through traditional feature selection. We formalize Informative Projection Recovery, the problem of extracting a set of low-dimensional projections of data which jointly form an accurate solution to a given learning task. Our solution to this problem is a regression-based algorithm that identifies informative projections by optimizing over a matrix of point-wise loss estimators. It generalizes to a number of machine learning problems, offering solutions to classification, clustering and regression tasks. Experiments show that our method can discover and leverage low-dimensional structure, yielding accurate and compact models. Our method is particularly useful in applications involving multivariate numeric data in which expert assessment of the results is of the essence. Additionally, we developed an active learning framework which works with the obtained compact models in finding unlabeled data deemed to be worth expert evaluation. For this purpose, we enhance standard active selection criteria using the information encapsulated by the trained model. The advantage of our approach is that the labeling effort is expended mainly on samples which benefit models from the hypothesis class we are considering. Additionally, the domain experts benefit from the availability of informative axis aligned projections at the time of labeling. Experiments show that this results in an improved learning rate over standard selection criteria, both for synthetic data and real-world data from the clinical domain, while the comprehensible view of the data supports the labeling process and helps preempt labeling errors.

APA, Harvard, Vancouver, ISO, and other styles

28

Al-Mter, Yusur. "Automatic Prediction of Human Age based on Heart Rate Variability Analysis using Feature-Based Methods." Thesis, Linköpings universitet, Statistik och maskininlärning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166139.

Full text

Abstract:

Heart rate variability (HRV) is the time variation between adjacent heartbeats. This variation is regulated by the autonomic nervous system (ANS) and its two branches, the sympathetic and parasympathetic nervous system. HRV is considered as an essential clinical tool to estimate the imbalance between the two branches, hence as an indicator of age and cardiac-related events.This thesis focuses on the ECG recordings during nocturnal rest to estimate the influence of HRV in predicting the age decade of healthy individuals. Time and frequency domains, as well as non-linear methods, are explored to extract the HRV features. Three feature-based methods (support vector machine (SVM), random forest, and extreme gradient boosting (XGBoost)) were employed, and the overall test accuracy achieved in capturing the actual class was relatively low (lower than 30%). SVM classifier had the lowest performance, while random forests and XGBoost performed slightly better. Although the difference is negligible, the random forest had the highest test accuracy, approximately 29%, using a subset of ten optimal HRV features. Furthermore, to validate the findings, the original dataset was shuffled and used as a test set and compared the performance to other related research outputs.

APA, Harvard, Vancouver, ISO, and other styles

29

Kueterman, Nathan. "Comparative Study of Classification Methods for the Mitigation of Class Imbalance Issues in Medical Imaging Applications." University of Dayton / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=dayton1591611376235015.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Pereira, Vinicius Gomes. "Using supervised machine learning and sentiment analysis techniques to predict homophobia in portuguese tweets." reponame:Repositório Institucional do FGV, 2018. http://hdl.handle.net/10438/24301.

Full text

Abstract:

Submitted by Vinicius Pereira (viniciusgomespe@gmail.com) on 2018-06-26T20:56:26Z No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5)
Approved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2018-07-11T12:40:51Z (GMT) No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5)
Made available in DSpace on 2018-07-16T17:48:51Z (GMT). No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5) Previous issue date: 2018-04-16
Este trabalho estuda a identificação de tweets homofóbicos, utilizando uma abordagem de processamento de linguagem natural e aprendizado de máquina. O objetivo é construir um modelo preditivo que possa detectar, com razoável precisão, se um Tweet contém conteúdo ofensivo a indivı́duos LGBT ou não. O banco de dados utilizado para treinar os modelos preditivos foi construı́do agregando tweets de usuários que interagiram com polı́ticos e/ou partidos polı́ticos no Brasil. Tweets contendo termos relacionados a LGBTs ou que têm referências a indivı́duos LGBT foram coletados e classificados manualmente. Uma grande parte deste trabalho está na construção de features que capturam com precisão não apenas o texto do tweet, mas também caracterı́sticas especı́ficas dos usuários e de expressões coloquiais do português. Em particular, os usos de palavrões e vocabulários especı́ficos são um forte indicador de tweets ofensivos. Naturalmente, n-gramas e esquemas de frequência de termos também foram considerados como caracterı́sticas do modelo. Um total de 12 conjuntos de recursos foram construı́dos. Uma ampla gama de técnicas de aprendizado de máquina foi empregada na tarefa de classificação: Naive Bayes, regressões logı́sticas regularizadas, redes neurais feedforward, XGBoost (extreme gradient boosting), random forest e support vector machines. Depois de estimar e ajustar cada modelo, eles foram combinados usando voting e stacking. Voting utilizando 10 modelos obteve o melhor resultado, com 89,42% de acurácia.
This work studies the identification of homophobic tweets from a natural language processing and machine learning approach. The goal is to construct a predictive model that can detect, with reasonable accuracy, whether a Tweet contains offensive content to LGBT or not. The database used to train the predictive models was constructed aggregating tweets from users that have interacted with politicians and/or political parties in Brazil. Tweets containing LGBT-related terms or that have references to open LGBT individuals were collected and manually classified. A large part of this work is in constructing features that accurately capture not only the text of the tweet but also specific characteristics of the users and language choices. In particular, the uses of swear words and strong vocabulary is a quite strong predictor of offensive tweets. Naturally, n-grams and term weighting schemes were also considered as features of the model. A total of 12 sets of features were constructed. A broad range of machine learning techniques were employed in the classification task: naive Bayes, regularized logistic regressions, feedforward neural networks, extreme gradient boosting (XGBoost), random forest and support vector machines. After estimating and tuning each model, they were combined using voting and stacking. Voting using 10 models obtained the best result, with 89.42% accuracy.

APA, Harvard, Vancouver, ISO, and other styles

31

Zhao, Xiaochuang. "Ensemble Learning Method on Machine Maintenance Data." Scholar Commons, 2015. http://scholarcommons.usf.edu/etd/6056.

Full text

Abstract:

In the industry, a lot of companies are facing the explosion of big data. With this much information stored, companies want to make sense of the data and use it to help them for better decision making, especially for future prediction. A lot of money can be saved and huge revenue can be generated with the power of big data. When building statistical learning models for prediction, companies in the industry are aiming to build models with efficiency and high accuracy. After the learning models have been developed for production, new data will be generated. With the updated data, the models have to be updated as well. Due to this nature, the model performs best today doesn’t mean it will necessarily perform the same tomorrow. Thus, it is very hard to decide which algorithm should be used to build the learning model. This paper introduces a new method that ensembles the information generated by two different classification statistical learning algorithms together as inputs for another learning model to increase the final prediction power. The dataset used in this paper is NASA’s Turbofan Engine Degradation data. There are 49 numeric features (X) and the response Y is binary with 0 indicating the engine is working properly and 1 indicating engine failure. The model’s purpose is to predict whether the engine is going to pass or fail. The dataset is divided in training set and testing set. First, training set is used twice to build support vector machine (SVM) and neural network models. Second, it used the trained SVM and neural network model taking X of the training set as input to predict Y1 and Y2. Then, it takes Y1 and Y2 as inputs to build the Penalized Logistic Regression model, which is the ensemble model here. Finally, use the testing set follow the same steps to get the final prediction result. The model accuracy is calculated using overall classification accuracy. The result shows that the ensemble model has 92% accuracy. The prediction accuracies of SVM, neural network and ensemble models are compared to prove that the ensemble model successfully captured the power of the two individual learning model.

APA, Harvard, Vancouver, ISO, and other styles

32

Ezzeddine, Diala. "A contribution to topological learning and its application in Social Networks." Thesis, Lyon 2, 2014. http://www.theses.fr/2014LYO22011/document.

Full text

Abstract:

L'Apprentissage Supervisé est un domaine populaire de l'Apprentissage Automatique en progrès constant depuis plusieurs années. De nombreuses techniques ont été développées pour résoudre le problème de classification, mais, dans la plupart des cas, ces méthodes se basent sur la présence et le nombre de points d'une classe donnée dans des zones de l'espace que doit définir le classifieur. Á cause de cela la construction de ce classifieur est dépendante de la densité du nuage de points des données de départ. Dans cette thèse, nous montrons qu'utiliser la topologie des données peut être une bonne alternative lors de la construction des classifieurs. Pour cela, nous proposons d'utiliser les graphes topologiques comme le Graphe de Gabriel (GG) ou le Graphes des Voisins Relatifs (RNG). Ces dernier représentent la topologie de données car ils sont basées sur la notion de voisinages et ne sont pas dépendant de la densité. Pour appliquer ce concept, nous créons une nouvelle méthode appelée Classification aléatoire par Voisinages (Random Neighborhood Classification (RNC)). Cette méthode utilise des graphes topologiques pour construire des classifieurs. De plus, comme une Méthodes Ensemble (EM), elle utilise plusieurs classifieurs pour extraire toutes les informations pertinentes des données. Les EM sont bien connues dans l'Apprentissage Automatique. Elles génèrent de nombreux classifieurs à partir des données, puis agrègent ces classifieurs en un seul. Le classifieur global obtenu est reconnu pour être très eficace, ce qui a été montré dans de nombreuses études. Cela est possible car il s'appuie sur des informations obtenues auprès de chaque classifieur qui le compose. Nous avons comparé RNC à d'autres méthodes de classification supervisées connues sur des données issues du référentiel UCI Irvine. Nous constatons que RNC fonctionne bien par rapport aux meilleurs d'entre elles, telles que les Forêts Aléatoires (RF) et Support Vector Machines (SVM). La plupart du temps, RNC se classe parmi les trois premières méthodes en terme d'eficacité. Ce résultat nous a encouragé à étudier RNC sur des données réelles comme les tweets. Twitter est un réseau social de micro-blogging. Il est particulièrement utile pour étudier l'opinion à propos de l'actualité et sur tout sujet, en particulier la politique. Cependant, l'extraction de l'opinion politique depuis Twitter pose des défis particuliers. En effet, la taille des messages, le niveau de langage utilisé et ambiguïté des messages rend très diffcile d'utiliser les outils classiques d'analyse de texte basés sur des calculs de fréquence de mots ou des analyses en profondeur de phrases. C'est cela qui a motivé cette étude. Nous proposons d'étudier les couples auteur/sujet pour classer le tweet en fonction de l'opinion de son auteur à propos d'un politicien (un sujet du tweet). Nous proposons une procédure qui porte sur l'identification de ces opinions. Nous pensons que les tweets expriment rarement une opinion objective sur telle ou telle action d'un homme politique mais plus souvent une conviction profonde de son auteur à propos d'un mouvement politique. Détecter l'opinion de quelques auteurs nous permet ensuite d'utiliser la similitude dans les termes employés par les autres pour retrouver ces convictions à plus grande échelle. Cette procédure à 2 étapes, tout d'abord identifier l'opinion de quelques couples de manière semi-automatique afin de constituer un référentiel, puis ensuite d'utiliser l'ensemble des tweets d'un couple (tous les tweets d'un auteur mentionnant un politicien) pour les comparer avec ceux du référentiel. L'Apprentissage Topologique semble être un domaine très intéressant à étudier, en particulier pour résoudre les problèmes de classification
Supervised Learning is a popular field of Machine Learning that has made recent progress. In particular, many methods and procedures have been developed to solve the classification problem. Most classical methods in Supervised Learning use the density estimation of data to construct their classifiers.In this dissertation, we show that the topology of data can be a good alternative in constructing classifiers. We propose using topological graphs like Gabriel graphs (GG) and Relative Neighborhood Graphs (RNG) that can build the topology of data based on its neighborhood structure. To apply this concept, we create a new method called Random Neighborhood Classification (RNC).In this method, we use topological graphs to construct classifiers and then apply Ensemble Methods (EM) to get all relevant information from the data. EM is well known in Machine Learning, generates many classifiers from data and then aggregates these classifiers into one. Aggregate classifiers have been shown to be very efficient in many studies, because it leverages relevant and effective information from each generated classifier. We first compare RNC to other known classification methods using data from the UCI Irvine repository. We find that RNC works very well compared to very efficient methods such as Random Forests and Support Vector Machines. Most of the time, it ranks in the top three methods in efficiency. This result has encouraged us to study the efficiency of RNC on real data like tweets. Twitter, a microblogging Social Network, is especially useful to mine opinion on current affairs and topics that span the range of human interest, including politics. Mining political opinion from Twitter poses peculiar challenges such as the versatility of the authors when they express their political view, that motivate this study. We define a new attribute, called couple, that will be very helpful in the process to study the tweets opinion. A couple is an author that talk about a politician. We propose a new procedure that focuses on identifying the opinion on tweet using couples. We think that focusing on the couples's opinion expressed by several tweets can overcome the problems of analysing each single tweet. This approach can be useful to avoid the versatility, language ambiguity and many other artifacts that are easy to understand for a human being but not automatically for a machine.We use classical Machine Learning techniques like KNN, Random Forests (RF) and also our method RNC. We proceed in two steps : First, we build a reference set of classified couples using Naive Bayes. We also apply a second alternative method to Naive method, sampling plan procedure, to compare and evaluate the results of Naive method. Second, we evaluate the performance of this approach using proximity measures in order to use RNC, RF and KNN. The expirements used are based on real data of tweets from the French presidential election in 2012. The results show that this approach works well and that RNC performs very good in order to classify opinion in tweets.Topological Learning seems to be very intersting field to study, in particular to address the classification problem. Many concepts to get informations from topological graphs need to analyse like the ones described by Aupetit, M. in his work (2005). Our work show that Topological Learning can be an effective way to perform classification problem

APA, Harvard, Vancouver, ISO, and other styles

33

Liu, Xuan. "An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduce." Thèse, Université d'Ottawa / University of Ottawa, 2014. http://hdl.handle.net/10393/30702.

Full text

Abstract:

We propose a new ensemble algorithm: the meta-boosting algorithm. This algorithm enables the original Adaboost algorithm to improve the decisions made by different WeakLearners utilizing the meta-learning approach. Better accuracy results are achieved since this algorithm reduces both bias and variance. However, higher accuracy also brings higher computational complexity, especially on big data. We then propose the parallelized meta-boosting algorithm: Parallelized-Meta-Learning (PML) using the MapReduce programming paradigm on Hadoop. The experimental results on the Amazon EC2 cloud computing infrastructure show that PML reduces the computation complexity enormously while retaining lower error rates than the results on a single computer. As we know MapReduce has its inherent weakness that it cannot directly support iterations in an algorithm, our approach is a win-win method, since it not only overcomes this weakness, but also secures good accuracy performance. The comparison between this approach and a contemporary algorithm AdaBoost.PL is also performed.

APA, Harvard, Vancouver, ISO, and other styles

34

Farrash, Majed. "Machine learning ensemble method for discovering knowledge from big data." Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/59367/.

Full text

Abstract:

Big data, generated from various business internet and social media activities, has become a big challenge to researchers in the field of machine learning and data mining to develop new methods and techniques for analysing big data effectively and efficiently. Ensemble methods represent an attractive approach in dealing with the problem of mining large datasets because of their accuracy and ability of utilizing the divide-and-conquer mechanism in parallel computing environments. This research proposes a machine learning ensemble framework and implements it in a high performance computing environment. This research begins by identifying and categorising the effects of partitioned data subset size on ensemble accuracy when dealing with very large training datasets. Then an algorithm is developed to ascertain the patterns of the relationship between ensemble accuracy and the size of partitioned data subsets. The research concludes with the development of a selective modelling algorithm, which is an efficient alternative to static model selection methods for big datasets. The results show that maximising the size of partitioned data subsets does not necessarily improve the performance of an ensemble of classifiers that deal with large datasets. Identifying the patterns exhibited by the relationship between ensemble accuracy and partitioned data subset size facilitates the determination of the best subset size for partitioning huge training datasets. Finally, traditional model selection is inefficient in cases wherein large datasets are involved.

APA, Harvard, Vancouver, ISO, and other styles

35

Koco, Sokol. "Méthodes ensembliste pour des problèmes de classification multi-vues et multi-classes avec déséquilibres." Thesis, Aix-Marseille, 2013. http://www.theses.fr/2013AIXM4101/document.

Full text

Abstract:

De nos jours, dans plusieurs domaines, tels que la bio-informatique ou le multimédia, les données peuvent être représentées par plusieurs ensembles d'attributs, appelés des vues. Pour une tâche de classification donnée, nous distinguons deux types de vues : les vues fortes sont celles adaptées à la tâche, les vues faibles sont adaptées à une (petite) partie de la tâche ; en classification multi-classes, chaque vue peut s'avérer forte pour reconnaître une classe, et faible pour reconnaître d’autres classes : une telle vue est dite déséquilibrée. Les travaux présentés dans cette thèse s'inscrivent dans le cadre de l'apprentissage supervisé et ont pour but de traiter les questions d'apprentissage multi-vue dans le cas des vues fortes, faibles et déséquilibrées. La première contribution de cette thèse est un algorithme d'apprentissage multi-vues théoriquement fondé sur le cadre de boosting multi-classes utilisé par AdaBoost.MM. La seconde partie de cette thèse concerne la mise en place d'un cadre général pour les méthodes d'apprentissage de classes déséquilibrées (certaines classes sont plus représentées que les autres). Dans la troisième partie, nous traitons le problème des vues déséquilibrées en combinant notre approche des classes déséquilibrées et la coopération entre les vues mise en place pour appréhender la classification multi-vues. Afin de tester les méthodes sur des données réelles, nous nous intéressons au problème de classification d'appels téléphoniques, qui a fait l'objet du projet ANR DECODA. Ainsi chaque partie traite différentes facettes du problème
Nowadays, in many fields, such as bioinformatics or multimedia, data may be described using different sets of features, also called views. For a given classification task, we distinguish two types of views:strong views, which are suited for the task, and weak views suited for a (small) part of the task; in multi-class learning, a view can be strong with respect to some (few) classes and weak for the rest of the classes: these are imbalanced views. The works presented in this thesis fall in the supervised learning setting and their aim is to address the problem of multi-view learning under strong, weak and imbalanced views, regrouped under the notion of uneven views. The first contribution of this thesis is a multi-view learning algorithm based on the same framework as AdaBoost.MM. The second part of this thesis proposes a unifying framework for imbalanced classes supervised methods (some of the classes are more represented than others). In the third part of this thesis, we tackle the uneven views problem through the combination of the imbalanced classes framework and the between-views cooperation used to take advantage of the multiple views. In order to test the proposed methods on real-world data, we consider the task of phone calls classifications, which constitutes the subject of the ANR DECODA project. Each part of this thesis deals with different aspects of the problem

APA, Harvard, Vancouver, ISO, and other styles

36

Ferreira, Ednaldo José. "Método baseado em rotação e projeção otimizadas para a construção de ensembles de modelos." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-27062012-161603/.

Full text

Abstract:

O desenvolvimento de novas técnicas capazes de produzir modelos de predição com erros de generalização relativamente baixos é uma constante em aprendizado de máquina e áreas correlatas. Nesse sentido, a composição de um conjunto de modelos no denominado ensemble merece destaque por seu potencial teórico e empírico de minimizar o erro de generalização. Diversos métodos para construção de ensembles de modelos são encontrados na literatura. Dentre esses, o método baseado em rotação (RB) tem apresentado desempenho superior a outros clássicos. O método RB utiliza a técnica de extração de características da análise de componentes principais (PCA) como estratégia de rotação para provocar acurácia e diversidade entre os modelos componentes. Contudo, essa estratégia não assegura que a direção resultante será apropriada para a técnica de aprendizado supervisionado (SLT) escolhida. Adicionalmente, o método RB não é adequado com SLTs invariantes à rotação e não foi amplamente validado com outras estáveis. Esses aspectos tornam-no inadequado e/ou restrito a algumas SLTs. Nesta tese, é proposta uma nova abordagem de extração baseada na concatenação de rotação e projeção otimizadas em prol da SLT (denominada roto-projeção otimizada). A abordagem utiliza uma metaheurística para otimizar os parâmetros da transformação de roto-projeção e minimizar o erro da técnica diretora da otimização. Mais enfaticamente, propõe-se a roto-projeção otimizada como parte fundamental de um novo método de ensembles, denominado ensemble baseado em roto-projeção otimizada (ORPE). Os resultados obtidos mostram que a roto-projeção otimizada pode reduzir a dimensionalidade e a complexidade dos dados e do modelo, além de aumentar o desempenho da SLT utilizada posteriormente. O método ORPE superou, com relevância estatística, o RB e outros com SLTs estáveis e instáveis em bases de classificação e regressão de domínio público e privado. O ORPE mostrou-se irrestrito e altamente eficaz assumindo a primeira posição em todos os ranqueamentos de dominância realizados
The development of new techniques capable of inducing predictive models with low generalization errors has been a constant in machine learning and other related areas. In this context, the composition of an ensemble of models should be highlighted due to its theoretical and empirical potential to minimize the generalization error. Several methods for building ensembles are found in the literature. Among them, the rotation-based (RB) has become known for outperforming other traditional methods. RB method applies the principal components analysis (PCA) for feature extraction as a rotation strategy to provide diversity and accuracy among base models. However, this strategy does not ensure that the resulting direction is appropriate for the supervised learning technique (SLT). Moreover, the RB method is not suitable for rotation-invariant SLTs and also it has not been evaluated with stable ones, which makes RB inappropriate and/or restricted to the use with only some SLTs. This thesis proposes a new approach for feature extraction based on concatenation of rotation and projection optimized for the SLT (called optimized roto-projection). The approach uses a metaheuristic to optimize the parameters from the roto-projection transformation, minimizing the error of the director technique of the optimization process. More emphatically, it is proposed the optimized roto-projection as a fundamental part of a new ensemble method, called optimized roto-projection ensemble (ORPE). The results show that the optimized roto-projection can reduce the dimensionality and the complexities of the data and model. Moreover, optimized roto-projection can increase the performance of the SLT subsequently applied. The ORPE outperformed, with statistical significance, RB and others using stable and unstable SLTs for classification and regression with databases from public and private domains. The ORPE method was unrestricted and highly effective holding the first position in every dominance rankings

APA, Harvard, Vancouver, ISO, and other styles

37

Hadjem, Medina. "Contribution à l'analyse et à la détection automatique d'anomalies ECG dans le cas de l'ischémie myocardique." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB011.

Full text

Abstract:

Les récentes avancées dans le domaine de la miniaturisation des capteurs biomédicaux à ultra-faible consommation énergétique, permettent aujourd’hui la conception de systèmes de télésurveillance médicale, à la fois plus intelligents et moins invasifs. Ces capteurs sont capables de collecter des signaux vitaux tels que le rythme cardiaq ue, la température, la saturation en oxygène, la pression artérielle, l'ECG, l'EMG, etc., et de les transmettre sans fil à un smartphone ou un autre dispositif distant. Ces avancées sus-citées ont conduit une large communauté scientifique à s'intéresser à la conception de nouveaux systèmes d'analyse de données biomédicales, en particulier de l’électrocardiogramme (ECG). S’inscrivant dans cette thématique de recherche, la présente thèse s’intéresse principalement à l’analyse et à la détection automatique des maladies cardiaques coronariennes, en particulier l’ischémie myocardique et l’infarctus du myocarde (IDM). A cette fin, et compte tenu de la nature non stationnaire et fortement bruitée du signal ECG, le premier défi a été d'extraire les paramètres pertinents de l’ECG, sans altérer leurs caractéristiques essentielles. Cette problématique a déjà fait l’objet de plusieurs travaux et ne représente pas l’objectif principal de cette thèse. Néanmoins, étant un prérequis incontournable, elle a nécessité une étude et une compréhension de l'état de l'art afin de sélectionner la méthode la plus appropriée. En s'appuyant sur les paramètres ECG extraits, en particulier les paramètres relatifs au segment ST et à l'onde T, nous avons contribué dans cette thèse par deux approches d'analyse ECG : (1) Une première analyse réalisée au niveau de la série temporelle des paramètres ECG, son objectif est de détecter les élévations anormales du segment ST et de l'onde T, connues pour être un signe précoce d'une ischémie myocardique ou d’un IDM. (2) Une deuxième analyse réalisée au niveau des battements de l’ECG, dont l’objectif est la classification des anomalies du segment ST et de l’onde T en différentes catégories. Cette dernière approche est la plus utilisée dans la littérature, cependant, il est difficile d’interpréter les résultats des travaux existants en raison de l'absence d’une méthodologie standard de classification. Nous avons donc réalisé notre propre étude comparative des principales méthodes de classification utilisées dans la littérature, en prenant en compte diverses classes d'anomalies ST et T, plusieurs paramètres d'évaluation des performances ainsi que plusieurs dérivations du signal ECG. Afin d'aboutir à des résultats plus significatifs, nous avons également réalisé la même étude en prenant en compte la présence d'autres anomalies cardiaques fréquentes dans l’ECG (arythmies). Enfin, en nous basant sur les résultats de cette étude comparative, nous avons proposé une nouvelle approche de classification des anomalies ST-T en utilisant une combinaison de la technique du Boosting et du sous-échantillonnage aléatoire, notre objectif étant de trouver le meilleur compromis entre vrais-positifs et faux-positifs
Recent advances in sensing and miniaturization of ultra-low power devices allow for more intelligent and wearable health monitoring sensor-based systems. The sensors are capable of collecting vital signs, such as heart rate, temperature, oxygen saturation, blood pressure, ECG, EMG, etc., and communicate wirelessly the collected data to a remote device and/or smartphone. Nowadays, these aforementioned advances have led a large research community to have interest in the design and development of new biomedical data analysis systems, particularly electrocardiogram (ECG) analysis systems. Aimed at contributing to this broad research area, we have mainly focused in this thesis on the automatic analysis and detection of coronary heart diseases, such as Ischemia and Myocardial Infarction (MI), that are well known to be the leading death causes worldwide. Toward this end, and because the ECG signals are deemed to be very noisy and not stationary, our challenge was first to extract the relevant parameters without losing their main features. This particular issue has been widely addressed in the literature and does not represent the main purpose of this thesis. However, as it is a prerequisite, it required us to understand the state of the art proposed methods and select the most suitable one for our work. Based on the ECG parameters extracted, particularly the ST segment and the T wave parameters, we have contributed with two different approaches to analyze the ECG records: (1) the first analysis is performed in the time series level, in order to detect abnormal elevations of the ST segment and the T wave, known to be an accurate predictor of ischemia or MI; (2) the second analysis is performed at the ECG beat level to automatically classify the ST segment and T wave anomalies within different categories. This latter approach is the most commonly used in the literature. However, lacking a performance comparison standard in the state of the art existing works, we have carried out our own comparison of the actual classification methods by taking into account diverse ST and T anomaly classes, several performance evaluation parameters, as well as several ECG signal leads. To obtain more realistic performances, we have also performed the same study in the presence of other frequent cardiac anomalies, such as arrhythmia. Based on this substantial comparative study, we have proposed a new classification approach of seven ST-T anomaly classes, by using a hybrid of the boosting and the random under sampling methods, our goal was ultimately to reach the best tradeoff between true-positives and false-positives

APA, Harvard, Vancouver, ISO, and other styles

38

Silva, Bernardes Juliana. "Evolution et apprentissage automatique pour l'annotation fonctionnelle et la classification des homologies lointains en protéines." Phd thesis, Université Pierre et Marie Curie - Paris VI, 2012. http://tel.archives-ouvertes.fr/tel-00684155.

Full text

Abstract:

La détection d'homologues lointains est essentielle pour le classement fonctionnel et structural des séquences protéiques et pour l'amélioration de l'annotation des génomes très divergents. Pour le classement des séquences, nous présentons la méthode "ILP-SVM homology", combinant la programmation logique inductive (PLI) et les modèles propositionnels. Elle propose une nouvelle représentation logique des propriétés physico-chimiques des résidus et des positions conservées au sein de l'alignement de séquences. Ainsi, PLI trouve les règles les plus fréquentes et les utilise pour la phase d'apprentissage utilisant des modèles d'arbre de décision ou de machine à vecteurs de support. La méthode présente au moins les mêmes performances que les autres méthodes trouvées dans la littérature. Puis, nous proposons la méthode CASH pour annoter les génomes très divergents. CASH a été appliqué à Plasmodium falciparum, mais reste applicable à toutes les espèces. CASH utilise aussi bien l'information issue de génomes proches ou éloignés de P. falciparum. Chaque domaine connu est ainsi représenté par un ensemble de modèles évolutifs, et les sorties sont combinées par un méta-classificateur qui assigne un score de confiance à chaque prédiction. Basé sur ce score et sur des propriétés de co-ocurrences de domaines, CASH trouve l'architecture la plus probable de chaque séquence en appliquant une approche d'optimisation multi-objectif. CASH est capable d'annoter 70% des domaines protéiques de P. falciparum, contre une moyenne de 58% pour ses concurrents. De nouveaux domaines protéiques ont pu être caractérisés au sein de protéines de fonction inconnue ou déjà annotées.

APA, Harvard, Vancouver, ISO, and other styles

39

Faußer, Stefan Artur [Verfasser]. "Large state spaces and large data: Utilizing neural network ensembles in reinforcement learning and kernel methods for clustering / Stefan Artur Faußer." Ulm : Universität Ulm. Fakultät für Ingenieurwissenschaften und Informatik, 2015. http://d-nb.info/1074196201/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Hronský, Patrik. "Bioinformatický nástroj pro predikci rozpustnosti proteinů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255363.

Full text

Abstract:

This master's thesis addresses the solubility of recombinant proteins and its prediction. It describes the subject of protein synthesis, as well as the process of recombinant protein creation. Recombinant protein synthesis is of great importance for example to pharmacologic industry. This synthesis is not a simple task and it does not always produce viable proteins. Protein solubility is an important factor, determining the viability of the resulting proteins. It is of course favourable for companies, that take part in recombinant protein synthesis, to focus their effort and their resources on proteins, that will be viable in the end. In this regard, bioinformatics is of great help, as it is capable, with the help of machine learning, of predicting the solubility of proteins, for example based on their sequences. This thesis introduces the reader to the basic principles of machine learning and presents several machine learning methods, used in the field of protein solubility prediction. It deals with the definition of a dataset, which is later used to test selected predictors, as well as to train the ensemble predictor, which is the main focus of this thesis. It also focuses on several specific protein solubility predictors and explains the basic principles upon which they are built, as well as the results of their testing. In the end, it presents the ensemble predictor of protein solubility.

APA, Harvard, Vancouver, ISO, and other styles

41

Paradeda, Raul Benites. "Utilizando Pesos est?ticos e din?micos em sistemas multi-classificadores com diferentes n?veis de diversidade." Universidade Federal do Rio Grande do Norte, 2007. http://repositorio.ufrn.br:8080/jspui/handle/123456789/17963.

Full text

Abstract:

Made available in DSpace on 2014-12-17T15:47:44Z (GMT). No. of bitstreams: 1 RaulBP.pdf: 1811907 bytes, checksum: 007d54350318472b95b8e06144b749a5 (MD5) Previous issue date: 2007-07-27
Although some individual techniques of supervised Machine Learning (ML), also known as classifiers, or algorithms of classification, to supply solutions that, most of the time, are considered efficient, have experimental results gotten with the use of large sets of pattern and/or that they have a expressive amount of irrelevant data or incomplete characteristic, that show a decrease in the efficiency of the precision of these techniques. In other words, such techniques can t do an recognition of patterns of an efficient form in complex problems. With the intention to get better performance and efficiency of these ML techniques, were thought about the idea to using some types of LM algorithms work jointly, thus origin to the term Multi-Classifier System (MCS). The MCS s presents, as component, different of LM algorithms, called of base classifiers, and realized a combination of results gotten for these algorithms to reach the final result. So that the MCS has a better performance that the base classifiers, the results gotten for each base classifier must present an certain diversity, in other words, a difference between the results gotten for each classifier that compose the system. It can be said that it does not make signification to have MCS s whose base classifiers have identical answers to the sames patterns. Although the MCS s present better results that the individually systems, has always the search to improve the results gotten for this type of system. Aim at this improvement and a better consistency in the results, as well as a larger diversity of the classifiers of a MCS, comes being recently searched methodologies that present as characteristic the use of weights, or confidence values. These weights can describe the importance that certain classifier supplied when associating with each pattern to a determined class. These weights still are used, in associate with the exits of the classifiers, during the process of recognition (use) of the MCS s. Exist different ways of calculating these weights and can be divided in two categories: the static weights and the dynamic weights. The first category of weights is characterizes for not having the modification of its values during the classification process, different it occurs with the second category, where the values suffers modifications during the classification process. In this work an analysis will be made to verify if the use of the weights, statics as much as dynamics, they can increase the perfomance of the MCS s in comparison with the individually systems. Moreover, will be made an analysis in the diversity gotten for the MCS s, for this mode verify if it has some relation between the use of the weights in the MCS s with different levels of diversity
Apesar de algumas t?cnicas individuais de Aprendizado de M?quina (AM) supervisionado, tamb?mconhecidos como classificadores, ou algoritmos de classifica??o, fornecerem solu??es que, na maioria das vezes, s?o consideradas eficientes, h? resultados experimentais obtidos com a utiliza??o de grandes conjuntos de padr?es e/ou que apresentam uma quantidade expressiva de dados incompletos ou caracter?sticas irrelevantes, que mostram uma queda na efic?cia da precis?o dessas t?cnicas. Ou seja, tais t?cnicas n?o conseguem realizar um reconhecimento de padr?es de uma forma eficiente em problemas complexos. Com o intuito de obter um melhor desempenho e efic?cia dessas t?cnicas de AM, pensouse na id?ia de fazer com que v?rios tipos de algoritmos de AM consigam trabalhar conjuntamente, dando assim origem ao termo Sistema Multi-Classificador (SMC). Os SMC s apresentam, como componentes, diferentes algoritmos de AM, chamados de classificadores base, e realizam uma combina??o dos resultados obtidos por estes algoritmos para atingir o resultado final. Para que o SMC tenha um desempenho melhor que os classificadores base, os resultados obtidos por cada classificador base devem apresentar uma determinada diversidade, ou seja, uma diferen?a entre os resultados obtidos por cada classificador que comp?em o sistema. Pode-se dizer que n?o faz sentido ter SMC s cujos classificadores base possuam respostas id?nticas aos padr?es apresentados. Apesar dos SMC s apresentarem melhores resultados que os sistemas executados individualmente, h? sempre a busca para melhorar os resultados obtidos por esse tipo de sistema. Visando essa melhora e uma maior consist?ncia nos resultados, assim como uma maior diversidade dos classificadores de um SMC, v?m sendo recentemente pesquisadas metodologias que apresentam como caracter?sticas o uso de pesos, ou valores de con- fian?a. Esses pesos podem descrever a import?ncia que um determinado classificador forneceu ao associar cada padr?o a uma determinada classe. Esses pesos ainda s?o utilizados, em conjunto com as sa?das dos classificadores, durante o processo de reconhecimento (uso) dos SMC s. Existem diferentes maneiras de se calcular esses pesos e podem ser divididas em duas categorias: os pesos est?ticos e os pesos din?micos. A primeira categoria de pesos se caracteriza por n?o haver a modifica??o de seus valores no decorrer do processo de classifica??o, ao contr?rio do que ocorre com a segunda categoria, onde os valores sofrem modifica??es no decorrer do processo de classifica??o. Neste trabalho ser? feito uma an?lise para verificar se o uso dos pesos, tanto est?ticos quanto din?micos, conseguem aumentar o desempenho dos SMC s em compara??o com estes sistemas executados individualmente. Al?m disso, ser? feita uma an?lise na diversidade obtida pelos SMC s, para dessa forma verificar se h? alguma rela??o entre o uso dos pesos nos SMC s com diferentes n?veis de diversidade

APA, Harvard, Vancouver, ISO, and other styles

42

ILARDI, DAVIDE. "Data-driven solutions to enhance planning, operation and design tools in Industry 4.0 context." Doctoral thesis, Università degli studi di Genova, 2023. https://hdl.handle.net/11567/1104513.

Full text

Abstract:

This thesis proposes three different data-driven solutions to be combined to state-of-the-art solvers and tools in order to primarily enhance their computational performances. The problem of efficiently designing the open sea floating platforms on which wind turbines can be mount on will be tackled, as well as the tuning of a data-driven engine's monitoring tool for maritime transportation. Finally, the activities of SAT and ASP solvers will be thoroughly studied and a deep learning architecture will be proposed to enhance the heuristics-based solving approach adopted by such software. The covered domains are different and the same is true for their respective targets. Nonetheless, the proposed Artificial Intelligence and Machine Learning algorithms are shared as well as the overall picture: promote Industrial AI and meet the constraints imposed by Industry 4.0 vision. The lesser presence of human-in-the-loop, a data-driven approach to discover causalities otherwise ignored, a special attention to the environmental impact of industries' emissions, a real and efficient exploitation of the Big Data available today are just a subset of the latter. Hence, from a broader perspective, the experiments carried out within this thesis are driven towards the aforementioned targets and the resulting outcomes are satisfactory enough to potentially convince the research community and industrialists that they are not just "visions" but they can be actually put into practice. However, it is still an introduction to the topic and the developed models are at what can be defined a "pilot" stage. Nonetheless, the results are promising and they pave the way towards further improvements and the consolidation of the dictates of Industry 4.0.

APA, Harvard, Vancouver, ISO, and other styles

43

Santis, Rodrigo Barbosa de. "Previsão de falta de materiais no contexto de gestão inteligente de inventário: uma aplicação de aprendizado desbalanceado." Universidade Federal de Juiz de Fora (UFJF), 2018. https://repositorio.ufjf.br/jspui/handle/ufjf/6861.

Full text

Abstract:

Submitted by Geandra Rodrigues (geandrar@gmail.com) on 2018-06-19T13:13:53Z No. of bitstreams: 1 rodrigobarbosadesantis.pdf: 2597054 bytes, checksum: b19542ca0e9312572d8ffa5896d735db (MD5)
Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2018-06-27T11:12:01Z (GMT) No. of bitstreams: 1 rodrigobarbosadesantis.pdf: 2597054 bytes, checksum: b19542ca0e9312572d8ffa5896d735db (MD5)
Made available in DSpace on 2018-06-27T11:12:01Z (GMT). No. of bitstreams: 1 rodrigobarbosadesantis.pdf: 2597054 bytes, checksum: b19542ca0e9312572d8ffa5896d735db (MD5) Previous issue date: 2018-03-26
CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Falta de materiais é um problema comum na cadeia de suprimentos, impactando o nível de serviço e eficiência de um sistema de inventário. A identificação de materiais com grande riscos de falta antes da ocorrência do evento pode apresentar uma enorme oportunidade de melhoria no desempenho geral de uma empresa. No entanto, a complexidade deste tipo de problema é alta, devido ao desbalanceamento das classes de itens faltantes e não faltantes no inventário, que podem chegar a razões de 1 para 100. No presente trabalho, algoritmos de classificação são investigados para proposição de um modelo preditivo para preencher esta lacuna na literatura. Algumas métricas específicas como a área abaixo das curvas de Característica Operacionais do Receptor e de Precisão-Abrangência, bem como técnicas de amostragem e comitês de aprendizado são aplicados nesta tarefa. O modelo proposto foi testado em dois estudos de caso reais, nos quais verificou-se que adoção da ferramenta pode contribuir com o aumento do nível de serviço em uma cadeia de suprimentos.
Material backorder (or stockout) is a common supply chain problem, impacting the inventory system service level and effectiveness. Identifying materials with the highest chances of shortage prior its occurrence can present a high opportunity to improve the overall company’s performance. However, the complexity of this sort of problem is high, due to class imbalance between missing items and not missing ones in inventory, which can achieve proportions of 1 to 100. In this work, machine learning classifiers are investigated in order to fulfill this gap in literature. Specific metrics such as area under the Receiver Operator Characteristic and precision-recall curves, sampling techniques and ensemble learning are employed to this particular task. The proposed model was tested in two real case-studies, in which it was verified that the use of the tool may contribute with the improvemnet of the service level in the supply chain.

APA, Harvard, Vancouver, ISO, and other styles

44

"Optimizing Performance Measures in Classification Using Ensemble Learning Methods." Master's thesis, 2017. http://hdl.handle.net/2286/R.I.44123.

Full text

Abstract:

abstract: Ensemble learning methods like bagging, boosting, adaptive boosting, stacking have traditionally shown promising results in improving the predictive accuracy in classification. These techniques have recently been widely used in various domains and applications owing to the improvements in computational efficiency and distributed computing advances. However, with the advent of wide variety of applications of machine learning techniques to class imbalance problems, further focus is needed to evaluate, improve and optimize other performance measures such as sensitivity (true positive rate) and specificity (true negative rate) in classification. This thesis demonstrates a novel approach to evaluate and optimize the performance measures (specifically sensitivity and specificity) using ensemble learning methods for classification that can be especially useful in class imbalanced datasets. In this thesis, ensemble learning methods (specifically bagging and boosting) are used to optimize the performance measures (sensitivity and specificity) on a UC Irvine (UCI) 130 hospital diabetes dataset to predict if a patient will be readmitted to the hospital based on various feature vectors. From the experiments conducted, it can be empirically concluded that, by using ensemble learning methods, although accuracy does improve to some margin, both sensitivity and specificity are optimized significantly and consistently over different cross validation approaches. The implementation and evaluation has been done on a subset of the large UCI 130 hospital diabetes dataset. The performance measures of ensemble learners are compared to the base machine learning classification algorithms such as Naive Bayes, Logistic Regression, k Nearest Neighbor, Decision Trees and Support Vector Machines.
Dissertation/Thesis
Masters Thesis Computer Science 2017

APA, Harvard, Vancouver, ISO, and other styles

45

Gao, Zi-yuan, and 高子元. "Learning with Multiple Labels and Ensemble Methods for Tweets Polarity Classification System." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/b3rp9c.

Full text

Abstract:

碩士
國立中山大學
資訊工程學系研究所
106
In this paper, we focus on Twitter sentiment analysis, which is a task in the SemEval-2018 workshop. Given a tweet, classify it into one of seven ordinal classes. The method described in this paper is based on the previous work in the SemEval-2018 competition. We implement a system of learning with multiple labels. There are five sub-models in the system, namely three class model, negative class model, neutral class model, positive class model, and seven class model. Different labels are used in different sub-models to learn the polar representation of tweets. In the competition, we got the Pearson correlation coefficient of 0.638 on test data (ranked at 21/36). In order to improve the system performance, we change the usage of data, add the class weights, add lexicon features, and train own word vector. With our methods, we raise the Pearson correlation coefficient 0.137 on the test data. We also construct a lexicon model, which has recurrent neural network and lexicon score. The Pearson correlation coefficient of the lexicon model on the development set is about 0.1 higher than the traditional sentiment analysis. In the system of learning with multiple labels, we experiment with four ensemble methods, including weighted average, majority decision, voting and stacking ensemble. Finally, we retrain the DeepMoji model with transfer learning, and weighted average with polar classification results of system. The Pearson correlation coefficient is 0.806 on the test data and could be ranked 4th in the competition.

APA, Harvard, Vancouver, ISO, and other styles

46

Balasubramanyam, Rashmi. "Supervised Classification of Missense Mutations as Pathogenic or Tolerated using Ensemble Learning Methods." Thesis, 2017. http://etd.iisc.ac.in/handle/2005/3804.

Full text

Abstract:

Missense mutations account for more than 50% of the mutations known to be involved in human inherited diseases. Missense classification is a challenging task that involves sequencing of the genome, identifying the variations, and assessing their deleteriousness. This is a very laborious, time and cost intensive task to be carried out in the laboratory. Advancements in bioinformatics have led to several large-scale next-generation genome sequencing projects, and subsequently the identification of genome variations. Several studies have combined this data with information on established deleterious and neutral variants to develop machine learning based classifiers. There are significant issues with the missense classifiers due to which missense classification is still an open area of research. These issues can be classified under two broad categories: (a) Dataset overlap issue - where the performance estimates reported by the state-of-the-art classifiers are overly optimistic as they have often been evaluated on datasets that have significant overlaps with their training datasets. Also, there is no comparative analysis of these tools using a common benchmark dataset that contains no overlap with the training datasets, therefore making it impossible to identify the best classifier among them. Also, such a common benchmark dataset is not available. (b) Inadequate capture of vital biological information of the protein and mutations - such as conservation of long-range amino acid dependencies, changes in certain physico-chemical properties of the wild-type and mutant amino acids, due to the mutation. It is also not clear how to extract and use this information. Also, some classifiers use structural information that is not available for all proteins. In this study, we compiled a new dataset, containing around 2 - 15% overlap with the popularly used training datasets, with 18,036 mutations in 5,642 proteins. We reviewed and evaluated 15 state-of-the-art missense classifiers - SIFT, PANTHER, PROVEAN, PhD-SNP, Mutation Assessor, FATHMM, SNPs&GO, SNPs&GO3D, nsSNPAnalyzer, PolyPhen-2, SNAP, MutPred, PON-P2, CONDEL and MetaSNP, using the six metrics - accuracy, sensitivity, specificity, precision, NPV and MCC. When evaluated on our dataset, we observe huge performance drops from what has been claimed. Average drop in the performance for these 13 classifiers are around 15% in accuracy, 17% in sensitivity, 14% in specificity, 7% in NPV, 24% in precision and 30% in MCC. With this we show that the performance of these tools is not consistent on different datasets, and thus not reliable for practical use in a clinical setting. As we observed that the performance of the existing classifiers is poor in general, we tried to develop a new classifier that is robust and performs consistently across datasets, and better than the state-of-the-art classifiers. We developed a novel method of capturing long-range amino acid dependency conservation by boosting the conservation frequencies of substrings of amino acids of various lengths around the mutation position using AdaBoost learning algorithm. This score alone performed equivalently to the sequence conservation based tools in classifying missense mutations. Popularly used sequence conservation properties was combined with this boosted long-range dependency conservation scores using AdaBoost algorithm. This reduced the class bias, and improved the overall accuracy of the classifier. We trained a third classifier by incorporating changes in 21 important physico-chemical properties, due to the mutation. In this case, we observed that the overall performance further improved and the class bias further reduced. The performance of our final classifier is comparable with the state-of-the-art classifiers. We did not find any significant improvement, but the class-specific accuracies and precisions are marginally better by around 1-2% than those of the existing classifiers. In order to understand our classifier better, we dissected our benchmark dataset into: (a) seen and unseen proteins, and (b) pure and mixed proteins, and analysed the performance in detail. Finally we concluded that our classifier performs consistently across each of these categories of seen, unseen, pure and mixed protein.

APA, Harvard, Vancouver, ISO, and other styles

47

Balasubramanyam, Rashmi. "Supervised Classification of Missense Mutations as Pathogenic or Tolerated using Ensemble Learning Methods." Thesis, 2017. http://etd.iisc.ernet.in/2005/3804.

Full text

Abstract:

Missense mutations account for more than 50% of the mutations known to be involved in human inherited diseases. Missense classification is a challenging task that involves sequencing of the genome, identifying the variations, and assessing their deleteriousness. This is a very laborious, time and cost intensive task to be carried out in the laboratory. Advancements in bioinformatics have led to several large-scale next-generation genome sequencing projects, and subsequently the identification of genome variations. Several studies have combined this data with information on established deleterious and neutral variants to develop machine learning based classifiers. There are significant issues with the missense classifiers due to which missense classification is still an open area of research. These issues can be classified under two broad categories: (a) Dataset overlap issue - where the performance estimates reported by the state-of-the-art classifiers are overly optimistic as they have often been evaluated on datasets that have significant overlaps with their training datasets. Also, there is no comparative analysis of these tools using a common benchmark dataset that contains no overlap with the training datasets, therefore making it impossible to identify the best classifier among them. Also, such a common benchmark dataset is not available. (b) Inadequate capture of vital biological information of the protein and mutations - such as conservation of long-range amino acid dependencies, changes in certain physico-chemical properties of the wild-type and mutant amino acids, due to the mutation. It is also not clear how to extract and use this information. Also, some classifiers use structural information that is not available for all proteins. In this study, we compiled a new dataset, containing around 2 - 15% overlap with the popularly used training datasets, with 18,036 mutations in 5,642 proteins. We reviewed and evaluated 15 state-of-the-art missense classifiers - SIFT, PANTHER, PROVEAN, PhD-SNP, Mutation Assessor, FATHMM, SNPs&GO, SNPs&GO3D, nsSNPAnalyzer, PolyPhen-2, SNAP, MutPred, PON-P2, CONDEL and MetaSNP, using the six metrics - accuracy, sensitivity, specificity, precision, NPV and MCC. When evaluated on our dataset, we observe huge performance drops from what has been claimed. Average drop in the performance for these 13 classifiers are around 15% in accuracy, 17% in sensitivity, 14% in specificity, 7% in NPV, 24% in precision and 30% in MCC. With this we show that the performance of these tools is not consistent on different datasets, and thus not reliable for practical use in a clinical setting. As we observed that the performance of the existing classifiers is poor in general, we tried to develop a new classifier that is robust and performs consistently across datasets, and better than the state-of-the-art classifiers. We developed a novel method of capturing long-range amino acid dependency conservation by boosting the conservation frequencies of substrings of amino acids of various lengths around the mutation position using AdaBoost learning algorithm. This score alone performed equivalently to the sequence conservation based tools in classifying missense mutations. Popularly used sequence conservation properties was combined with this boosted long-range dependency conservation scores using AdaBoost algorithm. This reduced the class bias, and improved the overall accuracy of the classifier. We trained a third classifier by incorporating changes in 21 important physico-chemical properties, due to the mutation. In this case, we observed that the overall performance further improved and the class bias further reduced. The performance of our final classifier is comparable with the state-of-the-art classifiers. We did not find any significant improvement, but the class-specific accuracies and precisions are marginally better by around 1-2% than those of the existing classifiers. In order to understand our classifier better, we dissected our benchmark dataset into: (a) seen and unseen proteins, and (b) pure and mixed proteins, and analysed the performance in detail. Finally we concluded that our classifier performs consistently across each of these categories of seen, unseen, pure and mixed protein.

APA, Harvard, Vancouver, ISO, and other styles

48

Seca, Marta Sofia Lopes. "Explorations of the semantic learning machine neuroevolution algorithm: dynamic training data use and ensemble construction methods." Master's thesis, 2020. http://hdl.handle.net/10362/99078.

Full text

Abstract:

Dissertation presented as the partial requirement for obtaining a Master’s degree in Data Science and Advanced Analytics
As the world’s technology evolves, the power to implement new and more efficient algorithms increases but so does the complexity of the problems at hand. Neuroevolution algorithms fit in this context in the sense that they are able to evolve Artificial Neural Networks (ANNs). The recently proposed Neuroevolution algorithm called Semantic Learning Machine (SLM) has the advantage of searching over unimodal error landscapes in any Supervised Learning task where the error is measured as a distance to the known targets. The absence of local optima in the search space results in a more efficient learning when compared to other neuroevolution algorithms. This work studies how different approaches of dynamically using the training data affect the generalization of the SLM algorithm. Results show that these methods can be useful in offering different alternatives to achieve a superior generalization. These approaches are evaluated experimentally in fifteen real-world binary classification data sets. Across these fifteen data sets, results show that the SLM is able to outperform the Multilayer Perceptron (MLP) in 13 out of the 15 considered problems with statistical significance after parameter tuning was applied to both algorithms. Furthermore, this work also considers how different ensemble construction methods such as a simple averaging approach, Bagging and Boosting affect the resulting generalization of the SLM and MLP algorithms. Results suggest that the stochastic nature of the SLM offers enough diversity to the base learner in a way that a simple averaging method can be competitive when compared to more complex techniques like Bagging and Boosting.
À medida que a tecnologia evolui, a possibilidade de implementar algoritmos novos e mais eficientes aumenta, no entanto, a complexidade dos problemas com que nos deparamos também se torna maior. Algoritmos de Neuroevolution encaixam-se neste contexto, na medida em que são capazes de evoluir Artificial Neural Networks (ANNs). O algoritmo de Neuroevolution recentemente proposto chamado Semantic Learning Machine (SLM) tem a vantagem de procurar sobre landscapes de erros unimodais em qualquer problema de Supervised Learning, onde o erro é medido como a distância aos alvos conhecidos. A não existência de local optima no espaço de procura resulta numa aprendizagem mais eficiente quando comparada com outros algoritmos de Neuroevolution. Este trabalho estuda como métodos diferentes de uso dinâmico de dados de treino afeta a generalização do algoritmo SLM. Os resultados mostram que estes métodos são úteis a oferecer uma alternativa que atinge uma generalização competitiva. Estes métodos são testados em quinze problemas reais de classificação binária. Nestes quinze problemas, o algoritmo SLM mostra superioridade ao Multilayer Perceptron (MLP) em treze deles com significância estatística depois de ser aplicado parameter tuning em ambos os algoritmos. Para além disso, este trabalho também considera como diferentes métodos de construção de ensembles, tal como um simples método de averaging, Bagging e Boosting afetam os valores de generalização dos algoritmos SLM e MLP. Os resultados sugerem que a natureza estocástica da SLM oferece diversidade suficiente aos base learners de maneira a que o método mais simples de construção de ensembles se torne competitivo quando comparado com técnicas mais complexas como Bagging e Boosting.

APA, Harvard, Vancouver, ISO, and other styles

49

Reichenbach, Jonas. "Credit scoring with advanced analytics: applying machine learning methods for credit risk assessment at the Frankfurter sparkasse." Master's thesis, 2018. http://hdl.handle.net/10362/49557.

Full text

Abstract:

Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management
The need for controlling and managing credit risk obliges financial institutions to constantly reconsider their credit scoring methods. In the recent years, machine learning has shown improvement over the common traditional methods for the application of credit scoring. Even small improvements in prediction quality are of great interest for the financial institutions. In this thesis classification methods are applied to the credit data of the Frankfurter Sparkasse to score their credits. Since recent research has shown that ensemble methods deliver outstanding prediction quality for credit scoring, the focus of the model investigation and application is set on such methods. Additionally, the typical imbalanced class distribution of credit scoring datasets makes us consider sampling techniques, which compensate the imbalances for the training dataset. We evaluate and compare different types of models and techniques according to defined metrics. Besides delivering a high prediction quality, the model’s outcome should be interpretable as default probabilities. Hence, calibration techniques are considered to improve the interpretation of the model’s scores. We find ensemble methods to deliver better results than the best single model. Specifically, the method of the Random Forest delivers the best performance on the given data set. When compared to the traditional credit scoring methods of the Frankfurter Sparkasse, the Random Forest shows significant improvement when predicting a borrower’s default within a 12-month period. The Logistic Regression is used as a benchmark to validate the performance of the model.

APA, Harvard, Vancouver, ISO, and other styles

50

Haque, Mohammad Nazmul. "Genetic algorithm-based ensemble methods for large-scale biological data classification." Thesis, 2017. http://hdl.handle.net/1959.13/1335393.

Full text

Abstract:

Research Doctorate - Doctor of Philosophy (PhD)
We study the search for the best ensemble combinations from the wide variety of heterogeneous base classifiers. The number of possible ways to create the ensemble with a large number of base classifiers is exponential to the base classifiers pool size. To search for the best combinations from that wide search space is not suitable for exhaustive search because of it's exponential growth with the ensemble size. Hence, we employed a genetic algorithm to find the best ensemble combinations from a pool of heterogeneous base classifiers. The classification decisions of base classifiers are combined using the popular majority vote approach. We used random sub-sampling for balancing the class distributions in the class-imbalanced datasets. The empirical result on benchmarking and real-world datasets apparently outperformed the performances of base classifiers and other state-of-the-art ensemble methods. Afterwards, we evaluated the performance of an ensemble of classifiers combination search in a weighted voting approach using the differential evolution (DE) algorithm to find if employing weights could increase the generalisation performances of ensembles. The weights optimised by DE also outperformed both of the base classifiers and other ensembles for benchmarking and real-world biological datasets. Finally, we extend the majority voting-based ensemble of classifiers combination search with multi-objective settings. The search space is spread over the all possible ensemble combinations created with 29 heterogeneous base classifiers and the selection of feature subset from six feature selection methods as wrapper approach. The optimisation of two objectives, the maximisation of training MCC scores and maximisation of the diversity among base classifiers, with NSGA-II, a popular multi-objective genetic algorithm, is used for simultaneously finding the best feature set and the ensemble combinations. We analyse the Pareto front of solutions obtained by NSGA-II for their generalisation performances. Datasets taken from UCI machine learning repository and NIPS2003 feature selection challenges have been used to investigate the performance of proposed method. The experimental outcomes suggest that the proposed multiobjective-based NSGA-II found the better feature set and the best ensemble combination that produces better generalisation performances in compared to other ensemble of classifiers methods.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!