Dissertations / Theses: 'Random Forests'

1

Gómez, Silvio Normey. "Random forests estocástico." Pontifícia Universidade Católica do Rio Grande do Sul, 2012. http://hdl.handle.net/10923/1598.

Full text

Abstract:

Made available in DSpace on 2013-08-07T18:43:07Z (GMT). No. of bitstreams: 1 000449231-Texto+Completo-0.pdf: 1860025 bytes, checksum: 1ace09799e27fa64938e802d2d91d1af (MD5) Previous issue date: 2012
In the Data Mining area experiments have been carried out using Ensemble Classifiers. We experimented Random Forests to evaluate the performance when randomness is applied. The results of this experiment showed us that the impact of randomness is much more relevant in Random Forests when compared with other algorithms, e. g., Bagging and Boosting. The main purpose of this work is to decrease the effect of randomness in Random Forests. To achieve the main purpose we implemented an extension of this method named Stochastic Random Forests and specified the strategy to increase the performance and stability combining the results. At the end of this work the improvements achieved are presented.
Na área de Mineração de Dados, experimentos vem sendo realizados utilizando Conjuntos de Classificadores. Estes experimentos são baseados em comparações empíricas que sofrem com a falta de cuidados no que diz respeito à questões de aleatoriedade destes métodos. Experimentamos o Random Forests para avaliar a eficiência do algoritmo quando submetido a estas questões. Estudos sobre os resultados mostram que a sensibilidade do Random Forests é significativamente maior quando comparado com a de outros métodos encontrados na literatura, como Bagging e Boosting. O proposito desta dissertação é diminuir a sensibilidade do Random Forests quando submetido a aleatoriedade. Para alcançar este objetivo, implementamos uma extensão do método, que chamamos de Random Forests Estocástico. Logo especificamos como podem ser alcançadas melhorias no problema encontrado no algoritmo combinando seus resultados. Por último, um estudo é apresentado mostrando as melhorias atingidas no problema de sensibilidade.

APA, Harvard, Vancouver, ISO, and other styles

2

Abdulsalam, Hanady. "Streaming Random Forests." Thesis, Kingston, Ont. : [s.n.], 2008. http://hdl.handle.net/1974/1321.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Linusson, Henrik. "Multi-Output Random Forests." Thesis, Högskolan i Borås, Institutionen Handels- och IT-högskolan, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-17167.

Full text

Abstract:

The Random Forests ensemble predictor has proven to be well-suited for solving a multitudeof different prediction problems. In this thesis, we propose an extension to the Random Forestframework that allows Random Forests to be constructed for multi-output decision problemswith arbitrary combinations of classification and regression responses, with the goal ofincreasing predictive performance for such multi-output problems. We show that our methodfor combining decision tasks within the same decision tree reduces prediction error for mosttasks compared to single-output decision trees based on the same node impurity metrics, andprovide a comparison of different methods for combining such metrics.
Program: Magisterutbildning i informatik

APA, Harvard, Vancouver, ISO, and other styles

4

G?mez, Silvio Normey. "Random forests estoc?stico." Pontif?cia Universidade Cat?lica do Rio Grande do Sul, 2012. http://tede2.pucrs.br/tede2/handle/tede/5226.

Full text

Abstract:

Made available in DSpace on 2015-04-14T14:50:03Z (GMT). No. of bitstreams: 1 449231.pdf: 1860025 bytes, checksum: 1ace09799e27fa64938e802d2d91d1af (MD5) Previous issue date: 2012-08-31
In the Data Mining area experiments have been carried out using Ensemble Classifiers. We experimented Random Forests to evaluate the performance when randomness is applied. The results of this experiment showed us that the impact of randomness is much more relevant in Random Forests when compared with other algorithms, e.g., Bagging and Boosting. The main purpose of this work is to decrease the effect of randomness in Random Forests. To achieve the main purpose we implemented an extension of this method named Stochastic Random Forests and specified the strategy to increase the performance and stability combining the results. At the end of this work the improvements achieved are presented
Na ?rea de Minera??o de Dados, experimentos vem sendo realizados utilizando Conjuntos de Classificadores. Estes experimentos s?o baseados em compara??es emp?ricas que sofrem com a falta de cuidados no que diz respeito ? quest?es de aleatoriedade destes m?todos. Experimentamos o Random Forests para avaliar a efici?ncia do algoritmo quando submetido a estas quest?es. Estudos sobre os resultados mostram que a sensibilidade do Random Forests ? significativamente maior quando comparado com a de outros m?todos encontrados na literatura, como Bagging e Boosting. O proposito desta disserta??o ? diminuir a sensibilidade do Random Forests quando submetido a aleatoriedade. Para alcan?ar este objetivo, implementamos uma extens?o do m?todo, que chamamos de Random Forests Estoc?stico. Logo especificamos como podem ser alcan?adas melhorias no problema encontrado no algoritmo combinando seus resultados. Por ?ltimo, um estudo ? apresentado mostrando as melhorias atingidas no problema de sensibilidade

APA, Harvard, Vancouver, ISO, and other styles

5

Lapajne, Mikael Hellborg, and Daniel Slat. "Random Forests for CUDA GPUs." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2953.

Full text

Abstract:

Context. Machine Learning is a complex and resource consuming process that requires a lot of computing power. With the constant growth of information, the need for efficient algorithms with high performance is increasing. Today's commodity graphics cards are parallel multi processors with high computing capacity at an attractive price and are usually pre-installed in new PCs. The graphics cards provide an additional resource to be used in machine learning applications. The Random Forest learning algorithm which has been showed competitive within machine learning has a good potential for performance increase through parallelization of the algorithm. Objectives. In this study we implement and review a revised Random Forest algorithm for GPU execution using CUDA. Methods. A review of previous work in the area has been done by studying articles from several sources, including Compendex, Inspec, IEEE Xplore, ACM Digital Library and Springer Link. Additional information regarding GPU architecture and implementation specific details have been obtained mainly from documentation available from Nvidia and the Nvidia developer forums. The implemented algorithm has been benchmarked and compared with two state-of-the-art CPU implementations of the Random Forest algorithm, both regarding consumed time for training and classification and for classification accuracy. Results. Measurements from benchmarks made on the three different algorithms are gathered showing the performance results of the algorithms for two publicly available data sets. Conclusion. We conclude that our implementation under the right conditions is able to outperform its competitors. We also conclude that this is only true for certain data sets depending on the size of the data sets. Moreover we conclude that there is potential for further improvements of the algorithm both regarding performance as well as adaption towards a wider range of real world applications.
Mikael: +46768539263, Daniel: +46703040693

APA, Harvard, Vancouver, ISO, and other styles

6

Diyar, Jamal. "Post-Pruning of Random Forests." Thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-15904.

Full text

Abstract:

Abstract Context. In machine learning, ensemble methods continue to receive increased attention. Since machine learning approaches that generate a single classifier or predictor have shown limited capabilities in some contexts, ensemble methods are used to yield better predictive performance. One of the most interesting and effective ensemble algorithms that have been introduced in recent years is Random Forests. A common approach to ensure that Random Forests can achieve a high predictive accuracy is to use a large number of trees. If the predictive accuracy is to be increased with a higher number of trees, this will result in a more complex model, which may be more difficult to interpret or analyse. In addition, the generation of an increased number of trees results in higher computational power and memory requirements. Objectives. This thesis explores automatic simplification of Random Forest models via post-pruning as a means to reduce the size of the model and increase interpretability while retaining or increasing predictive accuracy. The aim of the thesis is twofold. First, it compares and empirically evaluates a set of state-of-the-art post-pruning techniques on the simplification task. Second, it investigates the trade-off between predictive accuracy and model interpretability. Methods. The primary research method used to conduct this study and to address the research questions is experimentation. All post-pruning techniques are implemented in Python. The Random Forest models are trained, evaluated, and validated on five selected datasets with varying characteristics. Results. There is no significant difference in predictive performance between the compared techniques and none of the studied post-pruning techniques outperforms the other on all included datasets. The experimental results also show that model interpretability is proportional to model accuracy, at least for the studied settings. That is, a positive change in model interpretability is accompanied by a negative change in model accuracy. Conclusions. It is possible to reduce the size of a complex Random Forest model while retaining or improving the predictive accuracy. Moreover, the suitability of a particular post-pruning technique depends on the application area and the amount of training data available. Significantly simplified models may be less accurate than the original model but tend to be perceived as more comprehensible.
Sammanfattning Kontext. Ensemble metoder fortsätter att få mer uppmärksamhet inom maskininlärning. Då maskininlärningstekniker som genererar en enskild klassificerare eller prediktor har visat tecken på begränsad kapacitet i vissa sammanhang, har ensemble metoder vuxit fram som alternativa metoder för att åstadkomma bättre prediktiva prestanda. En av de mest intressanta och effektiva ensemble algoritmerna som har introducerats under de senaste åren är Random Forests. För att säkerställa att Random Forests uppnår en hög prediktiv noggrannhet behöver oftast ett stort antal träd användas. Resultatet av att använda ett större antal träd för att öka den prediktiva noggrannheten är en komplex modell som kan vara svår att tolka eller analysera. Problemet med det stora antalet träd ställer dessutom högre krav på såväl lagringsutrymmet som datorkraften. Syfte. Denna uppsats utforskar möjligheten att automatiskt förenkla modeller som är genererade av Random Forests i syfte att reducera storleken på modellen, öka dess tolkningsbarhet, samt bevara eller förbättra den prediktiva noggrannheten. Syftet med denna uppsats är tvåfaldigt. Vi kommer först att jämföra och empiriskt utvärdera olika beskärningstekniker. Den andra delen av uppsatsen undersöker sambandet mellan den prediktiva noggrannheten och modellens tolkningsbarhet. Metod. Den primära forskningsmetoden som har använts för att genomföra den studien är experiment. Alla beskärningstekniker är implementerade i Python. För att träna, utvärdera, samt validera de olika modellerna, har fem olika datamängder använts. Resultat. Det finns inte någon signifikant skillnad i det prediktiva prestanda mellan de jämförda teknikerna och ingen av de undersökta beskärningsteknikerna är överlägsen på alla plan. Resultat från experimenten har också visat att sambandet mellan tolkningsbarhet och noggrannhet är proportionellt, i alla fall för de studerade konfigurationerna. Det vill säga, en positiv förändring i modellens tolkningsbarhet åtföljs av en negativ förändring i modellens noggrannhet. Slutsats. Det är möjligt att reducera storleken på en komplex Random Forests modell samt bibehålla eller förbättra den prediktiva noggrannheten. Dessutom beror valet av beskärningstekniken på användningsområdet och mängden träningsdata tillgänglig. Slutligen kan modeller som är signifikant förenklade vara mindre noggranna men å andra sidan tenderar de att uppfattas som mer förståeliga.

APA, Harvard, Vancouver, ISO, and other styles

7

Xiong, Kuangnan. "Roughened Random Forests for Binary Classification." Thesis, State University of New York at Albany, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3624962.

Full text

Abstract:

Binary classification plays an important role in many decision-making processes. Random forests can build a strong ensemble classifier by combining weaker classification trees that are de-correlated. The strength and correlation among individual classification trees are the key factors that contribute to the ensemble performance of random forests. We propose roughened random forests, a new set of tools which show further improvement over random forests in binary classification. Roughened random forests modify the original dataset for each classification tree and further reduce the correlation among individual classification trees. This data modification process is composed of artificially imposing missing data that are missing completely at random and subsequent missing data imputation.

Through this dissertation we aim to answer a few important questions in building roughened random forests: (1) What is the ideal rate of missing data to impose on the original dataset? (2) Should we impose missing data on both the training and testing datasets, or only on the training dataset? (3) What are the best missing data imputation methods to use in roughened random forests? (4) Do roughened random forests share the same ideal number of covariates selected at each tree node as the original random forests? (5) Can roughened random forests be used in medium- to high- dimensional datasets?

APA, Harvard, Vancouver, ISO, and other styles

8

Strobl, Carolin, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. "Conditional Variable Importance for Random Forests." BioMed Central Ltd, 2008. http://dx.doi.org/10.1186/1471-2105-9-307.

Full text

Abstract:

Background Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. Results We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. Conclusion The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach. (authors' abstract)

APA, Harvard, Vancouver, ISO, and other styles

9

Sorice, Domenico <1995&gt. "Random forests in time series analysis." Master's Degree Thesis, Università Ca' Foscari Venezia, 2020. http://hdl.handle.net/10579/17482.

Full text

Abstract:

Machine learning algorithms are becoming more relevant in many fields from neuroscience to biostatistics, due to their adaptability and the possibility to learn from the data. In recent years, those techniques became popular in economics and found different applications in policymaking, financial forecasting, and portfolio optimization. The aim of this dissertation is two-fold. First, I will provide a review of the classification and Regression Tree and Random Forest methods proposed by [Breiman, 1984], [Breiman, 2001], then I study the effectiveness of those algorithms in time series analysis. I review the CART model and the Random Forest, which is an ensemble machine learning algorithm, based on the CART, using a variety of applications to test the performance of the algorithms. Second, I will implement an application on financial data: I will use the Random Forest algorithm to estimate a factor model based on macroeconomic variables with the aim of verifying if the Random Forest is able to capture part of the non-linear relationship between the factor considered and the index return.

APA, Harvard, Vancouver, ISO, and other styles

10

Hapfelmeier, Alexander. "Analysis of missing data with random forests." Diss., lmu, 2012. http://nbn-resolving.de/urn:nbn:de:bvb:19-150588.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Wonkye, Yaa Tawiah. "Innovations of random forests for longitudinal data." Bowling Green State University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1563054152739397.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Auret, Lidia. "Process monitoring and fault diagnosis using random forests." Thesis, Stellenbosch : University of Stellenbosch, 2010. http://hdl.handle.net/10019.1/5360.

Full text

Abstract:

Thesis (PhD (Process Engineering))--University of Stellenbosch, 2010.
Dissertation presented for the Degree of DOCTOR OF PHILOSOPHY (Extractive Metallurgical Engineering) in the Department of Process Engineering at the University of Stellenbosch
ENGLISH ABSTRACT: Fault diagnosis is an important component of process monitoring, relevant in the greater context of developing safer, cleaner and more cost efficient processes. Data-driven unsupervised (or feature extractive) approaches to fault diagnosis exploit the many measurements available on modern plants. Certain current unsupervised approaches are hampered by their linearity assumptions, motivating the investigation of nonlinear methods. The diversity of data structures also motivates the investigation of novel feature extraction methodologies in process monitoring. Random forests are recently proposed statistical inference tools, deriving their predictive accuracy from the nonlinear nature of their constituent decision tree members and the power of ensembles. Random forest committees provide more than just predictions; model information on data proximities can be exploited to provide random forest features. Variable importance measures show which variables are closely associated with a chosen response variable, while partial dependencies indicate the relation of important variables to said response variable. The purpose of this study was therefore to investigate the feasibility of a new unsupervised method based on random forests as a potentially viable contender in the process monitoring statistical tool family. The hypothesis investigated was that unsupervised process monitoring and fault diagnosis can be improved by using features extracted from data with random forests, with further interpretation of fault conditions aided by random forest tools. The experimental results presented in this work support this hypothesis. An initial study was performed to assess the quality of random forest features. Random forest features were shown to be generally difficult to interpret in terms of geometry present in the original variable space. Random forest mapping and demapping models were shown to be very accurate on training data, and to extrapolate weakly to unseen data that do not fall within regions populated by training data. Random forest feature extraction was applied to unsupervised fault diagnosis for process data, and compared to linear and nonlinear methods. Random forest results were comparable to existing techniques, with the majority of random forest detections due to variable reconstruction errors. Further investigation revealed that the residual detection success of random forests originates from the constrained responses and poor generalization artifacts of decision trees. Random forest variable importance measures and partial dependencies were incorporated in a visualization tool to allow for the interpretation of fault conditions. A dynamic change point detection application with random forests proved more successful than an existing principal component analysis-based approach, with the success of the random forest method again residing in reconstruction errors. The addition of random forest fault diagnosis and change point detection algorithms to a suite of abnormal event detection techniques is recommended. The distance-to-model diagnostic based on random forest mapping and demapping proved successful in this work, and the theoretical understanding gained supports the application of this method to further data sets.
AFRIKAANSE OPSOMMING: Foutdiagnose is ’n belangrike komponent van prosesmonitering, en is relevant binne die groter konteks van die ontwikkeling van veiliger, skoner en meer koste-effektiewe prosesse. Data-gedrewe toesigvrye of kenmerkekstraksie-benaderings tot foutdiagnose benut die vele metings wat op moderne prosesaanlegte beskikbaar is. Party van die huidige toesigvrye benaderings word deur aannames rakende liniariteit belemmer, wat as motivering dien om nie-liniêre metodes te ondersoek. Die diversiteit van datastrukture is ook verdere motivering vir ondersoek na nuwe kenmerkekstraksiemetodes in prosesmonitering. Lukrake-woude is ’n nuwe statistiese inferensie-tegniek, waarvan die akkuraatheid toegeskryf kan word aan die nie-liniêre aard van besluitnemingsboomlede en die bekwaamheid van ensembles. Lukrake-woudkomitees verskaf meer as net voorspellings; modelinligting oor datapuntnabyheid kan benut word om lukrakewoudkenmerke te verskaf. Metingbelangrikheidsaanduiers wys watter metings in ’n noue verhouding met ’n gekose uitsetveranderlike verkeer, terwyl parsiële afhanklikhede aandui wat die verhouding van ’n belangrike meting tot die gekose uitsetveranderlike is. Die doel van hierdie studie was dus om die uitvoerbaarheid van ’n nuwe toesigvrye metode vir prosesmonitering gebaseer op lukrake-woude te ondersoek. Die ondersoekte hipotese lui: toesigvrye prosesmonitering en foutdiagnose kan verbeter word deur kenmerke te gebruik wat met lukrake-woude geëkstraheer is, waar die verdere interpretasie van foutkondisies deur addisionele lukrake-woude-tegnieke bygestaan word. Eksperimentele resultate wat in hierdie werkstuk voorgelê is, ondersteun hierdie hipotese. ’n Intreestudie is gedoen om die gehalte van lukrake-woudkenmerke te assesseer. Daar is bevind dat dit moeilik is om lukrake-woudkenmerke in terme van die geometrie van die oorspronklike metingspasie te interpreteer. Verder is daar bevind dat lukrake-woudkartering en -dekartering baie akkuraat is vir opleidingsdata, maar dat dit swak ekstrapolasie-eienskappe toon vir ongesiene data wat in gebiede buite dié van die opleidingsdata val. Lukrake-woudkenmerkekstraksie is in toesigvrye-foutdiagnose vir gestadigde-toestandprosesse toegepas, en is met liniêre en nie-liniêre metodes vergelyk. Resultate met lukrake-woude is vergelykbaar met dié van bestaande metodes, en die meerderheid lukrake-woudopsporings is aan metingrekonstruksiefoute toe te skryf. Verdere ondersoek het getoon dat die sukses van res-opsporing op die beperkte uitsetwaardes en swak veralgemenende eienskappe van besluitnemingsbome berus. Lukrake-woude-metingbelangrikheidsaanduiers en parsiële afhanklikhede is ingelyf in ’n visualiseringstegniek wat vir die interpretasie van foutkondisies voorsiening maak. ’n Dinamiese aanwending van veranderingspuntopsporing met lukrake-woude is as meer suksesvol bewys as ’n bestaande metode gebaseer op hoofkomponentanalise. Die sukses van die lukrake-woudmetode is weereens aan rekonstruksie-reswaardes toe te skryf. ’n Voorstel wat na aanleiding van hierde studie gemaak is, is dat die lukrake-woudveranderingspunt- en foutopsporingsmetodes by ’n soortgelyke stel metodes gevoeg kan word. Daar is in hierdie werk bevind dat die afstand-vanaf-modeldiagnostiek gebaseer op lukrake-woudkartering en -dekartering suksesvol is vir foutopsporing. Die teoretiese begrippe wat ontsluier is, ondersteun die toepassing van hierdie metodes op verdere datastelle.

APA, Harvard, Vancouver, ISO, and other styles

13

Fawagreh, Khaled. "On pruning and feature engineering in Random Forests." Thesis, Robert Gordon University, 2016. http://hdl.handle.net/10059/2113.

Full text

Abstract:

Random Forest (RF) is an ensemble classification technique that was developed by Leo Breiman over a decade ago. Compared with other ensemble techniques, it has proved its accuracy and superiority. Many researchers, however, believe that there is still room for optimizing RF further by enhancing and improving its performance accuracy. This explains why there have been many extensions of RF where each extension employed a variety of techniques and strategies to improve certain aspect(s) of RF. The main focus of this dissertation is to develop new extensions of RF using new optimization techniques that, to the best of our knowledge, have never been used before to optimize RF. These techniques are clustering, the local outlier factor, diversified weighted subspaces, and replicator dynamics. Applying these techniques on RF produced four extensions which we have termed CLUB-DRF, LOFB-DRF, DSB-RF, and RDB-DR respectively. Experimental studies on 15 real datasets showed favorable results, demonstrating the potential of the proposed methods. Performance-wise, CLUB-DRF is ranked first in terms of accuracy and classifcation speed making it ideal for real-time applications, and for machines/devices with limited memory and processing power.

APA, Harvard, Vancouver, ISO, and other styles

14

Merrill, Andrew C. "Investigations of Variable Importance Measures Within Random Forests." DigitalCommons@USU, 2009. https://digitalcommons.usu.edu/etd/7078.

Full text

Abstract:

Random Forests (RF) (Breiman 2001; Breiman and Cutler 2004) is a completely nonparametric statistical learning procedure that may be used for regression analysis and. A feature of RF that is drawing a lot of attention is the novel algorithm that is used to evaluate the relative importance of the predictor/explanatory variables. Other machine learning algorithms for regression and classification, such as support vector machines and artificial neural networks (Hastie et al. 2009), exhibit high predictive accuracy but provide little insight into predictive power of individual variables. In contrast, the permutation algorithm of RF has already established a track record for identification of important predictors (Huang et al. 2005; Cutler et al. 2007; Archer and Kimes 2008). Recently, however, some authors (Nicodemus and Shugart 2007; Strobl et al. 2007, 2008) have shown that the presence of categorical variables with many categories (Strobl et al. 2007) or high colinearity give unduly large variable importance using the standard RF permutation algorithm (Strobl et al. 2008). This work creates simulations from multiple linear regression models with small numbers of variables to understand the issues raised by Strobl et al. (2008) regarding shortcomings of the original RF variable importance algorithm and the alternatives implemented in conditional forests (Strobl et al. 2008). In addition this paper will look at the dependence of RF variable importance values on user-defined parameters.

APA, Harvard, Vancouver, ISO, and other styles

15

Quach, Anna. "Extensions and Improvements to Random Forests for Classification." DigitalCommons@USU, 2017. https://digitalcommons.usu.edu/etd/6755.

Full text

Abstract:

The motivation of my dissertation is to improve two weaknesses of Random Forests. One, the failure to detect genetic interactions between two single nucleotide polymorphisms (SNPs) in higher dimensions when the interacting SNPs both have weak main effects and two, the difficulty of interpretation in comparison to parametric methods such as logistic regression, linear discriminant analysis, and linear regression. We focus on detecting pairwise SNP interactions in genome case-control studies. We determine the best parameter settings to optimize the detection of SNP interactions and improve the efficiency of Random Forests and present an efficient filtering method. The filtering method is compared to leading methods and is shown that it is computationally faster with good detection power. Random Forests allows us to identify clusters, outliers, and important features for subgroups of observations through the visualization of the proximities. We improve the interpretation of Random Forests through the proximities. The result of the new proximities are asymmetric, and the appropriate visualization requires an asymmetric model for interpretation. We propose a new visualization technique for asymmetric data and compare it to existing approaches.

APA, Harvard, Vancouver, ISO, and other styles

16

Parfionovas, Andrejus. "Enhancement of Random Forests Using Trees with Oblique Splits." DigitalCommons@USU, 2013. http://digitalcommons.usu.edu/etd/1508.

Full text

Abstract:

This work presents an enhancement to the classification tree algorithm which forms the basis for Random Forests. Differently from the classical tree-based methods that focus on one variable at a time to separate the observations, the new algorithm performs the search for the best split in two-dimensional space using a linear combination of variables. Besides the classification, the method can be used to determine variables interaction and perform feature extraction. Theoretical investigations and numerical simulations were used to analyze the properties and performance of the new approach. Comparison with other popular classification methods was performed using simulated and real data examples. The algorithm was implemented as an extension package for the statistical computing environment R and is available for free download under the GNU General Public License.

APA, Harvard, Vancouver, ISO, and other styles

17

Tang, Ying. "Real-time automatic face tracking using adaptive random forests." Thesis, McGill University, 2010. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=95172.

Full text

Abstract:

Tracking is treated as a pixel-based binary classification problem in this thesis. An ensemble strong classifier obtained as a weighted combination of several random forests (weak classifiers), is trained on pixel feature vectors. The strong classifier is then used to classify the pixels belonging to the face or the background in the next frame. The classification margins are used to create a confidence map, whose peak indicates the new location of the face. The peak is located by Camshift which adjusts the size of the tracked face. The random forests in the ensemble are updated using AdaBoost by training new random forests to replace certain older ones to adapt to the changes between two frames. Tracking accuracy is monitored by a variable called the classification score. If the score detects a tracking anomaly, the system will stop tracking and restart by re-initializing using a Viola-Jones face detector. The tracker is tested on several sequences and proved to provide robust performance in different scenarios and illumination. The tracker can deal with complex changes of the face, a short period of occlusion, and the loss of tracking.
La localisation est traitée comme étant un problème de classification binaire à base de pixels dans cette thèse. Un ensemble de fort classificateur, obtenu à l'aide d'une combinaison pesée de plusieurs forêts (faibles classificateurs) aléatoires, est entraîné sur des vecteurs figurant des pixels. Le classificateur fort est ensuite utilisé pour classifier les pixels appartenant à la face ou au fond dans la prochaine image. Les marges de classifications sont utilisées pour créer une carte de confiance dont le sommet indique où est la nouvelle face. Le sommet est localisé par Camshift qui ajuste la grandeur de la face à localiser. Les forêts aléatoires dans l'ensemble sont mises à jours avec AdaBoost en entraînant des nouvelles forêts aléatoires pour remplacer certaines vieilles forêts pour s'adapter aux changements entre deux images. La précision de localisation est surveillée par une variable appelée note de classification. Si la note détecte une anomalie, le système arrêtera la localisation et redémarrera en réinitialisant en utilisant un détecteur de face Viola-Jones. Le localisateur est testé sur plusieurs séquences et s'est prouvé d'une performance robuste dans divers scénarios et illumination. Le localisateur peut agir bien à travers plusieurs changement complexes de la face, une courte période d'occlusion et la perte de la localisation.

APA, Harvard, Vancouver, ISO, and other styles

18

Michaelson, Jacob. "Applications and extensions of Random Forests in genetic and environmental studies." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-64099.

Full text

Abstract:

Transcriptional regulation refers to the molecular systems that control the concentration of mRNA species within the cell. Variation in these controlling systems is not only responsible for many diseases, but also contributes to the vast phenotypic diversity in the biological world. There are powerful experimental approaches to probe these regulatory systems, and the focus of my doctoral research has been to develop and apply effective computational methods that exploit these rich data sets more completely. First, I present a method for mapping genetic regulators of gene expression (expression quantitative trait loci, or eQTL) using Random Forests. This approach allows for flexible modeling and feature selection, and results in eQTL that are more biologically supportable than those mapped with competing methods. Next, I present a method that finds interactions between genes that in turn regulate the expression of other genes. This is accomplished by finding recurring decision motifs in the forest structure that represent dependencies between genetic loci. Third, I present a method to use distributional differences in eQTL data to establish the regulatory roles of genes relative to other disease-associated genes. Using this method, we found that genes that are master regulators of other disease genes are more likely to be consistently associated with the disease in genetic association studies. Finally, I present a novel application of Random Forests to determine the mode of regulation of toxin-perturbed genes, using time-resolved gene expression. The results demonstrate a novel approach to supervised weighted clustering of gene expression data.

APA, Harvard, Vancouver, ISO, and other styles

19

Sandsveden, Daniel. "Evaluation of Random Forests for Detection and Localization of Cattle Eyes." Thesis, Linköpings universitet, Datorseende, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-121540.

Full text

Abstract:

In a time when cattle herds grow continually larger the need for automatic methods to detect diseases is ever increasing. One possible method to discover diseases is to use thermal images and automatic head and eye detectors. In this thesis an eye detector and a head detector is implemented using the Random Forests classifier. During the implementation the classifier is evaluated using three different descriptors: Histogram of Oriented Gradients, Local Binary Patterns, and a descriptor based on pixel differences. An alternative classifier, the Support Vector Machine, is also evaluated for comparison against Random Forests. The thesis results show that Histogram of Oriented Gradients performs well as a description of cattle heads, while Local Binary Patterns performs well as a description of cattle eyes. The provided descriptor performs almost equally well in both cases. The results also show that Random Forests performs approximately as good as the Support Vector Machine, when the Support Vector Machine is paired with Local Binary Patterns for both heads and eyes. Finally the thesis results indicate that it is easier to detect and locate cattle heads than it is to detect and locate cattle eyes. For eyes, combining a head detector and an eye detector is shown to give a better result than only using an eye detector. In this combination heads are first detected in images, followed by using the eye detector in areas classified as heads.

APA, Harvard, Vancouver, ISO, and other styles

20

Reiter, Richard M. "Prediction of recurrence in thin melanoma using trees and random forests /." Electronic version (PDF), 2005. http://dl.uncw.edu/etd/2005/reiterr/richardreiter.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Hansson, Kim, and Erik Hörlin. "Active learning via Transduction in Regression Forests." Thesis, Blekinge Tekniska Högskola, Institutionen för kreativa teknologier, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-10935.

Full text

Abstract:

Context. The amount of training data required to build accurate modelsis a common problem in machine learning. Active learning is a techniquethat tries to reduce the amount of required training data by making activechoices of which training data holds the greatest value.Objectives. This thesis aims to design, implement and evaluate the Ran-dom Forests algorithm combined with active learning that is suitable forpredictive tasks with real-value data outcomes where the amount of train-ing data is small. machine learning algorithms traditionally requires largeamounts of training data to create a general model, and training data is inmany cases sparse and expensive or difficult to create.Methods.The research methods used for this thesis is implementation andscientific experiment. An approach to active learning was implementedbased on previous work for classification type problems. The approachuses the Mahalanobis distance to perform active learning via transduction.Evaluation was done using several data sets were the decrease in predictionerror was measured over several iterations. The results of the evaluationwas then analyzed using nonparametric statistical testing.Results. The statistical analysis of the evaluation results failed to detect adifference between our approach and a non active learning approach, eventhough the proposed algorithm showed irregular performance. The evalu-ation of our tree-based traversal method, and the evaluation of the Maha-lanobis distance for transduction both showed that these methods performedbetter than Euclidean distance and complete graph traversal.Conclusions. We conclude that the proposed solution did not decreasethe amount of required training data on a significant level. However, theapproach has potential and future work could lead to a working active learn-ing solution. Further work is needed on key areas of the implementation,such as the choice of instances for active learning through transduction un-certainty as well as choice of method for going from transduction model toinduction model.

APA, Harvard, Vancouver, ISO, and other styles

22

Hapfelmeier, Alexander [Verfasser], and Kurt [Akademischer Betreuer] Ulm. "Analysis of missing data with random forests / Alexander Hapfelmeier. Betreuer: Kurt Ulm." München : Universitätsbibliothek der Ludwig-Maximilians-Universität, 2012. http://d-nb.info/102904032X/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Matheson, David. "An empirical study of practical, theoretical and online variants of random forests." Thesis, University of British Columbia, 2014. http://hdl.handle.net/2429/46586.

Full text

Abstract:

Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with different goals in mind: improving predictive accuracy, extending the range of application to online and structure domains, and introducing simplifications for theoretical amenability. While there are many subtle differences among the variants, the core difference is the method of selecting candidate split points. In our work, we examine eight different strategies for selecting candidate split points and study their effect on predictive accuracy, individual strength, diversity, computation time and model complexity. We also examine the effect of different parameter settings and several other design choices including bagging, subsampling data points at each node, taking linear combinations of features, splitting data points into structure and estimation streams and using a fixed frontier for online variants. Our empirical study finds several trends, some of which are in contrast to commonly held beliefs, that have value to practitioners and theoreticians. For variants used by practitioners the most important discoveries include: bagging almost never improves predictive accuracy, selecting candidate split points at all midpoints can achieve lower error than selecting them uniformly at random, and subsampling data points at each node decreases training time without affecting predictive accuracy. We also show that the gap between variants with proofs of consistency and those used in practice can be accounted for by the requirement to split data points into structure and estimation streams. Our work with online forests demonstrates the potential improvement that is possible by selecting candidate split points at data points, constraining memory with a fixed frontier and training with multiple passes through the data.

APA, Harvard, Vancouver, ISO, and other styles

24

Adriansson, Nils, and Ingrid Mattsson. "Forecasting GDP Growth, or How Can Random Forests Improve Predictions in Economics?" Thesis, Uppsala universitet, Statistiska institutionen, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-243028.

Full text

Abstract:

GDP is used to measure the economic state of a country and accurate forecasts of it is therefore important. Using the Economic Tendency Survey we investigate forecasting quarterly GDP growth using the data mining technique Random Forest. Comparisons are made with a benchmark AR(1) and an ad hoc linear model built on the most important variables suggested by the Random Forest. Evaluation by forecasting shows that the Random Forest makes the most accurate forecast supporting the theory that there are benefits to using Random Forests on economic time series.

APA, Harvard, Vancouver, ISO, and other styles

25

Mohammed, D. Y. "Overlapped speech and music segmentation using singular spectrum analysis and random forests." Thesis, University of Salford, 2017. http://usir.salford.ac.uk/43773/.

Full text

Abstract:

Recent years have seen ever-increasing volumes of digital media archives and an enormous amount of user-contributed content. As demand for indexing and searching these resources has increased, and new technologies such as multimedia content management systems, en-hanced digital broadcasting, and semantic web have emerged, audio information mining and automated metadata generation have received much attention. Manual indexing and metadata tagging are time-consuming and subject to the biases of individual workers. An automated architecture able to extract information from audio signals, generate content-related text descriptors or metadata, and enable further information mining and searching would be a tangible and valuable solution. In the field of audio classification, audio signals may be broadly divided into speech or music. Most studies, however, neglect the fact that real audio soundtracks may have either speech or music, or a combination of the two, and this is considered the major hurdle to achieving high performance in automatic audio classification, since overlapping can contaminate relevant characteristics and features, causing incorrect classification or information loss. This research undertakes an extensive review of the state of the art by outlining the well-established audio features and machine learning techniques that have been applied in a broad range of audio segmentation and recognition areas. Audio classification systems and the suggested solutions for the mixed soundtracks problem are presented. The suggested solutions can be listed as follows: developing augmented and modified features for recognising audio classes even in the presence of overlaps between them; robust segmentation of a given overlapped soundtrack stream depends on an innovative method of audio decomposition using Singular Spectrum Analysis (SSA) that has been studied extensively and has received increasing attention in the past two decades as a time series decomposition method with many applications; adoption and development of driven classification methods; and finally a technique for continuous time series tasks. In this study, SSA has been investigated and found to be an efficient way to discriminate speech/music in mixed soundtracks by two different methods, each of which has been developed and validated in this research. The first method serves to mitigate the overlapping ratio between speech and music in the mixed soundtracks by generating two new soundtracks with a lower level of overlapping. Next, feature space is calculated for the output audio streams, and these are classified using random forests into either speech or music. One of the distinct characteristics of this method is the separation of the speech/music key features that lead to improve the classification performance. Nevertheless, that did encounter a few obstructions, including excessively long processing time, increased storage requirements (each frame symbolised by two outputs), and this all leads to greater computational load than previously. Meanwhile, the second method em-ploys the SSA technique to decompose a given audio signal into a series of Principal Components (PCs), where each PC corresponds to a particular pattern of oscillation. Then, the transformed well-established feature is measured for each PC in order to classify it into either speech or music based on the baseline classification system using a RF machine learning technique. The classification performance of real-world soundtracks is effectively improved, which is demonstrated by comparing speech/music recognition using conventional classification methods and the proposed SSA method. The second proposed and de-veloped method can detect pure speech, pure music, and mix with a much lower complexity level.

APA, Harvard, Vancouver, ISO, and other styles

26

Samarakoon, Prasad. "Random Regression Forests for Fully Automatic Multi-Organ Localization in CT Images." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM039/document.

Full text

Abstract:

La localisation d'un organe dans une image médicale en délimitant cet organe spécifique par rapport à une entité telle qu'une boite ou sphère englobante est appelée localisation d'organes. La localisation multi-organes a lieu lorsque plusieurs organes sont localisés simultanément. La localisation d'organes est l'une des étapes les plus cruciales qui est impliquée dans toutes les phases du traitement du patient à partir de la phase de diagnostic à la phase finale de suivi. L'utilisation de la technique d'apprentissage supervisé appelée forêts aléatoires (Random Forests) a montré des résultats très encourageants dans de nombreuses sous-disciplines de l'analyse d'images médicales. De même, Random Regression Forests (RRF), une spécialisation des forêts aléatoires pour la régression, ont produit des résultats de l'état de l'art pour la localisation automatique multi-organes.Bien que l'état de l'art des RRF montrent des résultats dans la localisation automatique de plusieurs organes, la nouveauté relative de cette méthode dans ce domaine soulève encore de nombreuses questions sur la façon d'optimiser ses paramètres pour une utilisation cohérente et efficace. Basé sur une connaissance approfondie des rouages des RRF, le premier objectif de cette thèse est de proposer une paramétrisation cohérente et automatique des RRF. Dans un second temps, nous étudions empiriquement l'hypothèse d'indépendance spatiale utilisée par RRF. Enfin, nous proposons une nouvelle spécialisation des RRF appelé "Light Random Regression Forests" pour améliorant l'empreinte mémoire et l'efficacité calculatoire
Locating an organ in a medical image by bounding that particular organ with respect to an entity such as a bounding box or sphere is termed organ localization. Multi-organ localization takes place when multiple organs are localized simultaneously. Organ localization is one of the most crucial steps that is involved in all the phases of patient treatment starting from the diagnosis phase to the final follow-up phase. The use of the supervised machine learning technique called random forests has shown very encouraging results in many sub-disciplines of medical image analysis. Similarly, Random Regression Forests (RRF), a specialization of random forests for regression, have produced the state of the art results for fully automatic multi-organ localization.Although, RRF have produced state of the art results in multi-organ segmentation, the relative novelty of the method in this field still raises numerous questions about how to optimize its parameters for consistent and efficient usage. The first objective of this thesis is to acquire a thorough knowledge of the inner workings of RRF. After achieving the above mentioned goal, we proposed a consistent and automatic parametrization of RRF. Then, we empirically proved the spatial indenpendency hypothesis used by RRF. Finally, we proposed a novel RRF specialization called Light Random Regression Forests for multi-organ localization

APA, Harvard, Vancouver, ISO, and other styles

27

Stum, Alexander Knell. "Random Forests Applied as a Soil Spatial Predictive Model in Arid Utah." DigitalCommons@USU, 2010. https://digitalcommons.usu.edu/etd/736.

Full text

Abstract:

Initial soil surveys are incomplete for large tracts of public land in the western USA. Digital soil mapping offers a quantitative approach as an alternative to traditional soil mapping. I sought to predict soil classes across an arid to semiarid watershed of western Utah by applying random forests (RF) and using environmental covariates derived from Landsat 7 Enhanced Thematic Mapper Plus (ETM+) and digital elevation models (DEM). Random forests are similar to classification and regression trees (CART). However, RF is doubly random. Many (e.g., 500) weak trees are grown (trained) independently because each tree is trained with a new randomly selected bootstrap sample, and a random subset of variables is used to split each node. To train and validate the RF trees, 561 soil descriptions were made in the field. An additional 111 points were added by case-based reasoning using aerial photo interpretation. As RF makes classification decisions from the mode of many independently grown trees, model uncertainty can be derived. The overall out of the bag (OOB) error was lower without weighting of classes; weighting increased the overall OOB error and the resulting output did not reflect soil-landscape relationships observed in the field. The final RF model had an OOB error of 55.2% and predicted soils on landforms consistent with soil-landscape relationships. The OOB error for individual classes typically decreased with increasing class size. In addition to the final classification, I determined the second and third most likely classification, model confidence, and the hypothetical extent of individual classes. Pixels that had high possibility of belonging to multiple soil classes were aggregated using a minimum confidence value based on limiting soil features, which is an effective and objective method of determining membership in soil map unit associations and complexes mapped at the 1:24,000 scale. Variables derived from both DEM and Landsat 7 ETM+ sources were important for predicting soil classes based on Gini and standard measures of variable importance and OOB errors from groves grown with exclusively DEM- or Landsat-derived data. Random forests was a powerful predictor of soil classes and produced outputs that facilitated further understanding of soil-landscape relationships.

APA, Harvard, Vancouver, ISO, and other styles

28

edu, rdlyons@indiana. "Markov Chain Intersections and the Loop--Erased Walk." ESI preprints, 2001. ftp://ftp.esi.ac.at/pub/Preprints/esi1058.ps.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Hudec, Vladimír. "Klasifikační metody pro data z mikročipů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2011. http://www.nusl.cz/ntk/nusl-236982.

Full text

Abstract:

This paper discusses about the data obtained from gene chips and methods of their analysis. Analyzes some methods for analyzing these data and focus on the method of "Random Forests". Shows dataset that is used for specific experiments. Methods are realized in R language environment. Than they are tested, and the results are presented and compared. Results with method "Random Forests" are compared with other experiments on same dataset.

APA, Harvard, Vancouver, ISO, and other styles

30

Li, Ke. "Customer Relationship Management: from Conversion to Churn to Winback." Diss., Temple University Libraries, 2013. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/221333.

Full text

Abstract:

Business Administration/Marketing
Ph.D.
With the grant of a big CRM dataset from a large media company, this dissertation examines four different categories of factors that could impact three stages of customer relationship management, namely customer acquisition, retention, and winback of lost customers. Specifically, with the aid of machine learning method of random forests and text mining technique, this study identify among the factors of customer heterogeneity (e.g. in usage of self-care service channels, duration of service, responsiveness to marketing actions), firm's marketing initiatives (e.g. the volume of the marketing communications, the depth of the promotion, the different communication channels they use, and the marketing penetration in different geographical areas), customer self-reported deactivation reasons, as well as the call centers notes in text form, which factors play bigger roles than others during each of the three stages of CRM. Furthermore, the authors also examine how these factors evolve throughout these three stages of CRM in terms of their effects on shaping customers' decision making of whether to convert to paid customer, to churn, or to reactivate their service with the company. The findings help managers better allocate their resources in the processes of acquiring, retaining and winning back customers.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

31

Hjerpe, Adam. "Computing Random Forests Variable Importance Measures (VIM) on Mixed Numerical and Categorical Data." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-185496.

Full text

Abstract:

The Random Forest model is commonly used as a predictor function and the model have been proven useful in a variety of applications. Their popularity stems from the combination of providing high prediction accuracy, their ability to model high dimensional complex data, and their applicability under predictor correlations. This report investigates the random forest variable importance measure (VIM) as a means to find a ranking of important variables. The robustness of the VIM under imputation of categorical noise, and the capability to differentiate informative predictors from non-informative variables is investigated. The selection of variables may improve robustness of the predictor, improve the prediction accuracy, reduce computational time, and may serve as a exploratory data analysis tool. In addition the partial dependency plot obtained from the random forest model is examined as a means to find underlying relations in a non-linear simulation study.
Random Forest (RF) är en populär prediktormodell som visat goda resultat vid en stor uppsättning applikationsstudier. Modellen ger hög prediktionsprecision, har förmåga att modellera komplex högdimensionell data och modellen har vidare visat goda resultat vid interkorrelerade prediktorvariabler. Detta projekt undersöker ett mått, variabel importance measure (VIM) erhållna från RF modellen, för att beräkna graden av association mellan prediktorvariabler och målvariabeln. Projektet undersöker känsligheten hos VIM vid kvalitativt prediktorbrus och undersöker VIMs förmåga att differentiera prediktiva variabler från variabler som endast, med aveende på målvariableln, beskriver brus. Att differentiera prediktiva variabler vid övervakad inlärning kan användas till att öka robustheten hos klassificerare, öka prediktionsprecisionen, reducera data dimensionalitet och VIM kan användas som ett verktyg för att utforska relationer mellan prediktorvariabler och målvariablel.

APA, Harvard, Vancouver, ISO, and other styles

32

Persson, Karl. "Predicting movie ratings : A comparative study on random forests and support vector machines." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-11119.

Full text

Abstract:

The aim of this work is to evaluate the prediction performance of random forests in comparison to support vector machines, for predicting the numerical user ratings of a movie using pre-release attributes such as its cast, directors, budget and movie genres. In order to answer this question an experiment was conducted on predicting the overall user rating of 3376 hollywood movies, using data from the well established movie database IMDb. The prediction performance of the two algorithms was assessed and compared over three commonly used performance and error metrics, as well as evaluated by the means of significance testing in order to further investigate whether or not any significant differences could be identified. The results indicate some differences between the two algorithms, with consistently better performance from random forests in comparison to support vector machines over all of the performance metrics, as well as significantly better results for two out of three metrics. Although a slight difference has been indicated by the results one should also note that both algorithms show great similarities in terms of their prediction performance, making it hard to draw any general conclusions on which algorithm yield the most accurate movie predictions.

APA, Harvard, Vancouver, ISO, and other styles

33

Pauly, Olivier Verfasser], Nassir [Akademischer Betreuer] [Navab, and Nicholas [Akademischer Betreuer] Ayache. "Random Forests for Medical Applications / Olivier Pauly. Gutachter: Nicholas Ayache. Betreuer: Nassir Navab." München : Universitätsbibliothek der TU München, 2012. http://d-nb.info/1030099510/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Kimes, Ryan Vincent. "Quantifying the Effects of Correlated Covariates on Variable Importance Estimates from Random Forests." VCU Scholars Compass, 2006. http://scholarscompass.vcu.edu/etd/1433.

Full text

Abstract:

Recent advances in computing technology have lead to the development of algorithmic modeling techniques. These methods can be used to analyze data which are difficult to analyze using traditional statistical models. This study examined the effectiveness of variable importance estimates from the random forest algorithm in identifying the true predictor among a large number of candidate predictors. A simulation study was conducted using twenty different levels of association among the independent variables and seven different levels of association between the true predictor and the response. We conclude that the random forest method is an effective classification tool when the goals of a study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables. These goals are common in gene expression analysis, therefore we apply the random forest method for the purpose of estimating variable importance on a microarray data set.

APA, Harvard, Vancouver, ISO, and other styles

35

Varatharajah, Thujeepan, and Eriksson Victor. "A comparative study on artificial neural networks and random forests for stock market prediction." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186452.

Full text

Abstract:

This study investigates the predictive performance of two different machine learning (ML) models on the stock market and compare the results. The chosen models are based on artificial neural networks (ANN) and random forests (RF). The models are trained on two separate data sets and the predictions are made on the next day closing price. The input vectors of the models consist of 6 different financial indicators which are based on the closing prices of the past 5, 10 and 20 days. The performance evaluation are done by analyzing and comparing such values as the root mean squared error (RMSE) and mean average percentage error (MAPE) for the test period. Specific behavior in subsets of the test period is also analyzed to evaluate consistency of the models. The results showed that the ANN model performed better than the RF model as it throughout the test period had lower errors compared to the actual prices and thus overall made more accurate predictions.
Denna studie undersöker hur väl två olika modeller inom maskininlärning (ML) kan förutspå aktiemarknaden och jämför sedan resultaten av dessa. De valda modellerna baseras på artificiella neurala nätverk (ANN) samt random forests (RF). Modellerna tränas upp med två separata datamängder och prognoserna sker på nästföljande dags stängningskurs. Indatan för modellerna består av 6 olika finansiella nyckeltal som är baserade på stängningskursen för de senaste 5, 10 och 20 dagarna. Prestandan utvärderas genom att analysera och jämföra värden som root mean squared error (RMSE) samt mean average percentage error (MAPE) för testperioden. Även specifika trender i delmängder av testperioden undersöks för att utvärdera följdriktigheten av modellerna. Resultaten visade att ANN-modellen presterade bättre än RF-modellen då den sett över hela testperioden visade mindre fel jämfört med de faktiska värdena och gjorde därmed mer träffsäkra prognoser.

APA, Harvard, Vancouver, ISO, and other styles

36

Petersson, Andreas. "Data mining file sharing metadata : A comparison between Random Forests Classificiation and Bayesian Networks." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-11180.

Full text

Abstract:

In this comparative study based on experimentation it is demonstrated that the two evaluated machine learning techniques, Bayesian networks and random forests, have similar predictive power in the domain of classifying torrents on BitTorrent file sharing networks. This work was performed in two steps. First, a literature analysis was performed to gain insight into how the two techniques work and what types of attacks exist against BitTorrent file sharing networks. After the literature analysis, an experiment was performed to evaluate the accuracy of the two techniques. The results show no significant advantage of using one algorithm over the other when only considering accuracy. However, ease of use lies in Random forests’ favour because the technique requires little pre-processing of the data and still generates accurate results with few false positives.

APA, Harvard, Vancouver, ISO, and other styles

37

Petersson, Andreas. "Data mining file sharing metadata : A comparison between Random Forests Classification and Bayesian Networks." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-11285.

Full text

Abstract:

In this comparative study based on experimentation it is demonstrated that the two evaluated machine learning techniques, Bayesian networks and random forests, have similar predictive power in the domain of classifying torrents on BitTorrent file sharing networks.This work was performed in two steps. First, a literature analysis was performed to gain insight into how the two techniques work and what types of attacks exist against BitTorrent file sharing networks. After the literature analysis, an experiment was performed to evaluate the accuracy of the two techniques.The results show no significant advantage of using one algorithm over the other when only considering accuracy. However, ease of use lies in Random forests’ favour because the technique requires little pre-processing of the data and still generates accurate results with few false positives.

APA, Harvard, Vancouver, ISO, and other styles

38

Kanavati, Fahdi. "Efficient extraction of semantic information from medical images in large datasets using random forests." Thesis, Imperial College London, 2017. http://hdl.handle.net/10044/1/58017.

Full text

Abstract:

Large datasets of unlabelled medical images are increasingly becoming available; however only a small subset tend to be manually semantically labelled as it is a tedious and extremely time-consuming task to do for large datasets. This thesis aims to tackle the problem of efficiently extracting semantic information in the form of image segmentations and organ localisations from large datasets of unlabelled medical images. To do so, we investigate the suitability of supervoxels and random classification forests for the task. The first contribution of this thesis is a novel method for efficiently estimating coarse correspondences between pairs of images that can handle difficult cases that exhibit large variations in fields of view. The proposed methods adapts the random forest framework, which is a supervised learning algorithm, to work in an unsupervised manner by automatically generating labels for training via the use of supervoxels. The second contribution of this thesis is a method that extends our first contribution so as to be applicable efficiently on a large dataset of images. The proposed method is efficient and can be used to obtain correspondences between a large number of object-like supervoxels that are representative of organ structures in the images. The method is evaluated for the applications of organ-based image retrieval and weakly-supervised image segmentation using extremely minimal user input. While the method does not achieve image segmentation accuracies for all organs in an abdominal CT dataset compared to current fully-supervised state-of-the-art methods, it does provide a promising way for efficiently extracting and parsing a large dataset of medical images for the purpose of further processing.

APA, Harvard, Vancouver, ISO, and other styles

39

Pasquale, Daniel L. (Daniel Louis). "Characterizing drag and velocity within model mangrove forests of ordered and random tree arrangement." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/111525.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Civil and Environmental Engineering, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (page 50).
Changes in velocity and drag force on model mangrove trees within 13 different simulated mangrove forest segments in a flume were investigated. The simulated forests were composed of 1/12 scale model Rhizophora mangrove trees placed at three densities: low (3.42 trees/m²), medium (6.34 trees/m²), and high (9.27 trees/m²). For the low tree density cases, one forest with ordered tree placement and six forests with random tree placement were studied. For the medium and high tree density cases, one ordered tree arrangement and two random tree arrangements were studied. Spatial arrangements of the forests were described using the mean distance to nearest neighbor (NN) for all trees in a particular forest. The forest arrangements were also described using the spatial aggregation index developed by Clark and Evans. [9] For forests of ordered tree arrangement, depth-averaged velocity was found to decrease from the leading edge to the trailing edge of the forest segment at each density, and the reduction in velocity moving through the forest was greater for denser forests. Vertical profiles of velocity show that a region of high velocity developed above the root zone when moving from the leading edge to the trailing edge of the forest. This effect was more pronounced in the forests with random tree arrangement and low mean NN distance. For all spatial arrangements, the drag force acting on an individual tree decreased from the leading edge to the trailing edge of the forest. Larger decreases in drag force occurred within denser forests. Mangrove tree drag coefficient values were found to be similar or slightly higher for trees within forests of random arrangement compared to trees within forests of ordered arrangement, but further study examining a greater amount of random tree arrangements is needed. This study describes changes in the vulnerability of a mangrove forest that could occur if mangrove trees were removed from the forest by natural or human causes.
by Daniel L. Pasquale.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

40

Julock, Gregory Alan. "The Effectiveness of a Random Forests Model in Detecting Network-Based Buffer Overflow Attacks." NSUWorks, 2013. http://nsuworks.nova.edu/gscis_etd/190.

Full text

Abstract:

Buffer Overflows are a common type of network intrusion attack that continue to plague the networked community. Unfortunately, this type of attack is not well detected with current data mining algorithms. This research investigated the use of Random Forests, an ensemble technique that creates multiple decision trees, and then votes for the best tree. The research Investigated Random Forests' effectiveness in detecting buffer overflows compared to other data mining methods such as CART and Naïve Bayes. Random Forests was used for variable reduction, cost sensitive classification was applied, and each method's detection performance compared and reported along with the receive operator characteristics. The experiment was able to show that Random Forests outperformed CART and Naïve Bayes in classification performance. Using a technique to obtain Buffer Overflow most important variables, Random Forests was also able to improve upon its Buffer Overflow classification performance.

APA, Harvard, Vancouver, ISO, and other styles

41

Herlitz, Mattias. "Analyzing the Tobii Real-world-mapping tool and improving its workflow using Random Forests." Thesis, KTH, Matematisk statistik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-228474.

Full text

Abstract:

The Tobii Pro Glasses 2 are used to record gaze data that is used for market research or scientific experiments. To make extraction of relevant statistics more efficient, the gaze points in the recorded video are mapped to a static snapshot with areas of interests (AOIs). The most important statistics revolve around fixations. A fixation is when a person is keeping his or her vision still for a short period of time. The method most used today is to manually map the gaze points. However, a faster method is automated mapping using the Real World Mapping (RWM) tool. In order to examine the reliability of RWM, the fixations from different recordings and projects were analyzed using Decision Trees. Further, a Random Forest (RF) model was constructed in order to predict if a gaze point was correctly or incorrectly mapped. It was shown that fixation classification on data from RWM performed significantly worse than when the same fixation classification on manually mapped data was run. It was shown that RWM works better when head movement is low and AOIs are set appropriately. This can guide researchers in set- ting up experiments, although major improvements of RWM is needed. The RF classifier showed promising results on several test sets for mapped gaze points. It also showed promising results for gaze points that were not mapped and were close in time to being mapped. In conclusion, the RF should replace current methods of estimating the quality of RWM gaze points. Gaze points that are classified as badly mapped can be manually remapped. If RWM fails to map large segments of gaze points to a snapshot, visually classifying these to be remapped is the preferred method.
Tobii Pro Glasses 2 används för att spela in tittdata vid marknadsundersökningar och vetenskapliga experiment. Tittpunkterna mappas från den inspelade filmen till en bild med intresseareor (AOI). De flesta viktiga mätvärdena handlar om fixationer, som uppkommer när en person betraktar samma ställe under en kort period. Metoden som främst används idag är att mappa tittpunkter manuellt, men ett snabbare sätt är att genom automatisk mappning använda Real World Mapping-verktyget (RWM). RWM:s tillförlitlighet undersöktes genom att analysera fixationer från flera inspelningar med hjälp av beslutsträd. En metod för att klassificera gazepunkter som korrekt eller icke-korrekt mappade skapades med hjälp av Random Forests (RF). Resultaten visar att RWM inte är särskilt bra på att mappa fixationer, varken att finna dem eller mappa dem till korrekt AOI. Det visade sig att RWM fungerar bättre vid begränsade rörelser och då AOI:erna är korrekt utformade, vilket kan agera som riktlinjer för den som utför ett experiment. RWM borde dock förbättras. RF-klassificeringen gav bra resultat på flera test set där tittpunkterna är mappade på en bild av RWM, och på tittpunkter som inte var mappade av RWM men som var i avseende av tid nära tittpunkter som är mappade. Tittpunkter som är långt ifrån mappade tittpunkter hade dåliga testresultat. Slutsatsen var att relevanta tittpunkter borde klassificeras med RF för att mappa om felaktigt mappade tittpunkter. Om RWM inte mappar stora segment tittpunkter så borde visuell klassificering användas.

APA, Harvard, Vancouver, ISO, and other styles

42

Brokamp, Richard C. "Land Use Random Forests for Estimation of Exposure to Elemental Components of Particulate Matter." University of Cincinnati / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1463130851.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Williams, Paige T. "Mapping Smallholder Forest Plantations in Andhra Pradesh, India using Multitemporal Harmonized Landsat Sentinel-2 S10 Data." Thesis, Virginia Tech, 2020. http://hdl.handle.net/10919/104234.

Full text

Abstract:

The objective of this study was to develop a method by which smallholder forest plantations can be mapped accurately in Andhra Pradesh, India using multitemporal (intra- and inter-annual) visible and near-infrared (VNIR) bands from the Sentinel-2 MultiSpectral Instruments (MSIs). Dependency on and scarcity of wood products have driven the deforestation and degradation of natural forests in Southeast Asia. At the same time, forest plantations have been established both within and outside of forests, with the latter (as contiguous blocks) being the focus of this study. The ecosystem services provided by natural forests are different from those of plantations. As such, being able to separate natural forests from plantations is important. Unfortunately, there are constraints to accurately mapping planted forests in Andhra Pradesh (and other similar landscapes in South and Southeast Asia) using remotely sensed data due to the plantations' small size (average 2 hectares), short rotation ages (often 4-7 years for timber species), and spectral similarities to croplands and natural forests. The East and West Godavari districts of Andhra Pradesh were selected as the area for a case study. Cloud-free Harmonized Landsat Sentinel-2 (HLS) S10 data was acquired over six dates, from different seasons, as follows: December 28, 2015; November 22, 2016; November 2, 2017; December 22, 2017; March 1, 2018; and June 15, 2018. Cloud-free satellite data are not available during the monsoon season (July to September) in this coastal region. In situ data on forest plantations, provided by collaborators, was supplemented with additional training data representing other land cover subclasses in the region: agriculture, water, aquaculture, mangrove, palm, forest plantation, ground, natural forest, shrub/scrub, sand, and urban, with a total sample size of 2,230. These high-quality samples were then aggregated into three land use classes: non-forest, natural forest, and forest plantations. Image classification used random forests within the Julia Decision Tree package on a thirty-band stack that was comprised of the VNIR bands and NDVI images for all dates. The median classification accuracy from the 5-fold cross validation was 94.3%. Our results, predicated on high quality training data, demonstrate that (mostly smallholder) forest plantations can be separated from natural forests even using only the Sentinel 2 VNIR bands when multitemporal data (across both years and seasons) are used.
The objective of this study was to develop a method by which smallholder forest plantations can be mapped accurately in Andhra Pradesh, India using multitemporal (intra- and inter-annual) visible (red, green, blue) and near-infrared (VNIR) bands from the European Space Agency satellite Sentinel-2. Dependency on and scarcity of wood products have driven the deforestation and degradation of natural forests in Southeast Asia. At the same time, forest plantations have been established both within and outside of forests, with the latter (as contiguous blocks) being the focus of this study. The ecosystem services provided by natural forests are different from those of plantations. As such, being able to separate natural forests from plantations is important. Unfortunately, there are constraints to accurately mapping planted forests in Andhra Pradesh (and other similar landscapes in South and Southeast Asia) using remotely sensed data due to the plantations' small size (average 2 hectares), short rotation ages (often 4-7 years for timber species), and spectral (reflectance from satellite imagery) similarities to croplands and natural forests. The East and West Godavari districts of Andhra Pradesh were selected as the area for a case study. Cloud-free Harmonized Landsat Sentinel-2 (HLS) S10 images were acquired over six dates, from different seasons, as follows: December 28, 2015; November 22, 2016; November 2, 2017; December 22, 2017; March 1, 2018; and June 15, 2018. Cloud-free satellite data are not available during the monsoon season (July to September) in this coastal region. In situ data on forest plantations, provided by collaborators, was supplemented with additional training data points (X and Y locations with land cover class) representing other land cover subclasses in the region: agriculture, water, aquaculture, mangrove, palm, forest plantation, ground, natural forest, shrub/scrub, sand, and urban, with a total of 2,230 training points. These high-quality samples were then aggregated into three land use classes: non-forest, natural forest, and forest plantations. Image classification used random forests within the Julia DecisionTree package on a thirty-band stack that was comprised of the VNIR bands and NDVI (calculation related to greenness, i.e. higher value = more vegetation) images for all dates. The median classification accuracy from the 5-fold cross validation was 94.3%. Our results, predicated on high quality training data, demonstrate that (mostly smallholder) forest plantations can be separated from natural forests even using only the Sentinel 2 VNIR bands when multitemporal data (across both years and seasons) are used.

APA, Harvard, Vancouver, ISO, and other styles

44

ARAÚJO, Gilderlanio Santana de. "Uso de random forests e redes biológicas na associação de poliformismos à doença de Alzheimer." Universidade Federal de Pernambuco, 2013. https://repositorio.ufpe.br/handle/123456789/18012.

Full text

Abstract:

Submitted by Irene Nascimento (irene.kessia@ufpe.br) on 2016-10-18T19:17:10Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertacao -Gilderlanio Santana de Araujo.pdf: 9533988 bytes, checksum: 951b1cf090729a87ebf3a8741ff00ad4 (MD5)
Made available in DSpace on 2016-10-18T19:17:10Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertacao -Gilderlanio Santana de Araujo.pdf: 9533988 bytes, checksum: 951b1cf090729a87ebf3a8741ff00ad4 (MD5) Previous issue date: 2013-03-07
FACEPE
O desenvolvimento de técnicas de genotipagem de baixo custo (SNP arrays) e as anotações de milhares de polimorfismos de nucleotídeo único (SNPs) em bancos de dados públicos têm originado um crescente número de estudos de associação em escala genômica (do inglês, Genome-Wide Associations Studies - GWAS). Nesses estudos, um enorme número de SNPs (centenas de milhares) são avaliados com métodos estatísticos univariados de forma a encontrar SNPs associados a um determinado fenótipo. Testes univariados são incapazes de capturar relações de alta ordem entre os SNPs, algo comum em doenças genéticas complexas e são afetados pela alta correlação entre SNPs na mesma região genômica. Métodos de aprendizado de máquina, como o Random Forest (RF), têm sido aplicados em dados de GWAS para realizar a previsão de riscos de doenças e capturar os SNPs associados às mesmas. Apesar de RF ser um método com reconhecido desempenho em dados de alta dimensionalidade e na captura de relações não-lineares, o uso de todos os SNPs presentes em um estudo GWAS é computacionalmente inviável. Neste estudo propomos o uso de redes biológicas para a seleção inicial de SNPs candidatos a serem usados pela RF. A partir de um conjunto inicial de genes já relacionados à doença na literatura, usamos ferramentas de redes de interação gene-gene, para encontrar novos genes que possam estar associados a doença. Logo, é possível extrair um número reduzido de SNPs tornando a aplicação do método RF viável. Os experimentos realizados nesse estudo concentram-se em investigar quais polimorfismos podem influenciar na suscetibilidade à doença de Alzheimer (DA) e ao comprometimento cognitivo leve (MCI). O resultado final das análises é a delineação de uma metodologia para o uso de RF, para a análise de dados de GWAS, assim como a caracterização de potenciais fatores de riscos da DA.
The development of low cost genotyping techniques (SNP arrays) and annotations of thousands of single nucleotide polymorphisms (SNPs) in public databases has led to an increasing number of Genome-Wide Associations Studies (GWAS). In these studies, a large number of SNPs (hundreds of thousands) are evaluated with univariate statistical methods in order to find SNPs associated with a particular phenotype. Univariate tests are unable to capture high-order relationships among SNPs, which are common in complex genetic diseases, and are affected by the high correlation between SNPs at the same genomic region. Machine learning methods, such as the Random Forest (RF), have been applied to GWAS data to perform the prediction of the risk of diseases and capture a set of SNPs associated with them. Although, RF is a method with recognized performance in high dimensional data and capacity to capture non-linear relationships, the use of all SNPs present in GWAS data is computationally intractable. In this study we propose the use of biological networks for the initial selection of candidate SNPs to be used by RF. From an initial set of genes already related to a disease based on the literature, we use tools for construct gene-gene interaction networks, to find novel genes that might be associated with disease. Therefore, it is possible to extract a small number of SNPs making the method RF feasible. The experiments conducted in this study focus on investigating which polymorphisms may influence the susceptibility of Alzheimer’s disease (AD) and mild cognitive impairment (MCI). This work presents a delineation of a methodology on using RF for analysis of GWAS data, and characterization of potential risk factors for AD.

APA, Harvard, Vancouver, ISO, and other styles

45

Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." Department of Statistics and Mathematics, WU Vienna University of Economics and Business, 2006. http://epub.wu.ac.at/1274/1/document.pdf.

Full text

Abstract:

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale level or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale level or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analysing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. (author's abstract)
Series: Research Report Series / Department of Statistics and Mathematics

APA, Harvard, Vancouver, ISO, and other styles

46

Geremia, Ezequiel. "Spatial random forests for brain lesions segmentation in MRIs and model-based tumor cell extrapolation." Phd thesis, Université Nice Sophia Antipolis, 2013. http://tel.archives-ouvertes.fr/tel-00838795.

Full text

Abstract:

The large size of the datasets produced by medical imaging protocols contributes to the success of supervised discriminative methods for semantic labelling of images. Our study makes use of a general and efficient emerging framework, discriminative random forests, for the detection of brain lesions in multi-modal magnetic resonance images (MRIs). The contribution is three-fold. First, we focus on segmentation of brain lesions which is an essential task to diagnosis, prognosis and therapy planning. A context-aware random forest is designed for the automatic multi-class segmentation of MS lesions, low grade and high grade gliomas in MR images. It uses multi-channel MRIs, prior knowledge on tissue classes, symmetrical and long-range spatial context to discriminate lesions from background. Then, we investigate the promising perspective of estimating the brain tumor cell density from MRIs. A generative-discriminative framework is presented to learn the latent and clinically unavailable tumor cell density from model-based estimations associated with synthetic MRIs. The generative model is a validated and publicly available biophysiological tumor growth simulator. The discriminative model builds on multi-variate regression random forests to estimate the voxel-wise distribution of tumor cell density from input MRIs. Finally, we present the "Spatially Adaptive Random Forests" which merge the benefits of multi-scale and random forest methods and apply it to previously cited classification and regression settings. Quantitative evaluation of the proposed methods are carried out on publicly available labeled datasets and demonstrate state of the art performance.

APA, Harvard, Vancouver, ISO, and other styles

47

Bylund, Rebecca, and Höök Malin J-son. "Går det prediktera demens? : En jämförande studie mellan Logistisk regression, Elastic Net och Random Forests." Thesis, Umeå universitet, Statistik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-149728.

Full text

Abstract:

Denna studie tar avstamp i ett tidigare resultat av Boraxbekk et al. (2015) som genom data från Betula-projektet visat att vissa episodiska minnestester tillsammans med ålder ochutbildningsnivå har signifikanta samband med utvecklandet av demenssjukdomar. Syftet med denna studie är att jämföra klassificeringsmetoderna Random Forests, Elastic Net ochLogistisk Regression med avseende på prestationer vid klassificering av demens. I studien undersöks förutom det binära fallet (demens: ja/nej) prediktionsprestationer för utveckling av demens inom tidsspannen 1-10 år och 11-22 år. Detta för att undersöka om tidig diagnostisering av demens skulle vara möjlig. Prestationerna utvärderas även för situationen då de individer som avlidit inom de upp till 22 år de följts utgör en egen klass. Resultatet visar på att ingen av klassificeringsmetoderna presterar väl nog för att möjliggöra prediktion av demens på det givna datamaterialet och att skillnaderna i de resultat som metoderna genererar är väldigt små. Ingen större skillnad kan heller påvisas för prestationerna när tidsaspekten för utvecklandet utesluts. Inte heller kan några förbättringar i prediktion av demens utläsas när de personer som avlidit inom tidsramen för studien kontrollerats för.

APA, Harvard, Vancouver, ISO, and other styles

48

Al, Maathidi M. M. "Optimal feature selection and machine learning for high-level audio classification : a random forests approach." Thesis, University of Salford, 2017. http://usir.salford.ac.uk/44338/.

Full text

Abstract:

Content related information, metadata, and semantics can be extracted from soundtracks of multimedia files. Speech recognition, music information retrieval and environmental sound detection techniques have been developed into a fairly mature technology enabling a final text mining process to obtain semantics for the audio scene. An efficient speech, music and environmental sound classification system, which correctly identify these three types of audio signals and feed them into dedicated recognisers, is a critical pre-processing stage for such a content analysis system. The performance and computational efficiency of such a system is predominately dependent on the selected features. This thesis presents a detailed study to identify the suitable classification features and associate a suitable machine learning technique for the intended classification task. In particular, a systematic feature selection procedure is developed to employ the random forests classifier to rank the features according to their importance and reduces the dimensionality of the feature space accordingly. This new technique avoids the trial-and-error approach used by many authors researchers. The implemented feature selection produces results related to individual classification tasks instead of the commonly used statistical distance criteria based approaches that does not consider the intended classification task, which makes it more suitable for supervised learning with specific purposes. A final collective decision-making stage is employed to combine multiple class detectors patterns into one to produce a single classification result for each input frames. The performance of the proposed feature selection technique has been compared with the techniques proposed by MPEG-7 standard to extract the reduced feature space. The results show a significant improvement in the resulted classification accuracy, at the same time, the feature space is simplified and computational overhead reduced. The proposed feature selection and machine learning technique enable the use of only 30 out of the 47 features without degrading the classification accuracy while the classification accuracy lowered by 1.7% only while just 10 features were utilised. The validation shows good performance also and the last stage of collective decision making was able to improve the classification result even after selecting only a small number of classification features. The work represents a successful attempt to determine audio feature importance and classify the audio contents into speech, music and environmental sound using a selected feature subset. The result shows a high degree of accuracy by utilising the random forests for both feature importance ranking and audio content classification.

APA, Harvard, Vancouver, ISO, and other styles

49

Aichele, Figueroa Diego Andrés. "Detección de anomalías en componentes mecánicos en base a Deep Learning y Random Cut Forests." Tesis, Universidad de Chile, 2019. http://repositorio.uchile.cl/handle/2250/170571.

Full text

Abstract:

Memoria para optar al título de Ingeniero Civil Mecánico
Dentro del área de mantenimiento, el monitorear un equipo puede ser de gran utilidad ya que permite advertir cualquier anomalía en el funcionamiento interno de éste, y así, se puede corregir cualquier desperfecto antes de que se produzca una falla de mayor gravedad. En data mining, detección de anomalías es el ejercicio de identificar elementos anómalos, es decir, aquellos elementos que difieren a lo común dentro de un set de datos. Detección de anomalías tiene aplicación en diferentes dominios, por ejemplo, hoy en día se utiliza en bancos para detectar compras fraudulentas y posibles estafas a través de un patrón de comportamiento del usuario, por ese motivo se necesitan abarcar grandes cantidades de datos por lo que su desarrollo en aprendizajes de máquinas probabilísticas es imprescindible. Cabe destacar que se ha desarrollado una variedad de algoritmos para encontrar anomalías, una de las más famosas es el Isolated Forest dentro de los árboles de decisión. Del algoritmo de Isolated Forest han derivado distintos trabajos que proponen mejoras para éste, como es el Robust Random Cut Forest el cual, por un lado permite mejorar la precisión para buscar anomalías y, también, entrega la ventaja de poder realizar un estudio dinámico de datos y buscar anomalías en tiempo real. Por otro lado, presenta la desventaja de que entre más atributos contengan los sets de datos más tiempo de cómputo tendrá para detectar una anomalía. Por ende, se utilizará un método de reducción de atributos, también conocido como reducción de dimensión, por último se estudiará como afectan tanto en efectividad y eficiencia al algoritmo sin reducir la dimensión de los datos. En esta memoria se analiza el algoritmo Robust Random Cut Forest para finalmente entregar una posible mejora a éste. Para poner en prueba el algoritmo se realiza un experimento de barras de acero, donde se obtienen como resultado sus vibraciones al ser excitado por un ruido blanco. Estos datos se procesan en tres escenarios distintos: Sin reducción de dimensiones, análisis de componentes principales(principal component analysis) y autoencoder. En base a esto, el primer escenario (sin reducción de dimensiones) servirá para establecer un punto de orientación, para ver como varían el escenario dos y tres en la detección de anomalía, en efectividad y eficiencia. %partida para detección de anomalía, luego se ver si esta mejora Luego, se realiza el estudio en el marco de tres escenarios para detectar puntos anómalos; En los resultados se observa una mejora al reducir las dimensiones en cuanto a tiempo de cómputo (eficiencia) y en precisión (efectividad) para encontrar una anomalía, finalmente los mejores resultados son con análisis de componentes principales (principal component analysis).

APA, Harvard, Vancouver, ISO, and other styles

50

Goodwin, Christopher C. H. "The Influence of Cost-sharing Programs on Southern Non-industrial Private Forests." Thesis, Virginia Tech, 2001. http://hdl.handle.net/10919/30895.

Full text

Abstract:

This study was undertaken in response to concerns that the decreasing levels of funding for government tree planting cost share programs will result in significant reductions in non-industrial private tree planting efforts in the South. The purpose of this study is to quantify how the funding of various cost share programs, and market signals interact and affect the level of private tree planting. The results indicate that the ACP, CRP, and Soil Bank programs have been more influential than the FIP, FRM, FSP, SIP, and State run subsidy programs. Reductions in the CRP funding will result in less tree planting; while it is not clear that funding reductions in FIP, or other programs targeted toward reforestation after harvest, will have a negative impact on tree planting levels.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Random Forests'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles