To see the other types of publications on this topic, follow the link: Random forest.

Dissertations / Theses on the topic 'Random forest'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Random forest.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Linusson, Henrik, Robin Rudenwall, and Andreas Olausson. "Random forest och glesa datarespresentationer." Thesis, Högskolan i Borås, Institutionen Handels- och IT-högskolan, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-16672.

Full text
Abstract:
In silico experimentation is the process of using computational and statistical models to predict medicinal properties in chemicals; as a means of reducing lab work and increasing success rate this process has become an important part of modern drug development. There are various ways of representing molecules - the problem that motivated this paper derives from collecting substructures of the chemical into what is known as fractional representations. Assembling large sets of molecules represented in this way will result in sparse data, where a large portion of the set is null values. This consumes an excessive amount of computer memory which inhibits the size of data sets that can be used when constructing predictive models.In this study, we suggest a set of criteria for evaluation of random forest implementations to be used for in silico predictive modeling on sparse data sets, with regard to computer memory usage, model construction time and predictive accuracy.A novel random forest system was implemented to meet the suggested criteria, and experiments were made to compare our implementation to existing machine learning algorithms to establish our implementation‟s correctness. Experimental results show that our random forest implementation can create accurate prediction models on sparse datasets, with lower memory usage overhead than implementations using a common matrix representation, and in less time than existing random forest implementations evaluated against. We highlight design choices made to accommodate for sparse data structures and data sets in the random forest ensemble technique, and therein present potential improvements to feature selection in sparse data sets.
Program: Systemarkitekturutbildningen
APA, Harvard, Vancouver, ISO, and other styles
2

Karlsson, Isak. "Order in the random forest." Doctoral thesis, Stockholms universitet, Institutionen för data- och systemvetenskap, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-142052.

Full text
Abstract:
In many domains, repeated measurements are systematically collected to obtain the characteristics of objects or situations that evolve over time or other logical orderings. Although the classification of such data series shares many similarities with traditional multidimensional classification, inducing accurate machine learning models using traditional algorithms are typically infeasible since the order of the values must be considered. In this thesis, the challenges related to inducing predictive models from data series using a class of algorithms known as random forests are studied for the purpose of efficiently and effectively classifying (i) univariate, (ii) multivariate and (iii) heterogeneous data series either directly in their sequential form or indirectly as transformed to sparse and high-dimensional representations. In the thesis, methods are developed to address the challenges of (a) handling sparse and high-dimensional data, (b) data series classification and (c) early time series classification using random forests. The proposed algorithms are empirically evaluated in large-scale experiments and practically evaluated in the context of detecting adverse drug events. In the first part of the thesis, it is demonstrated that minor modifications to the random forest algorithm and the use of a random projection technique can improve the effectiveness of random forests when faced with discrete data series projected to sparse and high-dimensional representations. In the second part of the thesis, an algorithm for inducing random forests directly from univariate, multivariate and heterogeneous data series using phase-independent patterns is introduced and shown to be highly effective in terms of both computational and predictive performance. Then, leveraging the notion of phase-independent patterns, the random forest is extended to allow for early classification of time series and is shown to perform favorably when compared to alternatives. The conclusions of the thesis not only reaffirm the empirical effectiveness of random forests for traditional multidimensional data but also indicate that the random forest framework can, with success, be extended to sequential data representations.
APA, Harvard, Vancouver, ISO, and other styles
3

Siegel, Kathryn I. (Kathryn Iris). "Incremental random forest classifiers in spark." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106105.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
Cataloged from PDF version of thesis.
Includes bibliographical references (page 53).
The random forest is a machine learning algorithm that has gained popularity due to its resistance to noise, good performance, and training efficiency. Random forests are typically constructed using a static dataset; to accommodate new data, random forests are usually regrown. This thesis presents two main strategies for updating random forests incrementally, rather than entirely rebuilding the forests. I implement these two strategies-incrementally growing existing trees and replacing old trees-in Spark Machine Learning(ML), a commonly used library for running ML algorithms in Spark. My implementation draws from existing methods in online learning literature, but includes several novel refinements. I evaluate the two implementations, as well as a variety of hybrid strategies, by recording their error rates and training times on four different datasets. My benchmarks show that the optimal strategy for incremental growth depends on the batch size and the presence of concept drift in a data workload. I find that workloads with large batches should be classified using a strategy that favors tree regrowth, while workloads with small batches should be classified using a strategy that favors incremental growth of existing trees. Overall, the system demonstrates significant efficiency gains when compared to the standard method of regrowing the random forest.
by Kathryn I. Siegel.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
4

Cheng, Chuan. "Random forest training on reconfigurable hardware." Thesis, Imperial College London, 2015. http://hdl.handle.net/10044/1/28122.

Full text
Abstract:
Random Forest (RF) is one of the most widely used supervised learning methods available. An RF is ensemble of decision tree classifiers with injection of several sources of randomness. It demonstrates a set of improvement over single decision and regression trees and is comparable or superior to major classification tools such as support vector machine (SVM) and adaptive boosting (Adaboost) with respect to accuracy, interpretability, robustness and processing speed. RF can be generally divided into training process and predicting process. Recently with emergence of large-scale data mining applications, the RF training process implemented in software on a single computer can no longer induce a complex RF model within reasonable amount of time. Alternative solutions involving computer clusters and GPUs usually come with disadvantages with respect to Performance/Power ratio and are not feasible for portable/embedded applications. In this work a set of FPGA-based implementations of the RF training process are proposed. FPGA devices allow construction of efficient custom hardware architectures and feature lower power consumption than typical GPPs or GPUs therefore are suitable for portable/embedded applications. The proposed hardware training architectures take advantage of different types of inherent parallelism in the RF training algorithm and distribute the workload to a set of parallel workers. Combining the parallel processing techniques with custom hardware designs featuring low latency, the architectures are able to accelerate the training process without loss in accuracy.
APA, Harvard, Vancouver, ISO, and other styles
5

Nelson, Marc. "Evaluating Multitemporal Sentinel-2 data for Forest Mapping using Random Forest." Thesis, Stockholms universitet, Institutionen för naturgeografi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-146657.

Full text
Abstract:
The mapping of land cover using remotely sensed data is most effective when a robust classification method is employed. Random forest is a modern machine learning algorithm that has recently gained interest in the field of remote sensing due to its non-parametric nature, which may be better suited to handle complex, high-dimensional data than conventional techniques. In this study, the random forest method is applied to remote sensing data from the European Space Agency’s new Sentinel-2 satellite program, which was launched in 2015 yet remains relatively untested in scientific literature using non-simulated data. In a study site of boreo-nemoral forest in Ekerö mulicipality, Sweden, a classification is performed for six forest classes based on CadasterENV Sweden, a multi-purpose land covermapping and change monitoring program. The performance of Sentinel-2’s Multi-SpectralImager is investigated in the context of time series to capture phenological conditions, optimal band combinations, as well as the influence of sample size and ancillary inputs.Using two images from spring and summer of 2016, an overall map accuracy of 86.0% was achieved. The red edge, short wave infrared, and visible red bands were confirmed to be of high value. Important factors contributing to the result include the timing of image acquisition, use of a feature reduction approach to decrease the correlation between spectral channels, and the addition of ancillary data that combines topographic and edaphic information. The results suggest that random forest is an effective classification technique that is particularly well suited to high-dimensional remote sensing data.
APA, Harvard, Vancouver, ISO, and other styles
6

Lak, Kameran Majeed Mohammed <1985&gt. "Retina-inspired random forest for semantic image labelling." Master's Degree Thesis, Università Ca' Foscari Venezia, 2015. http://hdl.handle.net/10579/5970.

Full text
Abstract:
One of the most challenging problem in computer vision community is semantic image labeling, which requires assigning a semantic class to each pixel in an image. In the literature, this problem has been effectively addressed with Random Forest, i.e., a popular classification algorithm that delivers a prediction by averaging the outcome of an ensemble of random decision trees. In this thesis we propose a novel algorithm based on the Random Forest framework. Our main contribution is the introduction of a new family of decision functions (aka split functions), which build up the decision trees of the random forest. Our decision functions resemble the way the human retina works, by mimicking an increase in the receptive field sizes towards the periphery of the retina. This results in a better visual acuity in the proximity of the center of view (aka fovea), which gradually degrades as we move off from the center.\\ The solution we propose improves the quality of the semantic image labelling, while preserving the low computational cost of the classical Random Forest approaches in both the training and inference phases. We conducted quantitative experiments on two standard datasets, namely eTRIMS Image Database and MSRCv2 Database, and the results we obtained are extremely encouraging.
APA, Harvard, Vancouver, ISO, and other styles
7

Linusson, Henrik. "Multi-Output Random Forests." Thesis, Högskolan i Borås, Institutionen Handels- och IT-högskolan, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-17167.

Full text
Abstract:
The Random Forests ensemble predictor has proven to be well-suited for solving a multitudeof different prediction problems. In this thesis, we propose an extension to the Random Forestframework that allows Random Forests to be constructed for multi-output decision problemswith arbitrary combinations of classification and regression responses, with the goal ofincreasing predictive performance for such multi-output problems. We show that our methodfor combining decision tasks within the same decision tree reduces prediction error for mosttasks compared to single-output decision trees based on the same node impurity metrics, andprovide a comparison of different methods for combining such metrics.
Program: Magisterutbildning i informatik
APA, Harvard, Vancouver, ISO, and other styles
8

Nygren, Rasmus. "Evaluation of hyperparameter optimization methods for Random Forest classifiers." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301739.

Full text
Abstract:
In order to create a machine learning model, one is often tasked with selecting certain hyperparameters which configure the behavior of the model. The performance of the model can vary greatly depending on how these hyperparameters are selected, thus making it relevant to investigate the effects of hyperparameter optimization on the classification accuracy of a machine learning model. In this study, we train and evaluate a Random Forest classifier whose hyperparameters are set to default values and compare its classification accuracy to another classifier whose hyperparameters are obtained through the use of the hyperparameter optimization (HPO) methods Random Search, Bayesian Optimization and Particle Swarm Optimization. This is done on three different datasets, and each HPO method is evaluated based on the classification accuracy change it induces across the datasets. We found that every HPO method yielded a total classification accuracy increase of approximately 2-3% across all datasets compared to the accuracies obtained using the default hyperparameters. However, due to limitations of time, data and computational resources, no assertions can be made as to whether the observed positive effect is generalizable at a larger scale. Instead, we could conclude that the utility of HPO methods is dependent on the dataset at hand.
För att skapa en maskininlärningsmodell behöver en ofta välja olika hyperparametrar som konfigurerar modellens egenskaper. Prestandan av en sådan modell beror starkt på valet av dessa hyperparametrar, varför det är relevant att undersöka hur optimering av hyperparametrar kan påverka klassifikationssäkerheten av en maskininlärningsmodell. I denna studie tränar och utvärderar vi en Random Forest-klassificerare vars hyperparametrar sätts till särskilda standardvärden och jämför denna med en klassificerare vars hyperparametrar bestäms av tre olika metoder för optimering av hyperparametrar (HPO) - Random Search, Bayesian Optimization och Particle Swarm Optimization. Detta görs på tre olika dataset, och varje HPO- metod utvärderas baserat på den ändring av klassificeringsträffsäkerhet som den medför över dessa dataset. Vi fann att varje HPO-metod resulterade i en total ökning av klassificeringsträffsäkerhet på cirka 2-3% över alla dataset jämfört med den träffsäkerhet som kruleslassificeraren fick med standardvärdena för hyperparametrana. På grund av begränsningar i form av tid och data kunde vi inte fastställa om den positiva effekten är generaliserbar till en större skala. Slutsatsen som kunde dras var istället att användbarheten av metoder för optimering av hyperparametrar är beroende på det dataset de tillämpas på.
APA, Harvard, Vancouver, ISO, and other styles
9

Lazic, Marko, and Felix Eder. "Using Random Forest model to predict image engagement rate." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-229932.

Full text
Abstract:
The purpose of this research is to investigate if Google Cloud Vision API combined with Random Forest Machine Learning algorithm is advanced enough in order to make a software that would evaluate how much an Instagram photo contributes to the image of a brand. The data set contains images scraped from the public Instagram feed filtered by #Nike, together with the meta data of the post. Each image was processed by the Google Cloud Vision API in order to obtain a set of descriptive labels for the content of the image. The data set was sent to the Random Forest algorithm in order to train the predictor. The results of the research shows that the predictor can only guess the correct score in about 4% of cases. The results are not very accurate, which is mostly because of the limiting factors of the Google Cloud Vision API. The conclusion that was drawn is that it is not possible to create a software that can accurately predict the engagement rate of an image with the technology that is publicly available today.
Syftet med denna forskning är att undersöka om Google Cloud Vision API kombinerat med Random Forest Machine Learning algoritmer är tillräckligt avancerade för att skapa en mjukvara som tillförlitligt kan evaluera hur mycket ett Instagram-inlägg kan bidra till bilden av ett varumärke. Datamängden innehåller bilder hämtade från Instagrams publika flöde filtrerat av #Nike, tillsammans med metadatan för inlägget. Varje bild var bearbetad av Google Cloud Vision API för att få tag på en mängd deskriptiva etiketter för innehållet av en bild. Datamängden skickades till Random Forest-algoritmen för att träna dess model. Undersökningens resultat är inte särskilt exakta, vilket främst beror på de begränsade faktorerna från Google Cloud Vision API. Slutsatsen som dras är att det inte är möjligt att tillförlitligt förutspå en bilds kvalitet med tekniken som finns allmänt tillgänglig idag.
APA, Harvard, Vancouver, ISO, and other styles
10

Asritha, Kotha Sri Lakshmi Kamakshi. "Comparing Random forest and Kriging Methods for Surrogate Modeling." Thesis, Blekinge Tekniska Högskola, Fakulteten för datavetenskaper, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-20230.

Full text
Abstract:
The issue with conducting real experiments in design engineering is the cost factor to find an optimal design that fulfills all design requirements and constraints. An alternate method of a real experiment that is performed by engineers is computer-aided design modeling and computer-simulated experiments. These simulations are conducted to understand functional behavior and to predict possible failure modes in design concepts. However, these simulations may take minutes, hours, days to finish. In order to reduce the time consumption and simulations required for design space exploration, surrogate modeling is used. \par Replacing the original system is the motive of surrogate modeling by finding an approximation function of simulations that is quickly computed. The process of surrogate model generation includes sample selection, model generation, and model evaluation. Using surrogate models in design engineering can help reduce design cycle times and cost by enabling rapid analysis of alternative designs.\par Selecting a suitable surrogate modeling method for a given function with specific requirements is possible by comparing different surrogate modeling methods. These methods can be compared using different application problems and evaluation metrics. In this thesis, we are comparing the random forest model and kriging model based on prediction accuracy. The comparison is performed using mathematical test functions. This thesis conducted quantitative experiments to investigate the performance of methods. After experimental analysis, it is found that the kriging models have higher accuracy compared to random forests. Furthermore, the random forest models have less execution time compared to kriging for studied mathematical test problems.
APA, Harvard, Vancouver, ISO, and other styles
11

Williams, Alyssa. "Hybrid Recommender Systems via Spectral Learning and a Random Forest." Digital Commons @ East Tennessee State University, 2019. https://dc.etsu.edu/etd/3666.

Full text
Abstract:
We demonstrate spectral learning can be combined with a random forest classifier to produce a hybrid recommender system capable of incorporating meta information. Spectral learning is supervised learning in which data is in the form of one or more networks. Responses are predicted from features obtained from the eigenvector decomposition of matrix representations of the networks. Spectral learning is based on the highest weight eigenvectors of natural Markov chain representations. A random forest is an ensemble technique for supervised learning whose internal predictive model can be interpreted as a nearest neighbor network. A hybrid recommender can be constructed by first deriving a network model from a recommender's similarity matrix then applying spectral learning techniques to produce a new network model. The response learned by the new version of the recommender can be meta information. This leads to a system capable of incorporating meta data into recommendations.
APA, Harvard, Vancouver, ISO, and other styles
12

Elfving, Jan, and Sebastian Kalucza. "Random Forest för överlevnadsanalys med konkurrerande utfall : Prediktion av demens." Thesis, Umeå universitet, Statistik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-184927.

Full text
Abstract:
Statistik som ämnesområde är i ständig utveckling. I takt med att datorers beräkningskapacitet stadigt förbättrats har mer beräkningsintensiva metoder som tidigare varit krångliga att tillämpa nu blivit lättillgängliga. Random Forest är ett exempel på en sådan metod som vuxit fram ur dessa premisser och visat sig fungera väl på en rad statistiska problem, prediktionsproblem inkluderat. En sådan problemtyp är s.k. överlevnadsanalys. Ett sätt att göra överlevnadsmodellen mer verklighetsnära är att utöka den till att även beakta konkurrerande händelser. Konkurrerande händelser är händelser som tävlar med den huvudhändelse som studeras. Genom att beakta dessa konkurrerande händelser kan mer korrekta överlevnadsskattningar göras. I den här studien avser vi predikera demens med en Random Forest överlevnadsmodell som tar hänsyn till konkurrerande händelser (RF-SRC). Det data som analysen bygger på är från Betula-studien, en studie över tid som syftar till att identifiera riskfaktorer för demens samt tidiga, signaler på demens. Datat innehåller en del bakgrundsvariabler samt resultat från ett antal minnestester som deltagarna ombetts utföra. Den huvudsakliga konkurrerande händelsen i det här fallet är att den studerade deltagaren dör. Som ett resultat av demensprediktering får vi en skattning av respektive förklaringsvariabels relativa betydelse. Med undantag för den självskrivna variabeln ålder när individ påbörjar sitt deltagande i studien, så placerar sig ett prospektivt minnestest högst (prosp). Andra betydelsefulla förklaringsvariabler var två episodiska minnestest (sptb, sptcrc), genvarianten apoE4 samt ett visuospatialt minnestest (block). Vid jämförelse med traditionell överlevnadsanalys i form av Cox-regression utan och med hänsyn till konkurrerande händelser ser vi att samtliga kontinuerliga variabler som rankas högt i RF-SRC- modellen är signifikanta i Cox-modellerna. Däremot skiljer sig styrkeförhållandet åt en del för de två kategoriska förklaringsvariablerna apoE4 och kön, där dessa generellt sett värderas högre i Cox-modellerna. Att beslutsträd med en mix av kategoriska och kontinuerliga förklaringsvariabler tenderar att underskatta kategoriska variabler stöds av tidigare forskning. Gällande prediktionsförmåga så gjordes en jämförelse mellan RF-SRC-modellen och andra relevanta modeller med C-index som jämförelsesmått. Slutsatsen var att RF-SRC-modellen presterande aningen sämre än den traditionella prediktionsmodellen för överlevnadsanalys (Cox-regression) på detta data. Aningen förvånande var att RF-SRC modellen även presterade aningen sämre än en enklare Random Forest-modell som inte tar hänsyn till konkurrerande händelser, även om denna skillnad var liten och kan tänkas bero på slumpen.
APA, Harvard, Vancouver, ISO, and other styles
13

Adriansson, Nils, and Ingrid Mattsson. "Forecasting GDP Growth, or How Can Random Forests Improve Predictions in Economics?" Thesis, Uppsala universitet, Statistiska institutionen, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-243028.

Full text
Abstract:
GDP is used to measure the economic state of a country and accurate forecasts of it is therefore important. Using the Economic Tendency Survey we investigate forecasting quarterly GDP growth using the data mining technique Random Forest. Comparisons are made with a benchmark AR(1) and an ad hoc linear model built on the most important variables suggested by the Random Forest. Evaluation by forecasting shows that the Random Forest makes the most accurate forecast supporting the theory that there are benefits to using Random Forests on economic time series.
APA, Harvard, Vancouver, ISO, and other styles
14

Wonkye, Yaa Tawiah. "Innovations of random forests for longitudinal data." Bowling Green State University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1563054152739397.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

SILVA, J. P. M. "PROGNOSE DA PRODUÇÃO FLORESTAL UTILIZANDO SISTEMA NEURO-FUZZY E RANDOM FOREST." Universidade Federal do Espírito Santo, 2018. http://repositorio.ufes.br/handle/10/7680.

Full text
Abstract:
Made available in DSpace on 2018-08-01T22:35:53Z (GMT). No. of bitstreams: 1 tese_11765_Dissertação JEFERSON 2018-Final.pdf: 4406644 bytes, checksum: 0baf7d2721f4cabcec877505e31b18d1 (MD5) Previous issue date: 2018-02-28
O objetivo deste estudo foi avaliar o emprego das técnicas Random Forest (RF) e Sistema Neuro-Fuzzy (SNF) na prognose da produção florestal. Os dados utilizados foram provenientes de inventários florestais contínuos conduzidos em povoamentos de clones de eucalipto, localizados no sul da Bahia. O processamento dos dados foi realizado no software Matlab R2016a. Os dados foram divididos em 70% para de treinamento e 30% para validação. Os algoritmos usados para geração de regras no SNF foram Subtractive Clustering (SC) e Fuzzy-C-Means (FCM). O treinamento foi feito com o algoritmo híbrido (gradiente descente e mínimos quadrados) com o número de épocas variando de 1 a 20. As funções de pertinências associadas às variáveis de entradas foram do tipo gaussianas e a função linear na de saída. Foram treinadas várias RF variando o número de árvores de 50 a 850 e o número de observações por folhas de 5 a 35. A modelagem da produção florestal de povoamentos clonais de eucalipto pode ser realizada com SNF e RF. Os algoritmos SC e FCM fornecem estimativas acuradas na projeção de área basal e volume. A RF apresentou estatísticas inferiores em relação a SNF para prognose da produção florestal. Ambas as técnicas são boas alternativas para seleção de variáveis empregadas na modelagem da produção florestal. Palavras-chave: Inteligência artificial, ensemble learning, mensuração florestal.
APA, Harvard, Vancouver, ISO, and other styles
16

Kindbom, Hannes. "LSTM vs Random Forest for Binary Classification of Insurance Related Text." Thesis, KTH, Matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252748.

Full text
Abstract:
The field of natural language processing has received increased attention lately, but less focus is put on comparing models, which differ in complexity. This thesis compares Random Forest to LSTM, for the task of classifying a message as question or non-question. The comparison was done by training and optimizing the models on historic chat data from the Swedish insurance company Hedvig. Different types of word embedding were also tested, such as Word2vec and Bag of Words. The results demonstrated that LSTM achieved slightly higher scores than Random Forest, in terms of F1 and accuracy. The models’ performance were not significantly improved after optimization and it was also dependent on which corpus the models were trained on. An investigation of how a chatbot would affect Hedvig’s adoption rate was also conducted, mainly by reviewing previous studies about chatbots’ effects on user experience. The potential effects on the innovation’s five attributes, relative advantage, compatibility, complexity, trialability and observability were analyzed to answer the problem statement. The results showed that the adoption rate of Hedvig could be positively affected, by improving the first two attributes. The effects a chatbot would have on complexity, trialability and observability were however suggested to be negligible, if not negative.
Det vetenskapliga området språkteknologi har fått ökad uppmärksamhet den senaste tiden, men mindre fokus riktas på att jämföra modeller som skiljer sig i komplexitet. Den här kandidatuppsatsen jämför Random Forest med LSTM, genom att undersöka hur väl modellerna kan användas för att klassificera ett meddelande som fråga eller icke-fråga. Jämförelsen gjordes genom att träna och optimera modellerna på historisk chattdata från det svenska försäkringsbolaget Hedvig. Olika typer av word embedding, så som Word2vec och Bag of Words, testades också. Resultaten visade att LSTM uppnådde något högre F1 och accuracy än Random Forest. Modellernas prestanda förbättrades inte signifikant efter optimering och resultatet var också beroende av vilket korpus modellerna tränades på. En undersökning av hur en chattbot skulle påverka Hedvigs adoption rate genomfördes också, huvudsakligen genom att granska tidigare studier om chattbotars effekt på användarupplevelsen. De potentiella effekterna på en innovations fem attribut, relativ fördel, kompatibilitet, komplexitet, prövbarhet and observerbarhet analyserades för att kunna svara på frågeställningen. Resultaten visade att Hedvigs adoption rate kan påverkas positivt, genom att förbättra de två första attributen. Effekterna en chattbot skulle ha på komplexitet, prövbarhet och observerbarhet ansågs dock vara försumbar, om inte negativ.
APA, Harvard, Vancouver, ISO, and other styles
17

Verica, Weverton Rodrigo. "Mapeamento semiautomático por meio de padrão espectro-temporal de áreas agrícolas e alvos permanentes com evi/modis no Paraná." Universidade Estadual do Oeste do Paraná, 2018. http://tede.unioeste.br/handle/tede/3916.

Full text
Abstract:
Submitted by Neusa Fagundes (neusa.fagundes@unioeste.br) on 2018-09-06T19:38:50Z No. of bitstreams: 2 Weverton_Verica2018.pdf: 4544186 bytes, checksum: 766200b4dea97433d3d88b08cbe3e548 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
Made available in DSpace on 2018-09-06T19:38:50Z (GMT). No. of bitstreams: 2 Weverton_Verica2018.pdf: 4544186 bytes, checksum: 766200b4dea97433d3d88b08cbe3e548 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2018-02-16
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES
Knowledge of location and quantity of areas for agriculture or either native or planted forests is relevant for public managers to make their decisions based on reliable data. In addition, part of ICMS revenues from the Municipal Participation Fund (FPM) depends on agricultural production data, number of rural properties and the environmental factor. The objective of this research was to design an objective and semiautomatic methodology to map agricultural areas and targets permanent, and later to identify areas of soybean, corn 1st and 2nd crops, winter crops, semi-perennial agriculture, forests and other permanent targets in the state of Paraná for the harvest years (2013/14 to 2016/17), using temporal series of EVI/Modis vegetation indexes. The proposed methodology follows the steps of the Knowledge Discovery Process in Database – KDD, in which the classification task was performed by the Random Forest algorithm. For the validation of the mappings, samples extracted from Landsat-8 images were used, obtaining the global accuracy indices greater than 84.37% and a kappa index ranging from 0.63 to 0.98, hence considered mappings with good or excellent spatial accuracy. The municipal data of the area of soybean, corn 1st crop, corn 2nd crop and winter crops mapped were confronted with the official statistics obtaining coefficients of linear correlation between 0.61 to 0.9, indicating moderate or strong correlation with the data officials. In this way, the proposed semi-automatic methodology was successful in the mapping, as well as the automation of the process of elaboration of the metrics, thus generating a script in the software R in order to facilitate future mappings with low processing time.
O conhecimento da localização e da quantidade de áreas destinadas a agricultura ou a florestas nativas ou plantadas é relevante para que os gestores públicos tomem suas decisões pautadas em dados fidedignos com a realidade. Além disto, parte das receitas de ICMS advindas do Fundo de Participação aos Municípios (FPM) depende de dados de produção agropecuária, número de propriedades rurais e fator ambiental. Diante disso, esta dissertação teve como objetivo elaborar uma metodologia objetiva e semiautomática para mapear áreas agrícolas e alvos permanente e posteriormente identificar áreas de soja, milho 1ª e 2ª safras, culturas de inverno, agricultura semi-perene, florestas e demais alvos permanentes no estado do Paraná para os anos-safra (2013/14 a 2016/17), utilizando séries temporais de índices de vegetação EVI/Modis. A metodologia proposta segue os passos do Processo de descoberta de conhecimento em base de dados – KDD, sendo que para isso foram elaboradas métricas extraídas do perfil espectro temporal de cada pixel e foi empregada a tarefa de classificação, realizada pelo algoritmo Random Forest. Para a validação dos mapeamentos utilizaram-se amostras extraídas de imagens Landsat-8, obtendo-se os índices de exatidão global maior que 84,37% e um índice kappa variando entre 0,63 e 0,98, sendo, portanto, considerados mapeamentos com boa ou excelente acurácia espacial. Os dados municipais da área de soja, milho 1ª safra, milho 2ª safra e culturas de inverno mapeada foram confrontados com as estatísticas oficiais obtendo-se coeficientes de correlação linear entre 0,61 a 0,9, indicando moderada ou forte correlação com os dados oficiais. Desse modo, a metodologia semiautomática proposta obteve êxito na realização do mapeamento, bem como a automatização do processo de elaboração das métricas, gerando, com isso um script no software R de maneira a facilitar mapeamentos futuros com baixo tempo de processamento.
APA, Harvard, Vancouver, ISO, and other styles
18

Sjöqvist, Hugo. "Classifying Forest Cover type with cartographic variables via the Support Vector Machine, Naive Bayes and Random Forest classifiers." Thesis, Örebro universitet, Handelshögskolan vid Örebro Universitet, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-58384.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." Department of Statistics and Mathematics, WU Vienna University of Economics and Business, 2006. http://epub.wu.ac.at/1274/1/document.pdf.

Full text
Abstract:
Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale level or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale level or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analysing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. (author's abstract)
Series: Research Report Series / Department of Statistics and Mathematics
APA, Harvard, Vancouver, ISO, and other styles
20

Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. "Bias in random forest variable importance measures: Illustrations, sources and a solution." BioMed Central Ltd, 2007. http://dx.doi.org/10.1186/1471-2105-8-25.

Full text
Abstract:
Background: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Results: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. Conclusion: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. (authors' abstract)
APA, Harvard, Vancouver, ISO, and other styles
21

Alkazaz, Ayham, and Kharouki Marwa Saado. "Evaluation of Adaptive random forest algorithm for classification of evolving data stream." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-283114.

Full text
Abstract:
In the era of big data, online machine learning algorithms have gained more and more traction from both academia and industry. In multiple scenarios decisions and predictions has to be made in near real-time as data is observed from continuously evolving data streams. Offline learning algorithms fall short in different ways when it comes to handling such problems. Apart from the costs and difficulties of storing these data streams in storage clusters and the computational difficulties associated with retraining the models each time new data is observed in order to keep the model up to date, these methods also don’t have built-in mechanisms to handle seasonality and non-stationary data streams. In such streams, the data distribution might change over time in what is called concept drift. Adaptive random forests are well studied and effective for online learning and non-stationary data streams. By using bagging and drift detection mechanisms adaptive random forests aim to improve the accuracy and performance of traditional random forests for online learning. In this study, we analyze the predictive classification accuracy of adaptive random forests when used in conjunction with different data streams and concept drifts. The data streams used to evaluate the accuracy are SEA and Agrawal. Each data stream is tested in 3 different concept drift configurations; gradual, sudden, and recur- ring. The results obtained from the performed benchmarks shows that adaptive random forests have better accuracy handling SEA than Agrawal, which could be interpreted by the dimensions and structure of the input attributes. Adaptive random forests showed no clear difference in accuracy between gradual and sudden concept drifts. However, recurring concept drifts had lower accuracy in the benchmarks than both the sudden and the gradual counterparts. This could be a result of the higher frequency of concept drifts within the same time period (number of observed samples).
I big data tiden har online-maskininlärningsalgoritmer fått mer och mer dragkraft från både akademin och industrin. I flera scenarier måste beslut och predektioner göras i nära realtid när data observeras från dataströmmar som kontinuerligt utvecklas. Offline-inlärningsalgoritmer brister på olika sätt när det gäller att hantera sådana problem. Bortsett från kostnaderna och svårigheterna med att lagra dessa dataströmmar i en lagringskluster och den beräkningsmässiga svårigheterna förknippade med att träna modellen på nytt varje gång ny data observeras för att hålla modellen uppdaterad. Dessa metoder har inte heller inbyggda mekanismer för att hantera säsongsbetonade och icke-stationära dataströmmar. I sådana strömmar kan datadistributionen förändras över tid i det som kallas konceptdrift. Anpassningsbara slumpmässiga skogar (Adaptive random forests) är väl studerade och effektiva modeller för online-inlärning och hantering av icke-stationära dataströmmar. Genom att använda mekanismer för att upptäcka konceptdrift och bagging syftar adaptiva slumpmässiga skogar att förbättra noggrannheten och prestandan hos traditionella slumpmässiga skogar för onlineinlärning. I denna studie analyserar vi den prediktiva klassificeringsnoggrannheten för adaptiva slumpmässiga skogar när de används i samband med olika dataströmmar och konceptdrift. Dataströmmarna som används för att utvärdera prestandan är SEA och Agrawal. Varje dataström testas i 3 olika konceptdriftkonfigurationer; gradvis, plötslig och återkommande. Resultaten som erhållits från de utförda experiment visar att anpassningsbara slumpmässiga skogar har bättre noggrannhet än Agrawal, vilket kan tolkas av  antal dimensioner och strukturen av inmatningsattributen. Adaptiva slumpmässiga skogar visade dock ingen tydlig skillnad i noggrannhet mellan gradvisa och plötsliga konceptdrift. Emellertid hade återkommande konceptdrift lägre noggrannhet i riktmärken än både de plötsliga och gradvisa motstycken. Detta kan vara ett resultat av den högre frekvensen av konceptdrift inom samma tidsperiod (antal observerade prover).
APA, Harvard, Vancouver, ISO, and other styles
22

Tramontin, Davide <1992&gt. "Random forest implementation for classification analysis: default predictions applied to Italian companies." Master's Degree Thesis, Università Ca' Foscari Venezia, 2020. http://hdl.handle.net/10579/17720.

Full text
Abstract:
The growing importance of big data and the increased environment complexity have led to an increase in the implementation machine learning algorithms, given their ability to efficiently deal with entangled situations. This study contributes to the framework regarding the application of random forests and other machine learning algorithms. Specifically, the topic of research is company failure and probability of default. The major impact that the firm’s default has on businesses, markets, and societies, underlines the importance of developing models which predict the probability of default. This research attempts to address this topic with two purposes: create an accurate binary model to classify companies in Defaulted and Non-Defaulted; identify the most important predictors in order to understand the links between the financial ratios considered and the companies’ status. Random forests’ ability to deal with big data sets and with various and diverse predictors have led to choosing this algorithm to analyze the topic of research. Building on a literature review of decision trees, random forests, company failure, and the models which predict the probability of default, this study’s analysis is constructed through several experiments which permit to tune the model appropriately and construct the final model which provide the highest accuracy. Through its cross-sectional analysis, this research confirms random forests’ strong stability and its consistent performance. The final model generated performs well, and identifies in the coverage of fixed assets, gross profit, net working capital, cost of debt, debt to equity ratio, leverage, solvency ratio, and return on assets, the most important default predictors. Finally, the results and methods applied have been jointly used to extend the purpose of this research. In order to permit further development of this study and of research on random forest and machine learning, an R programming code which permits to reproduce the computations carried out is provided. Importantly, the designed function is applicable to any data set to permit the analysis of different topics as well and provides a visual representation of the results through a Shiny App, permitting an easier interpretation of results.
APA, Harvard, Vancouver, ISO, and other styles
23

Auret, Lidia. "Process monitoring and fault diagnosis using random forests." Thesis, Stellenbosch : University of Stellenbosch, 2010. http://hdl.handle.net/10019.1/5360.

Full text
Abstract:
Thesis (PhD (Process Engineering))--University of Stellenbosch, 2010.
Dissertation presented for the Degree of DOCTOR OF PHILOSOPHY (Extractive Metallurgical Engineering) in the Department of Process Engineering at the University of Stellenbosch
ENGLISH ABSTRACT: Fault diagnosis is an important component of process monitoring, relevant in the greater context of developing safer, cleaner and more cost efficient processes. Data-driven unsupervised (or feature extractive) approaches to fault diagnosis exploit the many measurements available on modern plants. Certain current unsupervised approaches are hampered by their linearity assumptions, motivating the investigation of nonlinear methods. The diversity of data structures also motivates the investigation of novel feature extraction methodologies in process monitoring. Random forests are recently proposed statistical inference tools, deriving their predictive accuracy from the nonlinear nature of their constituent decision tree members and the power of ensembles. Random forest committees provide more than just predictions; model information on data proximities can be exploited to provide random forest features. Variable importance measures show which variables are closely associated with a chosen response variable, while partial dependencies indicate the relation of important variables to said response variable. The purpose of this study was therefore to investigate the feasibility of a new unsupervised method based on random forests as a potentially viable contender in the process monitoring statistical tool family. The hypothesis investigated was that unsupervised process monitoring and fault diagnosis can be improved by using features extracted from data with random forests, with further interpretation of fault conditions aided by random forest tools. The experimental results presented in this work support this hypothesis. An initial study was performed to assess the quality of random forest features. Random forest features were shown to be generally difficult to interpret in terms of geometry present in the original variable space. Random forest mapping and demapping models were shown to be very accurate on training data, and to extrapolate weakly to unseen data that do not fall within regions populated by training data. Random forest feature extraction was applied to unsupervised fault diagnosis for process data, and compared to linear and nonlinear methods. Random forest results were comparable to existing techniques, with the majority of random forest detections due to variable reconstruction errors. Further investigation revealed that the residual detection success of random forests originates from the constrained responses and poor generalization artifacts of decision trees. Random forest variable importance measures and partial dependencies were incorporated in a visualization tool to allow for the interpretation of fault conditions. A dynamic change point detection application with random forests proved more successful than an existing principal component analysis-based approach, with the success of the random forest method again residing in reconstruction errors. The addition of random forest fault diagnosis and change point detection algorithms to a suite of abnormal event detection techniques is recommended. The distance-to-model diagnostic based on random forest mapping and demapping proved successful in this work, and the theoretical understanding gained supports the application of this method to further data sets.
AFRIKAANSE OPSOMMING: Foutdiagnose is ’n belangrike komponent van prosesmonitering, en is relevant binne die groter konteks van die ontwikkeling van veiliger, skoner en meer koste-effektiewe prosesse. Data-gedrewe toesigvrye of kenmerkekstraksie-benaderings tot foutdiagnose benut die vele metings wat op moderne prosesaanlegte beskikbaar is. Party van die huidige toesigvrye benaderings word deur aannames rakende liniariteit belemmer, wat as motivering dien om nie-liniêre metodes te ondersoek. Die diversiteit van datastrukture is ook verdere motivering vir ondersoek na nuwe kenmerkekstraksiemetodes in prosesmonitering. Lukrake-woude is ’n nuwe statistiese inferensie-tegniek, waarvan die akkuraatheid toegeskryf kan word aan die nie-liniêre aard van besluitnemingsboomlede en die bekwaamheid van ensembles. Lukrake-woudkomitees verskaf meer as net voorspellings; modelinligting oor datapuntnabyheid kan benut word om lukrakewoudkenmerke te verskaf. Metingbelangrikheidsaanduiers wys watter metings in ’n noue verhouding met ’n gekose uitsetveranderlike verkeer, terwyl parsiële afhanklikhede aandui wat die verhouding van ’n belangrike meting tot die gekose uitsetveranderlike is. Die doel van hierdie studie was dus om die uitvoerbaarheid van ’n nuwe toesigvrye metode vir prosesmonitering gebaseer op lukrake-woude te ondersoek. Die ondersoekte hipotese lui: toesigvrye prosesmonitering en foutdiagnose kan verbeter word deur kenmerke te gebruik wat met lukrake-woude geëkstraheer is, waar die verdere interpretasie van foutkondisies deur addisionele lukrake-woude-tegnieke bygestaan word. Eksperimentele resultate wat in hierdie werkstuk voorgelê is, ondersteun hierdie hipotese. ’n Intreestudie is gedoen om die gehalte van lukrake-woudkenmerke te assesseer. Daar is bevind dat dit moeilik is om lukrake-woudkenmerke in terme van die geometrie van die oorspronklike metingspasie te interpreteer. Verder is daar bevind dat lukrake-woudkartering en -dekartering baie akkuraat is vir opleidingsdata, maar dat dit swak ekstrapolasie-eienskappe toon vir ongesiene data wat in gebiede buite dié van die opleidingsdata val. Lukrake-woudkenmerkekstraksie is in toesigvrye-foutdiagnose vir gestadigde-toestandprosesse toegepas, en is met liniêre en nie-liniêre metodes vergelyk. Resultate met lukrake-woude is vergelykbaar met dié van bestaande metodes, en die meerderheid lukrake-woudopsporings is aan metingrekonstruksiefoute toe te skryf. Verdere ondersoek het getoon dat die sukses van res-opsporing op die beperkte uitsetwaardes en swak veralgemenende eienskappe van besluitnemingsbome berus. Lukrake-woude-metingbelangrikheidsaanduiers en parsiële afhanklikhede is ingelyf in ’n visualiseringstegniek wat vir die interpretasie van foutkondisies voorsiening maak. ’n Dinamiese aanwending van veranderingspuntopsporing met lukrake-woude is as meer suksesvol bewys as ’n bestaande metode gebaseer op hoofkomponentanalise. Die sukses van die lukrake-woudmetode is weereens aan rekonstruksie-reswaardes toe te skryf. ’n Voorstel wat na aanleiding van hierde studie gemaak is, is dat die lukrake-woudveranderingspunt- en foutopsporingsmetodes by ’n soortgelyke stel metodes gevoeg kan word. Daar is in hierdie werk bevind dat die afstand-vanaf-modeldiagnostiek gebaseer op lukrake-woudkartering en -dekartering suksesvol is vir foutopsporing. Die teoretiese begrippe wat ontsluier is, ondersteun die toepassing van hierdie metodes op verdere datastelle.
APA, Harvard, Vancouver, ISO, and other styles
24

Almer, Oscar Erik Gabriel. "Automated application-specific optimisation of interconnects in multi-core systems." Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/7622.

Full text
Abstract:
In embedded computer systems there are often tasks, implemented as stand-alone devices, that are both application-specific and compute intensive. A recurring problem in this area is to design these application-specific embedded systems as close to the power and efficiency envelope as possible. Work has been done on optimizing singlecore systems and memory organisation, but current methods for achieving system design goals are proving limited as the system capabilities and system size increase in the multi- and many-core era. To address this problem, this thesis investigates machine learning approaches to managing the design space presented in the interconnect design of embedded multi-core systems. The design space presented is large due to the system scale and level of interconnectivity, and also feature inter-dependant parameters, further complicating analysis. The results presented in this thesis demonstrate that machine learning approaches, particularly wkNN and random forest, work well in handling the complexity of the design space. The benefits of this approach are in automation, saving time and effort in the system design phase as well as energy and execution time in the finished system.
APA, Harvard, Vancouver, ISO, and other styles
25

Rörbrink, Malin. "Improving detection of promising unrefined protein docking complexes." Thesis, Linköpings universitet, Bioinformatik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-133633.

Full text
Abstract:
Understanding protein-protein interaction (PPI) is important in order to understand cellular processes. X-ray crystallography and mutagenesis, expensive methods both in time and resources, are the most reliable methods for detecting PPI. Computational approaches could, therefore, reduce resources and time spent on detecting PPIs. During this master thesis a method, cProQPred, was created for scoring how realistic coarse PPI models are. cProQPred use the machine learning method Random Forest trained on previously calculated features from the programs ProQDock and InterPred. By combining some of ProQDock’s features and the InterPred score from InterPred the cProQPred method generated a higher performance than both ProQDock and InterPred. This work also tried to predict the quality of the PPI model after refinement and the chance for a coarse PPI model to succeed at refinement. The result illustrated that the predicted quality of a coarse PPI model also was a relatively good prediction of the quality the coarse PPI model would get after refinement. Prediction of the chance for a coarse PPI model to succeed at refinement was, however, without success.
APA, Harvard, Vancouver, ISO, and other styles
26

Valbi, Eleonora. "Analysis and forecasting of the structure of marine phytoplankton assemblages using innovative molecular techniques of NGS (Next Generation Sequencing) and Machine Learning." Doctoral thesis, Urbino, 2020. http://hdl.handle.net/11576/2673494.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Elghazel, Wiem. "Wireless sensor networks for Industrial health assessment based on a random forest approach." Thesis, Besançon, 2015. http://www.theses.fr/2015BESA2055/document.

Full text
Abstract:
Une maintenance prédictive efficace se base essentiellement sur la fiabilité des données de surveillance.Dans certains cas, la surveillance des systèmes industriels ne peut pas être assurée à l’aide de capteurs individuels ou filaires. Les Réseaux de Capteurs Sans Fil (RCSF) sont alors une alternative. Vu la nature de communication dans ces réseaux, la perte de données est très probable. Nous proposons un algorithme distribué pour la survie des données dans le réseau. Cet algorithme réduit le risque d’une perte totale des paquets de données et assure la continuité du fonctionnement du réseau. Nous avons aussi simulé de différentes topologies du réseau pour évaluer leur impact sur la complétude des données au niveau du nœud puits. Par la suite, nous avons proposé une démarche d’évaluation de l’état de santé de systèmes physiques basée sur l’algorithme des forêts aléatoires. Cette démarche repose sur deux phases : une phase hors ligne et une phase en ligne. Dans la phase hors ligne, l’algorithme des forêts aléatoires sélectionne les paramètres qui contiennent le plus d’information sur l’état du système. Ces paramètres sont utilisés pour construire les arbres décisionnels qui constituent la forêt. Dans la phase en ligne, l’algorithme évalue l’état actuel du système en utilisant les données capteurs pour parcourir les arbres construits. Chaque arbre dans la forêt fournit une décision, et la classe finale est le résultat d’un vote majoritaire sur l’ensemble de la forêt. Quand les capteurs commencent à tomber en panne, les données décrivant un indicateur de santé deviennent incomplètes ou perdues. En injectant de l’aléatoire dans la base d’apprentissage, l’algorithme aura des points de départ différents, et par la suite les arbres aussi. Ainsi, l’absence des mesures d’un indicateur de santé ne conduit pas nécessairement à l’interruption du processus de prédiction de l’état de santé
An efficient predictive maintenance is based on the reliability of the monitoring data. In some cases, themonitoring activity cannot be ensured with individual or wired sensors. Wireless sensor networks (WSN) arethen an alternative. Considering the wireless communication, data loss becomes highly probable. Therefore,we study certain aspects of WSN reliability. We propose a distributed algorithm for network resiliency and datasurvival while optimizing energy consumption. This fault tolerant algorithm reduces the risks of data loss andensures the continuity of data transfer. We also simulated different network topologies in order to evaluate theirimpact on data completeness at the sink level. Thereafter, we propose an approach to evaluate the system’sstate of health using the random forests algorithm. In an offline phase, the random forest algorithm selects theparameters holding more information about the system’s health state. These parameters are used to constructthe decision trees that make the forest. By injecting the random aspect in the training set, the algorithm (thetrees) will have different starting points. In an online phase, the algorithm evaluates the current health stateusing the sensor data. Each tree will provide a decision, and the final class is the result of the majority voteof all trees. When sensors start to break down, the data describing a health indicator becomes incompleteor unavailable. Considering that the trees have different starting points, the absence of some data will notnecessarily result in the interruption of the prediction process
APA, Harvard, Vancouver, ISO, and other styles
28

Dyer, Ross. "Predicting residential demand: applying random forest to predict housing demand in Cape Town." Master's thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/29602.

Full text
Abstract:
The literature shows that Random Forest is a suitable technique to predict a target variable for a household with completely unseen characteristics. The models produced in this paper show that the characteristics of a household can be used to predict the Type of Dwelling, the Tenure and the Number of Bedrooms to varying degrees of accuracy. While none of the sets of models produced indicate a high degree of predictive accuracy relative to hurdle rates, the paper does demonstrate the value that the Random Forest technique offers in moving closer to an understanding of the complex nature of housing demand. A key finding is that the Census variables available for the models are not discriminatory enough to enable the high degree of accuracy expected from a predictive model.
APA, Harvard, Vancouver, ISO, and other styles
29

Mussumeci, Elisa. "A machine learning approach to dengue forecasting: comparing LSTM, Random Forest and Lasso." reponame:Repositório Institucional do FGV, 2018. http://hdl.handle.net/10438/24093.

Full text
Abstract:
Submitted by Elisa Mussumeci (elisamussumeci@gmail.com) on 2018-05-29T18:53:58Z No. of bitstreams: 1 machine-learning-aproach (4).pdf: 11272802 bytes, checksum: 52b25abf2711fdd6d1a338316c15c154 (MD5)
Approved for entry into archive by ÁUREA CORRÊA DA FONSECA CORRÊA DA FONSECA (aurea.fonseca@fgv.br) on 2018-05-29T19:15:35Z (GMT) No. of bitstreams: 1 machine-learning-aproach (4).pdf: 11272802 bytes, checksum: 52b25abf2711fdd6d1a338316c15c154 (MD5)
Made available in DSpace on 2018-06-14T19:45:29Z (GMT). No. of bitstreams: 1 machine-learning-aproach (4).pdf: 11272802 bytes, checksum: 52b25abf2711fdd6d1a338316c15c154 (MD5) Previous issue date: 2018-04-12
We used the Infodengue database of incidence and weather time-series, to train predictive models for the weekly number of cases of dengue in 790 cities of Brazil. To overcome a limitation in the length of time-series available to train the model, we proposed using the time series of epidemiologically similar cities as predictors for the incidence of each city. As Machine Learning-based forecasting models have been used in recent years with reasonable success, in this work we compare three machine learning models: Random Forest, lasso and Long-short term memory neural network in their forecasting performance for all cities monitored by the Infodengue Project.
APA, Harvard, Vancouver, ISO, and other styles
30

Williams, Paige T. "Mapping Smallholder Forest Plantations in Andhra Pradesh, India using Multitemporal Harmonized Landsat Sentinel-2 S10 Data." Thesis, Virginia Tech, 2020. http://hdl.handle.net/10919/104234.

Full text
Abstract:
The objective of this study was to develop a method by which smallholder forest plantations can be mapped accurately in Andhra Pradesh, India using multitemporal (intra- and inter-annual) visible and near-infrared (VNIR) bands from the Sentinel-2 MultiSpectral Instruments (MSIs). Dependency on and scarcity of wood products have driven the deforestation and degradation of natural forests in Southeast Asia. At the same time, forest plantations have been established both within and outside of forests, with the latter (as contiguous blocks) being the focus of this study. The ecosystem services provided by natural forests are different from those of plantations. As such, being able to separate natural forests from plantations is important. Unfortunately, there are constraints to accurately mapping planted forests in Andhra Pradesh (and other similar landscapes in South and Southeast Asia) using remotely sensed data due to the plantations' small size (average 2 hectares), short rotation ages (often 4-7 years for timber species), and spectral similarities to croplands and natural forests. The East and West Godavari districts of Andhra Pradesh were selected as the area for a case study. Cloud-free Harmonized Landsat Sentinel-2 (HLS) S10 data was acquired over six dates, from different seasons, as follows: December 28, 2015; November 22, 2016; November 2, 2017; December 22, 2017; March 1, 2018; and June 15, 2018. Cloud-free satellite data are not available during the monsoon season (July to September) in this coastal region. In situ data on forest plantations, provided by collaborators, was supplemented with additional training data representing other land cover subclasses in the region: agriculture, water, aquaculture, mangrove, palm, forest plantation, ground, natural forest, shrub/scrub, sand, and urban, with a total sample size of 2,230. These high-quality samples were then aggregated into three land use classes: non-forest, natural forest, and forest plantations. Image classification used random forests within the Julia Decision Tree package on a thirty-band stack that was comprised of the VNIR bands and NDVI images for all dates. The median classification accuracy from the 5-fold cross validation was 94.3%. Our results, predicated on high quality training data, demonstrate that (mostly smallholder) forest plantations can be separated from natural forests even using only the Sentinel 2 VNIR bands when multitemporal data (across both years and seasons) are used.
The objective of this study was to develop a method by which smallholder forest plantations can be mapped accurately in Andhra Pradesh, India using multitemporal (intra- and inter-annual) visible (red, green, blue) and near-infrared (VNIR) bands from the European Space Agency satellite Sentinel-2. Dependency on and scarcity of wood products have driven the deforestation and degradation of natural forests in Southeast Asia. At the same time, forest plantations have been established both within and outside of forests, with the latter (as contiguous blocks) being the focus of this study. The ecosystem services provided by natural forests are different from those of plantations. As such, being able to separate natural forests from plantations is important. Unfortunately, there are constraints to accurately mapping planted forests in Andhra Pradesh (and other similar landscapes in South and Southeast Asia) using remotely sensed data due to the plantations' small size (average 2 hectares), short rotation ages (often 4-7 years for timber species), and spectral (reflectance from satellite imagery) similarities to croplands and natural forests. The East and West Godavari districts of Andhra Pradesh were selected as the area for a case study. Cloud-free Harmonized Landsat Sentinel-2 (HLS) S10 images were acquired over six dates, from different seasons, as follows: December 28, 2015; November 22, 2016; November 2, 2017; December 22, 2017; March 1, 2018; and June 15, 2018. Cloud-free satellite data are not available during the monsoon season (July to September) in this coastal region. In situ data on forest plantations, provided by collaborators, was supplemented with additional training data points (X and Y locations with land cover class) representing other land cover subclasses in the region: agriculture, water, aquaculture, mangrove, palm, forest plantation, ground, natural forest, shrub/scrub, sand, and urban, with a total of 2,230 training points. These high-quality samples were then aggregated into three land use classes: non-forest, natural forest, and forest plantations. Image classification used random forests within the Julia DecisionTree package on a thirty-band stack that was comprised of the VNIR bands and NDVI (calculation related to greenness, i.e. higher value = more vegetation) images for all dates. The median classification accuracy from the 5-fold cross validation was 94.3%. Our results, predicated on high quality training data, demonstrate that (mostly smallholder) forest plantations can be separated from natural forests even using only the Sentinel 2 VNIR bands when multitemporal data (across both years and seasons) are used.
APA, Harvard, Vancouver, ISO, and other styles
31

Abd, El Meguid Mostafa. "Unconstrained facial expression recognition in still images and video sequences using Random Forest classifiers." Thesis, McGill University, 2012. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=107692.

Full text
Abstract:
The aim of this project is to construct and implement a comprehensive facial expression detection and classification framework through the use of a proprietary face detector (PittPatt) and a novel classifier consisting of a set of Random Forests paired with either support vector machine or k-nearest neighbour labellers. The system should perform at real-time rates under unconstrained image conditions, with no intermediate human intervention. The still-image Binghamton University 3D Facial Expression database was used for training purposes, while a number of other expression-labelled video databases were used for testing. Quantitative evidence for qualitative and intuitive facial expression recognition constitutes the main theoretical contribution to the field.
L'objectif de ce projet est de construire et mettre en œuvre un cadre complète de détection de l'expression du visage par l'utilisation d'un détecteur de visage exclusif (PittPatt) et un nouveau classificateur composé d'un ensemble de 'Random Forests' a accompagné d'un étiqueteur 'support vector machine' ou 'k-nearest neighbour'. Le système doit effectuer au temps réel, dans des conditions sans contrainte, sans aucune intervention humaine intermédiaires. La base de données d'images fixes 'Binghamton University 3D Facial Expressions' était utilisé à des fins de formation. Un nombre de bases de données d'expression d'images fixes et de vidéo ont été utilisés pour l'évaluation. Des données quantitatives pour l'analyse qualitative, et parfois intuitive, les sujets liés à l'expression faciale constituaient la contribution principale et théorique sur le terrain.
APA, Harvard, Vancouver, ISO, and other styles
32

Arnroth, Lukas, and Dennis Jonni Fiddler. "Supervised Learning Techniques : A comparison of the Random Forest and the Support Vector Machine." Thesis, Uppsala universitet, Statistiska institutionen, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-274768.

Full text
Abstract:
This thesis examines the performance of the support vector machine and the random forest models in the context of binary classification. The two techniques are compared and the outstanding one is used to construct a final parsimonious model. The data set consists of 33 observations and 89 biomarkers as features with no known dependent variable. The dependent variable is generated through k-means clustering, with a predefined final solution of two clusters. The training of the algorithms is performed using five-fold cross-validation repeated twenty times. The outcome of the training process reveals that the best performing versions of the models are a linear support vector machine and a random forest with six randomly selected features at each split. The final results of the comparison on the test set of these optimally tuned algorithms show that the random forest outperforms the linear kernel support vector machine. The former classifies all observations in the test set correctly whilst the latter classifies all but one correctly. Hence, a parsimonious random forest model using the top five features is constructed, which, to conclude, performs equally well on the test set compared to the original random forest model using all features.
APA, Harvard, Vancouver, ISO, and other styles
33

Oliveira, Matheus Felipe. "Mapeamento digital de solos da quadrícula de Ribeirão Preto - SP pelo método Random Forest /." Jaboticabal, 2015. http://hdl.handle.net/11449/154733.

Full text
Abstract:
Orientador: José Eduardo Corá
Banca: Célia Regina Paes Bueno
Banca: Waldir de Carvalho Junior
Banca: Antonio Sérgio Ferraudo
Resumo: O presente estudo buscou desenvolver um modelo capaz de compreender as relações solo-paisagem para a predição de classes de solo das folhas do IBGE de Ribeirão Preto, Serrana, Cravinhos e Bonfim Paulista, que constituem a quadrícula de Ribeirão Preto. Para isto, foram utilizadas informações contidas em um mapa pedológico convencional semidetalhado na escala 1:100.000, um Modelo Digital de Elevação (MDE) com resolução espacial de 30 metros, além do mapa geológico na escala 1:50.000. Do mapa geológico foi obtida a litologia e do MDE, foram obtidas as variáveis geomorfométricas por meio de técnicas de geoprocessamento. Todas essas informações foram relacionadas em uma matriz, de onde foram selecionadas três amostragens estratificadas de acordo com a área das classes, extraindo-se dados para treino e teste, que foram utilizados para aplicação em modelos do método Random Forest e avaliação da acurácia. Foram testados diferentes ajustes, com aplicação dos modelos nas classes no segundo e terceiro nível categórico. Com uma amostragem que compreende apenas 0,43% do total da área, o modelo para o segundo nível categórico apresentou uma exatidão global de 62,5%, com o mapa digital de solos apresentando uma persistência de 70,63% das classes do mapa original, valores maiores do que os apresentados para o terceiro nível categórico, com exatidão global de 57,1% e persistência de 44,24%. As variáveis mais importantes na compreensão das relações solo-paisagem foram Litologia, Elevação, Declividade e Distância da rede de drenagem. O estudo mostrou que a metodologia empregada é capaz de contribuir para criação de mapas de solo, com a possibilidade de ser empregado em áreas onde não há informações de solos pré-existentes, de maneira rápida e menos onerosa, auxiliando o trabalho dos pedólogos
Abstract: This study aimed to develop a model to understand the soil-landscape relationships to predict soil classes of topographic sheets of IBGE from Ribeirão Preto, Serrana, Cravinhos and Bonfim Paulista, constituting the grid Ribeirão Preto. For this, we used information included in a conventional semi-detailed soil map at 1:100,000 scale, a Digital Elevation Model (DEM) with a spatial resolution of 30 meters, in addition to the geological map at 1: 50,000 scale. From geological map was obtained lithology and from MDE were obtained the geomorphometric variables through geoprocessing techniques. All this information was linked in a matrix, from which they were selected three stratified sampling according to the area of classes, extracting data for training and testing, which were used for use in models of Random Forest method and evaluation of accuracy. Adjustments were tested with application of models in classes on the second and third categorical level. With a sample comprising only 0.43% of the total area, the model for the second categorical level had an overall accuracy of 62.5%, with the digital soil map showing a persistence of 70.63% of classes from original map, higher values than those presented for the third categorical level, with an overall accuracy of 57.1% and persistence of 44.24%. The most important variables in understanding the soil-landscape relationships were Lithology, Elevation, Slope Distance and drainage network. The study showed that the method is able to contribute to the creation of soil maps, with the possibility of being employed in areas where there is no pre-existing soil information quickly and less costly way, assisting the work of soil scientists
Mestre
APA, Harvard, Vancouver, ISO, and other styles
34

Oliveira, Matheus Felipe [UNESP]. "Mapeamento digital de solos da quadrícula de Ribeirão Preto - SP pelo método Random Forest." Universidade Estadual Paulista (UNESP), 2015. http://hdl.handle.net/11449/154733.

Full text
Abstract:
Made available in DSpace on 2018-07-27T18:26:18Z (GMT). No. of bitstreams: 0 Previous issue date: 2015-12-08. Added 1 bitstream(s) on 2018-07-27T18:30:47Z : No. of bitstreams: 1 000881014.pdf: 6148920 bytes, checksum: 5c7e453ecdfb25f9189e533208588ad1 (MD5)
O presente estudo buscou desenvolver um modelo capaz de compreender as relações solo-paisagem para a predição de classes de solo das folhas do IBGE de Ribeirão Preto, Serrana, Cravinhos e Bonfim Paulista, que constituem a quadrícula de Ribeirão Preto. Para isto, foram utilizadas informações contidas em um mapa pedológico convencional semidetalhado na escala 1:100.000, um Modelo Digital de Elevação (MDE) com resolução espacial de 30 metros, além do mapa geológico na escala 1:50.000. Do mapa geológico foi obtida a litologia e do MDE, foram obtidas as variáveis geomorfométricas por meio de técnicas de geoprocessamento. Todas essas informações foram relacionadas em uma matriz, de onde foram selecionadas três amostragens estratificadas de acordo com a área das classes, extraindo-se dados para treino e teste, que foram utilizados para aplicação em modelos do método Random Forest e avaliação da acurácia. Foram testados diferentes ajustes, com aplicação dos modelos nas classes no segundo e terceiro nível categórico. Com uma amostragem que compreende apenas 0,43% do total da área, o modelo para o segundo nível categórico apresentou uma exatidão global de 62,5%, com o mapa digital de solos apresentando uma persistência de 70,63% das classes do mapa original, valores maiores do que os apresentados para o terceiro nível categórico, com exatidão global de 57,1% e persistência de 44,24%. As variáveis mais importantes na compreensão das relações solo-paisagem foram Litologia, Elevação, Declividade e Distância da rede de drenagem. O estudo mostrou que a metodologia empregada é capaz de contribuir para criação de mapas de solo, com a possibilidade de ser empregado em áreas onde não há informações de solos pré-existentes, de maneira rápida e menos onerosa, auxiliando o trabalho dos pedólogos
This study aimed to develop a model to understand the soil-landscape relationships to predict soil classes of topographic sheets of IBGE from Ribeirão Preto, Serrana, Cravinhos and Bonfim Paulista, constituting the grid Ribeirão Preto. For this, we used information included in a conventional semi-detailed soil map at 1:100,000 scale, a Digital Elevation Model (DEM) with a spatial resolution of 30 meters, in addition to the geological map at 1: 50,000 scale. From geological map was obtained lithology and from MDE were obtained the geomorphometric variables through geoprocessing techniques. All this information was linked in a matrix, from which they were selected three stratified sampling according to the area of classes, extracting data for training and testing, which were used for use in models of Random Forest method and evaluation of accuracy. Adjustments were tested with application of models in classes on the second and third categorical level. With a sample comprising only 0.43% of the total area, the model for the second categorical level had an overall accuracy of 62.5%, with the digital soil map showing a persistence of 70.63% of classes from original map, higher values than those presented for the third categorical level, with an overall accuracy of 57.1% and persistence of 44.24%. The most important variables in understanding the soil-landscape relationships were Lithology, Elevation, Slope Distance and drainage network. The study showed that the method is able to contribute to the creation of soil maps, with the possibility of being employed in areas where there is no pre-existing soil information quickly and less costly way, assisting the work of soil scientists
APA, Harvard, Vancouver, ISO, and other styles
35

Lento, Gabriel Carneiro. "Random forest em dados desbalanceados: uma aplicação na modelagem de churn em seguro saúde." reponame:Repositório Institucional do FGV, 2017. http://hdl.handle.net/10438/18256.

Full text
Abstract:
Submitted by Gabriel Lento (gabriel.carneiro.lento@gmail.com) on 2017-05-01T23:16:04Z No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5)
Approved for entry into archive by Leiliane Silva (leiliane.silva@fgv.br) on 2017-05-04T18:39:57Z (GMT) No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5)
Made available in DSpace on 2017-05-17T12:43:35Z (GMT). No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) Previous issue date: 2017-03-27
In this work we study churn in health insurance, that is predicting which clients will cancel the product or service within a preset time-frame. Traditionally, the probability whether a client will cancel the service is modeled using logistic regression. Recently, modern machine learning techniques are becoming popular in churn modeling, having been applied in the areas of telecommunications, banking, and car insurance, among others. One of the big challenges in this problem is that only a fraction of all customers cancel the service, meaning that we have to deal with highly imbalanced class probabilities. Under-sampling and over-sampling techniques have been used to overcome this issue. We use random forests, that are ensembles of decision trees, where each of the trees fits a subsample of the data constructed using either under-sampling or over-sampling. We compare the distinct specifications of random forests using various metrics that are robust to imbalanced classes, both in-sample and out-of-sample. We observe that random forests using imbalanced random samples with fewer observations than the original series present a better overall performance. Random forests also present a better performance than the classical logistic regression, often used in health insurance companies to model churn.
Neste trabalho estudamos o problema de churn em seguro saúde, isto é, a previsão se o cliente irá cancelar o produto ou serviço em até um período de tempo pré-estipulado. Tradicionalmente, regressão logística é utilizada para modelar a probabilidade de cancelamento do serviço. Atualmente, técnicas modernas de machine learning vêm se tornando cada vez mais populares para esse tipo de problema, com exemplos nas áreas de telecomunicação, bancos, e seguros de carro, dentre outras. Uma das grandes dificuldades nesta modelagem é que apenas uma pequena fração dos clientes de fato cancela o serviço, o que significa que a base de dados tratada é altamente desbalanceada. Técnicas de under-sampling e over-sampling são utilizadas para contornar esse problema. Neste trabalho, aplicamos random forests, que são combinações de árvores de decisão ajustadas em subamostras dos dados, construídas utilizando under-sampling e over-sampling. Ao fim do trabalho comparamos métricas de ajustes obtidas nas diversas especificações dos modelos testados e avaliamos seus resultados dentro e fora da amostra. Observamos que técnicas de random forest utilizando sub-amostras não balanceadas com o tamanho menor do que a amostra original apresenta a melhor performance dentre as random forests utilizadas e uma melhora com relação ao praticado no mercado de seguro saúde.
APA, Harvard, Vancouver, ISO, and other styles
36

Wålinder, Andreas. "Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis." Thesis, Linnéuniversitetet, Institutionen för matematik (MA), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-35126.

Full text
Abstract:
Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable.     There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection.     Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test.     Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance.     We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model.
APA, Harvard, Vancouver, ISO, and other styles
37

Oshiro, Thais Mayumi. "Uma abordagem para a construção de uma única árvore a partir de uma Random Forest para classificação de bases de expressão gênica." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/95/95131/tde-15102013-183234/.

Full text
Abstract:
Random Forest é uma técnica computacionalmente eciente que pode operar rapida-mente sobre grandes bases de dados. Ela tem sido usada em muitos projetos de pesquisa recentes e aplicações do mundo real em diversos domínios, entre eles a bioinformática uma vez que a Random Forest consegue lidar com bases que apresentam muitos atributos e poucos exemplos. Porém, ela é de difícil compreensão para especialistas humanos de diversas áreas. A pesquisa de mestrado aqui relatada tem como objetivo criar um modelo simbólico, ou seja, uma única árvore a partir da Random Forest para a classicação de bases de dados de expressão gênica. Almeja-se assim, aumentar a compreensão por parte dos especialistas humanos sobre o processo que classica os exemplos no mundo real tentando manter um bom desempenho. Os resultados iniciais obtidos com o algoritmo aqui proposto são pro-missores, uma vez que ela apresenta, em alguns casos, desempenho melhor do que outro algoritmo amplamente utilizado (J48) e um pouco inferior à Random Forest. Além disso, a árvore criada apresenta, no geral, tamanho menor do que a árvore criada pelo algoritmo J48.
Random Forest is a computationally ecient technique which can operate quickly over large datasets. It has been used in many research projects and recent real-world applications in several elds, including bioinformatics since Random Forest can handle datasets having many attributes, and few examples. However, it is dicult for human experts to understand it. The research reported here aims to create a symbolic model, i.e. a single tree from a Random Forest for the classication of gene expression datasets. Thus, we hope to increase the understanding by human experts on the process that classies the examples in the real world trying to keep a good performance. Initial results obtained from the proposed algorithm are promising since it presents in some cases performance better than other widely used algorithm (J48) and a slightly lower than a Random Forest. Furthermore, the induced tree presents, in general, a smaller size than the tree built by the algorithm J48.
APA, Harvard, Vancouver, ISO, and other styles
38

Halmann, Marju. "Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14710.

Full text
Abstract:
Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself. This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random Forest classifier can be used for the more general email classification task and how it compares to other existing email classifiers. The comparison is based on the literature study and on the empirical experimentation using two real-life datasets. Firstly, a literature study is performed to gain insight of the accuracy of other available email classifiers. Secondly, proposed model’s accuracy is explored with experimentation. The literature study shows that the accuracy of more general email classifiers differs greatly on different user sets. The proposed model accuracy is within the reported accuracy range, however in the lower part. It indicates that the proposed model performs poorly compared to other classifiers. On average, the classifier performance improves 15 percentage points with additional information. This indicates that Latent Dirichlet Allocation (LDA) combined with Random Forest classifier is promising, however future studies are needed to explore the model and ways to further increase the accuracy.
APA, Harvard, Vancouver, ISO, and other styles
39

Lindroth, Leonard. "Parallelization of Online Random Forest." Thesis, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-21098.

Full text
Abstract:
Context. Random Forests (RFs) is a very popular machine learning algorithm for mining large scale data. RFs is mainly known asan algorithm that operates in offline mode. However, in recent yearsimplementations of online random forests (ORFs) have been introduced. With multicore processors and successful implementation ofparallelism may result in increased performance of an algorithm, inrelation to its sequential implementation. Objectives. In this paper we develop and investigate the performanceof a parallel implementation of ORFs and compare the experimentalresults with its sequential counterpart. Methods. From using profiling tools on ORFs we located its bottlenecks and with this knowledge we used the implementation/experiment methodology to develop parallel online random forests (PORFs).Evaluation is done by comparing performance from ORFs and PORFs. Results. Experiments on common machine learning data sets showthat PORFs achieve equal classification to our execution of ORFs. However, there is a difference in classification on some data sets whencompared to results from another study. Furthermore, PORFs didn’tachieve any speed up compared to ORFs. In fact with the added overhead from pthreads PORFs takes longer time to finish than ORFs. Conclusions. We conclude that our parallelization of ORFs achievesequal classification performance as sequential ORFs. However, speedup wasn’t achieved with our chosen approach for parallelism. Possible solutions to achieve speed up is presented and suggested as futurework.
APA, Harvard, Vancouver, ISO, and other styles
40

Pan, Pin-Zhong, and 潘品忠. "Human Action Recognition using Random Forest." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/88721662507029523371.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Chen, Shi-zhong, and 陳時仲. "Evaluating the Effectiveness of Random Forest Model." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/46358970356692465998.

Full text
Abstract:
碩士
國立交通大學
統計學研究所
103
Random Forest is a popular machine learning algorithms. It is a decision tree model consists of multiple trees. First, we generate a specified number of tree (ex: 100), then we predict the final result by taking average of all the results (for continuous response) or by majority voting of the results (for categorical response). Random forests in R software package “randomForest” is very easy to use. As long as we choose the number of the decision tree (ntry) and the number of variables to be selected for node branching (mtry), then we can analyze the data by this model. Its analysis results of the real data (Chapter 3) are better than some of the statistical model. What’s more, our model also has the ability for finding important variables. Therefore, it is a very complete and convenient model.
APA, Harvard, Vancouver, ISO, and other styles
42

Antonella, Mensi. "Advanced random forest approaches for outlier detection." Doctoral thesis, 2022. http://hdl.handle.net/11562/1067504.

Full text
Abstract:
Outlier Detection (OD) is a Pattern Recognition task which consists of finding those patterns in a set of data which are likely to have been generated by a different mechanism than the one underlying the rest of the data. The importance of OD is visible in everyday life. Indeed, fast, and accurate detection of outliers is crucial: for example, in the electrocardiogram of a patient, an abnormality in the heart rhythm can cause severe health problems. Due to the high number of fields in which OD is needed, several approaches have been designed. Among them, Random Forest-based techniques have raised great interest in the research community: a Random Forest (RF) is an ensemble of Decision Trees where each tree is diverse and independent. They are characterized by a high degree of flexibility, robustness, and high generalization capabilities. Even though originally designed for classification and regression, in the latest years, due to their success, there has been an increased development of RF-based approaches for other learning tasks, including OD. The forerunner of several RF methods for OD is Isolation Forest (iForest), a technique which main principle is isolation, i.e. the separation of each object from the rest of the data. Since outliers are different from the rest of the data and thus easier to separate, we can easily identify them as those objects isolated after few splits in the tree. iForests have been employed in a great variety of application fields, showing excellent performances. This thesis is inserted into the above scenario: even if some extensions of basic RF-based approaches for OD have been proposed, their potentialities have not been fully exploited and there is large room for improvements. In this thesis, we introduce some advanced RF-based techniques for OD, investigating both methodological issues and alternative uses of these flexible approaches. In detail, we moved along four research directions. The starting point of the first one is the absence of RF methods for OD able to work with non-vectorial data: here we propose ProxIForest, an approach which works with all types of data for which a distance measure can be defined, thus including non-vectorial data as well. Indeed, for the latter, many powerful distances have been proposed. The second direction focuses on how to measure the outlierness degree of an object in an RF, i.e. the anomaly score, since most extensions of iForest concern only the tree building procedure. In detail, we propose two novel classes of methods: the first class exploits the information contained within a tree. The second one focuses on the ensemble aspect of RFs: the aggregation of the anomaly scores extracted from each tree is crucial to correctly identify outliers. As to the third research direction we took a different perspective exploiting the fact that each tree in a forest is a space partitioner encoding relations, i.e. distances, between objects. Whereas this aspect has been widely researched in the clustering field, it has never been investigated for OD: we extract from an iForest a distance measure and input it to an outlier detector. As last research direction, we designed a new variant of iForest to characterize multiple sclerosis given a brain connectivity network: we cast the problem as an OD task, by making an analogy between disconnected brain regions, the hallmark of the disease, and outliers. All proposals have been thoroughly empirically validated on either classical or ad hoc datasets: we performed several analyses, including comparisons to state-of-the-art approaches and statistical tests. This thesis proves the suitability of RF-based approaches for OD from different perspectives: not only they can be successfully used for the task, but we can also use them to extract distances or features. Further, by contributing to this field, this thesis proves that there are still many aspects requiring further investigation.
APA, Harvard, Vancouver, ISO, and other styles
43

Dehury, Jitendra Pratap. "Random Forest-Based Intrusion Detection System (IDS)." Thesis, 2018. http://ethesis.nitrkl.ac.in/9737/1/2018_MT_216CS2154_JPDehury_Random.pdf.

Full text
Abstract:
Intrusion Detection system plays an important role in network security because existing security technology is un-realistic. Most of the intrusion system (IDSs) are unable to detect intrusions due to rule based system. In this thesis random forest algorithm is used for outlier detection of network patterns. There are three intrusion techniques for intrusion detection: misuse detection , anomaly detection and hybrid detection .In this thesis the AWID-cls-R data set is used for classification. Here the aim is to reduce the false positive rate and improve the performance of intrusion detection systems which will help to prevent and monitor different types of attack.
APA, Harvard, Vancouver, ISO, and other styles
44

Brence, John R. "Analysis of robust measures for random forest regression /." 2004. http://wwwlib.umi.com/dissertations/fullcit/3131453.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Lin, Pa-Hsun, and 林伯勳. "Fire and Smoke Detection Using Random Forest Algorithm." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/56813956821976269774.

Full text
Abstract:
碩士
國立暨南國際大學
資訊工程學系
101
Along with the progress of computer computation capabilities, sophisticated image processing/understanding methods have been developed and the functions of intelligent video surveillance systems have been greatly extended. In this thesis, we develop a video-based fire and smoke detection system based on the random forest algorithm. We use the distinct color and image variation properties of fire/smoke to select candidate regions. Then, image features of texture and motion patterns of the candidate regions are analyzed to determine any fire/smoke region. We propose to extract the features of both the texture and motion patterns of the fire/smoke with the local binary pattern (LBP) method. The random forest method is augmented to use the LBP features for fire/smoke detection to reduce false positive and enhance the fire and smoke detection rate.
APA, Harvard, Vancouver, ISO, and other styles
46

Chien, Chia-Chih, and 簡嘉志. "License plate recognition using the random forest algorithm." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/21636450711897378754.

Full text
Abstract:
碩士
國立暨南國際大學
資訊工程學系
101
In this thesis, we study the car license plate recognition (LPR) problem which consists of a license plate localization sub-problem and a license character recognition sub-problem. We develop a heuristic method to detect license plate candidates by using mathematical morphology operations to filter edge detection results. Character recognition is accomplished by using the random forest algorithm which is trained with a huge number of synthesized character images. Since the random forest algorithm is very efficient, we use an exhaustive search strategy to detect characters with a search window. The search window is swept over the candidate license plate area to recognize every character. Therefore, we do not need to segment the license plate characters and the recognition error induced by incorrect character segmentation can be avoided. For comparison, we also implement a license plate recognition method which uses the support vector machine. Experimental results show that random forest LPR outperformed the implemented support vector machine algorithm.
APA, Harvard, Vancouver, ISO, and other styles
47

Liu, Meng-Hsin, and 劉孟鑫. "3D fingertip detection based on random decision forest." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/78861501850356031498.

Full text
Abstract:
碩士
中原大學
資訊工程研究所
103
Hand gesture is one of the most intuitive ways to interact with machine. However, traditional 2D hand gesture recognition is very sensitive to occlusions and changes in viewpoint. The 3D localization of fingertips and palm can be helpful for hand gesture recognition under different viewpoints. In this study, we propose a new fingertip detection algorithm using two-stage random decision forest (RDF). In the first stage, local depth difference pattern (LDDP) and 3D geodesic shortest path (GSP) are adopted for training a finger pixel classifier. Two spatial and temporal features are then added into RDF to further distinguish fingertip pixels from finger pixels in the second stage. Finally, we utilize K-means clustering to re-identify fingertip candidates and limit the number of candidates to five. Our experimental result demonstrates that the proposed fingertip detection method is effective in complex gesture.
APA, Harvard, Vancouver, ISO, and other styles
48

Wu, Feng-Jen, and 吳豐仁. "Optimal Operation Strategy of Chillers Using Random Forest." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/pe54fw.

Full text
Abstract:
碩士
國立臺北科技大學
能源與冷凍空調工程系
106
According to the Energy Bureau of the Ministry of Economic Affairs, air-conditioning energy consumption accounts for more than 40% of the energy consumption of the entire building, and the energy consumption of chiller plant accounts for about 50 to 60% of the energy consumption of air-conditioning systems. Therefore, how to reduce the need for chiller plant is unnecessary. The energy consumption has made the effective use of energy a very important and urgent research topic. For a long time, the operating personnel of the central air-conditioning system have determined the start-up combination of the chiller and the previous operating experience. However, in addition to the large summer load during the daytime, all chiller need to be turned on, and during the night and other seasons, the load is low. It is up to the operator to judge the start-up combination and whether it has achieved the best operating efficiency. There is no real reliable data to interpret and analyze. In this study, R software was used in conjunction with the Random Forests package to simulate the actual operating data of the chiller in a central air-conditioning system in a northern building. After the model and performance evaluation were established, the wet-bulb temperature range and approach temperature were set. Analyze the optimal start-up combination and evaluate the follow-up operation strategy of the ice-water master. Calculate the wet-bulb temperature range according to the different loading 200~400 RT and 2000~2200 RT. The energy-saving rate can reach 3.40(2000~2200 RT)~20.62%(200~400 RT). The results prove the importance of chiller start-up operation strategies. If this technology can be deeply rooted, besides providing real operational strategies for operating personnel, it can also truly reduce the use of domestic energy sources.
APA, Harvard, Vancouver, ISO, and other styles
49

HUNG, CHENG-WEI, and 洪政緯. "Forecasting New Products Selling Level by Random Forest." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/4rxq85.

Full text
Abstract:
碩士
國立交通大學
工業工程與管理系所
107
The most common problem in the clothing industry is that the products must be manufactured in advance and transferred to the sales shop for sales. The underwear industry does not produce all the products at one time, but after a period of trial sales, it is handed over to the company. Subjectively determine whether to continue to produce the product, and the wrong decision to turn the order will lead to high inventory of goods, causing damage to the company's overall operating interests. This study describes the purpose and motivation of the research from the introduction, and explores the decision tree and random forest model to establish an objective classification model to help the case company to forcast whether the new product is hot after the one-month trial sale period. It can be based on the case, and after the case study, the feasibility of the model is verified and finally used by the case company.
APA, Harvard, Vancouver, ISO, and other styles
50

Joshi, Ajjen Das. "A random forest approach to segmenting and classifying gestures." Thesis, 2014. https://hdl.handle.net/2144/15405.

Full text
Abstract:
This thesis investigates a gesture segmentation and recognition scheme that employs a random forest classification model. A complete gesture recognition system should localize and classify each gesture from a given gesture vocabulary, within a continuous video stream. Thus, the system must determine the start and end points of each gesture in time, as well as accurately recognize the class label of each gesture. We propose a unified approach that performs the tasks of temporal segmentation and classification simultaneously. Our method trains a random forest classification model to recognize gestures from a given vocabulary, as presented in a training dataset of video plus 3D body joint locations, as well as out-of-vocabulary (non-gesture) instances. Given an input video stream, our trained model is applied to candidate gestures using sliding windows at multiple temporal scales. The class label with the highest classifier confidence is selected, and its corresponding scale is used to determine the segmentation boundaries in time. We evaluated our formulation in segmenting and recognizing gestures from two different benchmark datasets: the NATOPS dataset of 9,600 gesture instances from a vocabulary of 24 aircraft handling signals, and the CHALEARN dataset of 7,754 gesture instances from a vocabulary of 20 Italian communication gestures. The performance of our method compares favorably with state-of-the-art methods that employ Hidden Markov Models or Hidden Conditional Random Fields on the NATOPS dataset. We conclude with a discussion of the advantages of using our model.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography