To see the other types of publications on this topic, follow the link: Random forest classification.

Dissertations / Theses on the topic 'Random forest classification'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Random forest classification.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Linusson, Henrik, Robin Rudenwall, and Andreas Olausson. "Random forest och glesa datarespresentationer." Thesis, Högskolan i Borås, Institutionen Handels- och IT-högskolan, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-16672.

Full text
Abstract:
In silico experimentation is the process of using computational and statistical models to predict medicinal properties in chemicals; as a means of reducing lab work and increasing success rate this process has become an important part of modern drug development. There are various ways of representing molecules - the problem that motivated this paper derives from collecting substructures of the chemical into what is known as fractional representations. Assembling large sets of molecules represented in this way will result in sparse data, where a large portion of the set is null values. This consumes an excessive amount of computer memory which inhibits the size of data sets that can be used when constructing predictive models.In this study, we suggest a set of criteria for evaluation of random forest implementations to be used for in silico predictive modeling on sparse data sets, with regard to computer memory usage, model construction time and predictive accuracy.A novel random forest system was implemented to meet the suggested criteria, and experiments were made to compare our implementation to existing machine learning algorithms to establish our implementation‟s correctness. Experimental results show that our random forest implementation can create accurate prediction models on sparse datasets, with lower memory usage overhead than implementations using a common matrix representation, and in less time than existing random forest implementations evaluated against. We highlight design choices made to accommodate for sparse data structures and data sets in the random forest ensemble technique, and therein present potential improvements to feature selection in sparse data sets.
Program: Systemarkitekturutbildningen
APA, Harvard, Vancouver, ISO, and other styles
2

Nelson, Marc. "Evaluating Multitemporal Sentinel-2 data for Forest Mapping using Random Forest." Thesis, Stockholms universitet, Institutionen för naturgeografi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-146657.

Full text
Abstract:
The mapping of land cover using remotely sensed data is most effective when a robust classification method is employed. Random forest is a modern machine learning algorithm that has recently gained interest in the field of remote sensing due to its non-parametric nature, which may be better suited to handle complex, high-dimensional data than conventional techniques. In this study, the random forest method is applied to remote sensing data from the European Space Agency’s new Sentinel-2 satellite program, which was launched in 2015 yet remains relatively untested in scientific literature using non-simulated data. In a study site of boreo-nemoral forest in Ekerö mulicipality, Sweden, a classification is performed for six forest classes based on CadasterENV Sweden, a multi-purpose land covermapping and change monitoring program. The performance of Sentinel-2’s Multi-SpectralImager is investigated in the context of time series to capture phenological conditions, optimal band combinations, as well as the influence of sample size and ancillary inputs.Using two images from spring and summer of 2016, an overall map accuracy of 86.0% was achieved. The red edge, short wave infrared, and visible red bands were confirmed to be of high value. Important factors contributing to the result include the timing of image acquisition, use of a feature reduction approach to decrease the correlation between spectral channels, and the addition of ancillary data that combines topographic and edaphic information. The results suggest that random forest is an effective classification technique that is particularly well suited to high-dimensional remote sensing data.
APA, Harvard, Vancouver, ISO, and other styles
3

Kindbom, Hannes. "LSTM vs Random Forest for Binary Classification of Insurance Related Text." Thesis, KTH, Matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252748.

Full text
Abstract:
The field of natural language processing has received increased attention lately, but less focus is put on comparing models, which differ in complexity. This thesis compares Random Forest to LSTM, for the task of classifying a message as question or non-question. The comparison was done by training and optimizing the models on historic chat data from the Swedish insurance company Hedvig. Different types of word embedding were also tested, such as Word2vec and Bag of Words. The results demonstrated that LSTM achieved slightly higher scores than Random Forest, in terms of F1 and accuracy. The models’ performance were not significantly improved after optimization and it was also dependent on which corpus the models were trained on. An investigation of how a chatbot would affect Hedvig’s adoption rate was also conducted, mainly by reviewing previous studies about chatbots’ effects on user experience. The potential effects on the innovation’s five attributes, relative advantage, compatibility, complexity, trialability and observability were analyzed to answer the problem statement. The results showed that the adoption rate of Hedvig could be positively affected, by improving the first two attributes. The effects a chatbot would have on complexity, trialability and observability were however suggested to be negligible, if not negative.
Det vetenskapliga området språkteknologi har fått ökad uppmärksamhet den senaste tiden, men mindre fokus riktas på att jämföra modeller som skiljer sig i komplexitet. Den här kandidatuppsatsen jämför Random Forest med LSTM, genom att undersöka hur väl modellerna kan användas för att klassificera ett meddelande som fråga eller icke-fråga. Jämförelsen gjordes genom att träna och optimera modellerna på historisk chattdata från det svenska försäkringsbolaget Hedvig. Olika typer av word embedding, så som Word2vec och Bag of Words, testades också. Resultaten visade att LSTM uppnådde något högre F1 och accuracy än Random Forest. Modellernas prestanda förbättrades inte signifikant efter optimering och resultatet var också beroende av vilket korpus modellerna tränades på. En undersökning av hur en chattbot skulle påverka Hedvigs adoption rate genomfördes också, huvudsakligen genom att granska tidigare studier om chattbotars effekt på användarupplevelsen. De potentiella effekterna på en innovations fem attribut, relativ fördel, kompatibilitet, komplexitet, prövbarhet and observerbarhet analyserades för att kunna svara på frågeställningen. Resultaten visade att Hedvigs adoption rate kan påverkas positivt, genom att förbättra de två första attributen. Effekterna en chattbot skulle ha på komplexitet, prövbarhet och observerbarhet ansågs dock vara försumbar, om inte negativ.
APA, Harvard, Vancouver, ISO, and other styles
4

Alkazaz, Ayham, and Kharouki Marwa Saado. "Evaluation of Adaptive random forest algorithm for classification of evolving data stream." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-283114.

Full text
Abstract:
In the era of big data, online machine learning algorithms have gained more and more traction from both academia and industry. In multiple scenarios decisions and predictions has to be made in near real-time as data is observed from continuously evolving data streams. Offline learning algorithms fall short in different ways when it comes to handling such problems. Apart from the costs and difficulties of storing these data streams in storage clusters and the computational difficulties associated with retraining the models each time new data is observed in order to keep the model up to date, these methods also don’t have built-in mechanisms to handle seasonality and non-stationary data streams. In such streams, the data distribution might change over time in what is called concept drift. Adaptive random forests are well studied and effective for online learning and non-stationary data streams. By using bagging and drift detection mechanisms adaptive random forests aim to improve the accuracy and performance of traditional random forests for online learning. In this study, we analyze the predictive classification accuracy of adaptive random forests when used in conjunction with different data streams and concept drifts. The data streams used to evaluate the accuracy are SEA and Agrawal. Each data stream is tested in 3 different concept drift configurations; gradual, sudden, and recur- ring. The results obtained from the performed benchmarks shows that adaptive random forests have better accuracy handling SEA than Agrawal, which could be interpreted by the dimensions and structure of the input attributes. Adaptive random forests showed no clear difference in accuracy between gradual and sudden concept drifts. However, recurring concept drifts had lower accuracy in the benchmarks than both the sudden and the gradual counterparts. This could be a result of the higher frequency of concept drifts within the same time period (number of observed samples).
I big data tiden har online-maskininlärningsalgoritmer fått mer och mer dragkraft från både akademin och industrin. I flera scenarier måste beslut och predektioner göras i nära realtid när data observeras från dataströmmar som kontinuerligt utvecklas. Offline-inlärningsalgoritmer brister på olika sätt när det gäller att hantera sådana problem. Bortsett från kostnaderna och svårigheterna med att lagra dessa dataströmmar i en lagringskluster och den beräkningsmässiga svårigheterna förknippade med att träna modellen på nytt varje gång ny data observeras för att hålla modellen uppdaterad. Dessa metoder har inte heller inbyggda mekanismer för att hantera säsongsbetonade och icke-stationära dataströmmar. I sådana strömmar kan datadistributionen förändras över tid i det som kallas konceptdrift. Anpassningsbara slumpmässiga skogar (Adaptive random forests) är väl studerade och effektiva modeller för online-inlärning och hantering av icke-stationära dataströmmar. Genom att använda mekanismer för att upptäcka konceptdrift och bagging syftar adaptiva slumpmässiga skogar att förbättra noggrannheten och prestandan hos traditionella slumpmässiga skogar för onlineinlärning. I denna studie analyserar vi den prediktiva klassificeringsnoggrannheten för adaptiva slumpmässiga skogar när de används i samband med olika dataströmmar och konceptdrift. Dataströmmarna som används för att utvärdera prestandan är SEA och Agrawal. Varje dataström testas i 3 olika konceptdriftkonfigurationer; gradvis, plötslig och återkommande. Resultaten som erhållits från de utförda experiment visar att anpassningsbara slumpmässiga skogar har bättre noggrannhet än Agrawal, vilket kan tolkas av  antal dimensioner och strukturen av inmatningsattributen. Adaptiva slumpmässiga skogar visade dock ingen tydlig skillnad i noggrannhet mellan gradvisa och plötsliga konceptdrift. Emellertid hade återkommande konceptdrift lägre noggrannhet i riktmärken än både de plötsliga och gradvisa motstycken. Detta kan vara ett resultat av den högre frekvensen av konceptdrift inom samma tidsperiod (antal observerade prover).
APA, Harvard, Vancouver, ISO, and other styles
5

Linusson, Henrik. "Multi-Output Random Forests." Thesis, Högskolan i Borås, Institutionen Handels- och IT-högskolan, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-17167.

Full text
Abstract:
The Random Forests ensemble predictor has proven to be well-suited for solving a multitudeof different prediction problems. In this thesis, we propose an extension to the Random Forestframework that allows Random Forests to be constructed for multi-output decision problemswith arbitrary combinations of classification and regression responses, with the goal ofincreasing predictive performance for such multi-output problems. We show that our methodfor combining decision tasks within the same decision tree reduces prediction error for mosttasks compared to single-output decision trees based on the same node impurity metrics, andprovide a comparison of different methods for combining such metrics.
Program: Magisterutbildning i informatik
APA, Harvard, Vancouver, ISO, and other styles
6

Röhss, Josefine. "A Statistical Framework for Classification of Tumor Type from microRNA Data." Thesis, KTH, Matematisk statistik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-191990.

Full text
Abstract:
Hepatocellular carcinoma (HCC) is a type of liver cancer with low survival rate, not least due to the difficulty of diagnosing it in an early stage. The objective of this thesis is to build a random forest classification method based on microRNA (and messenger RNA) expression profiles from patients with HCC. The main purpose is to be able to distinguish between tumor samples and normal samples by measuring the miRNA expression. If successful, this method can be used to detect HCC at an earlier stage and to design new therapeutics. The microRNAs and messenger RNAs which have a significant difference in expression between tumor samples and normal samples are selected for building random forest classification models. These models are then tested on paired samples of tumor and surrounding normal tissue from patients with HCC. The results show that the classification models built for classifying tumor and normal samples have high prediction accuracy and hence show high potential for using microRNA and messenger RNA expression levels for diagnosis of HCC.
Hepatocellulär cancer (HCC) är en typ av levercancer med mycket låg överlevnadsgrad, inte minst på grund av svårigheten att diagnosticera i ett tidigt skede. Syftet med det här projektet är att bygga en klassificeringsmodell med random forest, baserad på uttrycksprofiler av mikroRNA (och budbärar-RNA) från patienter med HCC. Målet är att kunna skilja mellan tumörprover och normala prover genom att mäta uttrycket av mikroRNA. Om detta mål uppnås kan metoden användas för att upptäcka HCC i ett tidigare skede och för att utveckla nya läkemedel. De mikroRNA och budbärar-RNA som har en signifikant skillnad i uttryck mellan prover från tumörvävnad och intilliggande normal vävnad väljs ut för att bygga klassificaringsmodeller med random forest. Dessa modeller testas sedan på parade prover av tumörvävnad och intilliggande vävnad från patienter med HCC. Resultaten visar att modeller som byggs med denna metod kan klassificera tumörprover och normala prover med hög noggrannhet. Det finns således stor potential för att använda uttrycksprofiler från mikroRNA och budbärar-RNA för att diagnosticera HCC.
APA, Harvard, Vancouver, ISO, and other styles
7

Ringqvist, Sanna. "Classification of terrain using superpixel segmentation and supervised learning." Thesis, Linköpings universitet, Datorseende, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-112511.

Full text
Abstract:
The usage of 3D-modeling is expanding rapidly. Modeling from aerial imagery has become very popular due to its increasing number of both civilian and mili- tary applications like urban planning, navigation and target acquisition. This master thesis project was carried out at Vricon Systems at SAAB. The Vricon system produces high resolution geospatial 3D data based on aerial imagery from manned aircrafts, unmanned aerial vehicles (UAV) and satellites. The aim of this work was to investigate to what degree superpixel segmentation and supervised learning can be applied to a terrain classification problem using imagery and digital surface models (dsm). The aim was also to investigate how the height information from the digital surface model may contribute compared to the information from the grayscale values. The goal was to identify buildings, trees and ground. Another task was to evaluate existing methods, and compare results. The approach for solving the stated goal was divided into several parts. The first part was to segment the image using superpixel segmentation, after that features were extracted. Then the classifiers were created and trained and finally the classifiers were evaluated. The classification method that obtained the best results in this thesis had approx- imately 90 % correctly labeled superpixels. The result was equal, if not better, compared to other solutions available on the market.
APA, Harvard, Vancouver, ISO, and other styles
8

Wålinder, Andreas. "Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis." Thesis, Linnéuniversitetet, Institutionen för matematik (MA), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-35126.

Full text
Abstract:
Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable.     There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection.     Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test.     Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance.     We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model.
APA, Harvard, Vancouver, ISO, and other styles
9

Pettersson, Anders. "High-Dimensional Classification Models with Applications to Email Targeting." Thesis, KTH, Matematisk statistik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-168203.

Full text
Abstract:
Email communication is valuable for any modern company, since it offers an easy mean for spreading important information or advertising new products, features or offers and much more. To be able to identify which customers that would be interested in certain information would make it possible to significantly improve a company's email communication and as such avoiding that customers start ignoring messages and creating unnecessary badwill. This thesis focuses on trying to target customers by applying statistical learning methods to historical data provided by the music streaming company Spotify. An important aspect was the high-dimensionality of the data, creating certain demands on the applied methods. A binary classification model was created, where the target was whether a customer will open the email or not. Two approaches were used for trying to target the costumers, logistic regression, both with and without regularization, and random forest classifier, for their ability to handle the high-dimensionality of the data. Performance accuracy of the suggested models were then evaluated on both a training set and a test set using statistical validation methods, such as cross-validation, ROC curves and lift charts. The models were studied under both large-sample and high-dimensional scenarios. The high-dimensional scenario represents when the number of observations, N, is of the same order as the number of features, p and the large sample scenario represents when N ≫ p. Lasso-based variable selection was performed for both these scenarios, to study the informative value of the features. This study demonstrates that it is possible to greatly improve the opening rate of emails by targeting users, even in the high dimensional scenario. The results show that increasing the amount of training data over a thousand fold will only improve the performance marginally. Rather efficient customer targeting can be achieved by using a few highly informative variables selected by the Lasso regularization.
Företag kan använda e-mejl för att på ett enkelt sätt sprida viktig information, göra reklam för nya produkter eller erbjudanden och mycket mer, men för många e-mejl kan göra att kunder slutar intressera sig för innehållet, genererar badwill och omöjliggöra framtida kommunikation. Att kunna urskilja vilka kunder som är intresserade av det specifika innehållet skulle vara en möjlighet att signifikant förbättra ett företags användning av e-mejl som kommunikationskanal. Denna studie fokuserar på att urskilja kunder med hjälp av statistisk inlärning applicerad på historisk data tillhandahållen av musikstreaming-företaget Spotify. En binärklassificeringsmodell valdes, där responsvariabeln beskrev huruvida kunden öppnade e-mejlet eller inte. Två olika metoder användes för att försöka identifiera de kunder som troligtvis skulle öppna e-mejlen, logistisk regression, både med och utan regularisering, samt random forest klassificerare, tack vare deras förmåga att hantera högdimensionella data. Metoderna blev sedan utvärderade på både ett träningsset och ett testset, med hjälp av flera olika statistiska valideringsmetoder så som korsvalidering och ROC kurvor. Modellerna studerades under både scenarios med stora stickprov och högdimensionella data. Där scenarion med högdimensionella data representeras av att antalet observationer, N, är av liknande storlek som antalet förklarande variabler, p, och scenarion med stora stickprov representeras av att N ≫ p. Lasso-baserad variabelselektion utfördes för båda dessa scenarion för att studera informationsvärdet av förklaringsvariablerna. Denna studie visar att det är möjligt att signifikant förbättra öppningsfrekvensen av e-mejl genom att selektera kunder, även när man endast använder små mängder av data. Resultaten visar att en enorm ökning i antalet träningsobservationer endast kommer förbättra modellernas förmåga att urskilja kunder marginellt.
APA, Harvard, Vancouver, ISO, and other styles
10

Halmann, Marju. "Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14710.

Full text
Abstract:
Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself. This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random Forest classifier can be used for the more general email classification task and how it compares to other existing email classifiers. The comparison is based on the literature study and on the empirical experimentation using two real-life datasets. Firstly, a literature study is performed to gain insight of the accuracy of other available email classifiers. Secondly, proposed model’s accuracy is explored with experimentation. The literature study shows that the accuracy of more general email classifiers differs greatly on different user sets. The proposed model accuracy is within the reported accuracy range, however in the lower part. It indicates that the proposed model performs poorly compared to other classifiers. On average, the classifier performance improves 15 percentage points with additional information. This indicates that Latent Dirichlet Allocation (LDA) combined with Random Forest classifier is promising, however future studies are needed to explore the model and ways to further increase the accuracy.
APA, Harvard, Vancouver, ISO, and other styles
11

Verica, Weverton Rodrigo. "Mapeamento semiautomático por meio de padrão espectro-temporal de áreas agrícolas e alvos permanentes com evi/modis no Paraná." Universidade Estadual do Oeste do Paraná, 2018. http://tede.unioeste.br/handle/tede/3916.

Full text
Abstract:
Submitted by Neusa Fagundes (neusa.fagundes@unioeste.br) on 2018-09-06T19:38:50Z No. of bitstreams: 2 Weverton_Verica2018.pdf: 4544186 bytes, checksum: 766200b4dea97433d3d88b08cbe3e548 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
Made available in DSpace on 2018-09-06T19:38:50Z (GMT). No. of bitstreams: 2 Weverton_Verica2018.pdf: 4544186 bytes, checksum: 766200b4dea97433d3d88b08cbe3e548 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2018-02-16
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES
Knowledge of location and quantity of areas for agriculture or either native or planted forests is relevant for public managers to make their decisions based on reliable data. In addition, part of ICMS revenues from the Municipal Participation Fund (FPM) depends on agricultural production data, number of rural properties and the environmental factor. The objective of this research was to design an objective and semiautomatic methodology to map agricultural areas and targets permanent, and later to identify areas of soybean, corn 1st and 2nd crops, winter crops, semi-perennial agriculture, forests and other permanent targets in the state of Paraná for the harvest years (2013/14 to 2016/17), using temporal series of EVI/Modis vegetation indexes. The proposed methodology follows the steps of the Knowledge Discovery Process in Database – KDD, in which the classification task was performed by the Random Forest algorithm. For the validation of the mappings, samples extracted from Landsat-8 images were used, obtaining the global accuracy indices greater than 84.37% and a kappa index ranging from 0.63 to 0.98, hence considered mappings with good or excellent spatial accuracy. The municipal data of the area of soybean, corn 1st crop, corn 2nd crop and winter crops mapped were confronted with the official statistics obtaining coefficients of linear correlation between 0.61 to 0.9, indicating moderate or strong correlation with the data officials. In this way, the proposed semi-automatic methodology was successful in the mapping, as well as the automation of the process of elaboration of the metrics, thus generating a script in the software R in order to facilitate future mappings with low processing time.
O conhecimento da localização e da quantidade de áreas destinadas a agricultura ou a florestas nativas ou plantadas é relevante para que os gestores públicos tomem suas decisões pautadas em dados fidedignos com a realidade. Além disto, parte das receitas de ICMS advindas do Fundo de Participação aos Municípios (FPM) depende de dados de produção agropecuária, número de propriedades rurais e fator ambiental. Diante disso, esta dissertação teve como objetivo elaborar uma metodologia objetiva e semiautomática para mapear áreas agrícolas e alvos permanente e posteriormente identificar áreas de soja, milho 1ª e 2ª safras, culturas de inverno, agricultura semi-perene, florestas e demais alvos permanentes no estado do Paraná para os anos-safra (2013/14 a 2016/17), utilizando séries temporais de índices de vegetação EVI/Modis. A metodologia proposta segue os passos do Processo de descoberta de conhecimento em base de dados – KDD, sendo que para isso foram elaboradas métricas extraídas do perfil espectro temporal de cada pixel e foi empregada a tarefa de classificação, realizada pelo algoritmo Random Forest. Para a validação dos mapeamentos utilizaram-se amostras extraídas de imagens Landsat-8, obtendo-se os índices de exatidão global maior que 84,37% e um índice kappa variando entre 0,63 e 0,98, sendo, portanto, considerados mapeamentos com boa ou excelente acurácia espacial. Os dados municipais da área de soja, milho 1ª safra, milho 2ª safra e culturas de inverno mapeada foram confrontados com as estatísticas oficiais obtendo-se coeficientes de correlação linear entre 0,61 a 0,9, indicando moderada ou forte correlação com os dados oficiais. Desse modo, a metodologia semiautomática proposta obteve êxito na realização do mapeamento, bem como a automatização do processo de elaboração das métricas, gerando, com isso um script no software R de maneira a facilitar mapeamentos futuros com baixo tempo de processamento.
APA, Harvard, Vancouver, ISO, and other styles
12

Andersson, Ricky. "Classification of Video Traffic : An Evaluation of Video Traffic Classification using Random Forests and Gradient Boosted Trees." Thesis, Karlstads universitet, Fakulteten för hälsa, natur- och teknikvetenskap (from 2013), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-55302.

Full text
Abstract:
Traffic classification is important for Internet providers and other organizations to solve some critical network management problems.The most common methods for traffic classification is Deep Packet Inspection (DPI) and port based classification. These methods are starting to become obsolete as more and more traffic are being encrypted and applications are starting to use dynamic ports and ports of other popular applications. An alternative method for traffic classification uses Machine Learning (ML).This ML method uses statistical features of network traffic flows, which solves the fundamental problems of DPI and port based classification for encrypted flows.The data used in this study is divided into video and non-video traffic flows and the goal of the study is to create a model which can classify video flows accurately in real-time.Previous studies found tree-based algorithms to work well in classifying network traffic. In this study random forest and gradient boosted trees are examined and compared as they are two of the best performing tree-based classification models.Random forest was found to work the best as the classification speed was significantly faster than gradient boosted trees. Over 93% correctly classified flows were achieved while keeping the random forest model small enough to keep fast classification speeds.
HITS, 4707
APA, Harvard, Vancouver, ISO, and other styles
13

Lou, Yuxiang, and Filip Matz. "Optimizing Product Assortments with Unknown Historical Transaction Data Using Nonparametric Choice Modeling and Random Forest Classification." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-261636.

Full text
Abstract:
Assortment optimization is a crucial problem for many firms who need to make decisions on which products to stock in their stores in order to maximize revenues. Optimizing assortments usually entails fitting choice models to historical data. To a large extent, this becomes a problem of understanding consumer behavior. In this paper, a two-step method is proposed for optimizing assortments for stores where there was no known sales data. First, training data was generated by optimizing assortments, using a non parametric choice model, on similar stores for which data is available. Then this training data isused to develop a series of random forest models which given parameters for a store can generate an optimal assortment. The features used in the random forest models were chosen based on consumer behavior theory and consisted of geographical and financial features as well as features regarding the composition of the stores. The data used in this report was provided by a major Swedish print distributor with over 1000 stores and 2500 products. The results presented in this paper show that this method outperforms the baseline in all cases studied. Furthermore, it was determined that geographic features are the essential type of features for the models to determine the optimal assortments for stores.
Produktsortimentsoptimering är ett centralt problem för många företag som måste ta beslut om vilka produkter de ska lagerhålla för att maximera sin vinst. Att optimera produktsortiment brukar ofta innebära att träna valmodeller på historisk data. Detta blir ofta en fråga om att förstå konsumenters beteende. I denna uppsats presenteras en tvåstegs metod för att optimiera produktsortiment utan historisk data. I det första steget optimeras sortimentet med hjälp av en icke-parametrisk valmodell på liknande butiker där data finns tillgängligt. Sedan utvecklas Random Forest modellermed de optimerade sortimenten som träningsdata. Givna en rad parameterar kan dessa modeller generera optimala sortiment. Parametrarna som användes i Random Forest modellerna valdes baserat på konsumentteori and bestod av geografiska och finansiella parametrar så väl som parameterar som beskrev butikernas sammansättning. Datan som användes tillhandahölls av ett svenskt företag inom tryckbranschen som har över 1000 butiker och 2500 produkter i sitt sortiment. Resultaten som presenterades i denna uppsats visar att metoden presterar bättre än baslinjen i alla fall som studerades. Utöver detta, så beslutas det att geografiska parametrar är de viktigaste parametrarna för modelerna att ta beslut angående de optimala sortimenten.
APA, Harvard, Vancouver, ISO, and other styles
14

Williams, Paige T. "Mapping Smallholder Forest Plantations in Andhra Pradesh, India using Multitemporal Harmonized Landsat Sentinel-2 S10 Data." Thesis, Virginia Tech, 2020. http://hdl.handle.net/10919/104234.

Full text
Abstract:
The objective of this study was to develop a method by which smallholder forest plantations can be mapped accurately in Andhra Pradesh, India using multitemporal (intra- and inter-annual) visible and near-infrared (VNIR) bands from the Sentinel-2 MultiSpectral Instruments (MSIs). Dependency on and scarcity of wood products have driven the deforestation and degradation of natural forests in Southeast Asia. At the same time, forest plantations have been established both within and outside of forests, with the latter (as contiguous blocks) being the focus of this study. The ecosystem services provided by natural forests are different from those of plantations. As such, being able to separate natural forests from plantations is important. Unfortunately, there are constraints to accurately mapping planted forests in Andhra Pradesh (and other similar landscapes in South and Southeast Asia) using remotely sensed data due to the plantations' small size (average 2 hectares), short rotation ages (often 4-7 years for timber species), and spectral similarities to croplands and natural forests. The East and West Godavari districts of Andhra Pradesh were selected as the area for a case study. Cloud-free Harmonized Landsat Sentinel-2 (HLS) S10 data was acquired over six dates, from different seasons, as follows: December 28, 2015; November 22, 2016; November 2, 2017; December 22, 2017; March 1, 2018; and June 15, 2018. Cloud-free satellite data are not available during the monsoon season (July to September) in this coastal region. In situ data on forest plantations, provided by collaborators, was supplemented with additional training data representing other land cover subclasses in the region: agriculture, water, aquaculture, mangrove, palm, forest plantation, ground, natural forest, shrub/scrub, sand, and urban, with a total sample size of 2,230. These high-quality samples were then aggregated into three land use classes: non-forest, natural forest, and forest plantations. Image classification used random forests within the Julia Decision Tree package on a thirty-band stack that was comprised of the VNIR bands and NDVI images for all dates. The median classification accuracy from the 5-fold cross validation was 94.3%. Our results, predicated on high quality training data, demonstrate that (mostly smallholder) forest plantations can be separated from natural forests even using only the Sentinel 2 VNIR bands when multitemporal data (across both years and seasons) are used.
The objective of this study was to develop a method by which smallholder forest plantations can be mapped accurately in Andhra Pradesh, India using multitemporal (intra- and inter-annual) visible (red, green, blue) and near-infrared (VNIR) bands from the European Space Agency satellite Sentinel-2. Dependency on and scarcity of wood products have driven the deforestation and degradation of natural forests in Southeast Asia. At the same time, forest plantations have been established both within and outside of forests, with the latter (as contiguous blocks) being the focus of this study. The ecosystem services provided by natural forests are different from those of plantations. As such, being able to separate natural forests from plantations is important. Unfortunately, there are constraints to accurately mapping planted forests in Andhra Pradesh (and other similar landscapes in South and Southeast Asia) using remotely sensed data due to the plantations' small size (average 2 hectares), short rotation ages (often 4-7 years for timber species), and spectral (reflectance from satellite imagery) similarities to croplands and natural forests. The East and West Godavari districts of Andhra Pradesh were selected as the area for a case study. Cloud-free Harmonized Landsat Sentinel-2 (HLS) S10 images were acquired over six dates, from different seasons, as follows: December 28, 2015; November 22, 2016; November 2, 2017; December 22, 2017; March 1, 2018; and June 15, 2018. Cloud-free satellite data are not available during the monsoon season (July to September) in this coastal region. In situ data on forest plantations, provided by collaborators, was supplemented with additional training data points (X and Y locations with land cover class) representing other land cover subclasses in the region: agriculture, water, aquaculture, mangrove, palm, forest plantation, ground, natural forest, shrub/scrub, sand, and urban, with a total of 2,230 training points. These high-quality samples were then aggregated into three land use classes: non-forest, natural forest, and forest plantations. Image classification used random forests within the Julia DecisionTree package on a thirty-band stack that was comprised of the VNIR bands and NDVI (calculation related to greenness, i.e. higher value = more vegetation) images for all dates. The median classification accuracy from the 5-fold cross validation was 94.3%. Our results, predicated on high quality training data, demonstrate that (mostly smallholder) forest plantations can be separated from natural forests even using only the Sentinel 2 VNIR bands when multitemporal data (across both years and seasons) are used.
APA, Harvard, Vancouver, ISO, and other styles
15

Sakouvogui, Kekoura. "Comparative Classification of Prostate Cancer Data using the Support Vector Machine, Random Forest, Dualks and k-Nearest Neighbours." Thesis, North Dakota State University, 2015. https://hdl.handle.net/10365/27698.

Full text
Abstract:
This paper compares four classifications tools, Support Vector Machine (SVM), Random Forest (RF), DualKS and the k-Nearest Neighbors (kNN) that are based on different statistical learning theories. The dataset used is a microarray gene expression of 596 male patients with prostate cancer. After treatment, the patients were classified into one group of phenotype with three levels: PSA (Prostate-Specific Antigen), Systematic and NED (No Evidence of Disease). The purpose of this research is to determine the performance rate of each classifier by selecting the optimal kernels and parameters that give the best prediction rate of the phenotype. The paper begins with the discussion of previous implementations of the tools and their mathematical theories. The results showed that three classifiers achieved a comparable performance that was above the average while DualKS did not. We also observed that SVM outperformed the kNN, RF and DualKS classifiers.
APA, Harvard, Vancouver, ISO, and other styles
16

Maginnity, Joseph D. "Comparing the Uses and Classification Accuracy of Logistic and Random Forest Models on an Adolescent Tobacco Use Dataset." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1586997693789325.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Shockey, Melissa Dawn. "Incorporating Climate Sensitivity for Southern Pine Species into the Forest Vegetation Simulator." Thesis, Virginia Tech, 2013. http://hdl.handle.net/10919/22031.

Full text
Abstract:
Growing concerns over the possible effects of greenhouse-gas-related global warming on North American forests have led to increasing calls to address climate change effects on forest vegetation in management and planning applications.  The objectives of this project are to model contemporary conditions of soils and climate associated with the presence or absence and abundance of five southern pine species: shortleaf pine (Pinus echinata Mill.), slash pine (P. elliottii Engelm.), longleaf pine (P. palustris Mill.), pond pine (P. serótina Michx.), and loblolly pine (P. taeda L.).  Classification and regression based Random Forest models were developed for presence-absence and abundance data, respectively.  Model and diagnostics such as receiver operating curves (ROC) and variable importance plots were examined to assess model performance.  Presence-absence classification models had out-of-bag error rates ranging from 6.32% to 16.06%, and areas under ROC curves ranging from 0.92-0.98.  Regression models explained between 13.76% and 43.31% of variation in abundance values.  Using the models based on contemporary data, predictions were made for the future years 2030, 2060, and 2090 using four different greenhouse gas emissions scenarios and three different general circulation models.  Maps of future climate scenarios showed a range of potential changes in the geographic extent of the conditions consistent with current presence observations.  Results of this work will be incorporated into eastern U.S. variants of the Forest Vegetation Simulator (FVS) model, similar to work that has been done for FVS variants in the West.
Master of Science
APA, Harvard, Vancouver, ISO, and other styles
18

Arnroth, Lukas, and Dennis Jonni Fiddler. "Supervised Learning Techniques : A comparison of the Random Forest and the Support Vector Machine." Thesis, Uppsala universitet, Statistiska institutionen, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-274768.

Full text
Abstract:
This thesis examines the performance of the support vector machine and the random forest models in the context of binary classification. The two techniques are compared and the outstanding one is used to construct a final parsimonious model. The data set consists of 33 observations and 89 biomarkers as features with no known dependent variable. The dependent variable is generated through k-means clustering, with a predefined final solution of two clusters. The training of the algorithms is performed using five-fold cross-validation repeated twenty times. The outcome of the training process reveals that the best performing versions of the models are a linear support vector machine and a random forest with six randomly selected features at each split. The final results of the comparison on the test set of these optimally tuned algorithms show that the random forest outperforms the linear kernel support vector machine. The former classifies all observations in the test set correctly whilst the latter classifies all but one correctly. Hence, a parsimonious random forest model using the top five features is constructed, which, to conclude, performs equally well on the test set compared to the original random forest model using all features.
APA, Harvard, Vancouver, ISO, and other styles
19

Daines, Kyle. "Fall Risk Classification for People with Lower Extremity Amputations Using Machine Learning and Smartphone Sensor Features from a 6-Minute Walk Test." Thesis, Université d'Ottawa / University of Ottawa, 2020. http://hdl.handle.net/10393/40948.

Full text
Abstract:
Falls are a leading cause of injury and accidental injury death worldwide. Fall-risk prevention techniques exist but fall-risk identification can be difficult. While clinical assessment tools are the standard for identifying fall risk, wearable-sensors and machine learning could improve outcomes with automated and efficient techniques. Machine learning research has focused on older adults. Since people with lower limb amputations have greater falling and injury risk than the elderly, research is needed to evaluate these approaches with the amputee population. In this thesis, random forest and fully connected feedforward artificial neural network (ANN) machine learning models were developed and optimized for fall-risk identification in amputee populations, using smartphone sensor data (phone at posterior pelvis) from 89 people with various levels of lower-limb amputation who completed a 6-minute walk test (6MWT). The best model was a random forest with 500 trees, using turn data and a feature set selected using correlation-based feature selection (81.3% accuracy, 57.2% sensitivity, 94.9% specificity, 0.59 Matthews correlation coefficient, 0.83 F1 score). After extensive ANN optimization with the best ranked 50 features from an Extra Trees Classifier, the best ANN model achieved 69.7% accuracy, 53.1% sensitivity, 78.9% specificity, 0.33 Matthews correlation coefficient, and 0.62 F1 score. Features from a single smartphone during a 6MWT can be used with random forest machine learning for fall-risk classification in lower limb amputees. Model performance was similarly effective or better than the Timed Up and Go and Four Square Step Test. This model could be used clinically to identify fall-risk individuals during a 6MWT, thereby finding people who were not intended for fall screening. Since model specificity was very high, the risk of accidentally misclassifying people who are a no fall-risk individual is quite low, and few people would incorrectly be entered into fall mitigation programs based on the test outcomes.
APA, Harvard, Vancouver, ISO, and other styles
20

Fürderer, Niklas. "A Study of an Iterative User-Specific Human Activity Classification Approach." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-253802.

Full text
Abstract:
Applications for sensor-based human activity recognition use the latest algorithms for the detection and classification of human everyday activities, both for online and offline use cases. The insights generated by those algorithms can in a next step be used within a wide broad of applications such as safety, fitness tracking, localization, personalized health advice and improved child and elderly care.In order for an algorithm to be performant, a significant amount of annotated data from a specific target audience is required. However, a satisfying data collection process is cost and labor intensive. This also may be unfeasible for specific target groups as aging effects motion patterns and behaviors. One main challenge in this application area lies in the ability to identify relevant changes over time while being able to reuse previously annotated user data. The accurate detection of those user-specific patterns and movement behaviors therefore requires individual and adaptive classification models for human activities.The goal of this degree work is to compare several supervised classifier performances when trained and tested on a newly iterative user-specific human activity classification approach as described in this report. A qualitative and quantitative data collection process was applied. The tree-based classification algorithms Decision Tree, Random Forest as well as XGBoost were tested on custom based datasets divided into three groups. The datasets contained labeled motion data of 21 volunteers from wrist worn sensors.Computed across all datasets, the average performance measured in recall increased by 5.2% (using a simulated leave-one-subject-out cross evaluation) for algorithms trained via the described approach compared to a random non-iterative approach.
Sensorbaserad aktivitetsigenkänning använder sig av det senaste algoritmerna för detektion och klassificering av mänskliga vardagliga aktiviteter, både i uppoch frånkopplat läge. De insikter som genereras av algoritmerna kan i ett nästa steg användas inom en mängd nya applikationer inom områden så som säkerhet, träningmonitorering, platsangivelser, personifierade hälsoråd samt inom barnoch äldreomsorgen.För att en algoritm skall uppnå hög prestanda krävs en inte obetydlig mängd annoterad data, som med fördel härrör från den avsedda målgruppen. Dock är datainsamlingsprocessen kostnadsoch arbetsintensiv. Den kan dessutom även vara orimlig att genomföra för vissa specifika målgrupper, då åldrandet påverkar rörelsemönster och beteenden. En av de största utmaningarna inom detta område är att hitta de relevanta förändringar som sker över tid, samtidigt som man vill återanvända tidigare annoterad data. För att kunna skapa en korrekt bild av det individuella rörelsemönstret behövs därför individuella och adaptiva klassificeringsmodeller.Målet med detta examensarbete är att jämföra flera olika övervakade klassificerares (eng. supervised classifiers) prestanda när dem tränats med hjälp av ett iterativt användarspecifikt aktivitetsklassificeringsmetod, som beskrivs i denna rapport. En kvalitativ och kvantitativ datainsamlingsprocess tillämpades. Trädbaserade klassificeringsalgoritmerna Decision Tree, Random Forest samt XGBoost testades utifrån specifikt skapade dataset baserade på 21 volontärer, som delades in i tre grupper. Data är baserad på rörelsedata från armbandssensorer.Beräknat över samtlig data, ökade den genomsnittliga sensitiviteten med 5.2% (simulerad korsvalidering genom utelämna-en-individ) för algoritmer tränade via beskrivna metoden jämfört med slumpvis icke-iterativ träning.
APA, Harvard, Vancouver, ISO, and other styles
21

Jabali, Aghyad, and Husein Abdelkadir Mohammedbrhan. "Tyre sound classification with machine learning." Thesis, Högskolan i Gävle, Datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-36209.

Full text
Abstract:
Having enough data about the usage of tyre types on the road can lead to a better understanding of the consequences of studded tyres on the environment. This paper is focused on training and testing a machine learning model which can be further integrated into a larger system for automation of the data collection process. Different machine learning algorithms, namely CNN, SVM, and Random Forest, were compared in this experiment. The method used in this paper is an empirical method. First, sound data for studded and none-studded tyres was collected from three different locations in the city of Gävle/Sweden. A total of 760 Mel spectrograms from both classes was generated to train and test a well-known CNN model (AlexNet) on MATLAB. Sound features for both classes were extracted using JAudio to train and test models that use SVM and Random Forest classifi-ers on Weka. Unnecessary features were removed one by one from the list of features to improve the performance of the classifiers. The result shows that CNN achieved accuracy of 84%, SVM has the best performance both with and without removing some audio features (i.e 94% and 92%, respectively), while Random Forest has 89 % accuracy. The test data is comprised of 51% of the studded class and 49% of the none-studded class and the result of the SVM model has achieved more than 94 %. Therefore, it can be considered as an acceptable result that can be used in practice.
APA, Harvard, Vancouver, ISO, and other styles
22

Li, Sichu. "Application of Machine Learning Techniques for Real-time Classification of Sensor Array Data." ScholarWorks@UNO, 2009. http://scholarworks.uno.edu/td/913.

Full text
Abstract:
There is a significant need to identify approaches for classifying chemical sensor array data with high success rates that would enhance sensor detection capabilities. The present study attempts to fill this need by investigating six machine learning methods to classify a dataset collected using a chemical sensor array: K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Classification and Regression Trees (CART), Random Forest (RF), Naïve Bayes Classifier (NB), and Principal Component Regression (PCR). A total of 10 predictors that are associated with the response from 10 sensor channels are used to train and test the classifiers. A training dataset of 4 classes containing 136 samples is used to build the classifiers, and a dataset of 4 classes with 56 samples is used for testing. The results generated with the six different methods are compared and discussed. The RF, CART, and KNN are found to have success rates greater than 90%, and to outperform the other methods.
APA, Harvard, Vancouver, ISO, and other styles
23

Vraštiak, Pavel. "Hledání anomálií v DNS provozu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236506.

Full text
Abstract:
This master thesis is written in collaboration with NIC.CZ company. It describes basic principles of DNS system and properties of DNS traffic. It's goal an implementation of DNS anomaly classifier and its evaluation in practice.
APA, Harvard, Vancouver, ISO, and other styles
24

Ankaräng, Marcus, and Jakob Kristiansson. "Comparison of Logistic Regression and an Explained Random Forest in the Domain of Creditworthiness Assessment." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301907.

Full text
Abstract:
As the use of AI in society is developing, the requirement of explainable algorithms has increased. A challenge with many modern machine learning algorithms is that they, due to their often complex structures, lack the ability to produce human-interpretable explanations. Research within explainable AI has resulted in methods that can be applied on top of non- interpretable models to motivate their decision bases. The aim of this thesis is to compare an unexplained machine learning model used in combination with an explanatory method, and a model that is explainable through its inherent structure. Random forest was the unexplained model in question and the explanatory method was SHAP. The explainable model was logistic regression, which is explanatory through its feature weights. The comparison was conducted within the area of creditworthiness and was based on predictive performance and explainability. Furthermore, the thesis intends to use these models to investigate what characterizes loan applicants who are likely to default. The comparison showed that no model performed significantly better than the other in terms of predictive performance. Characteristics of bad loan applicants differed between the two algorithms. Three important aspects were the applicant’s age, where they lived and whether they had a residential phone. Regarding explainability, several advantages with SHAP were observed. With SHAP, explanations on both a local and a global level can be produced. Also, SHAP offers a way to take advantage of the high performance in many modern machine learning algorithms, and at the same time fulfil today’s increased requirement of transparency.
I takt med att AI används allt oftare för att fatta beslut i samhället, har kravet på förklarbarhet ökat. En utmaning med flera moderna maskininlärningsmodeller är att de, på grund av sina komplexa strukturer, sällan ger tillgång till mänskligt förståeliga motiveringar. Forskning inom förklarar AI har lett fram till metoder som kan appliceras ovanpå icke- förklarbara modeller för att tolka deras beslutsgrunder. Det här arbetet syftar till att jämföra en icke- förklarbar maskininlärningsmodell i kombination med en förklaringsmetod, och en modell som är förklarbar genom sin struktur. Den icke- förklarbara modellen var random forest och förklaringsmetoden som användes var SHAP. Den förklarbara modellen var logistisk regression, som är förklarande genom sina vikter. Jämförelsen utfördes inom området kreditvärdighet och grundades i prediktiv prestanda och förklarbarhet. Vidare användes dessa modeller för att undersöka vilka egenskaper som var kännetecknande för låntagare som inte förväntades kunna betala tillbaka sitt lån. Jämförelsen visade att ingen av de båda metoderna presterande signifikant mycket bättre än den andra sett till prediktiv prestanda. Kännetecknande särdrag för dåliga låntagare skiljde sig åt mellan metoderna. Tre viktiga aspekter var låntagarens °ålder, vart denna bodde och huruvida personen ägde en hemtelefon. Gällande förklarbarheten framträdde flera fördelar med SHAP, däribland möjligheten att kunna producera både lokala och globala förklaringar. Vidare konstaterades att SHAP gör det möjligt att dra fördel av den höga prestandan som många moderna maskininlärningsmetoder uppvisar och samtidigt uppfylla dagens ökade krav på transparens.
APA, Harvard, Vancouver, ISO, and other styles
25

Bouaziz, Ameni. "Méthodes d’apprentissage interactif pour la classification des messages courts." Thesis, Université Côte d'Azur (ComUE), 2017. http://www.theses.fr/2017AZUR4039/document.

Full text
Abstract:
La classification automatique des messages courts est de plus en plus employée de nos jours dans diverses applications telles que l'analyse des sentiments ou la détection des « spams ». Par rapport aux textes traditionnels, les messages courts, comme les tweets et les SMS, posent de nouveaux défis à cause de leur courte taille, leur parcimonie et leur manque de contexte, ce qui rend leur classification plus difficile. Nous présentons dans cette thèse deux nouvelles approches visant à améliorer la classification de ce type de message. Notre première approche est nommée « forêts sémantiques ». Dans le but d'améliorer la qualité des messages, cette approche les enrichit à partir d'une source externe construite au préalable. Puis, pour apprendre un modèle de classification, contrairement à ce qui est traditionnellement utilisé, nous proposons un nouvel algorithme d'apprentissage qui tient compte de la sémantique dans le processus d'induction des forêts aléatoires. Notre deuxième contribution est nommée « IGLM » (Interactive Generic Learning Method). C'est une méthode interactive qui met récursivement à jour les forêts en tenant compte des nouvelles données arrivant au cours du temps, et de l'expertise de l'utilisateur qui corrige les erreurs de classification. L'ensemble de ce mécanisme est renforcé par l'utilisation d'une méthode d'abstraction permettant d'améliorer la qualité des messages. Les différentes expérimentations menées en utilisant ces deux méthodes ont permis de montrer leur efficacité. Enfin, la dernière partie de la thèse est consacrée à une étude complète et argumentée de ces deux prenant en compte des critères variés tels que l'accuracy, la rapidité, etc
Automatic short text classification is more and more used nowadays in various applications like sentiment analysis or spam detection. Short texts like tweets or SMS are more challenging than traditional texts. Therefore, their classification is more difficult owing to their shortness, sparsity and lack of contextual information. We present two new approaches to improve short text classification. Our first approach is "Semantic Forest". The first step of this approach proposes a new enrichment method that uses an external source of enrichment built in advance. The idea is to transform a short text from few words to a larger text containing more information in order to improve its quality before building the classification model. Contrarily to the methods proposed in the literature, the second step of our approach does not use traditional learning algorithm but proposes a new one based on the semantic links among words in the Random Forest classifier. Our second contribution is "IGLM" (Interactive Generic Learning Method). It is a new interactive approach that recursively updates the classification model by considering the new data arriving over time and by leveraging the user intervention to correct misclassified data. An abstraction method is then combined with the update mechanism to improve short text quality. The experiments performed on these two methods show their efficiency and how they outperform traditional algorithms in short text classification. Finally, the last part of the thesis concerns a complete and argued comparative study of the two proposed methods taking into account various criteria such as accuracy, speed, etc
APA, Harvard, Vancouver, ISO, and other styles
26

Yu, Jie. "Classification of Genotype and Age of Eyes Using RPE Cell Size and Shape." Digital Archive @ GSU, 2012. http://digitalarchive.gsu.edu/math_theses/118.

Full text
Abstract:
Retinal pigment epithelium (RPE) is a principal site of pathogenesis in age-related macular de-generation (AMD). AMD is a main source of vision loss even blindness in the elderly and there is no effective treatment right now. Our aim is to describe the relationship between the morphology of RPE cells and the age and genotype of the eyes. We use principal component analysis (PCA) or functional principal component method (FPCA), support vector machine (SVM), and random forest (RF) methods to analyze the morphological data of RPE cells in mouse eyes to classify their age and genotype. Our analyses show that amongst all morphometric measures of RPE cells, cell shape measurements (eccentricity and solidity) are good for classification. But combination of cell shape and size (perimeter) provide best classification.
APA, Harvard, Vancouver, ISO, and other styles
27

Hesping, Malena. "Remote sensing-based land cover classification and change detection using Sentinel-2 data and Random Forest : A case study of Rusinga Island, Kenya." Thesis, Linköpings universitet, Tema Miljöförändring, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166749.

Full text
Abstract:
Healthy forests and soils are crucial for the very existence of mankind as they provide food, clean water and air, shade and protection against floods and storms. With their photosynthetic carbon storage ability, they mitigate climate change and fertilise and stabilise soils. Unfortunately, deforestation and the loss of fertile soils are the bleak reality and among the world’s most pressing challenges. Over the past decades Kenya has faced severe deforestation, but efforts are being undertaken to reverse deforestation, revegetate degraded land and combat erosion. Satellite remote sensing technology becomes increasingly useful for vegetation monitoring as the data quality improves and the costs decrease. This thesis explores the potential of free open access Sentinel-2 data for vegetation monitoring through Random Forest land cover classification and post-classification change detection on Rusinga Island, Kenya. Different single-date and multi-temporal predictor datasets differentiating respectively between five and four classes were examined to develop the most suitable model. The classification achieved acceptable results when assessed on an independent test dataset (overall accuracy of 90.06% with five classes and 96.89% with four classes), which should however be confirmed on the ground and could potentially be improved with better reference data. In this study, change detection could only be analysed over a time frame of two years, which is too short to produce meaningful results. Nevertheless, the method was proven conceptually and could be applied in the future to monitor land cover changes on Rusinga Island.
APA, Harvard, Vancouver, ISO, and other styles
28

Örnbratt, Filip, Jonathan Isaksson, and Mario Willing. "A comparative study of social bot classification techniques." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-16994.

Full text
Abstract:
With social media rising in popularity over the recent years, new so called social bots are infiltrating by spamming and manipulating people all over the world. Many different methods have been presented to solve this problem with varying success. This study aims to compare some of these methods, on a dataset of Twitter account metadata, to provide helpful information to companies when deciding how to solve this problem. Two machine learning algorithms and a human survey will be compared on the ability to classify accounts. The algorithms used are the supervised algorithm random forest and the unsupervised algorithm k-means. There will also be an evaluation of two ways to run these algorithms, using the machine learning as a service BigML and the python library Scikit-learn. Additionally, what metadata features are most valuable in the supervised and human survey will be compared. Results show that supervised machine learning is the superior technique for social bot identification with an accuracy of almost 99%. To conclude, it depends on the expertise of the company and if a relevant training dataset is available but in most cases supervised machine learning is recommended.
APA, Harvard, Vancouver, ISO, and other styles
29

Xia, Junshi. "Multiple classifier systems for the classification of hyperspectral data." Thesis, Grenoble, 2014. http://www.theses.fr/2014GRENT047/document.

Full text
Abstract:
Dans cette thèse, nous proposons plusieurs nouvelles techniques pour la classification d'images hyperspectrales basées sur l'apprentissage d'ensemble. Le cadre proposé introduit des innovations importantes par rapport aux approches précédentes dans le même domaine, dont beaucoup sont basées principalement sur un algorithme individuel. Tout d'abord, nous proposons d'utiliser la Forêt de Rotation (Rotation Forest) avec différentes techiniques d'extraction de caractéristiques linéaire et nous comparons nos méthodes avec les approches d'ensemble traditionnelles, tels que Bagging, Boosting, Sous-espace Aléatoire et Forêts Aléatoires. Ensuite, l'intégration des machines à vecteurs de support (SVM) avec le cadre de sous-espace de rotation pour la classification de contexte est étudiée. SVM et sous-espace de rotation sont deux outils puissants pour la classification des données de grande dimension. C'est pourquoi, la combinaison de ces deux méthodes peut améliorer les performances de classification. Puis, nous étendons le travail de la Forêt de Rotation en intégrant la technique d'extraction de caractéristiques locales et l'information contextuelle spatiale avec un champ de Markov aléatoire (MRF) pour concevoir des méthodes spatio-spectrale robustes. Enfin, nous présentons un nouveau cadre général, ensemble de sous-espace aléatoire, pour former une série de classifieurs efficaces, y compris les arbres de décision et la machine d'apprentissage extrême (ELM), avec des profils multi-attributs étendus (EMaPS) pour la classification des données hyperspectrales. Six méthodes d'ensemble de sous-espace aléatoire, y compris les sous-espaces aléatoires avec les arbres de décision, Forêts Aléatoires (RF), la Forêt de Rotation (RoF), la Forêt de Rotation Aléatoires (Rorf), RS avec ELM (RSELM) et sous-espace de rotation avec ELM (RoELM), sont construits par multiples apprenants de base. L'efficacité des techniques proposées est illustrée par la comparaison avec des méthodes de l'état de l'art en utilisant des données hyperspectrales réelles dans de contextes différents
In this thesis, we propose several new techniques for the classification of hyperspectral remote sensing images based on multiple classifier system (MCS). Our proposed framework introduces significant innovations with regards to previous approaches in the same field, many of which are mainly based on an individual algorithm. First, we propose to use Rotation Forests with several linear feature extraction and compared them with the traditional ensemble approaches, such as Bagging, Boosting, Random subspace and Random Forest. Second, the integration of the support vector machines (SVM) with Rotation subspace framework for context classification is investigated. SVM and Rotation subspace are two powerful tools for high-dimensional data classification. Therefore, combining them can further improve the classification performance. Third, we extend the work of Rotation Forests by incorporating local feature extraction technique and spatial contextual information with Markov random Field (MRF) to design robust spatial-spectral methods. Finally, we presented a new general framework, Random subspace ensemble, to train series of effective classifiers, including decision trees and extreme learning machine (ELM), with extended multi-attribute profiles (EMAPs) for classifying hyperspectral data. Six RS ensemble methods, including Random subspace with DT (RSDT), Random Forest (RF), Rotation Forest (RoF), Rotation Random Forest (RoRF), RS with ELM (RSELM) and Rotation subspace with ELM (RoELM), are constructed by the multiple base learners. The effectiveness of the proposed techniques is illustrated by comparing with state-of-the-art methods by using real hyperspectral data sets with different contexts
APA, Harvard, Vancouver, ISO, and other styles
30

Braff, Pamela Hope. "Not All Biomass is Created Equal: An Assessment of Social and Biophysical Factors Constraining Wood Availability in Virginia." Thesis, Virginia Tech, 2014. http://hdl.handle.net/10919/63997.

Full text
Abstract:
Most estimates of wood supply do not reflect the true availability of wood resources. The availability of wood resources ultimately depends on collective wood harvesting decisions across the landscape. Both social and biophysical constraints impact harvesting decisions and thus the availability of wood resources. While most constraints do not completely inhibit harvesting, they may significantly reduce the probability of harvest. Realistic assessments of woody availability and distribution are needed for effective forest management and planning. This study focuses on predicting the probability of harvest at forested FIA plot locations in Virginia. Classification and regression trees, conditional inferences trees, random forest, balanced random forest, conditional random forest, and logistic regression models were built to predict harvest as a function of social and biophysical availability constraints. All of the models were evaluated and compared to identify important variables constraining harvest, predict future harvests, and estimate the available wood supply. Variables related to population and resource quality seem to be the best predictors of future harvest. The balanced random forest and logistic regressions models are recommended for predicting future harvests. The balanced random forest model is the best predictor, while the logistic regression model can be most easily shared and replicated. Both models were applied to predict harvest at recently measured FIA plots. Based on the probability of harvest, we estimate that between 2012 and 2017, 10 – 21 percent of total wood volume on timberland will be available for harvesting.
Master of Science
APA, Harvard, Vancouver, ISO, and other styles
31

Axén, Maja, and Jennifer Karlberg. "Binary Classification for Predicting Customer Churn." Thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-171892.

Full text
Abstract:
Predicting when a customer is about to turn to a competitor can be difficult, yet extremely valuable from a business perspective. The moment a customer stops being considered a customer is known as churn, a widely researched topic in several industries when dealing with subscription-services. However, in industries with non-subscription services and products, defining churn can be a daunting task and the existing literature does not fully cover this field. Therefore, this thesis can be seen as a contribution to current research, specially when not having a set definition for churn. A definition for churn, adjusted to DIAKRIT’s business, is created. DIAKRIT is a company working in the real estate industry, which faces many challenges, such as a huge seasonality. The prediction was approached as a supervised problem, where three different Machine Learning methods were used: Logistic Regression, Random Forest and Support Vector Machine. The variables used in the predictions are predominantly activity data. With a relatively high accuracy and AUC-score, Random Forest was concluded to be the most reliable model. It is however clear that the model cannot separate between the classes perfectly. It was also visible that the Random Forest model produces a relatively high precision. Thereby, it can be settled that even though the model is not flawless the customers predicted to churn are very likely to churn.
Att prediktera när en kund är påväg att vända sig till en konkurrent kan vara svårt, dock kan det visa sig extremt värdefullt ur ett affärsperspektiv. När en kund slutar vara kund benäms det ofta som kundbortfall eller ”churn”. Detta är ett ämne som är brett forskat på i flertalet olika industrier, men då ofta i situationer med prenumenationstjänster. När man inte har en prenumerationstjänst försvåras uppgiften att definera churn och existerande studier brister i att analysera detta. Denna uppsats kan därför ses som ett bidrag till nuvarande litteratur, i synnerhet i fall där ingen tydlig definition för churn existerar. En definition för churn, anpassad efter DIAKRIT och deras affärsstruktur har skapats i det här projektet. DIAKRIT är verksamma i fastighetsbranschen, en industri som har flera utmaningar, bland annat en extrem säsongsvariaton. För att genomföra prediktionerna användes tre olika maskininlärningamodeller: Logistisk Regression, Random Forest och Support Vector Machine. De variabler som användes är mestadels aktivitetsdata. Med relativt hög noggranhet och AUC-värde anses Random Forest vara mest pålitlig. Modellen kan dock inte separera mellan de två klasserna perfekt. Random Forest modellen visade sig också genera en hög precision. Därför kan slutsatsen dras att även om modellen inte är felfri verkar det som att kunderna predikterade som churn mest sannolikt kommer churna.
APA, Harvard, Vancouver, ISO, and other styles
32

Dekrét, Lukáš. "Techniky klasifikace proteinů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2020. http://www.nusl.cz/ntk/nusl-417215.

Full text
Abstract:
Main goal of classifying proteins into families is to understand structural, functional and evolutionary relationships between individual proteins, which are not easily deducible from available data. Since the structure and function of proteins are closely related, determination of function is mainly based on structural properties, that can be obtained relatively easily with current resources. Protein classification is also used in development of special medicines, in the diagnosis of clinical diseases or in personalized healthcare, which means a lot of investment in it. I created a new hierarchical tool for protein classification that achieves better results than some existing solutions. The implementation of the tool was preceded by acquaintance with the properties of proteins, examination of existing classification approaches, creation of an extensive data set, realizing experiments and selection of the final classifiers of the hierarchical tool.
APA, Harvard, Vancouver, ISO, and other styles
33

Säfström, Stella. "Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification." Thesis, Uppsala universitet, Statistiska institutionen, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-388581.

Full text
Abstract:
The aim of the thesis is to investigate how the classification performance of random forest and logistic regression differ, given an imbalanced data set with MCAR missing data. The performance is measured in terms of accuracy and sensitivity. Two analyses are performed: one with a simulated data set and one application using data from the Swedish population registries. The simulation study is created to have the same class imbalance at 1:5. The missing values are handled using three different techniques: complete case analysis, predictive mean matching and mean imputation. The thesis concludes that logistic regression and random forest are on average equally accurate, with some instances of random forest outperforming logistic regression. Logistic regression consistently outperforms random forest with regards to sensitivity. This implies that logistic regression may be the best option for studies where the goal is to accurately predict outcomes in the minority class. None of the missing data techniques stood out in terms of performance.
APA, Harvard, Vancouver, ISO, and other styles
34

Amlathe, Prakhar. "Standard Machine Learning Techniques in Audio Beehive Monitoring: Classification of Audio Samples with Logistic Regression, K-Nearest Neighbor, Random Forest and Support Vector Machine." DigitalCommons@USU, 2018. https://digitalcommons.usu.edu/etd/7050.

Full text
Abstract:
Honeybees are one of the most important pollinating species in agriculture. Every three out of four crops have honeybee as their sole pollinator. Since 2006 there has been a drastic decrease in the bee population which is attributed to Colony Collapse Disorder(CCD). The bee colonies fail/ die without giving any traditional health symptoms which otherwise could help in alerting the Beekeepers in advance about their situation. Electronic Beehive Monitoring System has various sensors embedded in it to extract video, audio and temperature data that could provide critical information on colony behavior and health without invasive beehive inspections. Previously, significant patterns and information have been extracted by processing the video/image data, but no work has been done using audio data. This research inaugurates and takes the first step towards the use of audio data in the Electronic Beehive Monitoring System (BeePi) by enabling a path towards the automatic classification of audio samples in different classes and categories within it. The experimental results give an initial support to the claim that monitoring of bee buzzing signals from the hive is feasible, it can be a good indicator to estimate hive health and can help to differentiate normal behavior against any deviation for honeybees.
APA, Harvard, Vancouver, ISO, and other styles
35

Benacchio, Véronique. "Etude par imagerie in situ des processus biophysiques en milieu fluvial : éléments méthodologiques et applications." Thesis, Lyon, 2017. http://www.theses.fr/2017LYSE2056/document.

Full text
Abstract:
La télédétection est une technique de plus en plus utilisée dans le domaine fluvial, et si des images acquises à haute, voire très haute altitude via des vecteurs aéroportés et satellites sont traditionnellement utilisées, l’imagerie in situ (ou « imagerie de terrain ») constitue un outil complémentaire qui présente de nombreux avantages (facilité de mise en place, coûts réduits, point de vue oblique, etc.). Les possibilités de programmer les prises de vue fixes à des fréquences relativement élevées (de quelques dixièmes de secondes dans le cas de vidéos, à quelques heures par exemple) mais aussi de pouvoir observer les évènements au moment où ils surviennent, est sans commune mesure avec les contraintes associées à l’acquisition de l’imagerie « classique » (dont les plus hautes fréquences s’élèvent à quelques jours). Cela permet de produire des jeux de données conséquents, dont l’analyse automatisée est nécessaire et constitue l’un des enjeux de cette thèse. Le traitement et l’analyse de jeux de données produits sur cinq sites test français et québécois ont permis de mieux évaluer les potentialités et les limites liées à l’utilisation de l’imagerie in situ dans le cadre de l’étude des milieux fluviaux. La définition des conditions optimales d’installation des capteurs en vue de l’acquisition des données constitue la première étape d’une démarche globale, présentée sous forme de modules optionnels, à prendre en compte selon les objectifs de l’étude. L’extraction de l’information radiométrique, puis le traitement statistique du signal ont été évalués dans plusieurs situations tests. La classification orientée-objet avec apprentissage supervisé des images a notamment été expérimentée via des random forests. L’exploitation des jeux de données repose principalement sur l’analyse de séries temporelles à haute fréquence. Cette thèse expose les forces et les faiblesses de cette approche et illustre des usages potentiels pour la compréhension des dynamiques fluviales. Ainsi, l’imagerie in situ est un très bon outil pour l’étude et l’analyse des cours d’eau, car elle permet la mesure de différents types de temporalités régissant les processus biophysiques observés. Cependant, il est nécessaire d’optimiser la qualité des images produites et notamment de limiter au maximum l’angle de vue du capteur, ou la variabilité des conditions de luminosité entre clichés, afin de produire des séries temporelles pleinement exploitables
Remote sensing is more and more used in river sciences, mainly using satellite and airborne imagery. Ground imagery constitutes a complementary tool which presents numerous advantages for the study of rivers. For example, it is easy to set up; costs are limited; it allows an oblique angle; etc. It also presents the possibility to set up the triggering with very high frequency, ranging, for instance, from a few seconds to a few hours. The possibility to monitor events at the instant they occur makes ground imagery extremely advantageous compared to aerial or spatial imagery (whose highest acquisition frequency corresponds to a few days). Such frequencies produce huge datasets, which require automated analyses. This is one of the challenges addressed in this thesis. Processing and analysis of data acquired at five study sites located in France and Québec, Canada, facilitated the evaluation of ground imagery potentials, as well as its limitations with respect to the study of fluvial systems. The identification of optimal conditions to set up the cameras and to acquire images is the first step of a global approach, presented as a chain of optional modules. Each one is to be taken into account according to the objectives of the study. The extraction of radiometric information and the subsequent statistical analysis of the signal were tested in several situations. In particular, random forests were applied, as a supervised object-oriented classification method. The datasets were principally exploited using high frequency time series analyses, which allowed demonstrating strengths and weaknesses of this approach, as well as some potential applications. Ground imagery is a powerful tool to monitor fluvial systems, as it facilitates the definition of various kinds of time characteristics linked with fluvial biophysical processes. However, it is necessary to optimize the quality of the data produced. In particular, it is necessary to minimize the acquisition angle and to limit the variability of luminosity conditions between shots in order to acquire fully exploitable datasets
APA, Harvard, Vancouver, ISO, and other styles
36

Ankaräng, Fredrik, and Fabian Waldner. "Evaluating Random Forest and a Long Short-Term Memory in Classifying a Given Sentence as a Question or Non-Question." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-262209.

Full text
Abstract:
Natural language processing and text classification are topics of much discussion among researchers of machine learning. Contributions in the form of new methods and models are presented on a yearly basis. However, less focus is aimed at comparing models, especially comparing models that are less complex to state-of-the-art models. This paper compares a Random Forest with a Long-Short Term Memory neural network for the task of classifying sentences as questions or non-questions, without considering punctuation. The models were trained and optimized on chat data from a Swedish insurance company, as well as user comments data on articles from a newspaper. The results showed that the LSTM model performed better than the Random Forest. However, the difference was small and therefore Random Forest could still be a preferable alternative in some use cases due to its simplicity and its ability to handle noisy data. The models’ performances were not dramatically improved after hyper parameter optimization. A literature study was also conducted aimed at exploring how customer service can be automated using a chatbot and what features and functionality should be prioritized by management during such an implementation. The findings of the study showed that a data driven design should be used, where features are derived based on the specific needs and customers of the organization. However, three features were general enough to be presented the personality of the bot, its trustworthiness and in what stage of the value chain the chatbot is implemented.
Språkteknologi och textklassificering är vetenskapliga områden som tillägnats mycket uppmärksamhet av forskare inom maskininlärning. Nya metoder och modeller presenteras årligen, men mindre fokus riktas på att jämföra modeller av olika karaktär. Den här uppsatsen jämför Random Forest med ett Long Short-Term Memory neuralt nätverk genom att undersöka hur väl modellerna klassificerar meningar som frågor eller icke-frågor, utan att ta hänsyn till skiljetecken. Modellerna tränades och optimerades på användardata från ett svenskt försäkringsbolag, samt kommentarer från nyhetsartiklar. Resultaten visade att LSTM-modellen presterade bättre än Random Forest. Skillnaden var dock liten, vilket innebär att Random Forest fortfarande kan vara ett bättre alternativ i vissa situationer tack vare dess enkelhet. Modellernas prestanda förbättrades inte avsevärt efter hyperparameteroptimering. En litteraturstudie genomfördes även med målsättning att undersöka hur arbetsuppgifter inom kundsupport kan automatiseras genom införandet av en chatbot, samt vilka funktioner som bör prioriteras av ledningen inför en sådan implementation. Resultaten av studien visade att en data-driven approach var att föredra, där funktionaliteten bestämdes av användarnas och organisationens specifika behov. Tre funktioner var dock tillräckligt generella för att presenteras personligheten av chatboten, dess trovärdighet och i vilket steg av värdekedjan den implementeras.
APA, Harvard, Vancouver, ISO, and other styles
37

Victors, Mason Lemoyne. "A Classification Tool for Predictive Data Analysis in Healthcare." BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/5639.

Full text
Abstract:
Hidden Markov Models (HMMs) have seen widespread use in a variety of applications ranging from speech recognition to gene prediction. While developed over forty years ago, they remain a standard tool for sequential data analysis. More recently, Latent Dirichlet Allocation (LDA) was developed and soon gained widespread popularity as a powerful topic analysis tool for text corpora. We thoroughly develop LDA and a generalization of HMMs and demonstrate the conjunctive use of both methods in predictive data analysis for health care problems. While these two tools (LDA and HMM) have been used in conjunction previously, we use LDA in a new way to reduce the dimensionality involved in the training of HMMs. With both LDA and our extension of HMM, we train classifiers to predict development of Chronic Kidney Disease (CKD) in the near future.
APA, Harvard, Vancouver, ISO, and other styles
38

Lood, Olof. "Prediktering av grundvattennivåi område utan grundvattenrör : Modellering i ArcGIS Pro och undersökningav olika miljövariablers betydelse." Thesis, Uppsala universitet, Institutionen för geovetenskaper, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-448020.

Full text
Abstract:
Myndigheten Sveriges Geologiska Undersökning (SGU) har ett nationellt ansvar för att övervaka Sveriges grundvattennivåer. Eftersom det inte är möjligt att få ett heltäckande mätstationssystem måste grundvattennivån beräknas på vissa platser. Därför är det intressant att undersöka sambandet mellan grundvattennivån och utvald geografisk information, så kallade miljövariabler. På sikt kan maskininlärning komma att användas inom SGU för att beräkna grundvattennivån och då kan en förstudie vara till stor hjälp. Examensarbetets syfte är att genomföra en sådan förstudie genom att undersöka vilka miljövariabler som har störst betydelse för grundvattennivån och kartlägga modellosäkerheter vid grundvattenprediktering. Förstudien genomförs på sju områden inom SGUs grundvattennät där mätstationerna finns i grupper likt kluster. I förstudien används övervakad maskininlärning som i detta examensarbete innebär att medianvärden på grundvattennivån och miljövariablerna används för att träna modellerna. Med hjälp av statistisk data från modellerna kan prestandan utvärderas och justeringar göras. Algoritmen som används heter Random Forest som skapar ett klassifikations- och regressionsträd, vilket lär modellen att utifrån given indata fatta beslut som liknar männiksans beslutfattande. Modellerna ställs upp i ArcGIS Pros verktyg Forest-based Classification and Regression. På grund av områdenas geografiska spridning sätts flera separata modeller upp. Resultatet visar att det är möjligt att prediktera grundvattennivån men betydelsen av de olika miljövariablerna varierar mellan de sju undersökta områdena. Orsaken till detta lär vara geografiska skillnader. Oftast har den absoluta höjden och markens lutningsriktning mycket stor betydelse. Höjd- och avståndsskillnad till låg och hög genomsläpplig jord har större betydelse än vad höjd- och avståndsskillnad har till medelhög genomsläpplig jord. Höjd- och avståndsskillnad har större betydelse till större vattendrag än till mindre vattendrag. Modellernas r2-värde är något låga men inom rimliga gränser för att vara hydrologiska modeller. Standardfelen är oftast inom rimliga gränser. Osäkerheten har visats genom ett     90 %-igt konfidensintervall. Osäkerheterna ökar med ökat avstånd till mätstationerna och är som högst vid hög altitud. Orsaken lär vara för få ingående observationer och för få observationer på hög höjd. Nära mätstationer, bebyggelse och i dalgångar är osäkerheterna i de flesta fallen inom rimliga gränser.
The Swedish authority Geological Survey of Sweden (SGU) has a national responsibility to oversee the groundwater levels. A national network of measurement stations has been established to facilitate this. The density of measurement stations varies considerably. Since it will never be feasible to cover the entire country with measurement stations, the groundwater levels need to be computed in areas that are not in the near vicinity of a measurement station. For that reason, it is of interest to investigate the correlation between the groundwater levels and selected geographical information, so called environmental variables. In the future, SGU may use machine learning to compute the groundwater levels. The focus of this master's thesis is to study the importance of the environmental variables and model uncertainties in order to determine if this is a feasible option for implementation on a national basis. The study uses data from seven areas of the Groundwater network of SGU, where the measuring stations are in clusters. The pilot study uses a supervised machine learning method which in this case means that the median groundwater levels and the environmental variables train the models. By evaluating the model's statistical data output the performance can gradually be improved. The algorithm used is called “Random Forest” and uses a classification and regression tree to learn how to make decisions throughout a network of nodes, branches and leaves due to the input data. The models are set up by the prediction tool “Forest-based Classification and Regression” in ArcGIS Pro. Because the areas are geographically spread out, eight unique models are set up. The results show that it’s possible to predict groundwater levels by using this method but that the importance of the environmental variables varies between the different areas used in this study. The cause of this may be due to geographical and topographical differences. Most often, the absolute level over mean sea level and slope direction are the most important variables. Planar and height distance differences to low and high permeable soils have medium high importance while the distance differences to medium high permeable soils have lower importance. Planar and height distance differences are more important to lakes and large watercourses than to small watercourses and ditches.  The model’s r2-values are slightly low in theory but within reasonable limits to be a hydrological model. The Standard Errors Estimate (SSE) are also in most cases within reasonable limits. The uncertainty is displayed by a 90 % confidence interval. The uncertainties increase with increased distance to measuring stations and become greatest at high altitude. The cause of this may be due to having too few observations, especially in areas with high altitude. The uncertainties are smaller close to the stations and in valleys.
SGUs grundvattennät
APA, Harvard, Vancouver, ISO, and other styles
39

Ekman, Björn. "Machine Learning for Beam Based Mobility Optimization in NR." Thesis, Linköpings universitet, Kommunikationssystem, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-136489.

Full text
Abstract:
One option for enabling mobility between 5G nodes is to use a set of area-fixed reference beams in the downlink direction from each node. To save power these reference beams should be turned on only on demand, i.e. only if a mobile needs it. An User Equipment (UE) moving out of a beam's coverage will require a switch from one beam to another, preferably without having to turn on all possible beams to find out which one is the best. This thesis investigates how to transform the beam selection problem into a format suitable for machine learning and how good such solutions are compared to baseline models. The baseline models considered were beam overlap and average Reference Signal Received Power (RSRP), both building beam-to-beam maps. Emphasis in the thesis was on handovers between nodes and finding the beam with the highest RSRP. Beam-hit-rate and RSRP-difference (selected minus best) were key performance indicators and were compared for different numbers of activated beams. The problem was modeled as a Multiple Output Regression (MOR) problem and as a Multi-Class Classification (MCC) problem. Both problems are possible to solve with the random forest model, which was the learning model of choice during this work. An Ericsson simulator was used to simulate and collect data from a seven-site scenario with 40 UEs. Primary features available were the current serving beam index and its RSRP. Additional features, like position and distance, were suggested, though many ended up being limited either by the simulated scenario or by the cost of acquiring the feature in a real-world scenario. Using primary features only, learned models' performance were equal to or worse than the baseline models' performance. Adding distance improved the performance considerably, beating the baseline models, but still leaving room for more improvements.
APA, Harvard, Vancouver, ISO, and other styles
40

Wirgen, Isak, and Douglas Rube. "Supervised fraud detection of mobile money transactions on different distributions of imbalanced data : A comparative study of the classification methods logistic regression, random forest, and support vector machine." Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-446108.

Full text
Abstract:
The purpose of this paper is to compare the classification methods logistic regression, random forest, and support vector machine´s performance of detecting mobile money transaction fraud. Their performance will be evaluated on different distributions of imbalanced data in a supervised framework. Model performance will be evaluated from a variety of metrics to capture the full model performance. The results show that random forest attained the highest overall performance, followed by logistic regression. Support vector machine attained the worst overall performance and produced no useful classification of fraudulent transactions. In conclusion, the study suggests that better results could be achieved with actions such as improvements of the classification algorithms as well as better feature selection, among others.
APA, Harvard, Vancouver, ISO, and other styles
41

Li, Mao Li. "Spatial-temporal classification enhancement via 3-D iterative filtering for multi-temporal Very-High-Resolution satellite images." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1514939565470669.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Stříteský, Radek. "Sémantické rozpoznávání komentářů na webu." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2017. http://www.nusl.cz/ntk/nusl-317212.

Full text
Abstract:
The main goal of this paper is the identification of comments on internet websites. The theoretical part is focused on artificial intelligence, mainly classifiers are described there. The practical part deals with creation of training database, which is formed by using generators of features. A generated feature might be for example a title of the HTML element where the comment is. The training database is created by input of classifiers. The result of this paper is testing classifiers in the RapidMiner program.
APA, Harvard, Vancouver, ISO, and other styles
43

Consuegra, Rengifo Nathan Adolfo. "Detection and Classification of Anomalies in Road Traffic using Spark Streaming." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-238733.

Full text
Abstract:
Road traffic control has been around for a long time to guarantee the safety of vehicles and pedestrians. However, anomalies such as accidents or natural disasters cannot be avoided. Therefore, it is important to be prepared as soon as possible to prevent a higher number of human losses. Nevertheless, there is no system accurate enough that detects and classifies anomalies from the road traffic in real time. To solve this issue, the following study proposes the training of a machine learning model for detection and classification of anomalies on the highways of Stockholm. Due to the lack of a labeled dataset, the first phase of the work is to detect the different kind of outliers that can be found and manually label them based on the results of a data exploration study. Datasets containing information regarding accidents and weather are also included to further expand the amount of anomalies. All experiments use real world datasets coming from either the sensors located on the highways of Stockholm or from official accident and weather reports. Then, three models (Decision Trees, Random Forest and Logistic Regression) are trained to detect and classify the outliers. The design of an Apache Spark streaming application that uses the model with the best results is also provided. The outcomes indicate that Logistic Regression is better than the rest but still suffers from the imbalanced nature of the dataset. In the future, this project can be used to not only contribute to future research on similar topics but also to monitor the highways of Stockholm.
Vägtrafikkontroll har funnits länge för att garantera säkerheten hos fordon och fotgängare. Emellertid kan avvikelser som olyckor eller naturkatastrofer inte undvikas. Därför är det viktigt att förberedas så snart som möjligt för att förhindra ett större antal mänskliga förluster. Ändå finns det inget system som är noggrannt som upptäcker och klassificerar avvikelser från vägtrafiken i realtid. För att lösa detta problem föreslår följande studie utbildningen av en maskininlärningsmodell för detektering och klassificering av anomalier på Stockholms vägar. På grund av bristen på en märkt dataset är den första fasen av arbetet att upptäcka olika slags avvikare som kan hittas och manuellt märka dem utifrån resultaten av en datautforskningsstudie. Dataset som innehåller information om olyckor och väder ingår också för att ytterligare öka antalet anomalier. Alla experiment använder realtidsdataset från antingen sensorerna på Stockholms vägar eller från officiella olyckor och väderrapporter. Därefter utbildas tre modeller (beslutsträd, slumpmässig skog och logistisk regression) för att upptäcka och klassificera outliersna. Utformningen av en Apache Spark streaming-applikation som använder modellen med de bästa resultaten ges också. Resultaten tyder på att logistisk regression är bättre än resten men fortfarande lider av datasetets obalanserade natur. I framtiden kan detta projekt användas för att inte bara bidra till framtida forskning kring liknande ämnen utan även att övervaka Stockholms vägar.
APA, Harvard, Vancouver, ISO, and other styles
44

dos, Santos Toledo Busarello Mariana. "Machine Learning Applied to Reach Classification in a Northern Sweden Catchment." Thesis, Umeå universitet, Institutionen för ekologi, miljö och geovetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-184140.

Full text
Abstract:
An accurate fine resolution classification of river systems positively impacts the process of assessment and monitoring of water courses, as stressed by the European Commission’s Water Framework Directive. Being able to attribute classes using remotely obtained data can be advantageous to perform extensive classification of reaches without the use of field work, with some methods also allowing to identify which features best described each of the process domains. In this work, the data from two Swedish sub-catchments above the highest coastline was used to train a Random Forest Classifier, a Machine Learning algorithm. The obtained model provided predictions of classifications and analyses of the most important features. Each study area was studied separately, then combined. In the combined case, the analysis was made with and without lakes in the data, to verify how it would affect the predictions. The results showed that the accuracy of the estimator was reliable, however, due to data complexity and imbalance, rapids were harder to be classify accurately, with an overprediction of the slow-flowing class. Combining the datasets and having the presence of lakes lessened the shortcomings of the data imbalance. Using the feature importance and permutation importance methods, the three most important features identified were the channel slope, the median of the roughness in the 100-m buffer, and the standard deviation of the planform curvature in the 100-m buffer. This finding was supported by previous studies, but other variables expected to have a high participation such as lithology and valley confinement were not relevant, which most likely relates to the coarseness of the available data. The most frequent errors were also placed in maps, showing there was some overlap of error hotspots and areas previously restored in 2010.
APA, Harvard, Vancouver, ISO, and other styles
45

Andrade, Priscilla Valessa de Castro. "O que h? por tr?s das diferen?as individuais? Perfis comportamentais e fisiol?gicos em Betta splendens." PROGRAMA DE P?S-GRADUA??O EM PSICOBIOLOGIA, 2017. https://repositorio.ufrn.br/jspui/handle/123456789/23842.

Full text
Abstract:
Submitted by Automa??o e Estat?stica (sst@bczm.ufrn.br) on 2017-09-04T21:45:51Z No. of bitstreams: 1 PriscillaValessaDeCastroAndrade_DISSERT.pdf: 1841839 bytes, checksum: 3fb757eaa049425550138768d7d96f9b (MD5)
Approved for entry into archive by Arlan Eloi Leite Silva (eloihistoriador@yahoo.com.br) on 2017-09-12T19:30:23Z (GMT) No. of bitstreams: 1 PriscillaValessaDeCastroAndrade_DISSERT.pdf: 1841839 bytes, checksum: 3fb757eaa049425550138768d7d96f9b (MD5)
Made available in DSpace on 2017-09-12T19:30:23Z (GMT). No. of bitstreams: 1 PriscillaValessaDeCastroAndrade_DISSERT.pdf: 1841839 bytes, checksum: 3fb757eaa049425550138768d7d96f9b (MD5) Previous issue date: 2017-04-28
De acordo com as mudan?as ambientais, os indiv?duos apresentam diferentes estrat?gias para lidar com os variados est?mulos externos. Os diferentes comportamentos compreendem os diferentes fen?tipos que comp?em uma popula??o. Essas diferen?as podem ser explicadas por altera??es end?genas, como a secre??o hormonal. Por exemplo, os horm?nios modulam comportamentos reprodutivos e processos cognitivos. Com o objetivo de caracterizar as diferen?as individuais em uma popula??o, o presente estudo teve como objetivo testar a rela??o entre os perfis comportamental e hormonal em um grupo de machos lutando peixes, Betta splendens. Um grupo de 86 machos foi observado para constru??o de ninho de bolha, exposi??es agon?sticas em competi??es coespec?ficas e desempenho em um protocolo de aprendizagem espacial. Depois disso, mediram-se os n?veis plasm?ticos de cortisol e testosterona. Um procedimento estat?stico inovador e elegante foi aplicado ao conjunto de dados para separar animais em grupos relacionados ao seu comportamento de constru??o de ninhos (teste de m?dias de k) e depois mostrar quais os par?metros comportamentais e fisiol?gicos que melhor explicam os perfis dos grupos (Random Forest and Classification Tree). Nossos resultados apontam para tr?s perfis distintos: construtores de ninhos (ninhos de 30,74 ? 9,84 cm?), intermedi?rios (ninhos de 13,57 ? 4,23 cm?) e n?o-construtores (ninhos de 2,17 ? 2,25 cm?). Estes grupos apresentaram diferen?as nos comportamentos agon?stico e de aprendizagem, bem como nos n?veis hormonais. O cortisol foi o principal preditor apontado pelo teste Random Forest para a separa??o de indiv?duos nos diferentes grupos: construtores de ninhos e intermedi?rios apresentaram n?veis mais baixos de cortisol, enquanto os n?o-construtores apresentaram os maiores valores de cortisol basal. O segundo mais importante preditor foi o desempenho de aprendizagem, que separou os animais intermedi?rios dos construtores de ninhos (aqueles que aprenderam mais r?pido), seguidos pelos n?veis basais de testosterona e comportamentos agon?sticos. Enquanto os n?veis de testosterona n?o foram significativos para explicar as diferen?as comportamentais, parece estar relacionado com o perfil de constru??o. Nosso achado mostra que diferentes perfis investem de forma diferente na reprodu??o e que o cortisol influencia negativamente o comportamento e a aprendizagem do nidifica??o. Em resumo, nossos dados sugerem que diferentes perfis em uma popula??o s?o determinados por respostas hormonais e comportamentais, e essas diferen?as conferem flexibilidade ? popula??o, permitindo a presen?a de animais que investem mais na reprodu??o enquanto outros mostram defesa e agress?o como a dominante caracter?stica expressa.
According to environmental changes, the individuals show different strategies to coping with the varied external stimuli. The different responders comprise the different phenotypes that compose a population. These differences can be explained by endogenous changes, such as hormonal secretion. For instance, hormones modulate reproductive behaviors and cognitive processes. In order to characterize individual differences in a population, the present study aimed to testing the relationship between behavioral and hormonal profiles in a group of males Fighting fish, Betta splendens. A group of 86 males were observed for bubble nest construction, agonistic displays in conspecific contests and performance in a spatial learning protocol. After that, cortisol and testosterone plasma levels were measured. An innovative and stylish statistical procedure was applied to the data set in order to separate animal in groups related to its nest building behavior (k-means test) and then shown which behavioral and physiological parameters better explain the groups? profiles (Random forest and Classification tree). Our results point to three distinct profiles: nest builders (nests of 30.74 ? 9.84 cm?), intermediates (nests of 13.57 ? 4.23 cm?) and non-builders (nests of 2.17 ? 2.25 cm?). These groups presented marked different in agonistic and learning behavior, as well as hormone levels. Cortisol was the main predictor prepared by the Random Forest test for the separation of individuals in the different groups: nest builders and intermediates showed lower levels of cortisol while non-builders presented the highest basal cortisol values. The second most important predictor was learning performance, that separated animals from the intermediate from the nest builders (faster learners), followed by basal testosterone levels and agonistic behavior displays. While the testosterone levels were not significant to explain behavioral differences, it seems to be related to the construction profile. Our finding shows that different profiles invest differently in reproduction and that cortisol negatively influences nesting behavior and learning. In summary, our data suggest that different profiles in a population are determined by both hormonal and behavioral responses, and these differences confer flexibility to the population, allowing the presence of animals that invest the most in reproduction while other show defense and aggression as the dominant feature expressed.
APA, Harvard, Vancouver, ISO, and other styles
46

Mordini, Michael B. "GULF OF MAINE LAND COVER AND LAND USE CHANGE ANALYSIS UTILIZING RANDOM FOREST CLASSIFICATION: TO BE USED IN HYDROLOGICAL AND ECOLOGICAL MODELING OF TERRESTRIAL CARBON EXPORT TO THE GULF OF MAINE VIA RIVERINE SYSTEMS." Miami University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=miami1375801345.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

He, Juan Xia. "Assessing and Improving Methods for the Effective Use of Landsat Imagery for Classification and Change Detection in Remote Canadian Regions." Thesis, Université d'Ottawa / University of Ottawa, 2016. http://hdl.handle.net/10393/34221.

Full text
Abstract:
Canadian remote areas are characterized by a minimal human footprint, restricted accessibility, ubiquitous lichen/snow cover (e.g. Arctic) or continuous forest with water bodies (e.g. Sub-Arctic). Effective mapping of earth surface cover and land cover changes using free medium-resolution Landsat images in remote environments is a challenge due to the presence of spectrally mixed pixels, restricted field sampling and ground truthing, and the often relatively homogenous cover in some areas. This thesis investigates how remote sensing methods can be applied to improve the capability of Landsat images for mapping earth surface features and land cover changes in Canadian remote areas. The investigation is conducted from the following four perspectives: 1) determining the continuity of Landsat-8 images for mapping surficial materials, 2) selecting classification algorithms that best address challenges involving mixed pixels, 3) applying advanced image fusion algorithms to improve Landsat spatial resolution while maintaining spectral fidelity and reducing the effects of mixed pixels on image classification and change detection, and, 4) examining different change detection techniques, including post-classification comparisons and threshold-based methods employing PCA(Principal Components Analysis)-fused multi-temporal Landsat images to detect changes in Canadian remote areas. Three typical landscapes in Canadian remote areas are chosen in this research. The first is located in the Canadian Arctic and is characterized by ubiquitous lichen and snow cover. The second is located in the Canadian sub-Arctic and is characterized by well-defined land features such as highlands, ponds, and wetlands. The last is located in a forested highlands region with minimal built-environment features. The thesis research demonstrates that the newly available Landsat-8 images can be a major data source for mapping Canadian geological information in Arctic areas when Landsat-7 is decommissioned. In addition, advanced classification techniques such as a Support-Vector-Machine (SVM) can generate satisfactory classification results in the context of mixed training data and minimal field sampling and truthing. This thesis research provides a systematic investigation on how geostatistical image fusion can be used to improve the performance of Landsat images in identifying surface features. Finally, SVM-based post-classified multi-temporal, and threshold-based PCA-fused bi-temporal Landsat images are shown to be effective in detecting different aspects of vegetation change in a remote forested region in Ontario. This research provides a comprehensive methodology to employ free Landsat images for image classification and change detection in Canadian remote regions.
APA, Harvard, Vancouver, ISO, and other styles
48

Trahan, Patrick. "Classification of Carpiodes Using Fourier Descriptors: A Content Based Image Retrieval Approach." ScholarWorks@UNO, 2009. http://scholarworks.uno.edu/td/1085.

Full text
Abstract:
Taxonomic classification has always been important to the study of any biological system. Many biological species will go unclassified and become lost forever at the current rate of classification. The current state of computer technology makes image storage and retrieval possible on a global level. As a result, computer-aided taxonomy is now possible. Content based image retrieval techniques utilize visual features of the image for classification. By utilizing image content and computer technology, the gap between taxonomic classification and species destruction is shrinking. This content based study utilizes the Fourier Descriptors of fifteen known landmark features on three Carpiodes species: C.carpio, C.velifer, and C.cyprinus. Classification analysis involves both unsupervised and supervised machine learning algorithms. Fourier Descriptors of the fifteen known landmarks provide for strong classification power on image data. Feature reduction analysis indicates feature reduction is possible. This proves useful for increasing generalization power of classification.
APA, Harvard, Vancouver, ISO, and other styles
49

Woods, Tonya M. "Extracting meaningful statistics for the characterization and classification of biological, medical, and financial data." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/53857.

Full text
Abstract:
This thesis is focused on extracting meaningful statistics for the characterization and classification of biological, medical, and financial data and contains four chapters. The first chapter contains theoretical background on scaling and wavelets, which supports the work in chapters two and three. In the second chapter, we outline a methodology for representing sequences of DNA nucleotides as numeric matrices in order to analytically investigate important structural characteristics of DNA. This methodology involves assigning unit vectors to nucleotides, placing the vectors into columns of a matrix, and accumulating across the rows of this matrix. Transcribing the DNA in this way allows us to compute the 2-D wavelet transformation and assess regularity characteristics of the sequence via the slope of the wavelet spectra. In addition to computing a global slope measure for a sequence, we can apply our methodology for overlapping sections of nucleotides to obtain an evolutionary slope. In the third chapter, we describe various ways wavelet-based scaling may be used for cancer diagnostics. There were nearly half of a million new cases of ovarian, breast, and lung cancer in the United States last year. Breast and lung cancer have highest prevalence, while ovarian cancer has the lowest survival rate of the three. Early detection is critical for all of these diseases, but substantial obstacles to early detection exist in each case. In this work, we use wavelet-based scaling on metabolic data and radiography images in order to produce meaningful features to be used in classifying cases and controls. Computer-aided detection (CAD) algorithms for detecting lung and breast cancer often focus on select features in an image and make a priori assumptions about the nature of a nodule or a mass. In contrast, our approach to analyzing breast and lung images captures information contained in the background tissue of images as well as information about specific features and makes no such a priori assumptions. In the fourth chapter, we investigate the value of social media data in building commercial default and activity credit models. We use random forest modeling, which has been shown in many instances to achieve better predictive accuracy than logistic regression in modeling credit data. This result is of interest, as some entities are beginning to build credit scores based on this type of publicly available online data alone. Our work has shown that the addition of social media data does not provide any improvement in model accuracy over the bureau only models. However, the social media data on its own does have some limited predictive power.
APA, Harvard, Vancouver, ISO, and other styles
50

Xiong, Kuangnan. "Roughened Random Forests for Binary Classification." Thesis, State University of New York at Albany, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3624962.

Full text
Abstract:

Binary classification plays an important role in many decision-making processes. Random forests can build a strong ensemble classifier by combining weaker classification trees that are de-correlated. The strength and correlation among individual classification trees are the key factors that contribute to the ensemble performance of random forests. We propose roughened random forests, a new set of tools which show further improvement over random forests in binary classification. Roughened random forests modify the original dataset for each classification tree and further reduce the correlation among individual classification trees. This data modification process is composed of artificially imposing missing data that are missing completely at random and subsequent missing data imputation.

Through this dissertation we aim to answer a few important questions in building roughened random forests: (1) What is the ideal rate of missing data to impose on the original dataset? (2) Should we impose missing data on both the training and testing datasets, or only on the training dataset? (3) What are the best missing data imputation methods to use in roughened random forests? (4) Do roughened random forests share the same ideal number of covariates selected at each tree node as the original random forests? (5) Can roughened random forests be used in medium- to high- dimensional datasets?

APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography