To see the other types of publications on this topic, follow the link: Decision Tree and Random Forest Classifier.

Dissertations / Theses on the topic 'Decision Tree and Random Forest Classifier'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 39 dissertations / theses for your research on the topic 'Decision Tree and Random Forest Classifier.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Федоров, Д. П. "Comparison of classifiers based on the decision tree." Thesis, ХНУРЕ, 2021. https://openarchive.nure.ua/handle/document/16430.

Full text
Abstract:
The main purpose of this work is to compare classifiers. Random Forest and XGBoost are two popular machine learning algorithms. In this paper, we looked at how they work, compared their features, and obtained accurate results from their robots.
APA, Harvard, Vancouver, ISO, and other styles
2

Holloway, Jacinta. "Extending decision tree methods for the analysis of remotely sensed images." Thesis, Queensland University of Technology, 2021. https://eprints.qut.edu.au/207763/1/Jacinta_Holloway_Thesis.pdf.

Full text
Abstract:
One UN Sustainable Development Goal focuses on monitoring the presence, growth, and loss of forests. The cost of tracking progress towards this goal is often prohibitive. Satellite images provide an opportunity to use free data for environmental monitoring. However, these images have missing data due to cloud cover, particularly in the tropics. In this thesis I introduce fast and accurate new statistical methods to fill these data gaps. I create spatial and stochastic extensions of decision tree machine learning methods for interpolating missing data. I illustrate these methods with case studies monitoring forest cover in Australia and South America.
APA, Harvard, Vancouver, ISO, and other styles
3

Булах, В. А., Л. О. Кіріченко, and Т. А. Радівілова. "Classification of Multifractal Time Series by Decision Tree Methods." Thesis, КНУ, 2018. http://openarchive.nure.ua/handle/document/5840.

Full text
Abstract:
The article considers classification task of model fractal time series by the methods of machine learning. To classify the series, it is proposed to use the meta algorithms based on decision trees. To modeling the fractal time series, binomial stochastic cascade processes are used. Classification of time series by the ensembles of decision trees models is carried out. The analysis indicates that the best results are obtained by the methods of bagging and random forest which use regression trees.
APA, Harvard, Vancouver, ISO, and other styles
4

Assareh, Amin. "OPTIMIZING DECISION TREE ENSEMBLES FOR GENE-GENE INTERACTION DETECTION." Kent State University / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=kent1353971575.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Doubleday, Kevin. "Generation of Individualized Treatment Decision Tree Algorithm with Application to Randomized Control Trials and Electronic Medical Record Data." Thesis, The University of Arizona, 2016. http://hdl.handle.net/10150/613559.

Full text
Abstract:
With new treatments and novel technology available, personalized medicine has become a key topic in the new era of healthcare. Traditional statistical methods for personalized medicine and subgroup identification primarily focus on single treatment or two arm randomized control trials (RCTs). With restricted inclusion and exclusion criteria, data from RCTs may not reflect real world treatment effectiveness. However, electronic medical records (EMR) offers an alternative venue. In this paper, we propose a general framework to identify individualized treatment rule (ITR), which connects the subgroup identification methods and ITR. It is applicable to both RCT and EMR data. Given the large scale of EMR datasets, we develop a recursive partitioning algorithm to solve the problem (ITR-Tree). A variable importance measure is also developed for personalized medicine using random forest. We demonstrate our method through simulations, and apply ITR-Tree to datasets from diabetes studies using both RCT and EMR data. Software package is available at https://github.com/jinjinzhou/ITR.Tree.
APA, Harvard, Vancouver, ISO, and other styles
6

Wright, Lindsey. "Classifying textual fast food restaurant reviews quantitatively using text mining and supervised machine learning algorithms." Digital Commons @ East Tennessee State University, 2018. https://dc.etsu.edu/honors/451.

Full text
Abstract:
Companies continually seek to improve their business model through feedback and customer satisfaction surveys. Social media provides additional opportunities for this advanced exploration into the mind of the customer. By extracting customer feedback from social media platforms, companies may increase the sample size of their feedback and remove bias often found in questionnaires, resulting in better informed decision making. However, simply using personnel to analyze the thousands of relative social media content is financially expensive and time consuming. Thus, our study aims to establish a method to extract business intelligence from social media content by structuralizing opinionated textual data using text mining and classifying these reviews by the degree of customer satisfaction. By quantifying textual reviews, companies may perform statistical analysis to extract insight from the data as well as effectively address concerns. Specifically, we analyzed a subset of 56,000 Yelp reviews on fast food restaurants and attempt to predict a quantitative value reflecting the overall opinion of each review. We compare the use of two different predictive modeling techniques, bagged Decision Trees and Random Forest Classifiers. In order to simplify the problem, we train our model to accurately classify strongly negative and strongly positive reviews (1 and 5 stars) reviews. In addition, we identify drivers behind strongly positive or negative reviews allowing businesses to understand their strengths and weaknesses. This method provides companies an efficient and cost-effective method to process and understand customer satisfaction as it is discussed on social media.
APA, Harvard, Vancouver, ISO, and other styles
7

Lundström, Love, and Oscar Öhman. "Machine Learning in credit risk : Evaluation of supervised machine learning models predicting credit risk in the financial sector." Thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-164101.

Full text
Abstract:
When banks lend money to another party they face a risk that the borrower will not fulfill its obligation towards the bank. This risk is called credit risk and it’s the largest risk banks faces. According to the Basel accord banks need to have a certain amount of capital requirements to protect themselves towards future financial crisis. This amount is calculated for each loan with an attached risk-weighted asset, RWA. The main parameters in RWA is probability of default and loss given default. Banks are today allowed to use their own internal models to calculate these parameters. Thus hold capital with no gained interest is a great cost, banks seek to find tools to better predict probability of default to lower the capital requirement. Machine learning and supervised algorithms such as Logistic regression, Neural network, Decision tree and Random Forest can be used to decide credit risk. By training algorithms on historical data with known results the parameter probability of default (PD) can be determined with a higher certainty degree compared to traditional models, leading to a lower capital requirement. On the given data set in this article Logistic regression seems to be the algorithm with highest accuracy of classifying customer into right category. However, it classifies a lot of people as false positive meaning the model thinks a customer will honour its obligation but in fact the customer defaults. Doing this comes with a great cost for the banks. Through implementing a cost function to minimize this error, we found that the Neural network has the lowest false positive rate and will therefore be the model that is best suited for this specific classification task.<br>När banker lånar ut pengar till en annan part uppstår en risk i att låntagaren inte uppfyller sitt antagande mot banken. Denna risk kallas för kredit risk och är den största risken en bank står inför. Enligt Basel föreskrifterna måste en bank avsätta en viss summa kapital för varje lån de ger ut för att på så sätt skydda sig emot framtida finansiella kriser. Denna summa beräknas fram utifrån varje enskilt lån med tillhörande risk-vikt, RWA. De huvudsakliga parametrarna i RWA är sannolikheten att en kund ej kan betala tillbaka lånet samt summan som banken då förlorar. Idag kan banker använda sig av interna modeller för att estimera dessa parametrar. Då bundet kapital medför stora kostnader för banker, försöker de sträva efter att hitta bättre verktyg för att uppskatta sannolikheten att en kund fallerar för att på så sätt minska deras kapitalkrav. Därför har nu banker börjat titta på möjligheten att använda sig av maskininlärningsalgoritmer för att estimera dessa parametrar. Maskininlärningsalgoritmer såsom Logistisk regression, Neurala nätverk, Beslutsträd och Random forest, kan användas för att bestämma kreditrisk. Genom att träna algoritmer på historisk data med kända resultat kan parametern, chansen att en kund ej betalar tillbaka lånet (PD), bestämmas med en högre säkerhet än traditionella metoder. På den givna datan som denna uppsats bygger på visar det sig att Logistisk regression är den algoritm med högst träffsäkerhet att klassificera en kund till rätt kategori. Däremot klassifiserar denna algoritm många kunder som falsk positiv vilket betyder att den predikterar att många kunder kommer betala tillbaka sina lån men i själva verket inte betalar tillbaka lånet. Att göra detta medför en stor kostnad för bankerna. Genom att istället utvärdera modellerna med hjälp av att införa en kostnadsfunktion för att minska detta fel finner vi att Neurala nätverk har den lägsta falsk positiv ration och kommer därmed vara den model som är bäst lämpad att utföra just denna specifika klassifierings uppgift.
APA, Harvard, Vancouver, ISO, and other styles
8

Rosales, Martínez Octavio. "Caracterización de especies en plasma frío mediante análisis de espectroscopia de emisión óptica por técnicas de Machine Learning." Tesis de maestría, Universidad Autónoma del Estado de México, 2020. http://hdl.handle.net/20.500.11799/109734.

Full text
Abstract:
La espectroscopía de emisión óptica es una técnica que permite la identificación de elementos químicos usando el espectro electromagnético que emite un plasma. Con base en la literatura. tiene aplicaciones diversas, por ejemplo: en la identificación de entes estelares, para determinar el punto final de los procesos de plasma en la fabricación de semiconductores o bien, específicamente en este trabajo, se tratan espectros para la determinación de elementos presentes en la degradación de compuestos recalcitrantes. En este documento se identifican automáticamente espectros de elementos tales como He, Ar, N, O, y Hg, en sus niveles de energía uno y dos, mediante técnicas de Machine Learning (ML). En primer lugar, se descargan las líneas de elementos reportadas en el NIST (National Institute of Standards and Technology), después se preprocesan y unifican para los siguientes procesos: a) crear un generador de 84 espectros sintéticos implementado en Python y el módulo ipywidgets de Jupyter Notebook, con las posibilidades de elegir un elemento, nivel de energía, variar la temperatura, anchura a media altura, y normalizar el especto y, b) extraer las líneas para los elementos He, Ar, N, O y Hg en el rango de los 200 nm a 890 nm, posteriormente, se les aplica sobremuestreo para realizar la búsqueda de hiperparámetros para los algoritmos: Decision Tree, Bagging, Random Forest y Extremely Randomized Trees basándose en los principios del diseño de experimentos de aleatorización, replicación, bloqueo y estratificación.
APA, Harvard, Vancouver, ISO, and other styles
9

Yan, Ping. "Anomaly Detection in Categorical Data with Interpretable Machine Learning : A random forest approach to classify imbalanced data." Thesis, Linköpings universitet, Statistik och maskininlärning, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-158185.

Full text
Abstract:
Metadata refers to "data about data", which contains information needed to understand theprocess of data collection. In this thesis, we investigate if metadata features can be usedto detect broken data and how a tree-based interpretable machine learning algorithm canbe used for an effective classification. The goal of this thesis is two-fold. Firstly, we applya classification schema using metadata features for detecting broken data. Secondly, wegenerate the feature importance rate to understand the model’s logic and reveal the keyfactors that lead to broken data. The given task from the Swedish automotive company Veoneer is a typical problem oflearning from extremely imbalanced data set, with 97 percent of data belongs healthy dataand only 3 percent of data belongs to broken data. Furthermore, the whole data set containsonly categorical variables in nominal scales, which brings challenges to the learningalgorithm. The notion of handling imbalanced problem for continuous data is relativelywell-studied, but for categorical data, the solution is not straightforward. In this thesis, we propose a combination of tree-based supervised learning and hyperparametertuning to identify the broken data from a large data set. Our methods arecomposed of three phases: data cleaning, which is eliminating ambiguous and redundantinstances, followed by the supervised learning algorithm with random forest, lastly, weapplied a random search for hyper-parameter optimization on random forest model. Our results show empirically that tree-based ensemble method together with a randomsearch for hyper-parameter optimization have made improvement to random forest performancein terms of the area under the ROC. The model outperformed an acceptableclassification result and showed that metadata features are capable of detecting brokendata and providing an interpretable result by identifying the key features for classificationmodel.
APA, Harvard, Vancouver, ISO, and other styles
10

Stříteský, Radek. "Sémantické rozpoznávání komentářů na webu." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2017. http://www.nusl.cz/ntk/nusl-317212.

Full text
Abstract:
The main goal of this paper is the identification of comments on internet websites. The theoretical part is focused on artificial intelligence, mainly classifiers are described there. The practical part deals with creation of training database, which is formed by using generators of features. A generated feature might be for example a title of the HTML element where the comment is. The training database is created by input of classifiers. The result of this paper is testing classifiers in the RapidMiner program.
APA, Harvard, Vancouver, ISO, and other styles
11

Revend, War. "Predicting House Prices on the Countryside using Boosted Decision Trees." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279849.

Full text
Abstract:
This thesis intends to evaluate the feasibility of supervised learning models for predicting house prices on the countryside of South Sweden. It is essential for mortgage lenders to have accurate housing valuation algorithms and the current model offered by Booli is not accurate enough when evaluating residence prices on the countryside. Different types of boosted decision trees were implemented to address this issue and their performances were compared to traditional machine learning methods. These different types of supervised learning models were implemented in order to find the best model with regards to relevant evaluation metrics such as root-mean-squared error (RMSE) and mean absolute percentage error (MAPE). The implemented models were ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. All these models were benchmarked against Booli's current housing valuation algorithms which are based on a k-NN model. The results from this thesis indicated that the LightGBM model is the optimal one as it had the best overall performance with respect to the chosen evaluation metrics. When comparing the LightGBM model to the benchmark, the performance was overall better, the LightGBM model had an RMSE score of 0.330 compared to 0.358 for the Booli model, indicating that there is a potential of using boosted decision trees to improve the predictive accuracy of residence prices on the countryside.<br>Denna uppsats ämnar utvärdera genomförbarheten hos olika övervakade inlärningsmodeller för att förutse huspriser på landsbygden i Södra Sverige. Det är viktigt för bostadslånsgivare att ha noggranna algoritmer när de värderar bostäder, den nuvarande modellen som Booli erbjuder har dålig precision när det gäller värderingar av bostäder på landsbygden. Olika typer av boostade beslutsträd implementerades för att ta itu med denna fråga och deras prestanda jämfördes med traditionella maskininlärningsmetoder. Dessa olika typer av övervakad inlärningsmodeller implementerades för att hitta den bästa modellen med avseende på relevanta prestationsmått som t.ex. root-mean-squared error (RMSE) och mean absolute percentage error (MAPE). De övervakade inlärningsmodellerna var ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. Samtliga algoritmers prestanda jämförs med Boolis nuvarande bostadsvärderingsalgoritm, som är baserade på en k-NN modell. Resultatet från denna uppsats visar att LightGBM modellen är den optimala modellen för att värdera husen på landsbygden eftersom den hade den bästa totala prestandan med avseende på de utvalda utvärderingsmetoderna. LightGBM modellen jämfördes med Booli modellen där prestandan av LightGBM modellen var i överlag bättre, där LightGBM modellen hade ett RMSE värde på 0.330 jämfört med Booli modellen som hade ett RMSE värde på 0.358. Vilket indikerar att det finns en potential att använda boostade beslutsträd för att förbättra noggrannheten i förutsägelserna av huspriser på landsbygden.
APA, Harvard, Vancouver, ISO, and other styles
12

Velka, Elina. "Loss Given Default Estimation with Machine Learning Ensemble Methods." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279846.

Full text
Abstract:
This thesis evaluates the performance of three machine learning methods in prediction of the Loss Given Default (LGD). LGD can be seen as the opposite of the recovery rate, i.e. the ratio of an outstanding loan that the loan issuer would not be able to recover in case the customer would default. The methods investigated are decision trees, random forest and boosted methods. All of the methods investigated performed well in predicting the cases were the loan is not recovered, LGD = 1 (100%), or the loan is totally recovered, LGD = 0 (0% ). When the performance of the models was evaluated on a dataset where the observations with LGD = 1 were removed, a significant decrease in performance was observed. The random forest model built on an unbalanced training dataset showed better performance on the test dataset that included values LGD = 1 and the random forest model built on a balanced training dataset performed better on the test set where the observations of LGD = 1 were removed. Boosted models evaluated in this study showed less accurate predictions than other methods used. Overall, the performance of random forest models showed slightly better results than the performance of decision tree models, although the computational time (the cost) was considerably longer when running the random forest models. Therefore decision tree models would be suggested for prediction of the Loss Given Default.<br>Denna uppsats undersöker och jämför tre maskininlärningsmetoder som estimerar förlust vid fallissemang (Loss Given Default, LGD). LGD kan ses som motsatsen till återhämtningsgrad, dvs. andelen av det utstående lånet som långivaren inte skulle återfå ifall kunden skulle fallera. Maskininlärningsmetoder som undersöks i detta arbete är decision trees, random forest och boosted metoder. Alla metoder fungerade väl vid estimering av lån som antingen inte återbetalas, dvs. LGD = 1 (100%), eller av lån som betalas i sin helhet, LGD = 0 (0%). En tydlig minskning i modellernas träffsäkerhet påvisades när modellerna kördes med ett dataset där observationer med LGD = 1 var borttagna. Random forest modeller byggda på ett obalanserat träningsdataset presterade bättre än de övriga modellerna på testset som inkluderade observationer där LGD = 1. Då observationer med LGD = 1 var borttagna visade det sig att random forest modeller byggda på ett balanserat träningsdataset presterade bättre än de övriga modellerna. Boosted modeller visade den svagaste träffsäkerheten av de tre metoderna som blev undersökta i denna studie. Totalt sett visade studien att random forest modeller byggda på ett obalanserat träningsdataset presterade en aning bättre än decision tree modeller, men beräkningstiden (kostnaden) var betydligt längre när random forest modeller kördes. Därför skulle decision tree modeller föredras vid estimering av förlust vid fallissemang.
APA, Harvard, Vancouver, ISO, and other styles
13

Varatharajah, Thujeepan, and Eriksson Victor. "A comparative study on artificial neural networks and random forests for stock market prediction." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186452.

Full text
Abstract:
This study investigates the predictive performance of two different machine learning (ML) models on the stock market and compare the results. The chosen models are based on artificial neural networks (ANN) and random forests (RF). The models are trained on two separate data sets and the predictions are made on the next day closing price. The input vectors of the models consist of 6 different financial indicators which are based on the closing prices of the past 5, 10 and 20 days. The performance evaluation are done by analyzing and comparing such values as the root mean squared error (RMSE) and mean average percentage error (MAPE) for the test period. Specific behavior in subsets of the test period is also analyzed to evaluate consistency of the models. The results showed that the ANN model performed better than the RF model as it throughout the test period had lower errors compared to the actual prices and thus overall made more accurate predictions.<br>Denna studie undersöker hur väl två olika modeller inom maskininlärning (ML) kan förutspå aktiemarknaden och jämför sedan resultaten av dessa. De valda modellerna baseras på artificiella neurala nätverk (ANN) samt random forests (RF). Modellerna tränas upp med två separata datamängder och prognoserna sker på nästföljande dags stängningskurs. Indatan för modellerna består av 6 olika finansiella nyckeltal som är baserade på stängningskursen för de senaste 5, 10 och 20 dagarna. Prestandan utvärderas genom att analysera och jämföra värden som root mean squared error (RMSE) samt mean average percentage error (MAPE) för testperioden. Även specifika trender i delmängder av testperioden undersöks för att utvärdera följdriktigheten av modellerna. Resultaten visade att ANN-modellen presterade bättre än RF-modellen då den sett över hela testperioden visade mindre fel jämfört med de faktiska värdena och gjorde därmed mer träffsäkra prognoser.
APA, Harvard, Vancouver, ISO, and other styles
14

Yang, Kaolee. "A Statistical Analysis of Medical Data for Breast Cancer and Chronic Kidney Disease." Bowling Green State University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1587052897029939.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Fredriksson, Tomas, and Rickard Svensson. "Analysis of machine learning for human motion pattern recognition on embedded devices." Thesis, KTH, Mekatronik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-246087.

Full text
Abstract:
With an increased amount of connected devices and the recent surge of artificial intelligence, the two technologies need more attention to fully bloom as a useful tool for creating new and exciting products. As machine learning traditionally is implemented on computers and online servers this thesis explores the possibility to extend machine learning to an embedded environment. This evaluation of existing machine learning in embedded systems with limited processing capa-bilities has been carried out in the specific context of an application involving classification of basic human movements. Previous research and implementations indicate that it is possible with some limitations, this thesis aims to answer which hardware limitation is affecting clas-sification and what classification accuracy the system can reach on an embedded device. The tests included human motion data from an existing dataset and included four different machine learning algorithms on three devices. Support Vector Machine (SVM) are found to be performing best com-pared to CART, Random Forest and AdaBoost. It reached a classification accuracy of 84,69% between six different included motions with a clas-sification time of 16,88 ms per classification on a Cortex M4 processor. This is the same classification accuracy as the one obtained on the host computer with more computational capabilities. Other hardware and machine learning algorithm combinations had a slight decrease in clas-sification accuracy and an increase in classification time. Conclusions could be drawn that memory on the embedded device affect which al-gorithms could be run and the complexity of data that can be extracted in form of features. Processing speed is mostly affecting classification time. Additionally the performance of the machine learning system is connected to the type of data that is to be observed, which means that the performance of different setups differ depending on the use case.<br>Antalet uppkopplade enheter ökar och det senaste uppsvinget av ar-tificiell intelligens driver forskningen framåt till att kombinera de två teknologierna för att både förbättra existerande produkter och utveckla nya. Maskininlärning är traditionellt sett implementerat på kraftfulla system så därför undersöker den här masteruppsatsen potentialen i att utvidga maskininlärning till att köras på inbyggda system. Den här undersökningen av existerande maskinlärningsalgoritmer, implemen-terade på begränsad hårdvara, har utförts med fokus på att klassificera grundläggande mänskliga rörelser. Tidigare forskning och implemen-tation visar på att det ska vara möjligt med vissa begränsningar. Den här uppsatsen vill svara på vilken hårvarubegränsning som påverkar klassificering mest samt vilken klassificeringsgrad systemet kan nå på den begränsande hårdvaran. Testerna inkluderade mänsklig rörelsedata från ett existerande dataset och inkluderade fyra olika maskininlärningsalgoritmer på tre olika system. SVM presterade bäst i jämförelse med CART, Random Forest och AdaBoost. Den nådde en klassifikationsgrad på 84,69% på de sex inkluderade rörelsetyperna med en klassifikationstid på 16,88 ms per klassificering på en Cortex M processor. Detta är samma klassifikations-grad som en vanlig persondator når med betydligt mer beräknings-resurserresurser. Andra hårdvaru- och algoritm-kombinationer visar en liten minskning i klassificeringsgrad och ökning i klassificeringstid. Slutsatser kan dras att minnet på det inbyggda systemet påverkar vilka algoritmer som kunde köras samt komplexiteten i datan som kunde extraheras i form av attribut (features). Processeringshastighet påverkar mest klassificeringstid. Slutligen är prestandan för maskininlärningsy-stemet bunden till typen av data som ska klassificeras, vilket betyder att olika uppsättningar av algoritmer och hårdvara påverkar prestandan olika beroende på användningsområde.
APA, Harvard, Vancouver, ISO, and other styles
16

Fürderer, Niklas. "A Study of an Iterative User-Specific Human Activity Classification Approach." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-253802.

Full text
Abstract:
Applications for sensor-based human activity recognition use the latest algorithms for the detection and classification of human everyday activities, both for online and offline use cases. The insights generated by those algorithms can in a next step be used within a wide broad of applications such as safety, fitness tracking, localization, personalized health advice and improved child and elderly care.In order for an algorithm to be performant, a significant amount of annotated data from a specific target audience is required. However, a satisfying data collection process is cost and labor intensive. This also may be unfeasible for specific target groups as aging effects motion patterns and behaviors. One main challenge in this application area lies in the ability to identify relevant changes over time while being able to reuse previously annotated user data. The accurate detection of those user-specific patterns and movement behaviors therefore requires individual and adaptive classification models for human activities.The goal of this degree work is to compare several supervised classifier performances when trained and tested on a newly iterative user-specific human activity classification approach as described in this report. A qualitative and quantitative data collection process was applied. The tree-based classification algorithms Decision Tree, Random Forest as well as XGBoost were tested on custom based datasets divided into three groups. The datasets contained labeled motion data of 21 volunteers from wrist worn sensors.Computed across all datasets, the average performance measured in recall increased by 5.2% (using a simulated leave-one-subject-out cross evaluation) for algorithms trained via the described approach compared to a random non-iterative approach.<br>Sensorbaserad aktivitetsigenkänning använder sig av det senaste algoritmerna för detektion och klassificering av mänskliga vardagliga aktiviteter, både i uppoch frånkopplat läge. De insikter som genereras av algoritmerna kan i ett nästa steg användas inom en mängd nya applikationer inom områden så som säkerhet, träningmonitorering, platsangivelser, personifierade hälsoråd samt inom barnoch äldreomsorgen.För att en algoritm skall uppnå hög prestanda krävs en inte obetydlig mängd annoterad data, som med fördel härrör från den avsedda målgruppen. Dock är datainsamlingsprocessen kostnadsoch arbetsintensiv. Den kan dessutom även vara orimlig att genomföra för vissa specifika målgrupper, då åldrandet påverkar rörelsemönster och beteenden. En av de största utmaningarna inom detta område är att hitta de relevanta förändringar som sker över tid, samtidigt som man vill återanvända tidigare annoterad data. För att kunna skapa en korrekt bild av det individuella rörelsemönstret behövs därför individuella och adaptiva klassificeringsmodeller.Målet med detta examensarbete är att jämföra flera olika övervakade klassificerares (eng. supervised classifiers) prestanda när dem tränats med hjälp av ett iterativt användarspecifikt aktivitetsklassificeringsmetod, som beskrivs i denna rapport. En kvalitativ och kvantitativ datainsamlingsprocess tillämpades. Trädbaserade klassificeringsalgoritmerna Decision Tree, Random Forest samt XGBoost testades utifrån specifikt skapade dataset baserade på 21 volontärer, som delades in i tre grupper. Data är baserad på rörelsedata från armbandssensorer.Beräknat över samtlig data, ökade den genomsnittliga sensitiviteten med 5.2% (simulerad korsvalidering genom utelämna-en-individ) för algoritmer tränade via beskrivna metoden jämfört med slumpvis icke-iterativ träning.
APA, Harvard, Vancouver, ISO, and other styles
17

Granström, Daria, and Johan Abrahamsson. "Loan Default Prediction using Supervised Machine Learning Algorithms." Thesis, KTH, Matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252312.

Full text
Abstract:
It is essential for a bank to estimate the credit risk it carries and the magnitude of exposure it has in case of non-performing customers. Estimation of this kind of risk has been done by statistical methods through decades and with respect to recent development in the field of machine learning, there has been an interest in investigating if machine learning techniques can perform better quantification of the risk. The aim of this thesis is to examine which method from a chosen set of machine learning techniques exhibits the best performance in default prediction with regards to chosen model evaluation parameters. The investigated techniques were Logistic Regression, Random Forest, Decision Tree, AdaBoost, XGBoost, Artificial Neural Network and Support Vector Machine. An oversampling technique called SMOTE was implemented in order to treat the imbalance between classes for the response variable. The results showed that XGBoost without implementation of SMOTE obtained the best result with respect to the chosen model evaluation metric.<br>Det är nödvändigt för en bank att ha en bra uppskattning på hur stor risk den bär med avseende på kunders fallissemang. Olika statistiska metoder har använts för att estimera denna risk, men med den nuvarande utvecklingen inom maskininlärningsområdet har det väckt ett intesse att utforska om maskininlärningsmetoder kan förbättra kvaliteten på riskuppskattningen. Syftet med denna avhandling är att undersöka vilken metod av de implementerade maskininlärningsmetoderna presterar bäst för modellering av fallissemangprediktion med avseende på valda modelvaldieringsparametrar. De implementerade metoderna var Logistisk Regression, Random Forest, Decision Tree, AdaBoost, XGBoost, Artificiella neurala nätverk och Stödvektormaskin. En översamplingsteknik, SMOTE, användes för att behandla obalansen i klassfördelningen för svarsvariabeln. Resultatet blev följande: XGBoost utan implementering av SMOTE visade bäst resultat med avseende på den valda metriken.
APA, Harvard, Vancouver, ISO, and other styles
18

Choi, Bong-Jin. "Statistical Analysis, Modeling, and Algorithms for Pharmaceutical and Cancer Systems." Scholar Commons, 2014. https://scholarcommons.usf.edu/etd/5200.

Full text
Abstract:
The aim of the present study is to develop a statistical algorithm and model associ- ated with breast and lung cancer patients. In this study, we developed several statistical softwares, R packages, and models using our new statistical approach. In the present study, we used the five parameters logistic model for determining the optimal doses of a pharmaceutical drugs, including dynamic initial points, an automatic process for outlier detection and an algorithm that develops a graphic user interface(GUI) program. The developed statistical procedure assists medical scientists by reducing their time in determining the optimal dose of new drugs, and can also easily identify which drugs need more experimentation. Secondly, in the present study, we developed a new classification method that is very useful in the health sciences. We used a new decision tree algorithm and a random forest method to rank our variables and to build a final decision tree model. The decision tree can identify and communicate complex data systems to scientists with minimal knowledge in statistics. Thirdly, we developed statistical packages using the Johnson SB probability distribu- tion which is important in parametrically studying a variety of health, environmental, and engineering problems. Scientists are experiencing difficulties in obtaining estimates for the four parameters of the subject probability distribution. The developed algorithm com- bines several statistical procedures, such as, the Newtwon Raphson, the Bisection, the Least Square Estimation, and the regression method to develop our R package. This R package has functions that generate random numbers, calculate probabilities, inverse probabilities, and estimate the four parameters of the SB Johnson probability distribution. Researchers can use the developed R package to build their own statistical models or perform desirable statistical simulations. The final aspect of the study involves building a statistical model for lung cancer sur- vival time. In developing the subject statistical model, we have taken into consideration the number of cigarettes the patient smoked per day, duration of smoking, and the age at diagnosis of lung cancer. The response variables the survival time. The significant factors include interaction. the probability density function of the survival times has been obtained and the survival function is determined. The analysis is have on your groups the involve gender and with factors. A companies with the ordinary survival function is given.
APA, Harvard, Vancouver, ISO, and other styles
19

Drábek, Matěj. "Využití vybraných metod strojového učení pro modelování kreditního rizika." Master's thesis, Vysoká škola ekonomická v Praze, 2017. http://www.nusl.cz/ntk/nusl-360509.

Full text
Abstract:
This master's thesis is divided into three parts. In the first part I described P2P lending, its characteristics, basic concepts and practical implications. I also compared P2P market in the Czech Republic, UK and USA. The second part consists of theoretical basics for chosen methods of machine learning, which are naive bayes classifier, classification tree, random forest and logistic regression. I also described methods to evaluate the quality of classification models listed above. The third part is a practical one and shows the complete workflow of creating classification model, from data preparation to evaluation of model.
APA, Harvard, Vancouver, ISO, and other styles
20

Ekeberg, Lukas, and Alexander Fahnehjelm. "Maskininlärning som verktyg för att extrahera information om attribut kring bostadsannonser i syfte att maximera försäljningspris." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-240401.

Full text
Abstract:
The Swedish real estate market has been digitalized over the past decade with the current practice being to post your real estate advertisement online. A question that has arisen is how a seller can optimize their public listing to maximize the selling premium. This paper analyzes the use of three machine learning methods to solve this problem: Linear Regression, Decision Tree Regressor and Random Forest Regressor. The aim is to retrieve information regarding how certain attributes contribute to the premium value. The dataset used contains apartments sold within the years of 2014-2018 in the Östermalm / Djurgården district in Stockholm, Sweden. The resulting models returned an R2-value of approx. 0.26 and Mean Absolute Error of approx. 0.06. While the models were not accurate regarding prediction of premium, information was still able to be extracted from the models. In conclusion, a high amount of views and a publication made in April provide the best conditions for an advertisement to reach a high selling premium. The seller should try to keep the amount of days since publication lower than 15.5 days and avoid publishing on a Tuesday.<br>Den svenska bostadsmarknaden har blivit alltmer digitaliserad under det senaste årtiondet med nuvarande praxis att säljaren publicerar sin bostadsannons online. En fråga som uppstår är hur en säljare kan optimera sin annons för att maximera budpremie. Denna studie analyserar tre maskininlärningsmetoder för att lösa detta problem: Linear Regression, Decision Tree Regressor och Random Forest Regressor. Syftet är att utvinna information om de signifikanta attribut som påverkar budpremien. Det dataset som använts innehåller lägenheter som såldes under åren 2014-2018 i Stockholmsområdet Östermalm / Djurgården. Modellerna som togs fram uppnådde ett R²-värde på approximativt 0.26 och Mean Absolute Error på approximativt 0.06. Signifikant information kunde extraheras from modellerna trots att de inte var exakta i att förutspå budpremien. Sammanfattningsvis skapar ett stort antal visningar och en publicering i april de bästa förutsättningarna för att uppnå en hög budpremie. Säljaren ska försöka hålla antal dagar sedan publicering under 15.5 dagar och undvika att publicera på tisdagar.
APA, Harvard, Vancouver, ISO, and other styles
21

Consuegra, Rengifo Nathan Adolfo. "Detection and Classification of Anomalies in Road Traffic using Spark Streaming." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-238733.

Full text
Abstract:
Road traffic control has been around for a long time to guarantee the safety of vehicles and pedestrians. However, anomalies such as accidents or natural disasters cannot be avoided. Therefore, it is important to be prepared as soon as possible to prevent a higher number of human losses. Nevertheless, there is no system accurate enough that detects and classifies anomalies from the road traffic in real time. To solve this issue, the following study proposes the training of a machine learning model for detection and classification of anomalies on the highways of Stockholm. Due to the lack of a labeled dataset, the first phase of the work is to detect the different kind of outliers that can be found and manually label them based on the results of a data exploration study. Datasets containing information regarding accidents and weather are also included to further expand the amount of anomalies. All experiments use real world datasets coming from either the sensors located on the highways of Stockholm or from official accident and weather reports. Then, three models (Decision Trees, Random Forest and Logistic Regression) are trained to detect and classify the outliers. The design of an Apache Spark streaming application that uses the model with the best results is also provided. The outcomes indicate that Logistic Regression is better than the rest but still suffers from the imbalanced nature of the dataset. In the future, this project can be used to not only contribute to future research on similar topics but also to monitor the highways of Stockholm.<br>Vägtrafikkontroll har funnits länge för att garantera säkerheten hos fordon och fotgängare. Emellertid kan avvikelser som olyckor eller naturkatastrofer inte undvikas. Därför är det viktigt att förberedas så snart som möjligt för att förhindra ett större antal mänskliga förluster. Ändå finns det inget system som är noggrannt som upptäcker och klassificerar avvikelser från vägtrafiken i realtid. För att lösa detta problem föreslår följande studie utbildningen av en maskininlärningsmodell för detektering och klassificering av anomalier på Stockholms vägar. På grund av bristen på en märkt dataset är den första fasen av arbetet att upptäcka olika slags avvikare som kan hittas och manuellt märka dem utifrån resultaten av en datautforskningsstudie. Dataset som innehåller information om olyckor och väder ingår också för att ytterligare öka antalet anomalier. Alla experiment använder realtidsdataset från antingen sensorerna på Stockholms vägar eller från officiella olyckor och väderrapporter. Därefter utbildas tre modeller (beslutsträd, slumpmässig skog och logistisk regression) för att upptäcka och klassificera outliersna. Utformningen av en Apache Spark streaming-applikation som använder modellen med de bästa resultaten ges också. Resultaten tyder på att logistisk regression är bättre än resten men fortfarande lider av datasetets obalanserade natur. I framtiden kan detta projekt användas för att inte bara bidra till framtida forskning kring liknande ämnen utan även att övervaka Stockholms vägar.
APA, Harvard, Vancouver, ISO, and other styles
22

Haris, Daniel. "Optimalizace strojového učení pro predikci KPI." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2018. http://www.nusl.cz/ntk/nusl-385922.

Full text
Abstract:
This thesis aims to optimize the machine learning algorithms for predicting KPI metrics for an organization. The organization is predicting whether projects meet planned deadlines of the last phase of development process using machine learning. The work focuses on the analysis of prediction models and sets the goal of selecting new candidate models for the prediction system. We have implemented a system that automatically selects the best feature variables for learning. Trained models were evaluated by several performance metrics and the best candidates were chosen for the prediction. Candidate models achieved higher accuracy, which means, that the prediction system provides more reliable responses. We suggested other improvements that could increase the accuracy of the forecast.
APA, Harvard, Vancouver, ISO, and other styles
23

Lantz, Robin. "Time series monitoring and prediction of data deviations in a manufacturing industry." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-100181.

Full text
Abstract:
An automated manufacturing industry makes use of many interacting moving parts and sensors. Data from these sensors generate complex multidimensional data in the production environment. This data is difficult to interpret and also difficult to find patterns in. This project provides tools to get a deeper understanding of Swedsafe’s production data, a company involved in an automated manufacturing business. The project is based on and will show the potential of the multidimensional production data. The project mainly consists of predicting deviations from predefined threshold values in Swedsafe’s production data. Machine learning is a good method of finding relationships in complex datasets. Supervised machine learning classification is used to predict deviation from threshold values in the data. An investigation is conducted to identify the classifier that performs best on Swedsafe's production data. The technique sliding window is used for managing time series data, which is used in this project. Apart from predicting deviations, this project also includes an implementation of live graphs to easily get an overview of the production data. A steady production with stable process values is important. So being able to monitor and predict events in the production environment can provide the same benefit for other manufacturing companies and is therefore suitable not only for Swedsafe. The best performing machine learning classifier tested in this project was the Random Forest classifier. The Multilayer Perceptron did not perform well on Swedsafe’s data, but further investigation in recurrent neural networks using LSTM neurons would be recommended. During the projekt a web based application displaying the sensor data in live graphs is also developed.
APA, Harvard, Vancouver, ISO, and other styles
24

Konečný, Antonín. "Využití umělé inteligence v technické diagnostice." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2021. http://www.nusl.cz/ntk/nusl-443221.

Full text
Abstract:
The diploma thesis is focused on the use of artificial intelligence methods for evaluating the fault condition of machinery. The evaluated data are from a vibrodiagnostic model for simulation of static and dynamic unbalances. The machine learning methods are applied, specifically supervised learning. The thesis describes the Spyder software environment, its alternatives, and the Python programming language, in which the scripts are written. It contains an overview with a description of the libraries (Scikit-learn, SciPy, Pandas ...) and methods — K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Trees (DT) and Random Forests Classifiers (RF). The results of the classification are visualized in the confusion matrix for each method. The appendix includes written scripts for feature engineering, hyperparameter tuning, evaluation of learning success and classification with visualization of the result.
APA, Harvard, Vancouver, ISO, and other styles
25

Masetti, Masha. "Product Clustering e Machine Learning per il miglioramento dell'accuratezza della previsione della domanda: il caso Comer Industries S.p.A." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text
Abstract:
I lunghi lead time della catena di fornitura cinese dell’azienda Comer Industries S.p.A la obbligano ad ordinare i materiali con sei mesi di anticipo, data in cui spesso i clienti non sono consapevoli dei quantitativi di materiale che necessiteranno. Al fine di rispondere ai clienti mantenendo l’alto livello di servizio garantito storicamente da Comer Industries, risulta essenziale ordinare il materiale basandosi sulle previsioni della domanda. Tuttavia, attualmente le previsioni non sono sufficientemente accurate. L’obiettivo di questa ricerca è individuare un possibile metodo per incrementare l’accuratezza delle previsioni della domanda. Potrebbe, al fine del miglioramento della forecast accuracy, incidere positivamente l’utilizzo dell’Intelligenza Artificiale? Per rispondere alla domanda di ricerca, si sono implementati l’algoritmo K-Means e l’algoritmo Gerarchico in Visual Basic Application al fine di dividere i prodotti in cluster sulla base dei componenti comuni. Successivamente, si sono analizzati gli andamenti della domanda. Implementando differenti algoritmi di Machine Learning su Google Colaboratory, si sono paragonate le accuratezze ottenute e si è individuato un algoritmo di previsione ottimale per ciascun profilo di domanda. Infine, con le previsioni effettuate, si è potuto identificare con il K-means un miglioramento dell’accuracy di circa il 54,62% rispetto all’accuratezza iniziale ed un risparmio del 47% dei costi per il mantenimento del safety stock, mentre con il Clustering Gerarchico si è rilevato un miglioramento dell’accuracy del 11,15% ed un risparmio del 45% dei costi attuali. Si è, pertanto, concluso che la clusterizzazione dei prodotti potrebbe apportare un contributo positivo all’accuratezza delle previsioni. Inoltre, si è osservato come il Machine Learning potrebbe costituire lo strumento ideale per individuare le soluzioni ottimali sia all’interno degli algoritmi di Clustering sia all’interno dei metodi previsionali.
APA, Harvard, Vancouver, ISO, and other styles
26

Thanjavur, Bhaaskar Kiran Vishal. "Automatic generation of hardware Tree Classifiers." Thesis, 2017. https://hdl.handle.net/2144/23688.

Full text
Abstract:
Machine Learning is growing in popularity and spreading across different fields for various applications. Due to this trend, machine learning algorithms use different hardware platforms and are being experimented to obtain high test accuracy and throughput. FPGAs are well-suited hardware platform for machine learning because of its re-programmability and lower power consumption. Programming using FPGAs for machine learning algorithms requires substantial engineering time and effort compared to software implementation. We propose a software assisted design flow to program FPGA for machine learning algorithms using our hardware library. The hardware library is highly parameterized and it accommodates Tree Classifiers. As of now, our library consists of the components required to implement decision trees and random forests. The whole automation is wrapped around using a python script which takes you from the first step of having a dataset and design choices to the last step of having a hardware descriptive code for the trained machine learning model.
APA, Harvard, Vancouver, ISO, and other styles
27

Mistry, Pritesh, Daniel Neagu, Paul R. Trundle, and J. D. Vessey. "Using random forest and decision tree models for a new vehicle prediction approach in computational toxicology." 2015. http://hdl.handle.net/10454/7545.

Full text
Abstract:
yes<br>Drug vehicles are chemical carriers that provide beneficial aid to the drugs they bear. Taking advantage of their favourable properties can potentially allow the safer use of drugs that are considered highly toxic. A means for vehicle selection without experimental trial would therefore be of benefit in saving time and money for the industry. Although machine learning is increasingly used in predictive toxicology, to our knowledge there is no reported work in using machine learning techniques to model drug-vehicle relationships for vehicle selection to minimise toxicity. In this paper we demonstrate the use of data mining and machine learning techniques to process, extract and build models based on classifiers (decision trees and random forests) that allow us to predict which vehicle would be most suited to reduce a drug’s toxicity. Using data acquired from the National Institute of Health’s (NIH) Developmental Therapeutics Program (DTP) we propose a methodology using an area under a curve (AUC) approach that allows us to distinguish which vehicle provides the best toxicity profile for a drug and build classification models based on this knowledge. Our results show that we can achieve prediction accuracies of 80 % using random forest models whilst the decision tree models produce accuracies in the 70 % region. We consider our methodology widely applicable within the scientific domain and beyond for comprehensively building classification models for the comparison of functional relationships between two variables.
APA, Harvard, Vancouver, ISO, and other styles
28

Liu, Chao-Lun, and 劉兆倫. "An analysis for stock price prediction by VPIN based on decision tree and random forest models." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/9hncn5.

Full text
Abstract:
碩士<br>國立交通大學<br>財務金融研究所<br>107<br>This paper examines the high and low stocks of the information disclosure assessment, using the three models of the decision tree (Gini), decision tree (Entropy) and random forest by machine learning to verify the synchronous transaction volume (VPIN). Through some technical and market indicators as control variables, a total of 27 characteristics were added to the model for prediction, and the results were compared with the addition of the 28th feature PIN and VPIN, respectively. We found that adding PIN or VPIN can significantly improve Type I error, therefore, we focus on precision performance. If VPIN is added during the overall sample period, both the high and low companies under the random forest model will show significant improvement. However, in the financial turmoil period, the companies with high information disclosure assessment adopt the decision tree (Gini) and decision tree (Entropy) model, which could improve the precision and recall. Overall, the research showed that both PIN and VPIN improved predictive outcomes, while VPIN was slightly better than PIN.
APA, Harvard, Vancouver, ISO, and other styles
29

Huang, Yu-Ren, and 黃裕仁. "Using Random Forest、RIPPER and Decision Tree In Data Mining for predicting With In Vitro Fertilization Success Rate." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/sfqcwc.

Full text
Abstract:
碩士<br>國立虎尾科技大學<br>工業管理系工業工程與管理碩士班<br>105<br>The Purpose of this study was to investigate IVF (In Vitro Fertiliztion), Scientific name is la fertilization in vitro, people alwas call test tube baby. It’s a good medical tedchnology for infertility. Every patient who most care is success or failure. Exsiting techonology still can’t make it 100% success. The first course of treatment, phtsician will ask patient age, pathogen, follicle-stimulating hormone indes to predicating success rate. Man and wife’s treatment will depend on sperm count and healty or not, oarium activity, endometrium fory embryo acceptance etc. Give patients opnion and treatment to promote auccess rate. This research is using dataming techmology to predict IVF and construct IVF model. Find rule about success and failaur rule .This study using Random Forest have best predicting 72%. Also can use Decision tree C4.5 and RIPPER algorithm, can construct model and rule. The study provided a physician on the success rate of ivf patient predictive reference.
APA, Harvard, Vancouver, ISO, and other styles
30

YEH, TZU-WEI, and 葉子維. "Analysis of Consumer Behaviors of Bank and Prediction of users in Mobile Banking-Comparison of Decision Tree, Random Forest and Discriminant Analysis." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/ma57kv.

Full text
Abstract:
碩士<br>國立臺北大學<br>統計學系<br>106<br>In recent years, with the rise of smartphones, banks 3.0, Fintech and mobile payments, more and more people can complete their daily lives without using cash to deal with finance, consumption, transportation and so on. Therefore, the financial industry is paying more and more attention to the development of mobile banking. What kind of customers will be attracted by mobile banking is the purpose of this study. This study aims to predict potential users of mobile banks based on the credit card consumption characteristics of customers. The mobile banking users are defined as the customers who use mobile banking three months after credit card spending. Customers from a local bank for those who had more than 30 expenses with credit card during Feb. 2017 to Jul. 2017 were taken as the research object. A sample of 7,700 customers’ transaction data selected by simple random sampling was used in this study. Customer's consumer behavior is defined by the consumption characteristics of the RFM model. Random Forest, Decision Tree, and Linear Discriminant Analysis (LDA) are conducted to predict potential users of mobile banking and further for prediction performance comparison. Study results show that Random Forest has the largest prediction accuracy, LDA is more sensitivity, and Decision Tree has the lowest accuracy. Finally, market segmentation marketing was suggested by this study based on the prediction results.
APA, Harvard, Vancouver, ISO, and other styles
31

BARSACCHI, MARCO. "Fuzzy Methods for Machine Learning. A Big Data Perspective." Doctoral thesis, 2019. http://hdl.handle.net/2158/1150519.

Full text
Abstract:
More than fifty years after its introduction, fuzzy sets theory is still thriving and continues to play a relevant role in a wide number of scientific applications. Nevertheless, while the enrichments that fuzzy logic and set theory can provide are manifold, the recognition of fuzzy set and logic inside the machine learning community remains rather moderate. In this thesis, we present several approaches aimed at improving machine learning techniques using tools borrowed from fuzzy set theory and logic. Particularly, we try to focus more on the machine learning perspective, thus inviting machine learning researcher to appreciate the modelling strengths of fuzzy set theory. We begin presenting FDT-Boost, a boosting approach shaped according to the SAMME- Adaboost scheme, which leverages fuzzy binary decision trees as base classifiers; then, we explore a distributed fuzzy random forest DFRF, that leverages the Apache Spark framework, to generate an efficient and effective classifier for big data. We also propose a novel approach for generating, out of big data, a set of fuzzy rule-based classifiers characterised by different optimal trade-offs between accuracy and interpretability. The approach, dubbed DPAES-FDT-GL, extends a state-of-the-art distributed multi-objective evolutionary learning scheme, implemented in the Apache Spark environment. Lastly, we focus on an application, showing how fuzzy systems could be employed in helping medical decision; we propose a novel pipeline to support tumour type classification and rule extraction based on somatic CNV data. The pipeline outputs an interpretable Fuzzy Rule-Based Classifier (FRBC). Much work remains to be done, and fuzzy set theory has still a big role to play in machine learning.
APA, Harvard, Vancouver, ISO, and other styles
32

Maggi, Piero. "Enhanced web analytics for health insurance." Master's thesis, 2020. http://hdl.handle.net/10362/101010.

Full text
Abstract:
Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics<br>Nowadays companies need invest and improve on data solution implementation within most of the business workflows and processes, in order to differentiate the offer and stay ahead of their competitors. It’s becoming more and more important to take data driven decisions to boost profitability and improve the overall customer experience. In this way, strategies are defined not anymore on common beliefs and assumptions, but on contextualized and trustful insights. This reports describes the work that has been made during a 9-month internship, in order to provide the business with a new and improved solution for enhancing the web analytics tasks and supporting the improve of the online user digital experience. User-level data related to the website activity has been extracted at the highest granularity level. Afterwards, raw data have been cleaned and stored in an Analytical Base Table with which an initial data exploration has been made. After giving initial insights to the digital team, a predictive model has been developed in order to predict the probability of the users to buy the insurance product online. Finally, based on the initial data exploration and the model’s results, a set of recommendations has been built and provided to the digital department for their implementation in order to make the website more engaging and dynamic.
APA, Harvard, Vancouver, ISO, and other styles
33

Mylnikova, Ekaterina. "Multiclass Classification of Motor Insurance Customers in Portugal." Master's thesis, 2021. http://hdl.handle.net/10362/127584.

Full text
Abstract:
The insurance market is highly competitive. To stay in line with other companies in today's world, it is not enough for a company to have the best price. The most important move now is to make a personalized offer to each client. Insurance companies have an enormous amount of data that can be used to understand their customers better. What do they want? What offer would attract new clients, and what offer would keep existing customers from leaving? The project aims to classify customers’ profiles based on their individual preferences in motor insurance.<br>Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
APA, Harvard, Vancouver, ISO, and other styles
34

Lumpe, Lars. "The Foundation of Pattern Structures and their Applications." 2021. https://tud.qucosa.de/id/qucosa%3A76163.

Full text
Abstract:
This thesis is divided into a theoretical part, aimed at developing statements around the newly introduced concept of pattern morphisms, and a practical part, where we present use cases of pattern structures. A first insight of our work clarifies the facts on projections of pattern structures. We discovered that a projection of a pattern structure does not always lead again to a pattern structure. A solution to this problem, and one of the most important points of this thesis, is the introduction of pattern morphisms in Chapter4. Pattern morphisms make it possible to describe relationships between pattern structures, and thus enable a deeper understanding of pattern structures in general. They also provide the means to describe projections of pattern structures that lead to pattern structures again. In Chapter5 and Chapter6, we looked at the impact of morphisms between pattern structures on concept lattices and on their representations and thus clarified the theoretical background of existing research in this field. The application part reveals that random forests can be described through pattern structures, which constitutes another central achievement of our work. In order to demonstrate the practical relevance of our findings, we included a use case where this finding is used to build an algorithm that solves a real world classification problem of red wines. The prediction accuracy of the random forest is better, but the high interpretability makes our algorithm valuable. Another approach to the red wine classification problem is presented in Chapter 8, where, starting from an elementary pattern structure, we built a classification model that yielded good results.
APA, Harvard, Vancouver, ISO, and other styles
35

Silvestre, Martinho de Matos. "Three-stage ensemble model : reinforce predictive capacity without compromising interpretability." Master's thesis, 2019. http://hdl.handle.net/10362/71588.

Full text
Abstract:
Thesis proposal presented as partial requirement for obtaining the Master’s degree in Statistics and Information Management, with specialization in Risk Analysis and Management<br>Over the last decade, several banks have developed models to quantify credit risk. In addition to the monitoring of the credit portfolio, these models also help deciding the acceptance of new contracts, assess customers profitability and define pricing strategy. The objective of this paper is to improve the approach in credit risk modeling, namely in scoring models to predict default events. To this end, we propose the development of a three-stage ensemble model that combines the results interpretability of the Scorecard with the predictive power of machine learning algorithms. The results show that ROC index improves 0.5%-0.7% and Accuracy 0%-1% considering the Scorecard as baseline.
APA, Harvard, Vancouver, ISO, and other styles
36

Hellwig, Niels. "Spatial patterns of humus forms, soil organisms and soil biological activity at high mountain forest sites in the Italian Alps." Doctoral thesis, 2018. https://repositorium.ub.uni-osnabrueck.de/handle/urn:nbn:de:gbv:700-20181024676.

Full text
Abstract:
The objective of the thesis is the model-based analysis of spatial patterns of decomposition properties on the forested slopes of the montane level (ca. 1200-2200 m a.s.l.) in a study area in the Italian Alps (Val di Sole / Val di Rabbi, Autonomous Province of Trento). The analysis includes humus forms and enchytraeid assemblages as well as pH values, activities of extracellular enzymes and C/N ratios of the topsoil. The first aim is to develop, test and apply data-based techniques for spatial modelling of soil ecological parameters. This methodological approach is based on the concept of digital soil mapping. The second aim is to reveal the relationships between humus forms, soil organisms and soil microbiological parameters in the study area. The third aim is to analyze if the spatial patterns of indicators of decomposition differ between the landscape scale and the slope scale. At the landscape scale, sample data from six sites are used, covering three elevation levels at both north- and south-facing slopes. A knowledge-based approach that combines a decision tree analysis with the construction of fuzzy membership functions is introduced for spatial modelling. According to the sampling design, elevation and slope exposure are the explanatory variables. The investigations at the slope scale refer to one north-facing and one south-facing slope, with 30 sites occurring on each slope. These sites have been derived using conditioned Latin Hypercube Sampling, and thus reasonably represent the environmental conditions within the study area. Predictive maps have been produced in a purely data-based approach with random forests. At both scales, the models indicate a high variability of spatial decomposition patterns depending on the elevation and the slope exposure. In general, sites at high elevation on north-facing slopes almost exclusively exhibit the humus forms Moder and Mor. Sites on south-facing slopes and at low elevation exhibit also Mull and Amphimull. The predictions of those enchytraeid species characterized as Mull and Moder indicators match the occurrence of the corresponding humus forms well. Furthermore, referencing the mineral topsoil, the predictive models show increasing pH values, an increasing leucine-aminopeptidase activity, an increasing ratio alkaline/acid phosphomonoesterase activity and a decreasing C/N ratio from north-facing to south-facing slopes and from high to low elevation. The predicted spatial patterns of indicators of decomposition are basically similar at both scales. However, the patterns are predicted in more detail at the slope scale because of a larger data basis and a higher spatial precision of the environmental covariates. These factors enable the observation of additional correlations between the spatial patterns of indicators of decomposition and environmental influences, for example slope angle and curvature. Both the corresponding results and broad model evaluations have shown that the applied methods are generally suitable for modelling spatial patterns of indicators of decomposition in a heterogeneous high mountain environment. The overall results suggest that the humus form can be used as indicator of organic matter decomposition processes in the investigated high mountain area.
APA, Harvard, Vancouver, ISO, and other styles
37

Elmasry, Mohamed Hani Abdelhamid Mohamed Tawfik. "Machine learning approach for credit score analysis : a case study of predicting mortgage loan defaults." Master's thesis, 2019. http://hdl.handle.net/10362/62427.

Full text
Abstract:
Dissertation submitted in partial fulfilment of the requirements for the degree of Statistics and Information Management specialized in Risk Analysis and Management<br>To effectively manage credit score analysis, financial institutions instigated techniques and models that are mainly designed for the purpose of improving the process assessing creditworthiness during the credit evaluation process. The foremost objective is to discriminate their clients – borrowers – to fall either in the non-defaulter group, that is more likely to pay their financial obligations, or the defaulter one which has a higher probability of failing to pay their debts. In this paper, we devote to use machine learning models in the prediction of mortgage defaults. This study employs various single classification machine learning methodologies including Logistic Regression, Classification and Regression Trees, Random Forest, K-Nearest Neighbors, and Support Vector Machine. To further improve the predictive power, a meta-algorithm ensemble approach – stacking – will be introduced to combine the outputs – probabilities – of the afore mentioned methods. The sample for this study is solely based on the publicly provided dataset by Freddie Mac. By modelling this approach, we achieve an improvement in the model predictability performance. We then compare the performance of each model, and the meta-learner, by plotting the ROC Curve and computing the AUC rate. This study is an extension of various preceding studies that used different techniques to further enhance the model predictivity. Finally, our results are compared with work from different authors.<br>Para gerir com eficácia a análise de risco de crédito, as instituições financeiras desenvolveram técnicas e modelos que foram projetados principalmente para melhorar o processo de avaliação da qualidade de crédito durante o processo de avaliação de crédito. O objetivo final é classifica os seus clientes - tomadores de empréstimos - entre aqueles que tem maior probabilidade de pagar suas obrigações financeiras, e os potenciais incumpridores que têm maior probabilidade de entrar em default. Neste artigo, nos dedicamos a usar modelos de aprendizado de máquina na previsão de defaults de hipoteca. Este estudo emprega várias metodologias de aprendizado de máquina de classificação única, incluindo Regressão Logística, Classification and Regression Trees, Random Forest, K-Nearest Neighbors, and Support Vector Machine. Para melhorar ainda mais o poder preditivo, a abordagem do conjunto de meta-algoritmos - stacking - será introduzida para combinar as saídas - probabilidades - dos métodos acima mencionados. A amostra deste estudo é baseada exclusivamente no conjunto de dados fornecido publicamente pela Freddie Mac. Ao modelar essa abordagem, alcançamos uma melhoria no desempenho do modelo de previsibilidade. Em seguida, comparamos o desempenho de cada modelo e o meta-aprendiz, plotando a Curva ROC e calculando a taxa de AUC. Este estudo é uma extensão de vários estudos anteriores que usaram diferentes técnicas para melhorar ainda mais o modelo preditivo. Finalmente, nossos resultados são comparados com trabalhos de diferentes autores.
APA, Harvard, Vancouver, ISO, and other styles
38

Yehe, Nala. "Automatic Patent Classification." Thesis, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-49594.

Full text
Abstract:
Patents have a great research value and it is also beneficial to the community of industrial, commercial, legal and policymaking. Effective analysis of patent literature can reveal important technical details and relationships, and it can also explain business trends, propose novel industrial solutions, and make crucial investment decisions. Therefore, we should carefully analyze patent documents and use the value of patents. Generally, patent analysts need to have a certain degree of expertise in various research fields, including information retrieval, data processing, text mining, field-specific technology, and business intelligence. In real life, it is difficult to find and nurture such an analyst in a relatively short period of time, enabling him or her to meet the requirement of multiple disciplines. Patent classification is also crucial in processing patent applications because it will empower people with the ability to manage and maintain patent texts better and more flexible. In recent years, the number of patents worldwide has increased dramatically, which makes it very important to design an automatic patent classification system. This system can replace the time-consuming manual classification, thus providing patent analysis managers with an effective method of managing patent texts. This paper designs a patent classification system based on data mining methods and machine learning techniques and use KNIME software to conduct a comparative analysis. This paper will research by using different machine learning methods and different parts of a patent. The purpose of this thesis is to use text data processing methods and machine learning techniques to classify patents automatically. It mainly includes two parts, the first is data preprocessing and the second is the application of machine learning techniques. The research questions include: Which part of a patent as input data performs best in relation to automatic classification? And which of the implemented machine learning algorithms performs best regarding the classification of IPC keywords? This thesis will use design science research as a method to research and analyze this topic. It will use the KNIME platform to apply the machine learning techniques, which include decision tree, XGBoost linear, XGBoost tree, SVM, and random forest. The implementation part includes collection data, preprocessing data, feature word extraction, and applying classification techniques. The patent document consists of many parts such as description, abstract, and claims. In this thesis, we will feed separately these three group input data to our models. Then, we will compare the performance of those three different parts. Based on the results obtained from these three experiments and making the comparison, we suggest using the description part data in the classification system because it shows the best performance in English patent text classification. The abstract can be as the auxiliary standard for classification. However, the classification based on the claims part proposed by some scholars has not achieved good performance in our research. Besides, the BoW and TFIDF methods can be used together to extract efficiently the features words in our research. In addition, we found that the SVM and XGBoost techniques have better performance in the automatic patent classification system in our research.
APA, Harvard, Vancouver, ISO, and other styles
39

(9380318), Min Namgung. "Performance Comparison of Public Bike Demand Predictions: The Impact of Weather and Air Pollution." Thesis, 2020.

Find full text
Abstract:
Many metropolitan cities motivate people to exploit public bike-sharing programs as alternative transportation for many reasons. Due to its’ popularity, multiple types of research on optimizing public bike-sharing systems is conducted on city-level, neighborhood-level, station-level, or user-level to predict the public bike demand. Previously, the research on the public bike demand prediction primarily focused on discovering a relationship with weather as an external factor that possibly impacted the bike usage or analyzing the bike user trend in one aspect. This work hypothesizes two external factors that are likely to affect public bike demand: weather and air pollution. This study uses a public bike data set, daily temperature, precipitation data, and air condition data to discover the trend of bike usage using multiple machine learning techniques such as Decision Tree, Naïve Bayes, and Random Forest. After conducting the research, each algorithm’s output is evaluated with performance comparisons such as accuracy, precision, or sensitivity. As a result, Random Forest is an efficient classifier for the bike demand prediction by weather and precipitation, and Decision Tree performs best for the bike demand prediction by air pollutants. Also, the three class labelings in the daily bike demand has high specificity, and is easy to trace the trend of the public bike system.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography