Dissertations / Theses: 'K-fold validation'

1

Sood, Radhika. "Comparative Data Analytic Approach for Detection of Diabetes." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1544100930937728.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Ording, Marcus. "Context-Sensitive Code Completion : Improving Predictions with Genetic Algorithms." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-205334.

Full text

Abstract:

Within the area of context-sensitive code completion there is a need for accurate predictive models in order to provide useful code completion predictions. The traditional method for optimizing the performance of code completion systems is to empirically evaluate the effect of each system parameter individually and fine-tune the parameters. This thesis presents a genetic algorithm that can optimize the system parameters with a degree-of-freedom equal to the number of parameters to optimize. The study evaluates the effect of the optimized parameters on the prediction quality of the studied code completion system. Previous evaluation of the reference code completion system is also extended to include model size and inference speed. The results of the study shows that the genetic algorithm is able to improve the prediction quality of the studied code completion system. Compared with the reference system, the enhanced system is able to recognize 1 in 10 additional previously unseen code patterns. This increase in prediction quality does not significantly impact the system performance, as the inference speed remains less than 1 ms for both systems.
Inom området kontextkänslig kodkomplettering finns det ett behov av precisa förutsägande modeller för att kunna föreslå användbara kodkompletteringar. Den traditionella metoden för att optimera prestanda hos kodkompletteringssystem är att empiriskt utvärdera effekten av varje systemparameter individuellt och finjustera parametrarna. Det här arbetet presenterar en genetisk algoritm som kan optimera systemparametrarna med en frihetsgrad som är lika stor som antalet parametrar att optimera. Studien utvärderar effekten av de optimerade parametrarna på det studerade kodkompletteringssystemets pre- diktiva kvalitet. Tidigare utvärdering av referenssystemet utökades genom att även inkludera modellstorlek och slutledningstid. Resultaten av studien visar att den genetiska algoritmen kan förbättra den prediktiva kvali- teten för det studerade kodkompletteringssystemet. Jämfört med referenssystemet så lyckas det förbättrade systemet korrekt känna igen 1 av 10 ytterligare kodmönster som tidigare varit osedda. Förbättringen av prediktiv kvalietet har inte en signifikant inverkan på systemet, då slutledningstiden förblir mindre än 1 ms för båda systemen.

APA, Harvard, Vancouver, ISO, and other styles

3

Piják, Marek. "Klasifikace emailové komunikace." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2018. http://www.nusl.cz/ntk/nusl-385889.

Full text

Abstract:

This diploma's thesis is based around creating a classifier, which will be able to recognize an email communication received by Topefekt.s.r.o on daily basis and assigning it into classification class. This project will implement some of the most commonly used classification methods including machine learning. Thesis will also include evaluation comparing all used methods.

APA, Harvard, Vancouver, ISO, and other styles

4

Birba, Delwende Eliane. "A Comparative study of data splitting algorithms for machine learning model selection." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-287194.

Full text

Abstract:

Data splitting is commonly used in machine learning to split data into a train, test, or validation set. This approach allows us to find the model hyper-parameter and also estimate the generalization performance. In this research, we conducted a comparative analysis of different data partitioning algorithms on both real and simulated data. Our main objective was to address the question of how the choice of data splitting algorithm can improve the estimation of the generalization performance. Data splitting algorithms used in this study were variants of k-fold, Kennard-Stone, SPXY ( sample set partitioning based on joint x-y distance), and random sampling algorithm. Each algorithm divided the data into two subset, training/validation. The training set was used to fit the model and validation for the evaluation. We then analyzed the different data splitting algorithms based on the generalization performances estimated from the validation and the external test set. From the result, we noted that the important determinant for a good generalization is the size of the dataset. For all the data sample methods applied on small data set, the gap between the performance estimated on the validation and test set was significant. However, we noted that the gap reduced when there was more data in training or validation. Too many or few data in the training set can also lead to bad model performance. So it is importance to have a reasonable balance between the training/validation set sizes. In our study, KS and SPXY was the splitting algorithm with poor model performance estimation. Indeed these methods select the most representative samples to train the model, and poor representative samples are left for model performance estimation.
Datapartitionering används vanligtvis i maskininlärning för att dela data i en tränings, test eller valideringsuppsättning. Detta tillvägagångssätt gör det möjligt för oss att hitta hyperparametrar för modellen och även uppskatta generaliseringsprestanda. I denna forskning genomförde vi en jämförande analys av olika datapartitionsalgoritmer på både verkliga och simulerade data. Vårt huvudmål var att undersöka frågan om hur valet avdatapartitioneringsalgoritm kan förbättra uppskattningen av generaliseringsprestanda. Datapartitioneringsalgoritmer som användes i denna studie var varianter av k-faldig korsvalidering, Kennard-Stone (KS), SPXY (partitionering baserat på gemensamt x-y-avstånd) och bootstrap-algoritm. Varje algoritm användes för att dela upp data i två olika datamängder: tränings- och valideringsdata. Vi analyserade sedan de olika datapartitioneringsalgoritmerna baserat på generaliseringsprestanda uppskattade från valideringen och den externa testuppsättningen. Från resultatet noterade vi att det avgörande för en bra generalisering är storleken på data. För alla datapartitioneringsalgoritmer som använts på små datamängder var klyftan mellan prestanda uppskattad på valideringen och testuppsättningen betydande. Vi noterade emellertid att gapet minskade när det fanns mer data för träning eller validering. För mycket eller för litet data i träningsuppsättningen kan också leda till dålig prestanda. Detta belyser vikten av att ha en korrekt balans mellan storlekarna på tränings- och valideringsmängderna. I vår studie var KS och SPXY de algoritmer med sämst prestanda. Dessa metoder väljer de mest representativa instanserna för att träna modellen, och icke-representativa instanser lämnas för uppskattning av modellprestanda.

APA, Harvard, Vancouver, ISO, and other styles

5

Martins, Natalie Henriques. "Modelos de agrupamento e classificação para os bairros da cidade do Rio de Janeiro sob a ótica da Inteligência Computacional: Lógica Fuzzy, Máquinas de Vetores Suporte e Algoritmos Genéticos." Universidade do Estado do Rio de Janeiro, 2015. http://www.bdtd.uerj.br/tde_busca/arquivo.php?codArquivo=9502.

Full text

Abstract:

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
A partir de 2011, ocorreram e ainda ocorrerão eventos de grande repercussão para a cidade do Rio de Janeiro, como a conferência Rio+20 das Nações Unidas e eventos esportivos de grande importância mundial (Copa do Mundo de Futebol, Olimpíadas e Paraolimpíadas). Estes acontecimentos possibilitam a atração de recursos financeiros para a cidade, assim como a geração de empregos, melhorias de infraestrutura e valorização imobiliária, tanto territorial quanto predial. Ao optar por um imóvel residencial em determinado bairro, não se avalia apenas o imóvel, mas também as facilidades urbanas disponíveis na localidade. Neste contexto, foi possível definir uma interpretação qualitativa linguística inerente aos bairros da cidade do Rio de Janeiro, integrando-se três técnicas de Inteligência Computacional para a avaliação de benefícios: Lógica Fuzzy, Máquina de Vetores Suporte e Algoritmos Genéticos. A base de dados foi construída com informações da web e institutos governamentais, evidenciando o custo de imóveis residenciais, benefícios e fragilidades dos bairros da cidade. Implementou-se inicialmente a Lógica Fuzzy como um modelo não supervisionado de agrupamento através das Regras Elipsoidais pelo Princípio de Extensão com o uso da Distância de Mahalanobis, configurando-se de forma inferencial os grupos de designação linguística (Bom, Regular e Ruim) de acordo com doze características urbanas. A partir desta discriminação, foi tangível o uso da Máquina de Vetores Suporte integrado aos Algoritmos Genéticos como um método supervisionado, com o fim de buscar/selecionar o menor subconjunto das variáveis presentes no agrupamento que melhor classifique os bairros (Princípio da Parcimônia). A análise das taxas de erro possibilitou a escolha do melhor modelo de classificação com redução do espaço de variáveis, resultando em um subconjunto que contém informações sobre: IDH, quantidade de linhas de ônibus, instituições de ensino, valor m médio, espaços ao ar livre, locais de entretenimento e crimes. A modelagem que combinou as três técnicas de Inteligência Computacional hierarquizou os bairros do Rio de Janeiro com taxas de erros aceitáveis, colaborando na tomada de decisão para a compra e venda de imóveis residenciais. Quando se trata de transporte público na cidade em questão, foi possível perceber que a malha rodoviária ainda é a prioritária

APA, Harvard, Vancouver, ISO, and other styles

6

Luo, Shan. "Advanced Statistical Methodologies in Determining the Observation Time to Discriminate Viruses Using FTIR." Digital Archive @ GSU, 2009. http://digitalarchive.gsu.edu/math_theses/86.

Full text

Abstract:

Fourier transform infrared (FTIR) spectroscopy, one method of electromagnetic radiation for detecting specific cellular molecular structure, can be used to discriminate different types of cells. The objective is to find the minimum time (choice among 2 hour, 4 hour and 6 hour) to record FTIR readings such that different viruses can be discriminated. A new method is adopted for the datasets. Briefly, inner differences are created as the control group, and Wilcoxon Signed Rank Test is used as the first selecting variable procedure in order to prepare the next stage of discrimination. In the second stage we propose either partial least squares (PLS) method or simply taking significant differences as the discriminator. Finally, k-fold cross-validation method is used to estimate the shrinkages of the goodness measures, such as sensitivity, specificity and area under the ROC curve (AUC). There is no doubt in our mind 6 hour is enough for discriminating mock from Hsv1, and Coxsackie viruses. Adeno virus is an exception.

APA, Harvard, Vancouver, ISO, and other styles

7

Tandan, Isabelle, and Erika Goteman. "Bank Customer Churn Prediction : A comparison between classification and evaluation methods." Thesis, Uppsala universitet, Statistiska institutionen, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-411918.

Full text

Abstract:

This study aims to assess which supervised statistical learning method; random forest, logistic regression or K-nearest neighbor, that is the best at predicting banks customer churn. Additionally, the study evaluates which cross-validation set approach; k-Fold cross-validation or leave-one-out cross-validation that yields the most reliable results. Predicting customer churn has increased in popularity since new technology, regulation and changed demand has led to an increase in competition for banks. Thus, with greater reason, banks acknowledge the importance of maintaining their customer base. The findings of this study are that unrestricted random forest model estimated using k-Fold is to prefer out of performance measurements, computational efficiency and a theoretical point of view. Albeit, k-Fold cross-validation and leave-one-out cross-validation yield similar results, k-Fold cross-validation is to prefer due to computational advantages. For future research, methods that generate models with both good interpretability and high predictability would be beneficial. In order to combine the knowledge of which customers end their engagement as well as understanding why. Moreover, interesting future research would be to analyze at which dataset size leave-one-out cross-validation and k-Fold cross-validation yield the same results.

APA, Harvard, Vancouver, ISO, and other styles

8

Radeschnig, David. "Modelling Implied Volatility of American-Asian Options : A Simple Multivariate Regression Approach." Thesis, Mälardalens högskola, Akademin för utbildning, kultur och kommunikation, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-28951.

Full text

Abstract:

This report focus upon implied volatility for American styled Asian options, and a least squares approximation method as a way of estimating its magnitude. Asian option prices are calculated/approximated based on Quasi-Monte Carlo simulations and least squares regression, where a known volatility is being used as input. A regression tree then empirically builds a database of regression vectors for the implied volatility based on the simulated output of option prices. The mean squared errors between imputed and estimated volatilities are then compared using a five-folded cross-validation test as well as the non-parametric Kruskal-Wallis hypothesis test of equal distributions. The study results in a proposed semi-parametric model for estimating implied volatilities from options. The user must however be aware of that this model may suffer from bias in estimation, and should thereby be used with caution.

APA, Harvard, Vancouver, ISO, and other styles

9

Bodin, Camilla. "Automatic Flight Maneuver Identification Using Machine Learning Methods." Thesis, Linköpings universitet, Reglerteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-165844.

Full text

Abstract:

This thesis proposes a general approach to solve the offline flight-maneuver identification problem using machine learning methods. The purpose of the study was to provide means for the aircraft professionals at the flight test and verification department of Saab Aeronautics to automate the procedure of analyzing flight test data. The suggested approach succeeded in generating binary classifiers and multiclass classifiers that identified six flight maneuvers of different complexity from real flight test data. The binary classifiers solved the problem of identifying one maneuver from flight test data at a time, while the multiclass classifiers solved the problem of identifying several maneuvers from flight test data simultaneously. To achieve these results, the difficulties that this time series classification problem entailed were simplified by using different strategies. One strategy was to develop a maneuver extraction algorithm that used handcrafted rules. Another strategy was to represent the time series data by statistical measures. There was also an issue of an imbalanced dataset, where one class far outweighed others in number of samples. This was solved by using a modified oversampling method on the dataset that was used for training. Logistic Regression, Support Vector Machines with both linear and nonlinear kernels, and Artifical Neural Networks were explored, where the hyperparameters for each machine learning algorithm were chosen during model estimation by 4-fold cross-validation and solving an optimization problem based on important performance metrics. A feature selection algorithm was also used during model estimation to evaluate how the performance changes depending on how many features were used. The machine learning models were then evaluated on test data consisting of 24 flight tests. The results given by the test data set showed that the simplifications done were reasonable, but the maneuver extraction algorithm could sometimes fail. Some maneuvers were easier to identify than others and the linear machine learning models resulted in a poor fit to the more complex classes. In conclusion, both binary classifiers and multiclass classifiers could be used to solve the flight maneuver identification problem, and solving a hyperparameter optimization problem boosted the performance of the finalized models. Nonlinear classifiers performed the best on average across all explored maneuvers.

APA, Harvard, Vancouver, ISO, and other styles

10

Po-YangYeh and 葉柏揚. "A Study on the Appropriateness of Repeating K-fold Cross Validation." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/6jc74q.

Full text

Abstract:

碩士
國立成功大學
工業與資訊管理學系
105
K-fold cross validation is a popular approach for evaluating the performance of classification algorithms. The variance of accuracy estimate resulting from this approach is generally relatively large for conservative inference. Several studies therefore suggested to repeatedly perform K-fold cross validation for reducing the variance. Most of them did not consider the correlation among the repetitions of K-fold cross validation, and hence the variance could be underestimated. The purpose of this thesis is to study the appropriateness of repeating K-fold cross validation. We first investigate whether the accuracy estimates obtained from the repetitions of K-fold cross validation can be assumed to be independent. K-Nearest Neighbor algorithm with K = 1 is used to analyze the dependency relationships among the predictions of two repetitions of K-fold cross validation. Statistical methods are also proposed to test the strength of the dependency relationships. The experimental results on twenty data sets show that the predictions in two repetitions of K-fold cross validation are generally highly correlated, and the correlation will be higher as the number of folds increases. The results of a simulation study suggest that the K-fold cross validation with a small number of repetitions and a large number of folds should be adopted.

APA, Harvard, Vancouver, ISO, and other styles

11

Jing-TaiTsai and 蔡敬泰. "Dependency Analysis of the Accuracy Estimates Obtained from k-fold Cross Validation." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/ctp6zg.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Chang, Chih-Hsiang, and 張智翔. "Hollow Ball Screw Nut Preload Diagnosis by Support Vector Machine with K-Fold Cross Validation." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/k99q42.

Full text

Abstract:

碩士
國立彰化師範大學
機電工程學系
106
The purpose of this thesis is conducting a diagnosis method for the ball screw nut with different preload by analyzing signals from different operation conditions. This research focus on how to diagnosing the feed drive status of the machine tool based on short warm-up time before manufacturing. Since it cost long time operation for industrial ball screw turning into failure mode. This research changes different ball nut preload by 2%, 4 % and 6 % of the maximum dynamic in experiments. Motor load current, linear encoder signal and motor revolution speed signal were acquired and adopted for Support Vector Machine (SVM). Linear kernel function and radial basis function kernel function were used as for classification hyperplane. For bettering parameters of SVM classification, the k-fold cross validation is used. Experimental results show that it is possible to distinguish different ball nut preload status via deploying motor current, linear scale and motor revolution speed signals into SVM with k-fold classification. Experimental results show the early warning module for ball screw failure is successful and promising by developing SVM with k-fold cross validation method.

APA, Harvard, Vancouver, ISO, and other styles

13

Jian-Kuen, Wu, and 吳建昆. "The impact of stratification on the performance of classification algorithms evaluated by k-fold cross validation." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/xkvvzs.

Full text

Abstract:

碩士
國立成功大學
資訊管理研究所
105
K-fold cross validation is one of accuracy estimation methods used by many types of experimental research. Stratification method, however, is seldom performed in order to get more representative data in each partition. Stratification has the advantage of reducing the variance of estimators and thus better estimate the true accuracy. This research looks that stratification or imbalance dataset from a different perspective. General dataset is used to develop new algorithm from standard stratification on K-fold cross validation or investigate estimator from bias and variance. Imbalance dataset is used to discuss the performance of applying stratification from recall and precision or the others measure view in rare class value situation. Many types of research recommend their algorithm without the appropriate parametric method for statistical comparison. Therefore the purpose of this study is to compare these stratified methods in same condition environment, decision tree and k-nearest neighbors algorithm through reasonable statistical comparison. The results demonstrated that estimated value performance will closely with K-fold cross validation whether stratification implemented or not from single or multiple general or imbalanced dataset. Furthermore, when considering the factor of time complexity assuming stable estimator, standard stratification could be used on K-fold cross validation. By using advance stratification which takes into account features between data and data, the estimator will relatively more stable than standard stratification.

APA, Harvard, Vancouver, ISO, and other styles

14

Chiao-YingLin and 林巧盈. "A study on the selection error rate of classification algorithms evaluated by k-fold cross validation." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/23699989925707105417.

Full text

Abstract:

碩士
國立成功大學
資訊管理研究所
102
The performance of a classification algorithm is generally evaluated by K-fold cross validation to find the one that has the highest accuracy. Then the model induced from all available data by the best classification algorithm, called full sample model, is used for prediction and interpretation. Since there are no extra data to evaluate the full sample model resulting from the best algorithm, its prediction accuracy can be less than the accuracy of the full sample model induced by the other classification algorithm, and this is called a selection error. This study designs an experiment to calculate and estimate the selection error rate, and attempts to propose a new model for reducing selection error rate. The classification algorithms considered in this study are decision tree, naïve Bayesian classifier, logistic regression, and support vector machine. The experimental results on 30 data sets show that the actual and estimated selection error rates can be greatly different in several cases. The new model that has the median accuracy can reduce the selection error rate without sacrificing the prediction accuracy.

APA, Harvard, Vancouver, ISO, and other styles

15

Ying-YiChen and 陳映伊. "A study for investigating classification accuracy and consistency between K-fold cross validation and complete-data model." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/00015703419684582128.

Full text

Abstract:

碩士
國立成功大學
資訊管理研究所
101
In classification applications, analysts generally use K-fold cross validation to find the classifier that has the best performance. Then the classifier generates a learning model from all available data for prediction and interpretation. The K-fold cross validation randomly divides all available data into K folds, and every fold is in turn used for testing the model learned from the other K-1 folds. The average of the accuracies resulting from the K folds is an estimate of the prediction accuracy of the model learned from all available data. However, this procedure does not guarantee that the model induced from all available data by the best classifier evaluated by K-fold cross validation will have the highest prediction accuracy on new data with respect to the other classifiers. This study first designs an experiment to investigate whether the mean accuracy resulting from K-fold cross validation is a good estimate for the prediction accuracy of the model learned from all available data. An inconsistent rate is then introduced to measure the prediction consistency between the model learned from all available data and the K models induced from K-fold cross validation. When the inconsistent rate is small, using the model learned from all available data for prediction and interpretation will be appropriate. The experimental results on 30 data sets indicate that the average of the mean accuracy resulting from K-fold cross validation and the average of the prediction accuracy of the model induced from all available data on new data are generally not significantly different. However, since the probability of the difference between the mean accuracy resulting from K-fold cross validation and the prediction accuracy resulting from the model induced from all available data to be larger than one percent is approximately 0.60, the probability of choosing a classifier with a lower prediction accuracy on new data is generally larger than 0.3. The inconsistent rate shows that among the four classifiers adopted in this study, decision tree learning is the worst one to generate a model from all available data for prediction and interpretation.

APA, Harvard, Vancouver, ISO, and other styles

16

Yi-YinHuang and 黃宜音. "A study on the new models for improving the selection error rate among classification algorithms evaluated by k-fold cross validation." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/88651254533355280085.

Full text

Abstract:

碩士
國立成功大學
資訊管理研究所
103
The performance of a classification algorithm is generally evaluated by K-fold cross validation to find the one that has the highest accuracy. Then the model induced from all available data by the best classification algorithm, called full sample model, is used for prediction and interpretation. Since there are no extra data to evaluate the full sample model resulting from the best algorithm, its prediction accuracy can be less than the accuracy of the full sample model induced by the other classification algorithm, and this is called a selection error. The experimental results of some previous studies showed that the actual and the estimated selection error rates can be greatly different in several cases. This study repeatedly performs the experiment to stabilize the estimated selection error rates, and attempts to propose new models for reducing selection error rate without sacrificing the prediction accuracy. The classification algorithms considered in this study are decision tree, naïve Bayesian classifier, logistic regression, and support vector machine. This study investigates the impact of the number of classification algorithms, the number of folds, and the characteristics of data sets on the selection error rate, and proposes three methods to generate new models for reducing the selection error rate. The experimental results on thirty data sets show that the selection error rate increases as the number of classification algorithms increases, while the number of folds will not affect the selection error rate. The new models proposed in this study can effectively reduce the selection error rate for interpreting learning results.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'K-fold validation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles