To see the other types of publications on this topic, follow the link: Nearest Neighbor Classification.

Dissertations / Theses on the topic 'Nearest Neighbor Classification'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Nearest Neighbor Classification.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Karo, Ciril. "Two new nearest neighbor classification rules." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1998. http://handle.dtic.mil/100.2/ADA354997.

Full text
Abstract:
Thesis (M.S. in Operations Research) Naval Postgraduate School, September 1998.
"September 1998." Thesis advisor(s): Samuel E. Buttrey. Includes bibliographical references (p. 69-71). Also available online.
APA, Harvard, Vancouver, ISO, and other styles
2

Moraski, Ashley M. "Classification via distance profile nearest neighbors." Digital WPI, 2006. https://digitalcommons.wpi.edu/etd-theses/703.

Full text
Abstract:
Most classification rules can be expressed in terms of a distance (or dissimilarity) from the point to be classified to each of the candidate classes. For example, linear discriminant analysis classifies points into the class for which the (sample) Mahalanobis distance is smallest. However, dependence among these point-to-group distance measures is generally ignored. The primary goal of this project is to investigate the properties of a general non-parametric classification rule which takes this dependence structure into account. A review of classification procedures and applications is presented. The distance profile nearest-neighbor classification rule is defined. Properties of the rule are then explored via application to both real and simulated data and comparisons to other classification rules are discussed.
APA, Harvard, Vancouver, ISO, and other styles
3

Gupta, Nidhi. "Mutual k Nearest Neighbor based Classifier." University of Cincinnati / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1289937369.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Burkholder, Joshua Jeremy. "Nearest neighbor classification using a density sensitive distance measurement [electronic resource]." Thesis, Monterey, California : Naval Postgraduate School, 2009. http://edocs.nps.edu/npspubs/scholarly/theses/2009/Sep/09Sep%5FBurkholder.pdf.

Full text
Abstract:
Thesis (M.S. in Modeling, Virtual Environments, And Simulations (MOVES))--Naval Postgraduate School, September 2009.
Thesis Advisor(s): Squire, Kevin. "September 2009." Description based on title screen as viewed on November 03, 2009. Author(s) subject terms: Classification, Supervised Learning, k-Nearest Neighbor Classification, Euclidean Distance, Mahalanobis Distance, Density Sensitive Distance, Parzen Windows, Manifold Parzen Windows, Kernel Density Estimation Includes bibliographical references (p. 99-100). Also available in print.
APA, Harvard, Vancouver, ISO, and other styles
5

Ozsakabasi, Feray. "Classification Of Forest Areas By K Nearest Neighbor Method: Case Study, Antalya." Master's thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/12609548/index.pdf.

Full text
Abstract:
Among the various remote sensing methods that can be used to map forest areas, the K Nearest Neighbor (KNN) supervised classification method is becoming increasingly popular for creating forest inventories in some countries. In this study, the utility of the KNN algorithm is evaluated for forest/non-forest/water stratification. Antalya is selected as the study area. The data used are composed of Landsat TM and Landsat ETM satellite images, acquired in 1987 and 2002, respectively, SRTM 90 meters digital elevation model (DEM) and land use data from the year 2003. The accuracies of different modifications of the KNN algorithm are evaluated using Leave One Out, which is a special case of K-fold cross-validation, and traditional accuracy assessment using error matrices. The best parameters are found to be Euclidean distance metric, inverse distance weighting, and k equal to 14, while using bands 4, 3 and 2. With these parameters, the cross-validation error is 0.009174, and the overall accuracy is around 86%. The results are compared with those from the Maximum Likelihood algorithm. KNN results are found to be accurate enough for practical applicability of this method for mapping forest areas.
APA, Harvard, Vancouver, ISO, and other styles
6

PORFIRIO, DAVID JONATHAN. "SINGLE-SEQUENCE PROTEIN SECONDARY STRUCTURE PREDICTION BY NEAREST-NEIGHBOR CLASSIFICATION OF PROTEIN WORDS." Thesis, The University of Arizona, 2016. http://hdl.handle.net/10150/613449.

Full text
Abstract:
Predicting protein secondary structure is the process by which, given a sequence of amino acids as input, the secondary structure class of each position in the sequence is predicted. Our approach is built on the extraction of protein words of a fixed length from protein sequences, followed by nearest-neighbor classification in order to predict the secondary structure class of the middle position in each word. We present a new formulation for learning a distance function on protein words based on position-dependent substitution scores on amino acids. These substitution scores are learned by solving a large linear programming problem on examples of words with known secondary structures. We evaluated this approach by using a database of 5519 proteins with a total amino acid length of approximately 3000000. Presently, a test scheme using words of length 23 achieved a uniform average over word position of 65.2%. The average accuracy for alpha-classified words in the test was 63.1%, for beta-classified words was 56.6%, and for coil classified words was 71.6%.
APA, Harvard, Vancouver, ISO, and other styles
7

Ali, Khan Syed Irteza. "Classification using residual vector quantization." Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/50300.

Full text
Abstract:
Residual vector quantization (RVQ) is a 1-nearest neighbor (1-NN) type of technique. RVQ is a multi-stage implementation of regular vector quantization. An input is successively quantized to the nearest codevector in each stage codebook. In classification, nearest neighbor techniques are very attractive since these techniques very accurately model the ideal Bayes class boundaries. However, nearest neighbor classification techniques require a large size of representative dataset. Since in such techniques a test input is assigned a class membership after an exhaustive search the entire training set, a reasonably large training set can make the implementation cost of the nearest neighbor classifier unfeasibly costly. Although, the k-d tree structure offers a far more efficient implementation of 1-NN search, however, the cost of storing the data points can become prohibitive, especially in higher dimensionality. RVQ also offers a nice solution to a cost-effective implementation of 1-NN-based classification. Because of the direct-sum structure of the RVQ codebook, the memory and computational of cost 1-NN-based system is greatly reduced. Although, as compared to an equivalent 1-NN system, the multi-stage implementation of the RVQ codebook compromises the accuracy of the class boundaries, yet the classification error has been empirically shown to be within 3% to 4% of the performance of an equivalent 1-NN-based classifier.
APA, Harvard, Vancouver, ISO, and other styles
8

Liu, Dongqing. "GENETIC ALGORITHMS FOR SAMPLE CLASSIFICATION OF MICROARRAY DATA." University of Akron / OhioLINK, 2005. http://rave.ohiolink.edu/etdc/view?acc_num=akron1125253420.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Rudin, Pierre. "Football result prediction using simple classification algorithms, a comparison between k-Nearest Neighbor and Linear Regression." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-187659.

Full text
Abstract:
Ever since humans started competing with each other, people have tried to accurately predict the outcome of such events. Football is no exception to this and is extra interesting as subject for a project like this with the ever growing amount of data gathered from matches these days. Previously predictors had to make there predictions using there own knowledge and small amounts of data. This report will use this growing amount of data and find out if it is possible to accurately predict the outcome of a football match using the k-Nearest Neighbor algorithm and Linear regression. The algorithms are compared on how accurately they predict the winner of a match, how precise they predict how many goals each team will score and the accuracy of the predicted goal difference. The results are graphed and presented in tables. A discussion analyzes the results and draw the conclusion that booth algorithms could be useful if used with a good model, and that Linear Regression out performs k-NN.
Ända sedan vi människor började tävla mot varandra, har folk försökt förutspå vinnaren i tävlingarna. Fotboll är inget undantag till detta och är extra intressant för den här studien då den tillgängliga mängden data från fotbollsmatcher ständigt ökar. Tidigare har egna kunskaper och små mängder data använts för att förutspå resultaten. Den här rapporten kommer dra nytta av den växande mängden data för att ta reda på om det är möjligt att med hjälp av k-Nearest Neighbor algoritmen och Linjär regression förutspå resultat i fotbollsmatcher. Algoritmerna kommer jämföras utifrån hur exakt de förutspår vinnaren i matcher, hur många mål de båda lagen gör samt hur precist algoritmerna förutspår målskilnaden i matcherna.    Resultaten presenteras både i grafer och i tabeller. En diskusion förs för att analysera resultaten och kommer fram till att båda algoritmerna kan vara användbara om modelen är välkonstruerad, och att Linjär regression är bättre lämpad än k-NN.
APA, Harvard, Vancouver, ISO, and other styles
10

Blinn, Christine Elizabeth. "Increasing the Precision of Forest Area Estimates through Improved Sampling for Nearest Neighbor Satellite Image Classification." Diss., Virginia Tech, 2005. http://hdl.handle.net/10919/28694.

Full text
Abstract:
The impacts of training data sample size and sampling method on the accuracy of forest/nonforest classifications of three mosaicked Landsat ETM+ images with the nearest neighbor decision rule were explored. Large training data pools of single pixels were used in simulations to create samples with three sampling methods (random, stratified random, and systematic) and eight sample sizes (25, 50, 75, 100, 200, 300, 400, and 500). Two forest area estimation techniques were used to estimate the proportion of forest in each image and to calculate forest area precision estimates. Training data editing was explored to remove problem pixels from the training data pools. All possible band combinations of the six non-thermal ETM+ bands were evaluated for every sample draw. Comparisons were made between classification accuracies to determine if all six bands were needed. The utility of separability indices, minimum and average Euclidian distances, and cross-validation accuracies for the selection of band combinations, prediction of classification accuracies, and assessment of sample quality were determined. Larger training data sample sizes produced classifications with higher average accuracies and lower variability. All three sampling methods had similar performance. Training data editing improved the average classification accuracies by a minimum of 5.45%, 5.31%, and 3.47%, respectively, for the three images. Band combinations with fewer than all six bands almost always produced the maximum classification accuracy for a single sample draw. The number of bands and combination of bands, which maximized classification accuracy, was dependent on the characteristics of the individual training data sample draw, the image, sample size, and, to a lesser extent, the sampling method. All three band selection measures were unable to select band combinations that produced higher accuracies on average than all six bands. Cross-validation accuracies with sample size 500 had high correlations with classification accuracies, and provided an indication of sample quality. Collection of a high quality training data sample is key to the performance of the nearest neighbor classifier. Larger samples are necessary to guarantee classifier performance and the utility of cross-validation accuracies. Further research is needed to identify the characteristics of "good" training data samples.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
11

Naram, Hari Prasad. "Classification of Dense Masses in Mammograms." OpenSIUC, 2018. https://opensiuc.lib.siu.edu/dissertations/1528.

Full text
Abstract:
This dissertation material provided in this work details the techniques that are developed to aid in the Classification of tumors, non-tumors, and dense masses in a Mammogram, certain characteristics such as texture in a mammographic image are used to identify the regions of interest as a part of classification. Pattern recognizing techniques such as nearest mean classifier and Support vector machine classifier are also used to classify the features. The initial stages include the processing of mammographic image to extract the relevant features that would be necessary for classification and during the final stage the features are classified using the pattern recognizing techniques mentioned above. The goal of this research work is to provide the Medical Experts and Researchers an effective method which would aid them in identifying the tumors, non-tumors, and dense masses in a mammogram. At first the breast region extraction is carried using the entire mammogram. The extraction is carried out by creating the masks and using those masks to extract the region of interest pertaining to the tumor. A chain code is employed to extract the various regions, the extracted regions could potentially be classified as tumors, non-tumors, and dense regions. Adaptive histogram equalization technique is employed to enhance the contrast of an image. After applying the adaptive histogram equalization for several times which will provide a saturated image which would contain only bright spots of the mammographic image which appear like dense regions of the mammogram. These dense masses could be potential tumors which would need treatment. Relevant Characteristics such as texture in the mammographic image are used for feature extraction by using the nearest mean and support vector machine classifier. A total of thirteen Haralick features are used to classify the three classes. Support vector machine classifier is used to classify two class problems and radial basis function (RBF) kernel is used to find the best possible (c and gamma) values. Results obtained in this research suggest the best classification accuracy was achieved by using the support vector machines for both Tumor vs Non-Tumor and Tumor vs Dense masses. The maximum accuracies achieved for the tumor and non-tumor is above 90 % and for the dense masses is 70.8% using 11 features for support vector machines. Support vector machines performed better than the nearest mean majority classifier in the classification of the classes. Various case studies were performed using two distinct datasets in which each dataset consisting of 24 patients’ data in two individual views. Each patient data will consist of both the cranio caudal view and medio lateral oblique views. From these views the region of interest which could possibly be a tumor, non-tumor, or a dense regions(mass).
APA, Harvard, Vancouver, ISO, and other styles
12

Jiao, Lianmeng. "Classification of uncertain data in the framework of belief functions : nearest-neighbor-based and rule-based approaches." Thesis, Compiègne, 2015. http://www.theses.fr/2015COMP2222/document.

Full text
Abstract:
Dans de nombreux problèmes de classification, les données sont intrinsèquement incertaines. Les données d’apprentissage disponibles peuvent être imprécises, incomplètes, ou même peu fiables. En outre, des connaissances spécialisées partielles qui caractérisent le problème de classification peuvent également être disponibles. Ces différents types d’incertitude posent de grands défis pour la conception de classifieurs. La théorie des fonctions de croyance fournit un cadre rigoureux et élégant pour la représentation et la combinaison d’une grande variété d’informations incertaines. Dans cette thèse, nous utilisons cette théorie pour résoudre les problèmes de classification des données incertaines sur la base de deux approches courantes, à savoir, la méthode des k plus proches voisins (kNN) et la méthode à base de règles.Pour la méthode kNN, une préoccupation est que les données d’apprentissage imprécises dans les régions où les classes de chevauchent peuvent affecter ses performances de manière importante. Une méthode d’édition a été développée dans le cadre de la théorie des fonctions de croyance pour modéliser l’information imprécise apportée par les échantillons dans les régions qui se chevauchent. Une autre considération est que, parfois, seul un ensemble de données d’apprentissage incomplet est disponible, auquel cas les performances de la méthode kNN se dégradent considérablement. Motivé par ce problème, nous avons développé une méthode de fusion efficace pour combiner un ensemble de classifieurs kNN couplés utilisant des métriques couplées apprises localement. Pour la méthode à base de règles, afin d’améliorer sa performance dans les applications complexes, nous étendons la méthode traditionnelle dans le cadre des fonctions de croyance. Nous développons un système de classification fondé sur des règles de croyance pour traiter des informations incertains dans les problèmes de classification complexes. En outre, dans certaines applications, en plus de données d’apprentissage, des connaissances expertes peuvent également être disponibles. Nous avons donc développé un système de classification hybride fondé sur des règles de croyance permettant d’utiliser ces deux types d’information pour la classification
In many classification problems, data are inherently uncertain. The available training data might be imprecise, incomplete, even unreliable. Besides, partial expert knowledge characterizing the classification problem may also be available. These different types of uncertainty bring great challenges to classifier design. The theory of belief functions provides a well-founded and elegant framework to represent and combine a large variety of uncertain information. In this thesis, we use this theory to address the uncertain data classification problems based on two popular approaches, i.e., the k-nearest neighbor rule (kNN) andrule-based classification systems. For the kNN rule, one concern is that the imprecise training data in class over lapping regions may greatly affect its performance. An evidential editing version of the kNNrule was developed based on the theory of belief functions in order to well model the imprecise information for those samples in over lapping regions. Another consideration is that, sometimes, only an incomplete training data set is available, in which case the ideal behaviors of the kNN rule degrade dramatically. Motivated by this problem, we designedan evidential fusion scheme for combining a group of pairwise kNN classifiers developed based on locally learned pairwise distance metrics.For rule-based classification systems, in order to improving their performance in complex applications, we extended the traditional fuzzy rule-based classification system in the framework of belief functions and develop a belief rule-based classification system to address uncertain information in complex classification problems. Further, considering that in some applications, apart from training data collected by sensors, partial expert knowledge can also be available, a hybrid belief rule-based classification system was developed to make use of these two types of information jointly for classification
APA, Harvard, Vancouver, ISO, and other styles
13

Sammon, Ryan. "Data Collection, Analysis, and Classification for the Development of a Sailing Performance Evaluation System." Thèse, Université d'Ottawa / University of Ottawa, 2013. http://hdl.handle.net/10393/25481.

Full text
Abstract:
The work described in this thesis contributes to the development of a system to evaluate sailing performance. This work was motivated by the lack of tools available to evaluate sailing performance. The goal of the work presented is to detect and classify the turns of a sailing yacht. Data was collected using a BlackBerry PlayBook affixed to a J/24 sailing yacht. This data was manually annotated with three types of turn: tack, gybe, and mark rounding. This manually annotated data was used to train classification methods. Classification methods tested were multi-layer perceptrons (MLPs) of two sizes in various committees and nearest- neighbour search. Pre-processing algorithms tested were Kalman filtering, categorization using quantiles, and residual normalization. The best solution was found to be an averaged answer committee of small MLPs, with Kalman filtering and residual normalization performed on the input as pre-processing.
APA, Harvard, Vancouver, ISO, and other styles
14

Dastile, Xolani Collen. "Improved tree species discrimination at leaf level with hyperspectral data combining binary classifiers." Thesis, Rhodes University, 2011. http://hdl.handle.net/10962/d1002807.

Full text
Abstract:
The purpose of the present thesis is to show that hyperspectral data can be used for discrimination between different tree species. The data set used in this study contains the hyperspectral measurements of leaves of seven savannah tree species. The data is high-dimensional and shows large within-class variability combined with small between-class variability which makes discrimination between the classes challenging. We employ two classification methods: G-nearest neighbour and feed-forward neural networks. For both methods, direct 7-class prediction results in high misclassification rates. However, binary classification works better. We constructed binary classifiers for all possible binary classification problems and combine them with Error Correcting Output Codes. We show especially that the use of 1-nearest neighbour binary classifiers results in no improvement compared to a direct 1-nearest neighbour 7-class predictor. In contrast to this negative result, the use of neural networks binary classifiers improves accuracy by 10% compared to a direct neural networks 7-class predictor, and error rates become acceptable. This can be further improved by choosing only suitable binary classifiers for combination.
APA, Harvard, Vancouver, ISO, and other styles
15

Zhang, Xianjie, and Sebastian Bogic. "Datautvinning av klickdata : Kombination av klustring och klassifikation." Thesis, KTH, Hälsoinformatik och logistik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-230630.

Full text
Abstract:
Ägare av webbplatser och applikationer tjänar ofta på att användare klickar på deras länkar. Länkarna kan bland annat vara reklam eller varor som säljs. Det finns många studier inom dataanalys angående om en sådan länk kommer att bli klickad, men få studier fokuserar på hur länkarna kan justeras för att bli klickade. Problemet som företaget Flygresor.se har är att de saknar ett verktyg för deras kunder, resebyråer, att analysera deras biljetter och därefter justera attributen för resorna. Den efterfrågade lösningen var en applikation som gav förslag på hur biljetterna skulle förändras för att bli mer klickade och på såsätt kunna sälja fler resor. I detta arbete byggdes en prototyp som använder sig av två olika datautvinningsmetoder, klustring med algoritmen DBSCAN och klassifikation med algoritmen k-NN. Algoritmerna användes tillsammans med en utvärderingsprocess, kallad DNNA, som analyserade resultatet från dessa två algoritmer och gav förslag på förändringar av artikelns attribut. Kombinationen av algoritmerna tillsammans med DNNA testades och utvärderades som lösning till problemet. Programmet lyckades förutse vilka attribut av biljetter som behövde justeras för att biljetterna skulle bli mer klickade. Rekommendationerna av justeringar var rimliga men eftersom andra liknande verktyg inte hade publicerats kunde detta arbetes resultat inte jämföras.
Owners of websites and applications usually profits through users that clicks on their links. These can be advertisements or items for sale amongst others. There are many studies about data analysis where they tell you if a link will be clicked, but only a few that focus on what needs to be adjusted to get the link clicked. The problem that Flygresor.se have is that they are missing a tool for their customers, travel agencies, that analyses their tickets and after that adjusts the attributes of those trips. The requested solution was an application which gave suggestions about how to change the tickets in a way that would make it more clicked and in that way, make more sales. A prototype was constructed which make use of two different data mining methods, clustering with the algorithm DBSCAN and classification with the algorithm knearest neighbor. These algorithms were used together with an evaluation process, called DNNA, which analyzes the result from the algorithms and gave suggestions about changes that could be done to the attributes of the links. The combination of the algorithms and DNNA was tested and evaluated as the solution to the problem. The program was able to predict what attributes of the tickets needed to be adjusted to get the tickets more clicks. ‘The recommendations of adjustments were reasonable but this result could not be compared to similar tools since they had not been published.
APA, Harvard, Vancouver, ISO, and other styles
16

Amlathe, Prakhar. "Standard Machine Learning Techniques in Audio Beehive Monitoring: Classification of Audio Samples with Logistic Regression, K-Nearest Neighbor, Random Forest and Support Vector Machine." DigitalCommons@USU, 2018. https://digitalcommons.usu.edu/etd/7050.

Full text
Abstract:
Honeybees are one of the most important pollinating species in agriculture. Every three out of four crops have honeybee as their sole pollinator. Since 2006 there has been a drastic decrease in the bee population which is attributed to Colony Collapse Disorder(CCD). The bee colonies fail/ die without giving any traditional health symptoms which otherwise could help in alerting the Beekeepers in advance about their situation. Electronic Beehive Monitoring System has various sensors embedded in it to extract video, audio and temperature data that could provide critical information on colony behavior and health without invasive beehive inspections. Previously, significant patterns and information have been extracted by processing the video/image data, but no work has been done using audio data. This research inaugurates and takes the first step towards the use of audio data in the Electronic Beehive Monitoring System (BeePi) by enabling a path towards the automatic classification of audio samples in different classes and categories within it. The experimental results give an initial support to the claim that monitoring of bee buzzing signals from the hive is feasible, it can be a good indicator to estimate hive health and can help to differentiate normal behavior against any deviation for honeybees.
APA, Harvard, Vancouver, ISO, and other styles
17

Mestre, Ricardo Jorge Palheira. "Improvements on the KNN classifier." Master's thesis, Faculdade de Ciências e Tecnologia, 2013. http://hdl.handle.net/10362/10923.

Full text
Abstract:
Dissertação para obtenção do Grau de Mestre em Engenharia Informática
The object classification is an important area within the artificial intelligence and its application extends to various areas, whether or not in the branch of science. Among the other classifiers, the K-nearest neighbor (KNN) is among the most simple and accurate especially in environments where the data distribution is unknown or apparently not parameterizable. This algorithm assigns the classifying element the major class in the K nearest neighbors. According to the original algorithm, this classification implies the calculation of the distances between the classifying instance and each one of the training objects. If on the one hand, having an extensive training set is an element of importance in order to obtain a high accuracy, on the other hand, it makes the classification of each object slower due to its lazy-learning algorithm nature. Indeed, this algorithm does not provide any means of storing information about the previous calculated classifications,making the calculation of the classification of two equal instances mandatory. In a way, it may be said that this classifier does not learn. This dissertation focuses on the lazy-learning fragility and intends to propose a solution that transforms the KNNinto an eager-learning classifier. In other words, it is intended that the algorithm learns effectively with the training set, thus avoiding redundant calculations. In the context of the proposed change in the algorithm, it is important to highlight the attributes that most characterize the objects according to their discriminating power. In this framework, there will be a study regarding the implementation of these transformations on data of different types: continuous and/or categorical.
APA, Harvard, Vancouver, ISO, and other styles
18

Bhadoria, Divya. "Learning from spatially disjoint data." [Tampa, Fla.] : University of South Florida, 2004. http://purl.fcla.edu/fcla/etd/SFE0000344.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

gundam, madhuri, and Madhuri Gundam. "Automatic Classification of Fish in Underwater Video; Pattern Matching - Affine Invariance and Beyond." ScholarWorks@UNO, 2015. http://scholarworks.uno.edu/td/1976.

Full text
Abstract:
Underwater video is used by marine biologists to observe, identify, and quantify living marine resources. Video sequences are typically analyzed manually, which is a time consuming and laborious process. Automating this process will significantly save time and cost. This work proposes a technique for automatic fish classification in underwater video. The steps involved are background subtracting, fish region tracking and classification using features. The background processing is used to separate moving objects from their surrounding environment. Tracking associates multiple views of the same fish in consecutive frames. This step is especially important since recognizing and classifying one or a few of the views as a species of interest may allow labeling the sequence as that particular species. Shape features are extracted using Fourier descriptors from each object and are presented to nearest neighbor classifier for classification. Finally, the nearest neighbor classifier results are combined using a probabilistic-like framework to classify an entire sequence. The majority of the existing pattern matching techniques focus on affine invariance, mainly because rotation, scale, translation and shear are common image transformations. However, in some situations, other transformations may be modeled as a small deformation on top of an affine transformation. The proposed algorithm complements the existing Fourier transform-based pattern matching methods in such a situation. First, the spatial domain pattern is decomposed into non-overlapping concentric circular rings with centers at the middle of the pattern. The Fourier transforms of the rings are computed, and are then mapped to polar domain. The algorithm assumes that the individual rings are rotated with respect to each other. The variable angles of rotation provide information about the directional features of the pattern. This angle of rotation is determined starting from the Fourier transform of the outermost ring and moving inwards to the innermost ring. Two different approaches, one using dynamic programming algorithm and second using a greedy algorithm, are used to determine the directional features of the pattern.
APA, Harvard, Vancouver, ISO, and other styles
20

Piro, Paolo. "Learning prototype-based classification rules in a boosting framework: application to real-world and medical image categorization." Phd thesis, Université de Nice Sophia-Antipolis, 2010. http://tel.archives-ouvertes.fr/tel-00590403.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Björk, Gabriella. "Evaluation of system design strategies and supervised classification methods for fruit recognition in harvesting robots." Thesis, KTH, Skolan för industriell teknik och management (ITM), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-217859.

Full text
Abstract:
This master thesis project is carried out by one student at the Royal Institute of Technology in collaboration with Cybercom Group. The aim was to evaluate and compare system design strategies for fruit recognition in harvesting robots and the performance of supervised machine learning classification methods when applied to this specific task. The thesis covers the basics of these systems; to which parameters, constraints, requirements, and design decisions have been investigated. The framework is used as a foundation for the implementation of both sensing system, and processing and classification algorithms. A plastic tomato plant with fruit of varying maturity was used as a basis for training and testing, and a Kinect v2 for Windows including sensors for high resolution color-, depth, and IR data was used for image acquisition. The obtained data were processed and features of objects of interest extracted using MATLAB and a SDK for Kinect provided by Microsoft. Multiple views of the plant were acquired by having the plant rotate on a platform controlled by a stepper motor and an Ardunio Uno. The algorithms tested were binary classifiers, including Support Vector Machine, Decision Tree, and k-Nearest Neighbor. The models were trained and validated using a five fold cross validation in MATLABs Classification Learner application. Peformance metrics such as precision, recall, and the F1-score, used for accuracy comparison, were calculated. The statistical models k-NN and SVM achieved the best scores. The method considered most promising for fruit recognition purposes was the SVM.
Det här masterexamensarbetet har utförts av en student från Kungliga Tekniska Högskolan i samarbete med Cybercom Group. Målet var att utvärdera och jämföra designstrategier för igenkänning av frukt i en skörderobot och prestandan av klassificerande maskininlärningsalgoritmer när de appliceras på det specifika problemet. Arbetet omfattar grunderna av dessa system; till vilket parametrar, begränsningar, krav och designbeslut har undersökts. Ramverket användes sedan som grund för implementationen av sensorsystemet, processerings- och klassifikationsalgoritmerna. En tomatplanta i pplast med frukter av varierande mognasgrad användes som bas för träning och validering av systemet, och en Kinect för Windows v2 utrustad med sensorer för högupplöst färg, djup, och infraröd data anvöndes för att erhålla bilder. Datan processerades i MATLAB med hjälp av mjukvaruutvecklingskit för Kinect tillhandahållandet av Windows, i syfte att extrahera egenskaper ifrån objekt på bilderna. Multipla vyer erhölls genom att låta tomatplantan rotera på en plattform, driven av en stegmotor Arduino Uno. De binära klassifikationsalgoritmer som testades var Support Vector MAchine, Decision Tree och k-Nearest Neighbor. Modellerna tränades och valideras med hjälp av en five fold cross validation i MATLABs Classification Learner applikation. Prestationsindikatorer som precision, återkallelse och F1- poäng beräknades för de olika modellerna. Resultatet visade bland annat att statiska modeller som k-NN och SVM presterade bättre för det givna problemet, och att den sistnömnda är mest lovande för framtida applikationer.
APA, Harvard, Vancouver, ISO, and other styles
22

Aygar, Alper. "Doppler Radar Data Processing And Classification." Master's thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/12609890/index.pdf.

Full text
Abstract:
In this thesis, improving the performance of the automatic recognition of the Doppler radar targets is studied. The radar used in this study is a ground-surveillance doppler radar. Target types are car, truck, bus, tank, helicopter, moving man and running man. The input of this thesis is the output of the real doppler radar signals which are normalized and preprocessed (TRP vectors: Target Recognition Pattern vectors) in the doctorate thesis by Erdogan (2002). TRP vectors are normalized and homogenized doppler radar target signals with respect to target speed, target aspect angle and target range. Some target classes have repetitions in time in their TRPs. By the use of these repetitions, improvement of the target type classification performance is studied. K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) algorithms are used for doppler radar target classification and the results are evaluated. Before classification PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), NMF (Nonnegative Matrix Factorization) and ICA (Independent Component Analysis) are implemented and applied to normalized doppler radar signals for feature extraction and dimension reduction in an efficient way. These techniques transform the input vectors, which are the normalized doppler radar signals, to another space. The effects of the implementation of these feature extraction algoritms and the use of the repetitions in doppler radar target signals on the doppler radar target classification performance are studied.
APA, Harvard, Vancouver, ISO, and other styles
23

Günther, Michael. "FREDDY." Association for Computing Machinery, 2018. https://tud.qucosa.de/id/qucosa%3A38451.

Full text
Abstract:
Word embeddings are useful in many tasks in Natural Language Processing and Information Retrieval, such as text mining and classification, sentiment analysis, sentence completion, or dictionary construction. Word2vec and its predecessor fastText, both well-known models to produce word embeddings, are powerful techniques to study the syntactic and semantic relations between words by representing them in a low-dimensional vector. By applying algebraic operations on these vectors semantic relationships such as word analogies, gender-inflections, or geographical relationships can be easily recovered. The aim of this work is to investigate how word embeddings could be utilized to augment and enrich queries in DBMSs, e.g. to compare text values according to their semantic relation or to group rows according to the similarity of their text values. For this purpose, we use pre-trained word embedding models of large text corpora such as Wikipedia. By exploiting this external knowledge during query processing we are able to apply inductive reasoning on text values. Thereby, we reduce the demand for explicit knowledge in database systems. In the context of the IMDB database schema, this allows for example to query movies that are semantically close to genres such as historical fiction or road movie without maintaining this information. Another example query is sketched in Listing 1, that returns the top-3 nearest neighbors (NN) of each movie in IMDB. Given the movie “Godfather” as input this results in “Scarface”, “Goodfellas” and “Untouchables”.
APA, Harvard, Vancouver, ISO, and other styles
24

Tagami, Yukihiro. "Practical Web-scale Recommender Systems." Kyoto University, 2018. http://hdl.handle.net/2433/235110.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Zapletal, Petr. "Klasifikační metody analýzy vrstvy nervových vláken na sítnici." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2010. http://www.nusl.cz/ntk/nusl-218575.

Full text
Abstract:
This thesis is deal with classification for retinal nerve fibre layer. Texture features from six texture analysis methods are used for classification. All methods calculate feature vector from inputs images. This feature vector is characterized for every cluster (class). Classification is realized by three supervised learning algorithms and one unsupervised learning algorithm. The first testing algorithm is called Ho-Kashyap. The next is Bayess classifier NDDF (Normal Density Discriminant Function). The third is the Nearest Neighbor algorithm k-NN and the last tested classifier is algorithm K-means, which belongs to clustering. For better compactness of this thesis, three methods for selection of training patterns in supervised learning algorithms are implemented. The methods are based on Repeated Random Subsampling Cross Validation, K-Fold Cross Validation and Leave One Out Cross Validation algorithms. All algorithms are quantitatively compared in the sense of classication error evaluation.
APA, Harvard, Vancouver, ISO, and other styles
26

Axillus, Viktor. "Comparing Julia and Python : An investigation of the performance on image processing with deep neural networks and classification." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-19160.

Full text
Abstract:
Python is the most popular language when it comes to prototyping and developing machine learning algorithms. Python is an interpreted language that causes it to have a significant performance loss compared to compiled languages. Julia is a newly developed language that tries to bridge the gap between high performance but cumbersome languages such as C++ and highly abstracted but typically slow languages such as Python. However, over the years, the Python community have developed a lot of tools that addresses its performance problems. This raises the question if choosing one language over the other has any significant performance difference. This thesis compares the performance, in terms of execution time, of the two languages in the machine learning domain. More specifically, image processing with GPU-accelerated deep neural networks and classification with k-nearest neighbor on the MNIST and EMNIST dataset. Python with Keras and Tensorflow is compared against Julia with Flux for GPU-accelerated neural networks. For classification Python with Scikit-learn is compared against Julia with Nearestneighbors.jl. The results point in the direction that Julia has a performance edge in regards to GPU-accelerated deep neural networks. With Julia outperforming Python by roughly 1.25x − 1.5x. For classification with k-nearest neighbor the results were a bit more varied with Julia outperforming Python in 5 out of 8 different measurements. However, there exists some validity threats and additional research is needed that includes all different frameworks available for the languages in order to provide a more conclusive and generalized answer.
APA, Harvard, Vancouver, ISO, and other styles
27

Fu, Ruijun. "Empirical RF Propagation Modeling of Human Body Motions for Activity Classification." Digital WPI, 2012. https://digitalcommons.wpi.edu/etd-theses/1130.

Full text
Abstract:
"Many current and future medical devices are wearable, using the human body as a conduit for wireless communication, which implies that human body serves as a crucial part of the transmission medium in body area networks (BANs). Implantable medical devices such as Pacemaker and Cardiac Defibrillators are designed to provide patients with timely monitoring and treatment. Endoscopy capsules, pH Monitors and blood pressure sensors are used as clinical diagnostic tools to detect physiological abnormalities and replace traditional wired medical devices. Body-mounted sensors need to be investigated for use in providing a ubiquitous monitoring environment. In order to better design these medical devices, it is important to understand the propagation characteristics of channels for in-body and on- body wireless communication in BANs. The IEEE 802.15.6 Task Group 6 is officially working on the standardization of Body Area Network, including the channel modeling and communication protocol design. This thesis is focused on the propagation characteristics of human body movements. Specifically, standing, walking and jogging motions are measured, evaluated and analyzed using an empirical approach. Using a network analyzer, probabilistic models are derived for the communication links in the medical implant communication service band (MICS), the industrial scientific medical band (ISM) and the ultra- wideband (UWB) band. Statistical distributions of the received signal strength and second order statistics are presented to evaluate the link quality and outage performance for on-body to on- body communications at different antenna separations. The Normal distribution, Gamma distribution, Rayleigh distribution, Weibull distribution, Nakagami-m distribution, and Lognormal distribution are considered as potential models to describe the observed variation of received signal strength. Doppler spread in the frequency domain and coherence time in the time domain from temporal variations is analyzed to characterize the stability of the channels induced by human body movements. The shape of the Doppler spread spectrum is also investigated to describe the relationship of the power and frequency in the frequency domain. All these channel characteristics could be used in the design of communication protocols in BANs, as well as providing features to classify different human body activities. Realistic data extracted from built-in sensors in smart devices were used to assist in modeling and classification of human body movements along with the RF sensors. Variance, energy and frequency domain entropy of the data collected from accelerometer and orientation sensors are pre- processed as features to be used in machine learning algorithms. Activity classifiers with Backpropagation Network, Probabilistic Neural Network, k-Nearest Neighbor algorithm and Support Vector Machine are discussed and evaluated as means to discriminate human body motions. The detection accuracy can be improved with both RF and inertial sensors."
APA, Harvard, Vancouver, ISO, and other styles
28

Mody, Ravi. "Optimizing the distance function for nearest neighbors classification." Diss., [La Jolla] : University of California, San Diego, 2009. http://wwwlib.umi.com/cr/ucsd/fullcit?p1470299.

Full text
Abstract:
Thesis (M.S.)--University of California, San Diego, 2009.
Title from first page of PDF file (viewed December 2, 2009). Available via ProQuest Digital Dissertations. Includes bibliographical references (p. 48-49).
APA, Harvard, Vancouver, ISO, and other styles
29

Tandan, Isabelle, and Erika Goteman. "Bank Customer Churn Prediction : A comparison between classification and evaluation methods." Thesis, Uppsala universitet, Statistiska institutionen, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-411918.

Full text
Abstract:
This study aims to assess which supervised statistical learning method; random forest, logistic regression or K-nearest neighbor, that is the best at predicting banks customer churn. Additionally, the study evaluates which cross-validation set approach; k-Fold cross-validation or leave-one-out cross-validation that yields the most reliable results. Predicting customer churn has increased in popularity since new technology, regulation and changed demand has led to an increase in competition for banks. Thus, with greater reason, banks acknowledge the importance of maintaining their customer base.   The findings of this study are that unrestricted random forest model estimated using k-Fold is to prefer out of performance measurements, computational efficiency and a theoretical point of view. Albeit, k-Fold cross-validation and leave-one-out cross-validation yield similar results, k-Fold cross-validation is to prefer due to computational advantages.   For future research, methods that generate models with both good interpretability and high predictability would be beneficial. In order to combine the knowledge of which customers end their engagement as well as understanding why. Moreover, interesting future research would be to analyze at which dataset size leave-one-out cross-validation and k-Fold cross-validation yield the same results.
APA, Harvard, Vancouver, ISO, and other styles
30

Zhong, Xiao. "A study of several statistical methods for classification with application to microbial source tracking." Link to electronic thesis, 2004. http://www.wpi.edu/Pubs/ETD/Available/etd-0430104-155106/.

Full text
Abstract:
Thesis (M.S.)--Worcester Polytechnic Institute.
Keywords: classification; k-nearest-neighbor (k-n-n); neural networks; linear discriminant analysis (LDA); support vector machines; microbial source tracking (MST); quadratic discriminant analysis (QDA); logistic regression. Includes bibliographical references (p. 59-61).
APA, Harvard, Vancouver, ISO, and other styles
31

Villa, Medina Joe Luis. "Reliability of classification and prediction in k-nearest neighbours." Doctoral thesis, Universitat Rovira i Virgili, 2013. http://hdl.handle.net/10803/127108.

Full text
Abstract:
En esta tesis doctoral seha desarrollado el cálculo de la fiabilidad de clasificación y de la fiabilidad de predicción utilizando el método de los k-vecinos más cercanos (k-nearest neighbours, kNN) y estrategias de remuestreo basadas en bootstrap. Se han desarrollado, además, dos nuevos métodos de clasificación:Probabilistic Bootstrapk-Nearest Neighbours (PBkNN) y Bagged k-Nearest Neighbours (BaggedkNN),yun nuevo método de predicción,el Direct OrthogonalizationkNN (DOkNN).En todos los casos, los resultados obtenidos con los nuevos métodos han sido comparables o mejores que los obtenidos utilizando métodos clásicos de clasificación y calibración multivariante.
En aquesta tesi doctoral s'ha desenvolupat el càlcul de la fiabilitat de classificació i de la fiabilitat de predicció utilitzant el mètode dels k-veïns més propers (k-nearest neighbours, kNN) i estratègies de remostreig basades en bootstrap. S'han desenvolupat, a més, dos nous mètodes de classificació: Probabilistic Bootstrap k-Nearest Neighbours (PBkNN) i Bagged k-Nearest Neighbours (Bagged kNN), i un nou mètode de predicció, el Direct OrthogonalizationkNN (DOkNN). En tots els casos, els resultats obtinguts amb els nous mètodes han estat comparables o millors que els obtinguts utilitzant mètodes clàssics de classificació i calibratge multivariant.
APA, Harvard, Vancouver, ISO, and other styles
32

Hatko, Stan. "k-Nearest Neighbour Classification of Datasets with a Family of Distances." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/33361.

Full text
Abstract:
The k-nearest neighbour (k-NN) classifier is one of the oldest and most important supervised learning algorithms for classifying datasets. Traditionally the Euclidean norm is used as the distance for the k-NN classifier. In this thesis we investigate the use of alternative distances for the k-NN classifier. We start by introducing some background notions in statistical machine learning. We define the k-NN classifier and discuss Stone's theorem and the proof that k-NN is universally consistent on the normed space R^d. We then prove that k-NN is universally consistent if we take a sequence of random norms (that are independent of the sample and the query) from a family of norms that satisfies a particular boundedness condition. We extend this result by replacing norms with distances based on uniformly locally Lipschitz functions that satisfy certain conditions. We discuss the limitations of Stone's lemma and Stone's theorem, particularly with respect to quasinorms and adaptively choosing a distance for k-NN based on the labelled sample. We show the universal consistency of a two stage k-NN type classifier where we select the distance adaptively based on a split labelled sample and the query. We conclude by giving some examples of improvements of the accuracy of classifying various datasets using the above techniques.
APA, Harvard, Vancouver, ISO, and other styles
33

Luk, Andrew. "Some new results in nearest neighbour classification and lung sound analysis." Thesis, University of Glasgow, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.280756.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

He, Jingwen. "Large Margin Nearest Neighbors Classification With Privileged Information for Biometric Applications." Thesis, The University of Sydney, 2018. http://hdl.handle.net/2123/19663.

Full text
Abstract:
In this thesis, a novel metric learning algorithm is proposed to improve face verification and person re-identification in RGB images by learning from RGB and Depth (RGB-D) training images. We address this problem by formulating it as a Learning Using Privileged Information problem, in which the additional depth images associated with the RGB training images are not available for testing process. Based on the large margin nearest neighbors (LMNN) classification framework, we propose an effective metric learning method by incorporating depth information to improve the learning of decision function in the training process, and we formulate this distance metric learning method as large margin nearest neighbors classification with privileged information (LMNN+). Specifically, two distance metrics based on visual features as well as depth features are jointly learned by minimizing the triplet loss in which the within-class difference is minimized while the between-class difference is maximized. The distances in the depth space intuitively tell us which samples are hard or easy to separate and then the additional knowledge will be utilized to guide the training process in the visual space. In addition, we propose an efficient optimization method which can handle billions of constraints in the optimization problem of LMNN+. The comprehensive experiments on the EUROCOM data set, the CurtinFaces data set as well as the BIWI RGBD-ID data set demonstrate the effectiveness of our algorithm for face verification and person re-identification by leveraging privileged information.
APA, Harvard, Vancouver, ISO, and other styles
35

Berrett, Thomas Benjamin. "Modern k-nearest neighbour methods in entropy estimation, independence testing and classification." Thesis, University of Cambridge, 2017. https://www.repository.cam.ac.uk/handle/1810/267832.

Full text
Abstract:
Nearest neighbour methods are a classical approach in nonparametric statistics. The k-nearest neighbour classifier can be traced back to the seminal work of Fix and Hodges (1951) and they also enjoy popularity in many other problems including density estimation and regression. In this thesis we study their use in three different situations, providing new theoretical results on the performance of commonly-used nearest neighbour methods and proposing new procedures that are shown to outperform these existing methods in certain settings. The first problem we discuss is that of entropy estimation. Many statistical procedures, including goodness-of-fit tests and methods for independent component analysis, rely critically on the estimation of the entropy of a distribution. In this chapter, we seek entropy estimators that are efficient and achieve the local asymptotic minimax lower bound with respect to squared error loss. To this end, we study weighted averages of the estimators originally proposed by Kozachenko and Leonenko (1987), based on the k-nearest neighbour distances of a sample. A careful choice of weights enables us to obtain an efficient estimator in arbitrary dimensions, given sufficient smoothness, while the original unweighted estimator is typically only efficient in up to three dimensions. A related topic of study is the estimation of the mutual information between two random vectors, and its application to testing for independence. We propose tests for the two different situations of the marginal distributions being known or unknown and analyse their performance. Finally, we study the classical k-nearest neighbour classifier of Fix and Hodges (1951) and provide a new asymptotic expansion for its excess risk. We also show that, in certain situations, a new modification of the classifier that allows k to vary with the location of the test point can provide improvements. This has applications to the field of semi-supervised learning, where, in addition to labelled training data, we also have access to a large sample of unlabelled data.
APA, Harvard, Vancouver, ISO, and other styles
36

Veras, Ricardo da Costa. "Utilização de métodos de machine learning para identificação de instrumentos musicais de sopro pelo timbre." reponame:Repositório Institucional da UFABC, 2018.

Find full text
Abstract:
Orientador: Prof. Dr. Ricardo Suyama
Dissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Engenharia da Informação, Santo André, 2018.
De forma geral a Classificação de Padrões voltada a Processamento de Sinais vem sendo estudada e utilizada para a interpretação de informações diversas, que se manifestam em forma de imagens, áudios, dados geofísicos, impulsos elétricos, entre outros. Neste trabalho são estudadas técnicas de Machine Learning aplicadas ao problema de identificação de instrumentos musicais, buscando obter um sistema automático de reconhecimento de timbres. Essas técnicas foram utilizadas especificamente com cinco instrumentos da categoria de Sopro de Madeira (o Clarinete, o Fagote, a Flauta, o Oboé e o Sax). As técnicas utilizadas foram o kNN (com k = 3) e o SVM (numa configuração não linear), assim como foram estudadas algumas características (features) dos áudios, tais como o MFCC (do inglês Mel-Frequency Cepstral Coefficients), o ZCR (do inglês Zero Crossing Rate), a entropia, entre outros, sendo fonte de dados para os processos de treinamento e de teste. Procurou-se estudar instrumentos nos quais se observa uma aproximação nos timbres, e com isso verificar como é o comportamento de um sistema classificador nessas condições específicas. Observou-se também o comportamento dessas técnicas com áudios desconhecidos do treinamento, assim como com trechos em que há uma mistura de elementos (gerando interferências para cada modelo classificador) que poderiam desviar os resultados, ou com misturas de elementos que fazem parte das classes observadas, e que se somam num mesmo áudio. Os resultados indicam que as características selecionadas possuem informações relevantes a respeito do timbre de cada um dos instrumentos avaliados (como observou-se em relação aos solos), embora a acurácia obtida para alguns dos instrumentos tenha sido abaixo do esperado (como observou-se em relação aos duetos).
In general, Pattern Classification for Signal Processing has been studied and used for the interpretation of several information, which are manifested in many ways, like: images, audios, geophysical data, electrical impulses, among others. In this project we study techniques of Machine Learning applied to the problem of identification of musical instruments, aiming to obtain an automatic system of timbres recognition. These techniques were used specifically with five instruments of Woodwind category (Clarinet, Bassoon, Flute, Oboe and Sax). The techniques used were the kNN (with k = 3) and the SVM (in a non-linear configuration), as well as some audio features, such as MFCC (Mel-Frequency Cepstral Coefficients), ZCR (Zero Crossing Rate), entropy, among others, used as data source for the training and testing processes. We tried to study instruments in which an approximation in the timbres is observed, and to verify in this case how is the behavior of a classifier system in these specific conditions. It was also observed the behavior of these techniques with audios unknown to the training, as well as with sections in which there is a mixture of elements (generating interferences for each classifier model) that could deviate the results, or with mixtures of elements that are part of the observed classes, and added in a same audio. The results indicate that the selected characteristics have relevant information regarding the timbre of each one of evaluated instruments (as observed on the solos results), although the accuracy obtained for some of the instruments was lower than expected (as observed on the duets results).
APA, Harvard, Vancouver, ISO, and other styles
37

Fisher, Julia Marie. "Classification Analytics in Functional Neuroimaging: Calibrating Signal Detection Parameters." Thesis, The University of Arizona, 2015. http://hdl.handle.net/10150/594646.

Full text
Abstract:
Classification analyses are a promising way to localize signal, especially scattered signal, in functional magnetic resonance imaging data. However, there is not yet a consensus on the most effective analysis pathway. We explore the efficacy of k-Nearest Neighbors classifiers on simulated functional magnetic resonance imaging data. We utilize a novel construction of the classification data. Additionally, we vary the spatial distribution of signal, the design matrix of the linear model used to construct the classification data, and the feature set available to the classifier. Results indicate that the k-Nearest Neighbors classifier is not sufficient under the current paradigm to adequately classify neural data and localize signal. Further exploration of the data using k-means clustering indicates that this is likely due in part to the amount of noise present in each data point. Suggestions are made for further research.
APA, Harvard, Vancouver, ISO, and other styles
38

Stiernborg, Sebastian, and Sara Ervik. "Evaluation of Machine Learning Classification Methods : Support Vector Machines, Nearest Neighbour and Decision Tree." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-209119.

Full text
Abstract:
With more and more data available, the interest and use for machine learning is growing and so does the need for classification. Classifica- tion is an important method within machine learning for data simpli- fication and prediction. This report evaluates three classification methods for supervised learn- ing: Support Vector Machines (SVM) with several kernels, Nearest Neighbor (k-NN) and Decision Tree (DT). The methods were evalu- ated based on the factors accuracy, precision, recall and time. The experiments were conducted on artificial data created to represent a variation of distributions with a limitation of only 2 features and 3 classes. Different distributions of data were chosen to challenge each classification method. The results show that the measurements for ac- curacy and time vary considerably for the different distributed dataset. SVM with RBF kernel performed better for accuracy in comparison to the other classification methods. k-NN scored slightly lower accuracy values than SVM with RBF kernel in general, but performed better on the challenging dataset. DT is the less time consuming algorithm and was significally faster than the other classification methods. The only method that could compete with DT on time was k-NN that was faster than DT for the dataset with small spread and coinciding classes. Although a clear trend can be seen in the results the area needs to be studied further to draw a comprehensive conclusion due to limitation of the artificially generated datasets in this study.
Med växande data och tillgänglighet ökar intresset och användning- en för maskininlärning, tillsammans med behovet för klassificering. Klassificering är en viktig metod inom maskininlärning för att förenk- la data och göra förutstägelser. Denna rapport utvärderar tre klassificeringsmetoder för övervakad in- lärning: Stödvektormaskiner (SVM) med olika kärnor, Närmaste Gran- ne (k-NN) och Beslutsträd (DT). Metoderna utvärderades baserat på nogrannhet, precision, återkallelse och tid. Experimenten utfördes på artificiell data skapad för att representera en variation av fördelningar med en begränsning av endast 2 egenskaper och 3 klasser. Resultaten visar att mätningarna för noggrannhet och tid varierar avsevärt för olika variationer av dataset. SVM med RBF-kärna gav generellt högre värden för noggrannhet i jämförelse med de and- ra klassificeringsmetoderna. k-NN visade något lägre noggrannhet än SVM med RBF-kärna i allmänhet, men presterade bättre på det mest utmanande datasetet. DT är den minst tidskrävande algoritmen och var signifikant snabbare än de andra klassificeringsmetoderna. Den enda metoden som kunde konkurrera med DT i tid var SVM med k- NN som var snabbare än DT för det dataset som hade liten spridning och sammanfallande klasser. Även om en tydlig trend kan ses i resultaten behöver området studeras ytterligare för att dra en omfattande slutsats på grund av begränsning av dataset i denna studie.
APA, Harvard, Vancouver, ISO, and other styles
39

Borén, Mirjam. "Classification of discrete stress levels in users using eye tracker and K- Nearest Neighbour algorithm." Thesis, Umeå universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-176258.

Full text
Abstract:
The advancement of the Head Mounted Display (HMD) used for Virtual Reality (VR) has come a long way and now the option of eye tracking is available in some HMD. The eyes show physiological responses when healthy individuals are stressed, justifying eye tracking as a tool to estimate at minimum, the very presence of stress. Stress can present itself in many shapes and may be caused by different factors such as work, social situations, cognitive load and many others. The stress test Group Stroop Color Word Test (GSCWT) can induce four different levels of stress in users; no stress, low stress, medium stress and high stress. In this thesis GSCWT was implemented in a virtual reality and users had their pupil dilation and blinking rate recorded. The data was then used to train and test a K-Nearest Neighbour algorithm (KNN). The KNN- algorithm could not accurately predict between the four different stress classes but it could predict the presence or absence of stress. VR has been used successfully as a tool for practicing different social skills and other everyday life skills for individuals with Autism Spectrum Disorder (ASD). By correctly identifying the stress level in the user in VR, tools for practicing social skills for ASD individuals could be more personalized and improved.
APA, Harvard, Vancouver, ISO, and other styles
40

Sakouvogui, Kekoura. "Comparative Classification of Prostate Cancer Data using the Support Vector Machine, Random Forest, Dualks and k-Nearest Neighbours." Thesis, North Dakota State University, 2015. https://hdl.handle.net/10365/27698.

Full text
Abstract:
This paper compares four classifications tools, Support Vector Machine (SVM), Random Forest (RF), DualKS and the k-Nearest Neighbors (kNN) that are based on different statistical learning theories. The dataset used is a microarray gene expression of 596 male patients with prostate cancer. After treatment, the patients were classified into one group of phenotype with three levels: PSA (Prostate-Specific Antigen), Systematic and NED (No Evidence of Disease). The purpose of this research is to determine the performance rate of each classifier by selecting the optimal kernels and parameters that give the best prediction rate of the phenotype. The paper begins with the discussion of previous implementations of the tools and their mathematical theories. The results showed that three classifiers achieved a comparable performance that was above the average while DualKS did not. We also observed that SVM outperformed the kNN, RF and DualKS classifiers.
APA, Harvard, Vancouver, ISO, and other styles
41

Prokopová, Ivona. "Detekce fibrilace síní v EKG." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2020. http://www.nusl.cz/ntk/nusl-413170.

Full text
Abstract:
Atrial fibrillation is one of the most common cardiac rhythm disorders characterized by ever-increasing prevalence and incidence in the Czech Republic and abroad. The incidence of atrial fibrillation is reported at 2-4 % of the population, but due to the often asymptomatic course, the real prevalence is even higher. The aim of this work is to design an algorithm for automatic detection of atrial fibrillation in the ECG record. In the practical part of this work, an algorithm for the detection of atrial fibrillation is proposed. For the detection itself, the k-nearest neighbor method, the support vector method and the multilayer neural network were used to classify ECG signals using features indicating the variability of RR intervals and the presence of the P wave in the ECG recordings. The best detection was achieved by a model using a multilayer neural network classification with two hidden layers. Results of success indicators: Sensitivity 91.23 %, Specificity 99.20 %, PPV 91.23 %, F-measure 91.23 % and Accuracy 98.53 %.
APA, Harvard, Vancouver, ISO, and other styles
42

Jia, Wei. "Image analysis and representation for textile design classification." Thesis, University of Dundee, 2011. https://discovery.dundee.ac.uk/en/studentTheses/c667f279-d7a6-4670-b23e-c9dbe2784266.

Full text
Abstract:
A good image representation is vital for image comparision and classification; it may affect the classification accuracy and efficiency. The purpose of this thesis was to explore novel and appropriate image representations. Another aim was to investigate these representations for image classification. Finally, novel features were examined for improving image classification accuracy. Images of interest to this thesis were textile design images. The motivation of analysing textile design images is to help designers browse images, fuel their creativity, and improve their design efficiency. In recent years, bag-of-words model has been shown to be a good base for image representation, and there have been many attempts to go beyond this representation. Bag-of-words models have been used frequently in the classification of image data, due to good performance and simplicity. “Words” in images can have different definitions and are obtained through steps of feature detection, feature description, and codeword calculation. The model represents an image as an orderless collection of local features. However, discarding the spatial relationships of local features limits the power of this model. This thesis exploited novel image representations, bag of shapes and region label graphs models, which were based on bag-of-words model. In both models, an image was represented by a collection of segmented regions, and each region was described by shape descriptors. In the latter model, graphs were constructed to capture the spatial information between groups of segmented regions and graph features were calculated based on some graph theory. Novel elements include use of MRFs to extract printed designs and woven patterns from textile images, utilisation of the extractions to form bag of shapes models, and construction of region label graphs to capture the spatial information. The extraction of textile designs was formulated as a pixel labelling problem. Algorithms for MRF optimisation and re-estimation were described and evaluated. A method for quantitative evaluation was presented and used to compare the performance of MRFs optimised using alpha-expansion and iterated conditional modes (ICM), both with and without parameter re-estimation. The results were used in the formation of the bag of shapes and region label graphs models. Bag of shapes model was a collection of MRFs' segmented regions, and the shape of each region was described with generic Fourier descriptors. Each image was represented as a bag of shapes. A simple yet competitive classification scheme based on nearest neighbour class-based matching was used. Classification performance was compared to that obtained when using bags of SIFT features. To capture the spatial information, region label graphs were constructed to obtain graph features. Regions with the same label were treated as a group and each group was associated uniquely with a vertex in an undirected, weighted graph. Each region group was represented as a bag of shape descriptors. Edges in the graph denoted either the extent to which the groups' regions were spatially adjacent or the dissimilarity of their respective bags of shapes. Series of unweighted graphs were obtained by removing edges in order of weight. Finally, an image was represented using its shape descriptors along with features derived from the chromatic numbers or domination numbers of the unweighted graphs and their complements. Linear SVM classifiers were used for classification. Experiments were implemented on data from Liberty Art Fabrics, which consisted of more than 10,000 complicated images mainly of printed textile designs and woven patterns. Experimental data was classified into seven classes manually by assigning each image a text descriptor based on content or design type. The seven classes were floral, paisley, stripe, leaf, geometric, spot, and check. The result showed that reasonable and interesting regions were obtained from MRF segmentation in which alpha-expansion with parameter re-estimation performs better than alpha-expansion without parameter re-estimation or ICM. This result was not only promising for textile CAD (Computer-Aided Design) to redesign the textile image, but also for image representation. It was also found that bag of shapes model based on MRF segmentation can obtain comparable classification accuracy with bag of SIFT features in the framework of nearest neighbour class-based matching. Finally, the result indicated that incorporation of graph features extracted by constructing region label graphs can improve the classification accuracy compared to both bag of shapes model and bag of SIFT models.
APA, Harvard, Vancouver, ISO, and other styles
43

van, Woerden Irene. "A statistical investigation of the risk factors for tuberculosis." Thesis, University of Canterbury. Mathematics and Statistics, 2013. http://hdl.handle.net/10092/8662.

Full text
Abstract:
Tuberculosis (TB) is called a disease of poverty and is the main cause of death from infectious diseases among adults. In 1993 the World Health Organisation (WHO) declared TB to be a global emergency; however there were still approximately 1.4 million deaths due to TB in 2011. This thesis contains a detailed study of the existing literature regarding the global risk factors of TB. The risk factors identified from the literature review search which were also available from the NFHS-3 survey were then analysed to determine how well we could identify respondents who are at high risk of TB. We looked at the stigma and misconceptions people have regarding TB and include detailed reports from the existing literature of how a persons wealth, health, education, nutrition, and HIV status affect how likely the person is to have TB. The difference in the risk factor distribution for the TB and non-TB populations were examined and classification trees, nearest neighbours, and logistic regression models were trialled to determine if it was possible for respondents who were at high risk of TB to be identified. Finally gender-specific statistically likely directed acyclic graphs were created to visualise the most likely associations between the variables.
APA, Harvard, Vancouver, ISO, and other styles
44

Åkerblom, Thea, and Tobias Thor. "Fraud or Not?" Thesis, Uppsala universitet, Statistiska institutionen, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-388695.

Full text
Abstract:
This paper uses statistical learning to examine and compare three different statistical methods with the aim to predict credit card fraud. The methods compared are Logistic Regression, K-Nearest Neighbour and Random Forest. They are applied and estimated on a data set consisting of nearly 300,000 credit card transactions to determine their performance using classification of fraud as the outcome variable. The three models all have different properties and advantages. The K-NN model preformed the best in this paper but has some disadvantages, since it does not explain the data but rather predict the outcome accurately. Random Forest explains the variables but performs less precise. The Logistic Regression model seems to be unfit for this specific data set.
APA, Harvard, Vancouver, ISO, and other styles
45

Gan, Changquan. "Une approche de classification non supervisée basée sur la notion des K plus proches voisins." Compiègne, 1994. http://www.theses.fr/1994COMP765S.

Full text
Abstract:
La classification non supervisée a pour objectif de définir dans un ensemble de données des classes permettant de caractériser la structure interne des données. C’est une technique très utile dans de nombreux domaines technologiques comme en diagnostic des systèmes complexes (pour la mise en évidence de modes de fonctionnement) et en vision par ordinateur (pour la segmentation d'image). Les méthodes traditionnelles de la classification non supervisée présentent plusieurs problèmes en pratique, par exemple, la nécessité de préfixer le nombre de classes, le manque de stratégie appropriée pour le réglage de paramètres et la difficulté de valider le résultat obtenu. Dans cette thèse nous tentons d'apporter une solution à ces problèmes en développant une nouvelle approche basée sur la notion des K plus proches voisins. Alliant la détection de mode et la recherche de graphe reflétant la proximité des données, cette approche identifie d'abord les centres de classe, puis construit une classe autour de chaque centre. Elle n'emploie aucune connaissance a priori sur les données et ne possède qu'un seul paramètre. Une stratégie de réglage de ce paramètre a été établie après une étude théorique et une analyse expérimentale. L’idée est de rechercher la stabilité du résultat de classification. Des tests présentés dans ce mémoire montrent une bonne performance de l'approche proposée ; elle est libre d'hypothèse sur la nature des données, relativement robuste et facile à utiliser
APA, Harvard, Vancouver, ISO, and other styles
46

Alsouda, Yasser. "An IoT Solution for Urban Noise Identification in Smart Cities : Noise Measurement and Classification." Thesis, Linnéuniversitetet, Institutionen för fysik och elektroteknik (IFE), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-80858.

Full text
Abstract:
Noise is defined as any undesired sound. Urban noise and its effect on citizens area significant environmental problem, and the increasing level of noise has become a critical problem in some cities. Fortunately, noise pollution can be mitigated by better planning of urban areas or controlled by administrative regulations. However, the execution of such actions requires well-established systems for noise monitoring. In this thesis, we present a solution for noise measurement and classification using a low-power and inexpensive IoT unit. To measure the noise level, we implement an algorithm for calculating the sound pressure level in dB. We achieve a measurement error of less than 1 dB. Our machine learning-based method for noise classification uses Mel-frequency cepstral coefficients for audio feature extraction and four supervised classification algorithms (that is, support vector machine, k-nearest neighbors, bootstrap aggregating, and random forest). We evaluate our approach experimentally with a dataset of about 3000 sound samples grouped in eight sound classes (such as car horn, jackhammer, or street music). We explore the parameter space of the four algorithms to estimate the optimal parameter values for the classification of sound samples in the dataset under study. We achieve noise classification accuracy in the range of 88% – 94%.
APA, Harvard, Vancouver, ISO, and other styles
47

Makki, Sara. "An Efficient Classification Model for Analyzing Skewed Data to Detect Frauds in the Financial Sector." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1339/document.

Full text
Abstract:
Différents types de risques existent dans le domaine financier, tels que le financement du terrorisme, le blanchiment d’argent, la fraude de cartes de crédit, la fraude d’assurance, les risques de crédit, etc. Tout type de fraude peut entraîner des conséquences catastrophiques pour des entités telles que les banques ou les compagnies d’assurances. Ces risques financiers sont généralement détectés à l'aide des algorithmes de classification. Dans les problèmes de classification, la distribution asymétrique des classes, également connue sous le nom de déséquilibre de classe (class imbalance), est un défi très commun pour la détection des fraudes. Des approches spéciales d'exploration de données sont utilisées avec les algorithmes de classification traditionnels pour résoudre ce problème. Le problème de classes déséquilibrées se produit lorsque l'une des classes dans les données a beaucoup plus d'observations que l’autre classe. Ce problème est plus vulnérable lorsque l'on considère dans le contexte des données massives (Big Data). Les données qui sont utilisées pour construire les modèles contiennent une très petite partie de groupe minoritaire qu’on considère positifs par rapport à la classe majoritaire connue sous le nom de négatifs. Dans la plupart des cas, il est plus délicat et crucial de classer correctement le groupe minoritaire plutôt que l'autre groupe, comme la détection de la fraude, le diagnostic d’une maladie, etc. Dans ces exemples, la fraude et la maladie sont les groupes minoritaires et il est plus délicat de détecter un cas de fraude en raison de ses conséquences dangereuses qu'une situation normale. Ces proportions de classes dans les données rendent très difficile à l'algorithme d'apprentissage automatique d'apprendre les caractéristiques et les modèles du groupe minoritaire. Ces algorithmes seront biaisés vers le groupe majoritaire en raison de leurs nombreux exemples dans l'ensemble de données et apprendront à les classer beaucoup plus rapidement que l'autre groupe. Dans ce travail, nous avons développé deux approches : Une première approche ou classifieur unique basée sur les k plus proches voisins et utilise le cosinus comme mesure de similarité (Cost Sensitive Cosine Similarity K-Nearest Neighbors : CoSKNN) et une deuxième approche ou approche hybride qui combine plusieurs classifieurs uniques et fondu sur l'algorithme k-modes (K-modes Imbalanced Classification Hybrid Approach : K-MICHA). Dans l'algorithme CoSKNN, notre objectif était de résoudre le problème du déséquilibre en utilisant la mesure de cosinus et en introduisant un score sensible au coût pour la classification basée sur l'algorithme de KNN. Nous avons mené une expérience de validation comparative au cours de laquelle nous avons prouvé l'efficacité de CoSKNN en termes de taux de classification correcte et de détection des fraudes. D’autre part, K-MICHA a pour objectif de regrouper des points de données similaires en termes des résultats de classifieurs. Ensuite, calculez les probabilités de fraude dans les groupes obtenus afin de les utiliser pour détecter les fraudes de nouvelles observations. Cette approche peut être utilisée pour détecter tout type de fraude financière, lorsque des données étiquetées sont disponibles. La méthode K-MICHA est appliquée dans 3 cas : données concernant la fraude par carte de crédit, paiement mobile et assurance automobile. Dans les trois études de cas, nous comparons K-MICHA au stacking en utilisant le vote, le vote pondéré, la régression logistique et l’algorithme CART. Nous avons également comparé avec Adaboost et la forêt aléatoire. Nous prouvons l'efficacité de K-MICHA sur la base de ces expériences. Nous avons également appliqué K-MICHA dans un cadre Big Data en utilisant H2O et R. Nous avons pu traiter et analyser des ensembles de données plus volumineux en très peu de temps
There are different types of risks in financial domain such as, terrorist financing, money laundering, credit card fraudulence and insurance fraudulence that may result in catastrophic consequences for entities such as banks or insurance companies. These financial risks are usually detected using classification algorithms. In classification problems, the skewed distribution of classes also known as class imbalance, is a very common challenge in financial fraud detection, where special data mining approaches are used along with the traditional classification algorithms to tackle this issue. Imbalance class problem occurs when one of the classes have more instances than another class. This problem is more vulnerable when we consider big data context. The datasets that are used to build and train the models contain an extremely small portion of minority group also known as positives in comparison to the majority class known as negatives. In most of the cases, it’s more delicate and crucial to correctly classify the minority group rather than the other group, like fraud detection, disease diagnosis, etc. In these examples, the fraud and the disease are the minority groups and it’s more delicate to detect a fraud record because of its dangerous consequences, than a normal one. These class data proportions make it very difficult to the machine learning classifier to learn the characteristics and patterns of the minority group. These classifiers will be biased towards the majority group because of their many examples in the dataset and will learn to classify them much faster than the other group. After conducting a thorough study to investigate the challenges faced in the class imbalance cases, we found that we still can’t reach an acceptable sensitivity (i.e. good classification of minority group) without a significant decrease of accuracy. This leads to another challenge which is the choice of performance measures used to evaluate models. In these cases, this choice is not straightforward, the accuracy or sensitivity alone are misleading. We use other measures like precision-recall curve or F1 - score to evaluate this trade-off between accuracy and sensitivity. Our objective is to build an imbalanced classification model that considers the extreme class imbalance and the false alarms, in a big data framework. We developed two approaches: A Cost-Sensitive Cosine Similarity K-Nearest Neighbor (CoSKNN) as a single classifier, and a K-modes Imbalance Classification Hybrid Approach (K-MICHA) as an ensemble learning methodology. In CoSKNN, our aim was to tackle the imbalance problem by using cosine similarity as a distance metric and by introducing a cost sensitive score for the classification using the KNN algorithm. We conducted a comparative validation experiment where we prove the effectiveness of CoSKNN in terms of accuracy and fraud detection. On the other hand, the aim of K-MICHA is to cluster similar data points in terms of the classifiers outputs. Then, calculating the fraud probabilities in the obtained clusters in order to use them for detecting frauds of new transactions. This approach can be used to the detection of any type of financial fraud, where labelled data are available. At the end, we applied K-MICHA to a credit card, mobile payment and auto insurance fraud data sets. In all three case studies, we compare K-MICHA with stacking using voting, weighted voting, logistic regression and CART. We also compared with Adaboost and random forest. We prove the efficiency of K-MICHA based on these experiments
APA, Harvard, Vancouver, ISO, and other styles
48

Alzubaidi, Laith. "Deep learning for medical imaging applications." Thesis, Queensland University of Technology, 2022. https://eprints.qut.edu.au/227812/1/Laith_Alzubaidi_Thesis.pdf.

Full text
Abstract:
This thesis investigated novel deep learning techniques for advanced medical imaging applications. It addressed three major research issues of employing deep learning for medical imaging applications including network architecture, lack of training data, and generalisation. It proposed three new frameworks for CNN network architecture and three novel transfer learning methods. The proposed solutions have been tested on four different medical imaging applications demonstrating their effectiveness and generalisation. These solutions have already been employed by the scientific community showing excellent performance in medical imaging applications and other domains.
APA, Harvard, Vancouver, ISO, and other styles
49

Bílý, Ondřej. "Moderní řečové příznaky používané při diagnóze chorob." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2011. http://www.nusl.cz/ntk/nusl-218971.

Full text
Abstract:
This work deals with the diagnosis of Parkinson's disease by analyzing the speech signal. At the beginning of this work there is described speech signal production. The following is a description of the speech signal analysis, its preparation and subsequent feature extraction. Next there is described Parkinson's disease and change of the speech signal by this disability. The following describes the symptoms, which are used for the diagnosis of Parkinson's disease (FCR, VSA, VOT, etc.). Another part of the work deals with the selection and reduction symptoms using the learning algorithms (SVM, ANN, k-NN) and their subsequent evaluation. In the last part of the thesis is described a program to count symptoms. Further is described selection and the end evaluated all the result.
APA, Harvard, Vancouver, ISO, and other styles
50

Dyremark, Johanna, and Caroline Mayer. "Bedömning av elevuppsatser genom maskininlärning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-262041.

Full text
Abstract:
Betygsättning upptar idag en stor del av lärares arbetstid och det finns en betydande inkonsekvens vid bedömning utförd av olika lärare. Denna studie ämnar undersöka vilken träffsäkerhet som en automtiserad bedömningsmodell kan uppnå. Tre maskininlärningsmodeller för klassifikation i form av Linear Discriminant Analysis, K-Nearest Neighbor och Random Forest tränas och testas med femfaldig korsvalidering på uppsatser från nationella prov i svenska. Klassificeringen baseras på språk och formrelaterade attribut inkluderande ord och teckenvisa längdmått, likhet med texter av olika formalitetsgrad och grammatikrelaterade mått. Detta utmynnar i ett maximalt quadratic weighted kappa-värde på 0,4829 och identisk överensstämmelse med expertgivna betyg i 57,53 % av fallen. Dessa resultat uppnåddes av en modell baserad på Linear Discriminant Analysis och uppvisar en högre korrelation med expertgivna betyg än en ordinarie lärare. Trots pågående digitalisering inom skolväsendet kvarstår ett antal hinder innan fullständigt maskininlärningsbaserad bedömning kan realiseras, såsom användarnas inställning till tekniken, etiska dilemman och teknikens svårigheter med förståelse av semantik. En delvis integrerad automatisk betygssättning har dock potential att identifiera uppsatser där behov av dubbelrättning föreligger, vilket kan öka överensstämmelsen vid storskaliga prov till en låg kostnad.
Today, a large amount of a teacher’s workload is comprised of essay scoring and there is a large variability between teachers’ gradings. This report aims to examine what accuracy can be acceived with an automated essay scoring system for Swedish. Three following machine learning models for classification are trained and tested with 5-fold cross-validation on essays from Swedish national tests: Linear Discriminant Analysis, K-Nearest Neighbour and Random Forest. Essays are classified based on 31 language structure related attributes such as token-based length measures, similarity to texts with different formal levels and use of grammar. The results show a maximal quadratic weighted kappa value of 0.4829 and a grading identical to expert’s assessment in 57.53% of all tests. These results were achieved by a model based on Linear Discriminant Analysis and showed higher inter-rater reliability with expert grading than a local teacher. Despite an ongoing digitilization within the Swedish educational system, there are a number of obstacles preventing a complete automization of essay scoring such as users’ attitude, ethical issues and the current techniques difficulties in understanding semantics. Nevertheless, a partial integration of automatic essay scoring has potential to effectively identify essays suitable for double grading which can increase the consistency of large-scale tests to a low cost.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography