To see the other types of publications on this topic, follow the link: Data classification and machine learning.

Dissertations / Theses on the topic 'Data classification and machine learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Data classification and machine learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Stenekap, Daniel. "Classification of Gear-shift data using machine learning." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-53445.

Full text
Abstract:
Today, automatic transmissions are the industrial standard in heavy-duty vehicles. However, tolerances and component wear can cause factory calibrated gearshifts to have deviations that have a negative impact on clutch durability and driver comfort. An adaptive shift process could solve this problem by recognizing when pre-calibrated values are out-dated. The purpose of this thesis is to examine the classification of shift types using machine learning for the future goal of an adaptive gearshift process. Recent papers concerning machine learning on time-series are reviewed. Adata set is collected and validated using hand-engineered features and unsupervised learning. Four deep neural networks (DNN) models are trained on raw and normalized shift data. Three of the models show good generalization and perform with accuracies above 90%. An adaption of the fully convolutional network (FCN) used in [1] shows promise due to relative size and ability to learn the raw data sets. An adaptation of the multi-variate long short time memory fully convolutional network (MLSTMFCN) used in [2] is superior on normalized data sets. This thesis shows that DNN structures can be used to distinguish between time-series of shift data. However, much effort remains since a database for shift types is necessary for this work to continue.
APA, Harvard, Vancouver, ISO, and other styles
2

Fujino, Akinori. "Machine Learning with Heterogeneous Data for Classification Problems." 京都大学 (Kyoto University), 2009. http://hdl.handle.net/2433/123832.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Teatini, Alex. "Movement trajectory classification using supervised machine learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-265009.

Full text
Abstract:
Anything that moves can be tracked, and hence its trajectory analysed. The trajectory of a moving object can carry a lot of useful information depending on what is sought. In this work, the aim is to exploit machine learning to be able to classify finite trajectories based on their shape. In a clinical environment, a set of trajectory classes have been defined based on relevance to particular pathologies. Furthermore, several trajectories have been collected using a depth sensor from a number of subjects. The problem to address is to evaluate whether it is possible to classify these trajectories into predefined classes. A trajectory consists of a sequentially ordered list of coordinates, which would imply temporal processing. However, following the success of machine learning to classify images, the idea of a visual approach surfaced. On this basis, the plots of the trajectories are transformed into images, making the problem become similar to a written character recognition problem. The implemented methods for this classification tasks are the well-known Support Vector Machine (SVM) and the Convolutional Neural Network (CNN), the most appreciated deep approach to image recognition. We find that the best possible way to of achieving substantial performances on this classification task is to use a mixture of the two aforementioned methods, namely a two-step classification made of a binary SVM, responsible for a first distinction, followed by a CNN for the final decision. We illustrate that this tree-based approach is capable of granting the best classification accuracy score under the imposed restrictions. In conclusion, a look into possible future developments based on the exploration of novel deep learning methods will be given. This project has been developed during an internship at the company ‘Qinematic’.
Allt som rör sig kan detekteras och därmed kan dess bana analyseras. Banan för ett rörligt objekt kan bära en hel del användbar information beroende på vad som eftersöks. I detta arbete är syftet att utnyttja maskininlärning för att kunna klassificera ändliga banor baserat på deras form. I en klinisk miljö har en uppsättning banklasser definierats baserat på dess relevans för vissa sjukdomar. Vidare har flera banor samlats in med hjälp av en djupledssensor från ett antal personer. Projektets syfte är att utvärdera om det är möjligt att klassificera dessa banor i de fördefinierade klasserna. En bana består av en sekventiellt ordnad lista av koordinater, vilket skulle antyda temporal behandling. Men utifrån framgången av maskininlärning för att klassificera bilder fick vi idén om en bildbaserad analys. På grundval av detta har banor omvandlas till bilder, vilket gör att problemet nu liknar igenkänningsproblemet av handskrivna siffror. De genomförda metoderna för klassificeringsuppgiften är den välkända Support Vector Machine (SVM), implementerad i några olika konfigurationer samt Convolutional Neural Network (CNN), den mest uppskattade metoden för bildigenkänning inom Deep Learning. Vi finner att bästa möjliga sätt för att uppnå betydande prestationer på klassificeringsuppgiften är att använda en blandning av de två tidigare nämnda metoderna, nämligen en tvåstegsklassificering gjord av en binär SVM, ansvarig för en första distinktion, följt av en CNN för det slutliga beslutet. Vi visar att detta trädbaserade tillvägagångssätt kan ge den bästa klassnoggrannheten under ålagda restriktioner. Avslutningsvis ges en hypotes för framtida förbättringar av nya djupa inlärningsmetoder
APA, Harvard, Vancouver, ISO, and other styles
4

Milne, Linda Computer Science &amp Engineering Faculty of Engineering UNSW. "Machine learning for automatic classification of remotely sensed data." Publisher:University of New South Wales. Computer Science & Engineering, 2008. http://handle.unsw.edu.au/1959.4/41322.

Full text
Abstract:
As more and more remotely sensed data becomes available it is becoming increasingly harder to analyse it with the more traditional labour intensive, manual methods. The commonly used techniques, that involve expert evaluation, are widely acknowledged as providing inconsistent results, at best. We need more general techniques that can adapt to a given situation and that incorporate the strengths of the traditional methods, human operators and new technologies. The difficulty in interpreting remotely sensed data is that often only a small amount of data is available for classification. It can be noisy, incomplete or contain irrelevant information. Given that the training data may be limited we demonstrate a variety of techniques for highlighting information in the available data and how to select the most relevant information for a given classification task. We show that more consistent results between the training data and an entire image can be obtained, and how misclassification errors can be reduced. Specifically, a new technique for attribute selection in neural networks is demonstrated. Machine learning techniques, in particular, provide us with a means of automating classification using training data from a variety of data sources, including remotely sensed data and expert knowledge. A classification framework is presented in this thesis that can be used with any classifier and any available data. While this was developed in the context of vegetation mapping from remotely sensed data using machine learning classifiers, it is a general technique that can be applied to any domain. The emphasis of the applicability for this framework being domains that have inadequate training data available.
APA, Harvard, Vancouver, ISO, and other styles
5

Li, Ling Abu-Mostafa Yaser S. "Data complexity in machine learning and novel classification algorithms /." Diss., Pasadena, Calif. : Caltech, 2006. http://resolver.caltech.edu/CaltechETD:etd-04122006-114210.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Montiel, López Jacob. "Fast and slow machine learning." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLT014/document.

Full text
Abstract:
L'ère du Big Data a révolutionné la manière dont les données sont créées et traitées. Dans ce contexte, de nombreux défis se posent, compte tenu de la quantité énorme de données disponibles qui doivent être efficacement gérées et traitées afin d’extraire des connaissances. Cette thèse explore la symbiose de l'apprentissage en mode batch et en flux, traditionnellement considérés dans la littérature comme antagonistes, sur le problème de la classification à partir de flux de données en évolution. L'apprentissage en mode batch est une approche bien établie basée sur une séquence finie: d'abord les données sont collectées, puis les modèles prédictifs sont créés, finalement le modèle est appliqué. Par contre, l’apprentissage par flux considère les données comme infinies, rendant le problème d’apprentissage comme une tâche continue (sans fin). De plus, les flux de données peuvent évoluer dans le temps, ce qui signifie que la relation entre les caractéristiques et la réponse correspondante peut changer. Nous proposons un cadre systématique pour prévoir le surendettement, un problème du monde réel ayant des implications importantes dans la société moderne. Les deux versions du mécanisme d'alerte précoce (batch et flux) surpassent les performances de base de la solution mise en œuvre par le Groupe BPCE, la deuxième institution bancaire en France. De plus, nous introduisons une méthode d'imputation évolutive basée sur un modèle pour les données manquantes dans la classification. Cette méthode présente le problème d'imputation sous la forme d'un ensemble de tâches de classification / régression résolues progressivement.Nous présentons un cadre unifié qui sert de plate-forme d'apprentissage commune où les méthodes de traitement par batch et par flux peuvent interagir de manière positive. Nous montrons que les méthodes batch peuvent être efficacement formées sur le réglage du flux dans des conditions spécifiques. Nous proposons également une adaptation de l'Extreme Gradient Boosting algorithme aux flux de données en évolution. La méthode adaptative proposée génère et met à jour l'ensemble de manière incrémentielle à l'aide de mini-lots de données. Enfin, nous présentons scikit-multiflow, un framework open source en Python qui comble le vide en Python pour une plate-forme de développement/recherche pour l'apprentissage à partir de flux de données en évolution
The Big Data era has revolutionized the way in which data is created and processed. In this context, multiple challenges arise given the massive amount of data that needs to be efficiently handled and processed in order to extract knowledge. This thesis explores the symbiosis of batch and stream learning, which are traditionally considered in the literature as antagonists. We focus on the problem of classification from evolving data streams.Batch learning is a well-established approach in machine learning based on a finite sequence: first data is collected, then predictive models are created, then the model is applied. On the other hand, stream learning considers data as infinite, rendering the learning problem as a continuous (never-ending) task. Furthermore, data streams can evolve over time, meaning that the relationship between features and the corresponding response (class in classification) can change.We propose a systematic framework to predict over-indebtedness, a real-world problem with significant implications in modern society. The two versions of the early warning mechanism (batch and stream) outperform the baseline performance of the solution implemented by the Groupe BPCE, the second largest banking institution in France. Additionally, we introduce a scalable model-based imputation method for missing data in classification. This method casts the imputation problem as a set of classification/regression tasks which are solved incrementally.We present a unified framework that serves as a common learning platform where batch and stream methods can positively interact. We show that batch methods can be efficiently trained on the stream setting under specific conditions. The proposed hybrid solution works under the positive interactions between batch and stream methods. We also propose an adaptation of the Extreme Gradient Boosting (XGBoost) algorithm for evolving data streams. The proposed adaptive method generates and updates the ensemble incrementally using mini-batches of data. Finally, we introduce scikit-multiflow, an open source framework in Python that fills the gap in Python for a development/research platform for learning from evolving data streams
APA, Harvard, Vancouver, ISO, and other styles
7

He, Jin. "Robust Mote-Scale Classification of Noisy Data via Machine Learning." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1440413201.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Rosquist, Christine. "Text Classification of Human Resources-related Data with Machine Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-302375.

Full text
Abstract:
Text classification has been an important application and research subject since the origin of digital documents. Today, as more and more data are stored in the form of electronic documents, the text classification approach is even more vital. There exist various studies that apply machine learning methods such as Naive Bayes and Convolutional Neural Networks (CNN) to text classification and sentiment analysis. However, most of these studies do not focus on cross- domain classification i.e., machine learning models that have been trained on a dataset from one context are tested on another dataset from another context. This is useful when there is not enough training data for the specific domain where text data is to be classified. This thesis investigates how the machine learning methods Naive Bayes and CNN perform when they are trained in one context and then tested in another slightly different context. The study uses data from employee reviews in order to train the models, and the models are then tested on both the employee-review data but also on human resources-related data. Thus, the aim with the thesis is to gain insights on how to develop a system with the capability to perform an accurate cross-domain classification, and to provide more insights to the text classification research area in general. A comparative analysis of the models Naive Bayes and CNN was done, and the results showed that both of the models performed quite similarly when classifying sentences by only using the employee-review data to train and test the models. However, CNN performed slightly better when it comes to multiclass classification for the employee data, which indicates that CNN might be a better model in that context. From a cross-domain perspective, Naive Bayes turned out to be the better model since it performed better in all of the metrics evaluated. However, both of the models can be used as guidance tools in order to classify human-resources related data quickly, even if Naive Bayes is the model that performs the best in the cross-domain context. The results can possibly be improved with more research and need to be verified with more data. Suggestions on how to improve the results are among others to enhance the hyperparameter optimization, use another approach to handle the data imbalance, and adjust the preprocessing methods used. It is also worth noting that the statistical significance could not be confirmed in all of the different test cases, meaning that no absolute conclusions can be drawn, but the results from this thesis work still provide an indication of how well the models perform.
Textklassificering har varit en viktig tillämpning och ett viktigt forskningsämne sedan uppkomsten av digitala dokument. Idag, i och med att allt mer data sparas i form av elektroniska dokument, är textklassificeringen ännu mer relevant. Det existerar flera studier som applicerar maskininlärningsmodeller så som Naive Bayes och Convolutional Neural Networks (CNN) på textklassificering och sentimentanalys. Dock ligger inte fokuset i dessa studier på en krossdomän-klassificering, vilket innebär att maskinlärningsmodellerna tränas på ett dataset från en viss kontext och sedan testas på ett dataset från en annan kontext. Detta är användbart när det inte finns tillräckligt med träningsdata från den specifika domänen där textdata ska klassificeras. Den här studien undersöker hur maskininlärningsmodellerna Naive Bayes och CNN presterar när de är tränade i en viss kontext och sedan testade i en annan, något annorlunda, kontext. Studien använder data från recensioner gjorda av anställda för att träna modellerna, som sedan testas på den datan men också på personalavdelningsrelaterad data. Således är syftet med denna studie att bidra med insikt i hur ett system kan utvecklas med kapabilitet att utföra en korrekt krossdomän-klassificering, samt bidra med generell insikt till forskningsämnet textklassificering. En jämförande analys av modellerna Naive Bayes och CNN utfördes, och resultaten visade att modellerna presterar lika när det kom till att klassificera text genom att enbart använda datan med recensioner gjorda av anställda för att träna och testa modellerna. Dock visade det sig att CNN presterade bättre när det kom till multiklass-klassificering av datan med recensioner gjorda av anställda, vilket indikerar att CNN kan vara en bättre modell i den kontexten. Från ett krossdomän-perspektiv visade det sig att Naive Bayes var den bättre modellen, i och med att den modellen presterade bäst i alla mätningar. Båda modellerna kan användas som guidningsverktyg för att klassificera personalavdelningsrelaterad data, trots att Naive Bayes var modellen som presterade bäst i ett krossdomän-perspektiv. Resultatet kan förbättrats en del med mer forskning, och behöver verifieras med mer data. Förslag på hur resultaten kan förbättras är att förbättra hyperparameteroptimeringen, använda en annan metod för att hantera den obalanserade datan samt att justera förbehandlingen av datan. Det är också värt att notera att den statistiska signifikansen inte kunde bekräftas i alla testfall, vilket innebär att inga egentliga slutsatser kan dras, även om det fortfarande bidrar med en indikering om hur bra de olika modellerna presterar i de olika fallen.
APA, Harvard, Vancouver, ISO, and other styles
9

Pehrson, Jakob, and Sara Lindstrand. "Support Unit Classification through Supervised Machine Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281537.

Full text
Abstract:
The purpose of this article is to evaluate the impact a supervised machine learning classification model can have on the process of internal customer support within a large digitized company. Chatbots are becoming a frequently used utility among digital services, though the true general impact is not always clear. The research is separated into the following two questions: (1) Which supervised machine learning algorithm of naïve Bayes, logistic regression, and neural networks can best predict the correct support a user needs and with what accuracy? And (2) What is the effect on the productivity and customer satisfaction of using machine learning to sort customer needs? The data was collected from the internal server database of a large digital company and was then trained on and tested with the three classification algorithms. Furthermore, a survey was collected with questions focused on understanding how the current system affects the involved employees. A first finding indicates that neural networks is the best suited model for the classification task. Though, when the scope and complexity was limited, naïve Bayes and logistic regression performed sufficiently. A second finding of the study is that the classification model potentially improves productivity given that the baseline is met. However, a difficulty exists in drawing conclusions on the exact effects on customer satisfaction since there are many aspects to take into account. Nevertheless, there is a good potential to achieve a positive net effect.
Syftet med artikeln är att utvärdera den påverkan som en klassificeringsmodell kan ha på den interna processen av kundtjänst inom ett stort digitaliserat företag. Chatbotar används allt mer frekvent bland digitala tjänster, även om den generella effekten inte alltid är tydlig. Studien är uppdelad i följande två frågeställningar: (1) Vilken klassificeringsalgoritm bland naive Bayes, logistisk regression, och neurala nätverk kan bäst förutspå den korrekta hjälpen en användare är i behov av och med vilken noggrannhet? Och (2) Vad är effekten på produktivitet och kundnöjdhet för användandet av maskininlärning för sortering av kundbehov? Data samlades från ett stort, digitalt företags interna databas och används sedan i träning och testning med de tre klassificeringsalgoritmerna. Vidare, en enkät skickades ut med fokus på att förstå hur det nuvarande systemet påverkar de berörda arbetarna. Ett första fynd indikerar att neurala nätverk är den mest lämpade modellen för klassificeringen. Däremot, när omfånget och komplexiteten var begränsat presenterade även naive Bayes och logistisk regression tillräckligt. Ett andra fynd av studien är att klassificeringen potentiellt förbättrar produktiviteten givet att baslinjen är mött. Däremot existerar en svårighet i att dra slutsatser om den exakta effekten på kundnöjdhet eftersom det finns många olika aspekter att ta hänsyn till. Likväl finns en god potential i att uppnå en positiv nettoeffekt.
APA, Harvard, Vancouver, ISO, and other styles
10

Amil, Marletti Pablo. "Machine learning methods for the characterization and classification of complex data." Doctoral thesis, Universitat Politècnica de Catalunya, 2020. http://hdl.handle.net/10803/668842.

Full text
Abstract:
This thesis work presents novel methods for the analysis and classification of medical images and, more generally, complex data. First, an unsupervised machine learning method is proposed to order anterior chamber OCT (Optical Coherence Tomography) images according to a patient's risk of developing angle-closure glaucoma. In a second study, two outlier finding techniques are proposed to improve the results of above mentioned machine learning algorithm, we also show that they are applicable to a wide variety of data, including fraud detection in credit card transactions. In a third study, the topology of the vascular network of the retina, considering it a complex tree-like network is analyzed and we show that structural differences reveal the presence of glaucoma and diabetic retinopathy. In a fourth study we use a model of a laser with optical injection that presents extreme events in its intensity time-series to evaluate machine learning methods to forecast such extreme events.
El presente trabajo de tesis desarrolla nuevos métodos para el análisis y clasificación de imágenes médicas y datos complejos en general. Primero, proponemos un método de aprendizaje automático sin supervisión que ordena imágenes OCT (tomografía de coherencia óptica) de la cámara anterior del ojo en función del grado de riesgo del paciente de padecer glaucoma de ángulo cerrado. Luego, desarrollamos dos métodos de detección automática de anomalías que utilizamos para mejorar los resultados del algoritmo anterior, pero que su aplicabilidad va mucho más allá, siendo útil, incluso, para la detección automática de fraudes en transacciones de tarjetas de crédito. Mostramos también, cómo al analizar la topología de la red vascular de la retina considerándola una red compleja, podemos detectar la presencia de glaucoma y de retinopatía diabética a través de diferencias estructurales. Estudiamos también un modelo de un láser con inyección óptica que presenta eventos extremos en la serie temporal de intensidad para evaluar diferentes métodos de aprendizaje automático para predecir dichos eventos extremos.
Aquesta tesi desenvolupa nous mètodes per a l’anàlisi i la classificació d’imatges mèdiques i dades complexes. Hem proposat, primer, un mètode d’aprenentatge automàtic sense supervisió que ordena imatges OCT (tomografia de coherència òptica) de la cambra anterior de l’ull en funció del grau de risc del pacient de patir glaucoma d’angle tancat. Després, hem desenvolupat dos mètodes de detecció automàtica d’anomalies que hem utilitzat per millorar els resultats de l’algoritme anterior, però que la seva aplicabilitat va molt més enllà, sent útil, fins i tot, per a la detecció automàtica de fraus en transaccions de targetes de crèdit. Mostrem també, com en analitzar la topologia de la xarxa vascular de la retina considerant-la una xarxa complexa, podem detectar la presència de glaucoma i de retinopatia diabètica a través de diferències estructurals. Finalment, hem estudiat un làser amb injecció òptica, el qual presenta esdeveniments extrems en la sèrie temporal d’intensitat. Hem avaluat diferents mètodes per tal de predir-los.
APA, Harvard, Vancouver, ISO, and other styles
11

Dos, Santos Ludovic. "Representation learning for relational data." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066480/document.

Full text
Abstract:
L'utilisation croissante des réseaux sociaux et de capteurs génère une grande quantité de données qui peuvent être représentées sous forme de graphiques complexes. Il y a de nombreuses tâches allant de l'analyse de l'information à la prédiction et à la récupération que l'on peut imaginer sur ces données où la relation entre les noeuds de graphes devrait être informative. Dans cette thèse, nous avons proposé différents modèles pour trois tâches différentes: - Classification des noeuds graphiques - Prévisions de séries temporelles relationnelles - Filtrage collaboratif. Tous les modèles proposés utilisent le cadre d'apprentissage de la représentation dans sa variante déterministe ou gaussienne. Dans un premier temps, nous avons proposé deux algorithmes pour la tâche de marquage de graphe hétérogène, l'un utilisant des représentations déterministes et l'autre des représentations gaussiennes. Contrairement à d'autres modèles de pointe, notre solution est capable d'apprendre les poids de bord lors de l'apprentissage simultané des représentations et des classificateurs. Deuxièmement, nous avons proposé un algorithme pour la prévision des séries chronologiques relationnelles où les observations sont non seulement corrélées à l'intérieur de chaque série, mais aussi entre les différentes séries. Nous utilisons des représentations gaussiennes dans cette contribution. C'était l'occasion de voir de quelle manière l'utilisation de représentations gaussiennes au lieu de représentations déterministes était profitable. Enfin, nous appliquons l'approche d'apprentissage de la représentation gaussienne à la tâche de filtrage collaboratif. Ceci est un travail préliminaire pour voir si les propriétés des représentations gaussiennes trouvées sur les deux tâches précédentes ont également été vérifiées pour le classement. L'objectif de ce travail était de généraliser ensuite l'approche à des données plus relationnelles et pas seulement des graphes bipartis entre les utilisateurs et les items
The increasing use of social and sensor networks generates a large quantity of data that can be represented as complex graphs. There are many tasks from information analysis, to prediction and retrieval one can imagine on those data where relation between graph nodes should be informative. In this thesis, we proposed different models for three different tasks: - Graph node classification - Relational time series forecasting - Collaborative filtering. All the proposed models use the representation learning framework in its deterministic or Gaussian variant. First, we proposed two algorithms for the heterogeneous graph labeling task, one using deterministic representations and the other one Gaussian representations. Contrary to other state of the art models, our solution is able to learn edge weights when learning simultaneously the representations and the classifiers. Second, we proposed an algorithm for relational time series forecasting where the observations are not only correlated inside each series, but also across the different series. We use Gaussian representations in this contribution. This was an opportunity to see in which way using Gaussian representations instead of deterministic ones was profitable. At last, we apply the Gaussian representation learning approach to the collaborative filtering task. This is a preliminary work to see if the properties of Gaussian representations found on the two previous tasks were also verified for the ranking one. The goal of this work was to then generalize the approach to more relational data and not only bipartite graphs between users and items
APA, Harvard, Vancouver, ISO, and other styles
12

Zhao, Xiaochuang. "Ensemble Learning Method on Machine Maintenance Data." Scholar Commons, 2015. http://scholarcommons.usf.edu/etd/6056.

Full text
Abstract:
In the industry, a lot of companies are facing the explosion of big data. With this much information stored, companies want to make sense of the data and use it to help them for better decision making, especially for future prediction. A lot of money can be saved and huge revenue can be generated with the power of big data. When building statistical learning models for prediction, companies in the industry are aiming to build models with efficiency and high accuracy. After the learning models have been developed for production, new data will be generated. With the updated data, the models have to be updated as well. Due to this nature, the model performs best today doesn’t mean it will necessarily perform the same tomorrow. Thus, it is very hard to decide which algorithm should be used to build the learning model. This paper introduces a new method that ensembles the information generated by two different classification statistical learning algorithms together as inputs for another learning model to increase the final prediction power. The dataset used in this paper is NASA’s Turbofan Engine Degradation data. There are 49 numeric features (X) and the response Y is binary with 0 indicating the engine is working properly and 1 indicating engine failure. The model’s purpose is to predict whether the engine is going to pass or fail. The dataset is divided in training set and testing set. First, training set is used twice to build support vector machine (SVM) and neural network models. Second, it used the trained SVM and neural network model taking X of the training set as input to predict Y1 and Y2. Then, it takes Y1 and Y2 as inputs to build the Penalized Logistic Regression model, which is the ensemble model here. Finally, use the testing set follow the same steps to get the final prediction result. The model accuracy is calculated using overall classification accuracy. The result shows that the ensemble model has 92% accuracy. The prediction accuracies of SVM, neural network and ensemble models are compared to prove that the ensemble model successfully captured the power of the two individual learning model.
APA, Harvard, Vancouver, ISO, and other styles
13

Huss, Jakob. "Cross Site Product Page Classification with Supervised Machine Learning." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189555.

Full text
Abstract:
This work outlines a possible technique for identifying webpages that contain product  specifications. Using support vector machines a product web page classifier was constructed and tested with various settings. The final result for this classifier ended up being 0.958 in precision and 0.796 in recall for product pages. The scores imply that the method could be considered a valid technique in real world web classification tasks if additional features and more data were made available.
APA, Harvard, Vancouver, ISO, and other styles
14

Stakovska, Meri. "Improving search results with machine learning : Classifying multi-source data with supervised machine learning to improve search results." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-75598.

Full text
Abstract:
Sony’s Support Application team wanted an experiment to be conducted by which they could determine if it was suitable to use Machine Learning to improve the quantity and quality of search results of the in-application search tool. By improving the quantity and quality of the results the team wanted to improve the customer’s journey. A supervised machine learning model was created to classify articles into four categories; Wi-Fi & Connectivity, Apps & Settings, System & Performance, andBattery Power & Charging. The same model was used to create a service that categorized the search terms into one of the four categories. The classified articles and the classified search terms were used to complement the existing search tool. The baseline for the experiment was the result of the search tool without classification. The results of the experiment show that the number of articles did indeed increase but due mainly to the broadness of the categories the search results held low quality.
APA, Harvard, Vancouver, ISO, and other styles
15

Jiang, Fuhua. "SVM-Based Negative Data Mining to Binary Classification." Digital Archive @ GSU, 2006. http://digitalarchive.gsu.edu/cs_diss/8.

Full text
Abstract:
The properties of training data set such as size, distribution and the number of attributes significantly contribute to the generalization error of a learning machine. A not well-distributed data set is prone to lead to a partial overfitting model. Two approaches proposed in this dissertation for the binary classification enhance useful data information by mining negative data. First, an error driven compensating hypothesis approach is based on Support Vector Machines (SVMs) with (1+k)-iteration learning, where the base learning hypothesis is iteratively compensated k times. This approach produces a new hypothesis on the new data set in which each label is a transformation of the label from the negative data set, further producing the positive and negative child data subsets in subsequent iterations. This procedure refines the base hypothesis by the k child hypotheses created in k iterations. A prediction method is also proposed to trace the relationship between negative subsets and testing data set by a vector similarity technique. Second, a statistical negative example learning approach based on theoretical analysis improves the performance of the base learning algorithm learner by creating one or two additional hypotheses audit and booster to mine the negative examples output from the learner. The learner employs a regular Support Vector Machine to classify main examples and recognize which examples are negative. The audit works on the negative training data created by learner to predict whether an instance is negative. However, the boosting learning booster is applied when audit does not have enough accuracy to judge learner correctly. Booster works on training data subsets with which learner and audit do not agree. The classifier for testing is the combination of learner, audit and booster. The classifier for testing a specific instance returns the learner's result if audit acknowledges learner's result or learner agrees with audit's judgment, otherwise returns the booster's result. The error of the classifier is decreased to O(e^2) comparing to the error O(e) of a base learning algorithm.
APA, Harvard, Vancouver, ISO, and other styles
16

Langlet, Jonatan. "Towards Machine Learning Inference in the Data Plane." Thesis, Karlstads universitet, Institutionen för matematik och datavetenskap (from 2013), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-72875.

Full text
Abstract:
Recently, machine learning has been considered an important tool for various networkingrelated use cases such as intrusion detection, flow classification, etc. Traditionally, machinelearning based classification algorithms run on dedicated machines that are outside of thefast path, e.g. on Deep Packet Inspection boxes, etc. This imposes additional latency inorder to detect threats or classify the flows.With the recent advance of programmable data planes, implementing advanced function-ality directly in the fast path is now a possibility. In this thesis, we propose to implementArtificial Neural Network inference together with flow metadata extraction directly in thedata plane of P4 programmable switches, routers, or Network Interface Cards (NICs).We design a P4 pipeline, optimize the memory and computational operations for our dataplane target, a programmable NIC with Micro-C external support. The results show thatneural networks of a reasonable size (i.e. 3 hidden layers with 30 neurons each) can pro-cess flows totaling over a million packets per second, while the packet latency impact fromextracting a total of 46 features is 1.85μs.
APA, Harvard, Vancouver, ISO, and other styles
17

McClintick, Kyle W. "Training Data Generation Framework For Machine-Learning Based Classifiers." Digital WPI, 2018. https://digitalcommons.wpi.edu/etd-theses/1276.

Full text
Abstract:
In this thesis, we propose a new framework for the generation of training data for machine learning techniques used for classification in communications applications. Machine learning-based signal classifiers do not generalize well when training data does not describe the underlying probability distribution of real signals. The simplest way to accomplish statistical similarity between training and testing data is to synthesize training data passed through a permutation of plausible forms of noise. To accomplish this, a framework is proposed that implements arbitrary channel conditions and baseband signals. A dataset generated using the framework is considered, and is shown to be appropriately sized by having $11\%$ lower entropy than state-of-the-art datasets. Furthermore, unsupervised domain adaptation can allow for powerful generalized training via deep feature transforms on unlabeled evaluation-time signals. A novel Deep Reconstruction-Classification Network (DRCN) application is introduced, which attempts to maintain near-peak signal classification accuracy despite dataset bias, or perturbations on testing data unforeseen in training. Together, feature transforms and diverse training data generated from the proposed framework, teaching a range of plausible noise, can train a deep neural net to classify signals well in many real-world scenarios despite unforeseen perturbations.
APA, Harvard, Vancouver, ISO, and other styles
18

Tang, Fung Michael, and 鄧峰. "Sequence classification and melody tracks selection." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2001. http://hub.hku.hk/bib/B29742973.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Tang, Fung Michael. "Sequence classification and melody tracks selection /." Hong Kong : University of Hong Kong, 2001. http://sunzi.lib.hku.hk/hkuto/record.jsp?B25017470.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Anne, Chaitanya. "Advanced Text Analytics and Machine Learning Approach for Document Classification." ScholarWorks@UNO, 2017. http://scholarworks.uno.edu/td/2292.

Full text
Abstract:
Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model.
APA, Harvard, Vancouver, ISO, and other styles
21

Li, Sichu. "Application of Machine Learning Techniques for Real-time Classification of Sensor Array Data." ScholarWorks@UNO, 2009. http://scholarworks.uno.edu/td/913.

Full text
Abstract:
There is a significant need to identify approaches for classifying chemical sensor array data with high success rates that would enhance sensor detection capabilities. The present study attempts to fill this need by investigating six machine learning methods to classify a dataset collected using a chemical sensor array: K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Classification and Regression Trees (CART), Random Forest (RF), Naïve Bayes Classifier (NB), and Principal Component Regression (PCR). A total of 10 predictors that are associated with the response from 10 sensor channels are used to train and test the classifiers. A training dataset of 4 classes containing 136 samples is used to build the classifiers, and a dataset of 4 classes with 56 samples is used for testing. The results generated with the six different methods are compared and discussed. The RF, CART, and KNN are found to have success rates greater than 90%, and to outperform the other methods.
APA, Harvard, Vancouver, ISO, and other styles
22

Atallah, Louis N. "Learning from sonar data for the classification of underwater seabeds." Thesis, University of Oxford, 2005. http://ora.ox.ac.uk/objects/uuid:11a17b77-6e17-409e-9a6e-d19c13b86709.

Full text
Abstract:
The increased use of sonar surveys for both industrial and leisure activities has motivated the research for cost effective, automated processed for seabed classification. Seabed classification is essential for many fields including dredging, environmental studies, fisheries research, pipeline and cable route surveys, marine archaeology and automated underwater vehicles. The advancement in both sonar technology and sonar data storage has led to large quantities of sonar data being collected per survey. The challenge, however, is to derive relevant features that can summarise these large amounts of data and provide discrimination between several seabed types present in each survey. The main aim of this work is to classify sidescan bathymetric datasets. However, in most sidescan bathymetric surveys, only a few ground-truthed areas (if any) are available. Since sidescan ‘ground-truthed’ areas were also provided for this work, they were used to test feature extraction, selection and classification algorithms. Backscattering amplitude, after using bathymetric data to correct for variations, did not provide enough discrimination between sediment classes in this work which lead to the investigation of other features. The variation of backscattering amplitude at different scales corresponds to variations in both micro bathymetry and large scale bathymetry. A method that can derive multiscale features from signals was needed, and the wavelet method proved to be an efficient method of doing so. Wavelets are used for feature extraction in 1D sidescan bathymetry survey data and both the feature selection and classification stages are automated. The method is tested on areas of known types and in general, the features show good correlation with sediment types in both types of survey. The main disadvantage of this method, however, is that signal futures are calculated per swathe (or received signal). Thus, sediment boundaries within the same swathe are not detected. To solve this problem, information present in consecutive pings of data can be used, leading to 2-D feature extraction. Several textural classification methods are investigated for the segmentation of sidescan sonar images. The method includes 2D wavelets and Gabor filters. Effects of filter orientation filter scale and window size are observed in both cases, and validated on given sonar images. For sidescan bathymetric datasets, a novel method of classification using both sidescan images and depth maps is investigated. Backscattering amplitude and bathymetry images are both used for feature extraction. Features include amplitude-dependent features, textural features and bathymetric variation features. The method makes use of grab samples available in given areas of the survey for training the classifiers. Alternatively, clustering techniques are used to group the data. The results of applying the method on sidescan bathymetric surveys correlate with the grab samples available as well as the user-classified areas. An automatic method for sidescan bathymetric classification offers a cost effective approach to classify large areas of seabed with a fewer number of grab samples. This work sheds light on areas of feature extraction, selection and classification of sonar data.
APA, Harvard, Vancouver, ISO, and other styles
23

Roth, Robin, and Martin Lundblad. "An Evaluation of Machine Learning Approaches for Hierarchical Malware Classification." Thesis, Blekinge Tekniska Högskola, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-18260.

Full text
Abstract:
With an evermore growing threat of new malware that keeps growing in both number and complexity, the necessity for improvement in automatic detection and classification of malware is increasing. The signature-based approaches used by several Anti-Virus companies struggle with the increasing amount of polymorphic malware. The polymorphic malware change some minor aspects of the code to be able to remain undetected. Malware classification using machine learning have been used to try to solve this issue in previous research. In the proposed work, different hierarchical machine learning approaches are implemented to conduct three experiments. The methods utilise a hierarchical structure in various ways to be able to get a better classification performance. A selection of hierarchical levels and machine learning models are used in the experiments to evaluate how the results are affected. A data set is created, containing over 90000 different labelled malware samples. The proposed work also includes the creation of a labelling method that can be helpful for researchers in malware classification that needs labels for a created data set.The feature vector used contains 500 n-gram features and 3521 Import Address Table features. In the experiments for the proposed work, the thesis includes the testing of four machine learning models and three different amount of hierarchical levels. Stratified 5-fold cross validation is used in the proposed work to reduce bias and variance in the results. The results from the classification approach shows it achieves the highest hF-score, using Random Forest (RF) as the machine learning model and having four hierarchical levels, which got an hF-score of 0.858228. To be able to compare the proposed work with other related work, pure-flat classification accuracy was generated. The highest generated accuracy score was 0.8512816, which was not the highest compared to other related work.
APA, Harvard, Vancouver, ISO, and other styles
24

Awodokun, Olugbenga. "Classification of Patterns in Streaming Data Using Clustering Signatures." University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1504880155623189.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Reynen, Andrew. "Supervised Machine Learning on a Network Scale: Application to Seismic Event Detection and Classification." Thesis, Université d'Ottawa / University of Ottawa, 2017. http://hdl.handle.net/10393/36867.

Full text
Abstract:
A new method using a machine learning technique is applied to event classification and detection at seismic networks. This method is applicable to a variety of network sizes and settings. The algorithm makes use of a small catalogue of known observations across the entire network. Two attributes, the polarization and frequency content, are used as input to regression. These attributes are extracted at predicted arrival times for P and S waves using only an approximate velocity model, as attributes are calculated over large time spans. This method of waveform characterization is shown to be able to distinguish between blasts and earthquakes with 99 percent accuracy using a network of 13 stations located in Southern California. The combination of machine learning with generalized waveform features is further applied to event detection in Oklahoma, United States. The event detection algorithm makes use of a pair of unique seismic phases to locate events, with a precision directly related to the sampling rate of the generalized waveform features. Over a week of data from 30 stations in Oklahoma, United States are used to automatically detect 25 times more events than the catalogue of the local geological survey, with a false detection rate of less than 2 per cent. This method provides a highly confident way of detecting and locating events. Furthermore, a large number of seismic events can be automatically detected with low false alarm, allowing for a larger automatic event catalogue with a high degree of trust.
APA, Harvard, Vancouver, ISO, and other styles
26

Börthas, Lovisa, and Sjölander Jessica Krange. "Machine Learning Based Prediction and Classification for Uplift Modeling." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-266379.

Full text
Abstract:
The desire to model the true gain from targeting an individual in marketing purposes has lead to the common use of uplift modeling. Uplift modeling requires the existence of a treatment group as well as a control group and the objective hence becomes estimating the difference between the success probabilities in the two groups. Efficient methods for estimating the probabilities in uplift models are statistical machine learning methods. In this project the different uplift modeling approaches Subtraction of Two Models, Modeling Uplift Directly and the Class Variable Transformation are investigated. The statistical machine learning methods applied are Random Forests and Neural Networks along with the standard method Logistic Regression. The data is collected from a well established retail company and the purpose of the project is thus to investigate which uplift modeling approach and statistical machine learning method that yields in the best performance given the data used in this project. The variable selection step was shown to be a crucial component in the modeling processes as so was the amount of control data in each data set. For the uplift to be successful, the method of choice should be either the Modeling Uplift Directly using Random Forests, or the Class Variable Transformation using Logistic Regression. Neural network - based approaches are sensitive to uneven class distributions and is hence not able to obtain stable models given the data used in this project. Furthermore, the Subtraction of Two Models did not perform well due to the fact that each model tended to focus too much on modeling the class in both data sets separately instead of modeling the difference between the class probabilities. The conclusion is hence to use an approach that models the uplift directly, and also to use a great amount of control data in each data set.
Behovet av att kunna modellera den verkliga vinsten av riktad marknadsföring har lett till den idag vanligt förekommande metoden inkrementell responsanalys. För att kunna utföra denna typ av metod krävs förekomsten av en existerande testgrupp samt kontrollgrupp och målet är således att beräkna differensen mellan de positiva utfallen i de två grupperna. Sannolikheten för de positiva utfallen för de två grupperna kan effektivt estimeras med statistiska maskininlärningsmetoder. De inkrementella responsanalysmetoderna som undersöks i detta projekt är subtraktion av två modeller, att modellera den inkrementella responsen direkt samt en klassvariabeltransformation. De statistiska maskininlärningsmetoderna som tillämpas är random forests och neurala nätverk samt standardmetoden logistisk regression. Datan är samlad från ett väletablerat detaljhandelsföretag och målet är därmed att undersöka vilken inkrementell responsanalysmetod och maskininlärningsmetod som presterar bäst givet datan i detta projekt. De mest avgörande aspekterna för att få ett bra resultat visade sig vara variabelselektionen och mängden kontrolldata i varje dataset. För att få ett lyckat resultat bör valet av maskininlärningsmetod vara random forests vilken används för att modellera den inkrementella responsen direkt, eller logistisk regression tillsammans med en klassvariabeltransformation. Neurala nätverksmetoder är känsliga för ojämna klassfördelningar och klarar därmed inte av att erhålla stabila modeller med den givna datan. Vidare presterade subtraktion av två modeller dåligt på grund av att var modell tenderade att fokusera för mycket på att modellera klassen i båda dataseten separat, istället för att modellera differensen mellan dem. Slutsatsen är således att en metod som modellerar den inkrementella responsen direkt samt en relativt stor kontrollgrupp är att föredra för att få ett stabilt resultat.
APA, Harvard, Vancouver, ISO, and other styles
27

Sopova, Oleksandra. "Domain adaptation for classifying disaster-related Twitter data." Kansas State University, 2017. http://hdl.handle.net/2097/35388.

Full text
Abstract:
Master of Science
Department of Computing and Information Sciences
Doina Caragea
Machine learning is the subfield of Artificial intelligence that gives computers the ability to learn without being explicitly programmed, as it was defined by Arthur Samuel - the American pioneer in the field of computer gaming and artificial intelligence who was born in Emporia, Kansas. Supervised Machine Learning is focused on building predictive models given labeled training data. Data may come from a variety of sources, for instance, social media networks. In our research, we use Twitter data, specifically, user-generated tweets about disasters such as floods, hurricanes, terrorist attacks, etc., to build classifiers that could help disaster management teams identify useful information. A supervised classifier trained on data (training data) from a particular domain (i.e. disaster) is expected to give accurate predictions on unseen data (testing data) from the same domain, assuming that the training and test data have similar characteristics. Labeled data is not easily available for a current target disaster. However, labeled data from a prior source disaster is presumably available, and can be used to learn a supervised classifier for the target disaster. Unfortunately, the source disaster data and the target disaster data may not share the same characteristics, and the classifier learned from the source may not perform well on the target. Domain adaptation techniques, which use unlabeled target data in addition to labeled source data, can be used to address this problem. We study single-source and multi-source domain adaptation techniques, using Nave Bayes classifier. Experimental results on Twitter datasets corresponding to six disasters show that domain adaptation techniques improve the overall performance as compared to basic supervised learning classifiers. Domain adaptation is crucial for many machine learning applications, as it enables the use of unlabeled data in domains where labeled data is not available.
APA, Harvard, Vancouver, ISO, and other styles
28

Svensson, Patrik. "Machine learning techniques for binary classification of microarray data with correlation-based gene selection." Thesis, Uppsala universitet, Statistiska institutionen, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-302402.

Full text
Abstract:
Microarray analysis has made it possible to predict clinical outcomes or diagnosing patients with the help of biological data such as biomarkers or gene expressions. The data from microarrays are however characterized by high dimensionality and sparsity so that traditional statistical methods are difficult to use and machine learning algorithms are therefore applied for classification and prediction. In this thesis, five different machine learning algorithms were applied on four different microarray datasets from cancer studies and evaluated in terms of cross-validation performance and classification accuracy. A correlation-based gene selection method was also applied in order to reduce the amount of genes with the aim of improving accuracy of the algorithms. The findings of the thesis imply that the algorithm s elastic net and nearest shrunken centroid perform best on datasets with no gene selection, while support vector machine and random forest perform well on the reduced datasets with gene selection. However, no machine learning algorithm can be said to consistently outperform any of the other and the nature of the dataset seem to be a more important influence on the performance of the algorithm. The correlation-based gene selection method did however improve prediction accuracy of all the models by removing irrelevant genes.
APA, Harvard, Vancouver, ISO, and other styles
29

Alsaad, Amal. "Enhanced root extraction and document classification algorithm for Arabic text." Thesis, Brunel University, 2016. http://bura.brunel.ac.uk/handle/2438/13510.

Full text
Abstract:
Many text extraction and classification systems have been developed for English and other international languages; most of the languages are based on Roman letters. However, Arabic language is one of the difficult languages which have special rules and morphology. Not many systems have been developed for Arabic text categorization. Arabic language is one of the Semitic languages with morphology that is more complicated than English. Due to its complex morphology, there is a need for pre-processing routines to extract the roots of the words then classify them according to the group of acts or meaning. In this thesis, a system has been developed and tested for text classification. The system is based on two stages, the first is to extract the roots from text and the second is to classify the text according to predefined categories. The linguistic root extraction stage is composed of two main phases. The first phase is to handle removal of affixes including prefixes, suffixes and infixes. Prefixes and suffixes are removed depending on the length of the word, while checking its morphological pattern after each deduction to remove infixes. In the second phase, the root extraction algorithm is formulated to handle weak, defined, eliminated-long-vowel and two-letter geminated words, as there is a substantial great amount of irregular Arabic words in texts. Once the roots are extracted, they are checked against a predefined list of 3800 triliteral and 900 quad literal roots. Series of experiments has been conducted to improve and test the performance of the proposed algorithm. The obtained results revealed that the developed algorithm has better accuracy than the existing stemming algorithm. The second stage is the document classification stage. In this stage two non-parametric classifiers are tested, namely Artificial Neural Networks (ANN) and Support Vector Machine (SVM). The system is trained on 6 categories: culture, economy, international, local, religion and sports. The system is trained on 80% of the available data. From each category, the 10 top frequent terms are selected as features. Testing the classification algorithms has been done on the remaining 20% of the documents. The results of ANN and SVM are compared to the standard method used for text classification, the terms frequency-based method. Results show that ANN and SVM have better accuracy (80-90%) compared to the standard method (60-70%). The proposed method proves the ability to categorize the Arabic text documents into the appropriate categories with a high precision rate.
APA, Harvard, Vancouver, ISO, and other styles
30

Lan, Liang. "Data Mining Algorithms for Classification of Complex Biomedical Data." Diss., Temple University Libraries, 2012. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/214773.

Full text
Abstract:
Computer and Information Science
Ph.D.
In my dissertation, I will present my research which contributes to solve the following three open problems from biomedical informatics: (1) Multi-task approaches for microarray classification; (2) Multi-label classification of gene and protein prediction from multi-source biological data; (3) Spatial scan for movement data. In microarray classification, samples belong to several predefined categories (e.g., cancer vs. control tissues) and the goal is to build a predictor that classifies a new tissue sample based on its microarray measurements. When faced with the small-sample high-dimensional microarray data, most machine learning algorithm would produce an overly complicated model that performs well on training data but poorly on new data. To reduce the risk of over-fitting, feature selection becomes an essential technique in microarray classification. However, standard feature selection algorithms are bound to underperform when the size of the microarray data is particularly small. The best remedy is to borrow strength from external microarray datasets. In this dissertation, I will present two new multi-task feature filter methods which can improve the classification performance by utilizing the external microarray data. The first method is to aggregate the feature selection results from multiple microarray classification tasks. The resulting multi-task feature selection can be shown to improve quality of the selected features and lead to higher classification accuracy. The second method jointly selects a small gene set with maximal discriminative power and minimal redundancy across multiple classification tasks by solving an objective function with integer constraints. In protein function prediction problem, gene functions are predicted from a predefined set of possible functions (e.g., the functions defined in the Gene Ontology). Gene function prediction is a complex classification problem characterized by the following aspects: (1) a single gene may have multiple functions; (2) the functions are organized in hierarchy; (3) unbalanced training data for each function (much less positive than negative examples); (4) missing class labels; (5) availability of multiple biological data sources, such as microarray data, genome sequence and protein-protein interactions. As participants in the 2011 Critical Assessment of Function Annotation (CAFA) challenge, our team achieved the highest AUC accuracy among 45 groups. In the competition, we gained by focusing on the 5-th aspect of the problem. Thus, in this dissertation, I will discuss several schemes to integrate the prediction scores from multiple data sources and show their results. Interestingly, the experimental results show that a simple averaging integration method is competitive with other state-of-the-art data integration methods. Original spatial scan algorithm is used for detection of spatial overdensities: discovery of spatial subregions with significantly higher scores according to some density measure. This algorithm is widely used in identifying cluster of disease cases (e.g., identifying environmental risk factors for child leukemia). However, the original spatial scan algorithm only works on static spatial data. In this dissertation, I will propose one possible solution for spatial scan on movement data.
Temple University--Theses
APA, Harvard, Vancouver, ISO, and other styles
31

Bušo, Bohumír. "Porovnanie metód machine learningu pre analýzu kreditného rizika." Master's thesis, Vysoká škola ekonomická v Praze, 2015. http://www.nusl.cz/ntk/nusl-207120.

Full text
Abstract:
Recently, machine learning has been put into connection with a field called ,,Big Data'' more and more. Usually, in this field, a lot of data is available and we need to gather useful information based on this data. Nowadays, when still more and more data is generated by use of mobile phones, credit cards, etc., a need for high-performance methods is serious. In this work, we describe six different methods that serve this purpose. These are logistic regression, neural networks and deep neural networks, bagging, boosting and stacking. Last three methods compose a group called Ensemble Learning. We apply all six methods on real data, which were generously provided by one of the loan providers. These methods can help them to distinguish between good and bad potential takers of loans, when the decision about the loan is being made. Lastly, the results of particular methods are compared and we also briefly outline possible ways of interpretation.
APA, Harvard, Vancouver, ISO, and other styles
32

King, Michael Allen. "Ensemble Learning Techniques for Structured and Unstructured Data." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/51667.

Full text
Abstract:
This research provides an integrated approach of applying innovative ensemble learning techniques that has the potential to increase the overall accuracy of classification models. Actual structured and unstructured data sets from industry are utilized during the research process, analysis and subsequent model evaluations. The first research section addresses the consumer demand forecasting and daily capacity management requirements of a nationally recognized alpine ski resort in the state of Utah, in the United States of America. A basic econometric model is developed and three classic predictive models evaluated the effectiveness. These predictive models were subsequently used as input for four ensemble modeling techniques. Ensemble learning techniques are shown to be effective. The second research section discusses the opportunities and challenges faced by a leading firm providing sponsored search marketing services. The goal for sponsored search marketing campaigns is to create advertising campaigns that better attract and motivate a target market to purchase. This research develops a method for classifying profitable campaigns and maximizing overall campaign portfolio profits. Four traditional classifiers are utilized, along with four ensemble learning techniques, to build classifier models to identify profitable pay-per-click campaigns. A MetaCost ensemble configuration, having the ability to integrate unequal classification cost, produced the highest campaign portfolio profit. The third research section addresses the management challenges of online consumer reviews encountered by service industries and addresses how these textual reviews can be used for service improvements. A service improvement framework is introduced that integrates traditional text mining techniques and second order feature derivation with ensemble learning techniques. The concept of GLOW and SMOKE words is introduced and is shown to be an objective text analytic source of service defects or service accolades.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
33

Johansson, Henrik. "Video Flow Classification : Feature Based Classification Using the Tree-based Approach." Thesis, Karlstads universitet, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-43012.

Full text
Abstract:
This dissertation describes a study which aims to classify video flows from Internet network traffic. In this study, classification is done based on the characteristics of the flow, which includes features such as payload sizes and inter-arrival time. The purpose of this is to give an alternative to classifying flows based on the contents of their payload packets. Because of an increase of encrypted flows within Internet network traffic, this is a necessity. Data with known class is fed to a machine learning classifier such that a model can be created. This model can then be used for classification of new unknown data. For this study, two different classifiers are used, namely decision trees and random forest. Several tests are completed to attain the best possible models. The results of this dissertation shows that classification based on characteristics is possible and the random forest classifier in particular achieves good accuracies. However, the accuracy of classification of encrypted flows was not able to be tested within this project.
HITS, 4707
APA, Harvard, Vancouver, ISO, and other styles
34

Stephanos, Dembe. "Machine Learning Approaches to Dribble Hand-off Action Classification with SportVU NBA Player Coordinate Data." Digital Commons @ East Tennessee State University, 2021. https://dc.etsu.edu/etd/3908.

Full text
Abstract:
Recently, strategies of National Basketball Association teams have evolved with the skillsets of players and the emergence of advanced analytics. One of the most effective actions in dynamic offensive strategies in basketball is the dribble hand-off (DHO). This thesis proposes an architecture for a classification pipeline for detecting DHOs in an accurate and automated manner. This pipeline consists of a combination of player tracking data and event labels, a rule set to identify candidate actions, manually reviewing game recordings to label the candidates, and embedding player trajectories into hexbin cell paths before passing the completed training set to the classification models. This resulting training set is examined using the information gain from extracted and engineered features and the effectiveness of various machine learning algorithms. Finally, we provide a comprehensive accuracy evaluation of the classification models to compare various machine learning algorithms and highlight their subtle differences in this problem domain.
APA, Harvard, Vancouver, ISO, and other styles
35

Ceccon, Stefano. "Extending Bayesian network models for mining and classification of glaucoma." Thesis, Brunel University, 2013. http://bura.brunel.ac.uk/handle/2438/8051.

Full text
Abstract:
Glaucoma is a degenerative disease that damages the nerve fiber layer in the retina of the eye. Its mechanisms are not fully known and there is no fully-effective strategy to prevent visual impairment and blindness. However, if treatment is carried out at an early stage, it is possible to slow glaucomatous progression and improve the quality of life of sufferers. Despite the great amount of heterogeneous data that has become available for monitoring glaucoma, the performance of tests for early diagnosis are still insufficient, due to the complexity of disease progression and the diffculties in obtaining sufficient measurements. This research aims to assess and extend Bayesian Network (BN) models to investigate the nature of the disease and its progression, as well as improve early diagnosis performance. The exibility of BNs and their ability to integrate with clinician expertise make them a suitable tool to effectively exploit the available data. After presenting the problem, a series of BN models for cross-sectional data classification and integration are assessed; novel techniques are then proposed for classification and modelling of glaucoma progression. The results are validated against literature, direct expert knowledge and other Artificial Intelligence techniques, indicating that BNs and their proposed extensions improve glaucoma diagnosis performance and enable new insights into the disease process.
APA, Harvard, Vancouver, ISO, and other styles
36

Rado, Omesaad A. M. "Contributions to evaluation of machine learning models. Applicability domain of classification models." Thesis, University of Bradford, 2019. http://hdl.handle.net/10454/18447.

Full text
Abstract:
Artificial intelligence (AI) and machine learning (ML) present some application opportunities and challenges that can be framed as learning problems. The performance of machine learning models depends on algorithms and the data. Moreover, learning algorithms create a model of reality through learning and testing with data processes, and their performance shows an agreement degree of their assumed model with reality. ML algorithms have been successfully used in numerous classification problems. With the developing popularity of using ML models for many purposes in different domains, the validation of such predictive models is currently required more formally. Traditionally, there are many studies related to model evaluation, robustness, reliability, and the quality of the data and the data-driven models. However, those studies do not consider the concept of the applicability domain (AD) yet. The issue is that the AD is not often well defined, or it is not defined at all in many fields. This work investigates the robustness of ML classification models from the applicability domain perspective. A standard definition of applicability domain regards the spaces in which the model provides results with specific reliability. The main aim of this study is to investigate the connection between the applicability domain approach and the classification model performance. We are examining the usefulness of assessing the AD for the classification model, i.e. reliability, reuse, robustness of classifiers. The work is implemented using three approaches, and these approaches are conducted in three various attempts: firstly, assessing the applicability domain for the classification model; secondly, investigating the robustness of the classification model based on the applicability domain approach; thirdly, selecting an optimal model using Pareto optimality. The experiments in this work are illustrated by considering different machine learning algorithms for binary and multi-class classifications for healthcare datasets from public benchmark data repositories. In the first approach, the decision trees algorithm (DT) is used for the classification of data in the classification stage. The feature selection method is applied to choose features for classification. The obtained classifiers are used in the third approach for selection of models using Pareto optimality. The second approach is implemented using three steps; namely, building classification model; generating synthetic data; and evaluating the obtained results. The results obtained from the study provide an understanding of how the proposed approach can help to define the model’s robustness and the applicability domain, for providing reliable outputs. These approaches open opportunities for classification data and model management. The proposed algorithms are implemented through a set of experiments on classification accuracy of instances, which fall in the domain of the model. For the first approach, by considering all the features, the highest accuracy obtained is 0.98, with thresholds average of 0.34 for Breast cancer dataset. After applying recursive feature elimination (RFE) method, the accuracy is 0.96% with 0.27 thresholds average. For the robustness of the classification model based on the applicability domain approach, the minimum accuracy is 0.62% for Indian Liver Patient data at r=0.10, and the maximum accuracy is 0.99% for Thyroid dataset at r=0.10. For the selection of an optimal model using Pareto optimality, the optimally selected classifier gives the accuracy of 0.94% with 0.35 thresholds average. This research investigates critical aspects of the applicability domain as related to the robustness of classification ML algorithms. However, the performance of machine learning techniques depends on the degree of reliable predictions of the model. In the literature, the robustness of the ML model can be defined as the ability of the model to provide the testing error close to the training error. Moreover, the properties can describe the stability of the model performance when being tested on the new datasets. Concluding, this thesis introduced the concept of applicability domain for classifiers and tested the use of this concept with some case studies on health-related public benchmark datasets.
Ministry of Higher Education in Libya
APA, Harvard, Vancouver, ISO, and other styles
37

Kaden, Marika. "Integration of Auxiliary Data Knowledge in Prototype Based Vector Quantization and Classification Models." Doctoral thesis, Universitätsbibliothek Leipzig, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-206413.

Full text
Abstract:
This thesis deals with the integration of auxiliary data knowledge into machine learning methods especially prototype based classification models. The problem of classification is diverse and evaluation of the result by using only the accuracy is not adequate in many applications. Therefore, the classification tasks are analyzed more deeply. Possibilities to extend prototype based methods to integrate extra knowledge about the data or the classification goal is presented to obtain problem adequate models. One of the proposed extensions is Generalized Learning Vector Quantization for direct optimization of statistical measurements besides the classification accuracy. But also modifying the metric adaptation of the Generalized Learning Vector Quantization for functional data, i. e. data with lateral dependencies in the features, is considered.
APA, Harvard, Vancouver, ISO, and other styles
38

Lundgren, Andreas. "Data-Driven Engine Fault Classification and Severity Estimation Using Residuals and Data." Thesis, Linköpings universitet, Fordonssystem, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-165736.

Full text
Abstract:
Recent technological advances in the automotive industry have made vehicularsystems increasingly complex in terms of both hardware and software. As thecomplexity of the systems increase, so does the complexity of efficient monitoringof these system. With increasing computational power the field of diagnosticsis becoming evermore focused on software solutions for detecting and classifyinganomalies in the supervised systems. Model-based methods utilize knowledgeabout the physical system to device nominal models of the system to detect deviations,while data-driven methods uses historical data to come to conclusionsabout the present state of the system in question. This study proposes a combinedmodel-based and data-driven diagnostic framework for fault classification,severity estimation and novelty detection. An algorithm is presented which uses a system model to generate a candidate setof residuals for the system. A subset of the residuals are then selected for eachfault using L1-regularized logistic regression. The time series training data fromthe selected residuals is labelled with fault and severity. It is then compressedusing a Gaussian parametric representation, and data from different fault modesare modelled using 1-class support vector machines. The classification of datais performed by utilizing the support vector machine description of the data inthe residual space, and the fault severity is estimated as a convex optimizationproblem of minimizing the Kullback-Leibler divergence (kld) between the newdata and training data of different fault modes and severities. The algorithm is tested with data collected from a commercial Volvo car enginein an engine test cell and the results are presented in this report. Initial testsindicate the potential of the kld for fault severity estimation and that noveltydetection performance is closely tied to the residual selection process.
APA, Harvard, Vancouver, ISO, and other styles
39

Gonzalez, Munoz Mario, and Philip Hedström. "Predicting Customer Behavior in E-commerce using Machine Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-260269.

Full text
Abstract:
E-handel har varit en snabbt växande sektor de senaste åren och förväntas fortsätta växa i samma takt under de närmsta. Detta har öppnat upp nya möjligheter för företag som försöker sälja sina produkter och tjänster, men det tvingar dem även att utnyttja dessa möjligheter för att vara konkurrenskraftiga. En intressant möjlighet som vi har valt att fokusera detta arbete på är förmågan att använda kunddata, som inte varit tillgänglig i fysiska butiker, till att identifiera mönster i kundbeteenden. Förhoppningsvis ger detta en ökad förståelse för kunderna och gör det möjligt att förutspå framtida beteenden. Vi fokuserade specifikt på att skilja mellan potentiella köpare och faktiska köpare, med avsikt att identifiera nyckelfaktorer som avgör ifall en kund genomför ett köp eller ej. Detta gjorde vi genom att använda Binary Logistic Regression, en algoritm som använder övervakad maskininlärning för att klassificera en observation mellan två klasser. Vi lyckades ta fram en modell som förutsåg om en kund skulle genomföra ett köp eller ej med en noggrannhet på 88%.
E-commerce has been a rapidly growing sector during the last years, and are predicted to continue to grow as fast during the next ones. This has opened up a lot of opportunities for companies trying to sell their products or services, but it is also forcing them to exploit these opportunities before their competitors in order to not fall behind. One interesting opportunity we have chosen to focus this thesis on is the ability to use customer data, that has not been available with physical stores, to identify customer behaviour patterns and develop a better understanding for the customers. Hopefully this makes it possible to predict customer behaviour. We specifically focused on distinguishing possible-buyers from buyers, with the intent of identifying key factors that affect whether the customer performs a purchase or not. We did this using Binary Logistic Regression, a supervised machine learning algorithm that is trained to classify an input observation. We managed to create a model that predicted whether or not a customer was a possible-buyer or buyer with an accuracy of 88%.
APA, Harvard, Vancouver, ISO, and other styles
40

dos, Santos Toledo Busarello Mariana. "Machine Learning Applied to Reach Classification in a Northern Sweden Catchment." Thesis, Umeå universitet, Institutionen för ekologi, miljö och geovetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-184140.

Full text
Abstract:
An accurate fine resolution classification of river systems positively impacts the process of assessment and monitoring of water courses, as stressed by the European Commission’s Water Framework Directive. Being able to attribute classes using remotely obtained data can be advantageous to perform extensive classification of reaches without the use of field work, with some methods also allowing to identify which features best described each of the process domains. In this work, the data from two Swedish sub-catchments above the highest coastline was used to train a Random Forest Classifier, a Machine Learning algorithm. The obtained model provided predictions of classifications and analyses of the most important features. Each study area was studied separately, then combined. In the combined case, the analysis was made with and without lakes in the data, to verify how it would affect the predictions. The results showed that the accuracy of the estimator was reliable, however, due to data complexity and imbalance, rapids were harder to be classify accurately, with an overprediction of the slow-flowing class. Combining the datasets and having the presence of lakes lessened the shortcomings of the data imbalance. Using the feature importance and permutation importance methods, the three most important features identified were the channel slope, the median of the roughness in the 100-m buffer, and the standard deviation of the planform curvature in the 100-m buffer. This finding was supported by previous studies, but other variables expected to have a high participation such as lithology and valley confinement were not relevant, which most likely relates to the coarseness of the available data. The most frequent errors were also placed in maps, showing there was some overlap of error hotspots and areas previously restored in 2010.
APA, Harvard, Vancouver, ISO, and other styles
41

Alvarado, Mantecon Jesus Gerardo. "Towards the Automatic Classification of Student Answers to Open-ended Questions." Thesis, Université d'Ottawa / University of Ottawa, 2019. http://hdl.handle.net/10393/39093.

Full text
Abstract:
One of the main research challenges nowadays in the context of Massive Open Online Courses (MOOCs) is the automation of the evaluation process of text-based assessments effectively. Text-based assessments, such as essay writing, have been proved to be better indicators of higher level of understanding than machine-scored assessments (E.g. Multiple Choice Questions). Nonetheless, due to the rapid growth of MOOCs, text-based evaluation has become a difficult task for human markers, creating the need of automated systems for grading. In this thesis, we focus on the automated short answer grading task (ASAG), which automatically assesses natural language answers to open-ended questions into correct and incorrect classes. We propose an ensemble supervised machine learning approach that relies on two types of classifiers: a response-based classifier, which centers around feature extraction from available responses, and a reference-based classifier which considers the relationships between responses, model answers and questions. For each classifier, we explored a set of features based on words and entities. For the response-based classifier, we tested and compared 5 features: traditional n-gram models, entity URIs (Uniform Resource Identifier) and entity mentions both extracted using a semantic annotation API, entity mention embeddings based on GloVe and entity URI embeddings extracted from Wikipedia. For the reference-based classifier, we explored fourteen features: cosine similarity between sentence embeddings from student answers and model answers, number of overlapping elements (words, entity URI, entity mention) between student answers and model answers or question text, Jaccard similarity coefficient between student answers and model answers or question text (based on words, entity URI or entity mentions) and a sentence embedding representation. We evaluated our classifiers on three datasets, two of which belong to the SemEval ASAG competition (Dzikovska et al., 2013). Our results show that, in general, reference-based features perform much better than response-based features in terms of accuracy and macro average f1-score. Within the reference-based approach, we observe that the use of S6 embedding representation, which considers question text, student and model answer, generated the best performing models. Nonetheless, their combination with other similarity features helped build more accurate classifiers. As for response-based classifiers, models based on traditional n-gram features remained the best models. Finally, we combined our best reference-based and response-based classifiers using an ensemble learning model. Our ensemble classifiers combining both approaches achieved the best results for one of the evaluation datasets, but underperformed on the remaining two. We also compared the best two classifiers with some of the main state-of-the-art results on the SemEval competition. Our final embedded meta-classifier outperformed the top-ranking result on the SemEval Beetle dataset and our top classifier on SemEval SciEntBank, trained on reference-based features, obtained the 2nd position. In conclusion, the reference-based approach, powered mainly by sentence level embeddings and other similarity features, proved to generate the most efficient models in two out of three datasets and the ensemble model was the best on the SemEval Beetle dataset.
APA, Harvard, Vancouver, ISO, and other styles
42

Nordström, Jesper. "Automated classification of bibliographic data using SVM and Naive Bayes." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-75167.

Full text
Abstract:
Classification of scientific bibliographic data is an important and increasingly more time-consuming task in a “publish or perish” paradigm where the number of scientific publications is steadily growing. Apart from being a resource-intensive endeavor, manual classification has also been shown to be often performed with a quite high degree of inconsistency. Since many bibliographic databases contain a large number of already classified records supervised machine learning for automated classification might be a solution for handling the increasing volumes of published scientific articles. In this study automated classification of bibliographic data, based on two different machine learning methods; Naive Bayes and Support Vector Machine (SVM), were evaluated. The data used in the study were collected from the Swedish research database SwePub and the features used for training the classifiers were based on abstracts and titles in the bibliographic records. The accuracy achieved ranged between a lowest score of 0.54 and a highest score of 0.84. The classifiers based on Support Vector Machine did consistently receive higher scores than the classifiers based on Naive Bayes. Classification performed at the second level in the hierarchical classification system used clearly resulted in lower scores than classification performed at the first level. Using abstracts as the basis for feature extraction yielded overall better results than using titles, the differences were however very small.
APA, Harvard, Vancouver, ISO, and other styles
43

Karunaratne, Thashmee M. "Learning predictive models from graph data using pattern mining." Doctoral thesis, Stockholms universitet, Institutionen för data- och systemvetenskap, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-100713.

Full text
Abstract:
Learning from graphs has become a popular research area due to the ubiquity of graph data representing web pages, molecules, social networks, protein interaction networks etc. However, standard graph learning approaches are often challenged by the computational cost involved in the learning process, due to the richness of the representation. Attempts made to improve their efficiency are often associated with the risk of degrading the performance of the predictive models, creating tradeoffs between the efficiency and effectiveness of the learning. Such a situation is analogous to an optimization problem with two objectives, efficiency and effectiveness, where improving one objective without the other objective being worse off is a better solution, called a Pareto improvement. In this thesis, it is investigated how to improve the efficiency and effectiveness of learning from graph data using pattern mining methods. Two objectives are set where one concerns how to improve the efficiency of pattern mining without reducing the predictive performance of the learning models, and the other objective concerns how to improve predictive performance without increasing the complexity of pattern mining. The employed research method mainly follows a design science approach, including the development and evaluation of artifacts. The contributions of this thesis include a data representation language that can be characterized as a form in between sequences and itemsets, where the graph information is embedded within items. Several studies, each of which look for Pareto improvements in efficiency and effectiveness are conducted using sets of small graphs. Summarizing the findings, some of the proposed methods, namely maximal frequent itemset mining and constraint based itemset mining, result in a dramatically increased efficiency of learning, without decreasing the predictive performance of the resulting models. It is also shown that additional background knowledge can be used to enhance the performance of the predictive models, without increasing the complexity of the graphs.
APA, Harvard, Vancouver, ISO, and other styles
44

Moshfeghi, Mohammadshakib, Jyoti Prasad Bartaula, and Aliye Tuke Bedasso. "Emotion Recognition from EEG Signals using Machine Learning." Thesis, Blekinge Tekniska Högskola, Sektionen för ingenjörsvetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-4147.

Full text
Abstract:
The beauty of affective computing is to make machine more emphatic to the user. Machines with the capability of emotion recognition can actually look inside the user’s head and act according to observed mental state. In this thesis project, we investigate different features set to build an emotion recognition system from electroencephalographic signals. We used pictures from International Affective Picture System to motivate three emotional states: positive valence (pleasant), neutral, negative valence (unpleasant) and also to induce three sets of binary states: positive valence, not positive valence; negative valence, not negative valence; and neutral, not neutral. This experiment was designed with a head cap with six electrodes at the front of the scalp which was used to record data from subjects. To solve the recognition task we developed a system based on Support Vector Machines (SVM) and extracted the features, some of them we got from literature study and some of them proposed by ourselves in order to rate the recognition of emotional states. With this system we were able to achieve an average recognition rate up to 54% for three emotional states and an average recognition rate up to 74% for the binary states, solely based on EEG signals.
APA, Harvard, Vancouver, ISO, and other styles
45

Miloš, Radovanović. "High-Dimensional Data Representations and Metrics for Machine Learning and Data Mining." Phd thesis, Univerzitet u Novom Sadu, Prirodno-matematički fakultet u Novom Sadu, 2011. https://www.cris.uns.ac.rs/record.jsf?recordId=77530&source=NDLTD&language=en.

Full text
Abstract:
In the current information age, massive amounts of data are gathered, at a rate prohibiting their effective structuring, analysis, and conversion into useful knowledge. This information overload is manifested both in large numbers of data objects recorded in data sets, and large numbers of attributes, also known as high dimensionality. This dis-sertation deals with problems originating from high dimensionality of data representation, referred to as the “curse of dimensionality,” in the context of machine learning, data mining, and information retrieval. The described research follows two angles: studying the behavior of (dis)similarity metrics with increasing dimensionality, and exploring feature-selection methods, primarily with regard to document representation schemes for text classification. The main results of the dissertation, relevant to the first research angle, include theoretical insights into the concentration behavior of cosine similarity, and a detailed analysis of the phenomenon of hubness, which refers to the tendency of some points in a data set to become hubs by being in-cluded in unexpectedly many k-nearest neighbor lists of other points. The mechanisms behind the phenomenon are studied in detail, both from a theoretical and empirical perspective, linking hubness with the (intrinsic) dimensionality of data, describing its interaction with the cluster structure of data and the information provided by class la-bels, and demonstrating the interplay of the phenomenon and well known algorithms for classification, semi-supervised learning, clustering, and outlier detection, with special consideration being given to time-series classification and information retrieval. Results pertaining to the second research angle include quantification of the interaction between various transformations of high-dimensional document representations, and feature selection, in the context of text classification.
U tekućem „informatičkom dobu“, masivne količine podataka sesakupljaju brzinom koja ne dozvoljava njihovo efektivno strukturiranje,analizu, i pretvaranje u korisno znanje. Ovo zasićenje informacijamase manifestuje kako kroz veliki broj objekata uključenihu skupove podataka, tako i kroz veliki broj atributa, takođe poznatkao velika dimenzionalnost. Disertacija se bavi problemima kojiproizilaze iz velike dimenzionalnosti reprezentacije podataka, čestonazivanim „prokletstvom dimenzionalnosti“, u kontekstu mašinskogučenja, data mining-a i information retrieval-a. Opisana istraživanjaprate dva pravca: izučavanje ponašanja metrika (ne)sličnosti u odnosuna rastuću dimenzionalnost, i proučavanje metoda odabira atributa,prvenstveno u interakciji sa tehnikama reprezentacije dokumenata zaklasifikaciju teksta. Centralni rezultati disertacije, relevantni za prvipravac istraživanja, uključuju teorijske uvide u fenomen koncentracijekosinusne mere sličnosti, i detaljnu analizu fenomena habovitosti kojise odnosi na tendenciju nekih tačaka u skupu podataka da postanuhabovi tako što bivaju uvrštene u neočekivano mnogo lista k najbližihsuseda ostalih tačaka. Mehanizmi koji pokreću fenomen detaljno suproučeni, kako iz teorijske tako i iz empirijske perspektive. Habovitostje povezana sa (latentnom) dimenzionalnošću podataka, opisanaje njena interakcija sa strukturom klastera u podacima i informacijamakoje pružaju oznake klasa, i demonstriran je njen efekat napoznate algoritme za klasifikaciju, semi-supervizirano učenje, klasteringi detekciju outlier-a, sa posebnim osvrtom na klasifikaciju vremenskihserija i information retrieval. Rezultati koji se odnose nadrugi pravac istraživanja uključuju kvantifikaciju interakcije izmeđurazličitih transformacija višedimenzionalnih reprezentacija dokumenatai odabira atributa, u kontekstu klasifikacije teksta.
APA, Harvard, Vancouver, ISO, and other styles
46

Taslimitehrani, Vahid. "Contrast Pattern Aided Regression and Classification." Wright State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=wright1459377694.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Kristensson, Jonathan. "Load Classification with Machine Learning : Classifying Loads in a Distribution Grid." Thesis, Uppsala universitet, Institutionen för teknikvetenskaper, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-395280.

Full text
Abstract:
This thesis explores the use of machine learning as a load classifier in a distribution grid based on the daily consumption behaviour of roughly 1600 loads spread throughout the areas Bromma, Hässelby and Vällingby in Stockholm, Sweden. Two common unsupervised learning methods were used for this, K-means clustering and hierarchical agglomerative clustering (HAC), the performance of which was analysed with different input data sets and parameters. K-means and HAC were unfortunately difficult to compare and there were also some difficulties in finding a suitable number of clusters K with the used input data. This issue was resolved by evaluating the clustering outcome with custom loss function MSE-tot that compared created clusters with subsequent assignment of new data. The loss function MSE-tot indicates that K-means is more suitable than HAC in this particular clustering setup. To investigate how the obtained clusters could be used in practice, two K-means clustering models were also used to perform some cluster-specific peak load predictions. These predictions were done using unitless load profiles created from the mean properties of each cluster and dimensioned using load specific parameters. The developed models had a mean relative error of approximately 8-19 % per load, depending on the prediction method and which of the two clustering models that was used. This result is quite promising, especially since deviations above 20 % were not uncommon in previous work. The models gave poor predictions for some clusters, however, which indicates that the models may not be suitable to use on all kinds of load data in its current form. One suggestion for how to further improve the predictions is to add more explanatory variables, for example the temperature dependence. The result of the developed models were also compared to the conventionally used Velander's formula, which makes predictions based on the loads' facility-type and annual electricity consumption. Velander's formula generally performed worse than the developed methods, only reaching a mean relative error of 40-43 % per load. One likely reason for this is that the used database had poor facility label quality, which is essential for obtaining correct constants in Velander's formula.
APA, Harvard, Vancouver, ISO, and other styles
48

Chandra, Nagasai. "Node Classification on Relational Graphs using Deep-RGCNs." DigitalCommons@CalPoly, 2021. https://digitalcommons.calpoly.edu/theses/2265.

Full text
Abstract:
Knowledge Graphs are fascinating concepts in machine learning as they can hold usefully structured information in the form of entities and their relations. Despite the valuable applications of such graphs, most knowledge bases remain incomplete. This missing information harms downstream applications such as information retrieval and opens a window for research in statistical relational learning tasks such as node classification and link prediction. This work proposes a deep learning framework based on existing relational convolutional (R-GCN) layers to learn on highly multi-relational data characteristic of realistic knowledge graphs for node property classification tasks. We propose a deep and improved variant, Deep-RGCNs, with dense and residual skip connections between layers. These skip connections are known to be very successful with popular deep CNN-architectures such as ResNet and DenseNet. In our experiments, we investigate and compare the performance of Deep-RGCN with different baselines on multi-relational graph benchmark datasets, AIFB and MUTAG, and show how the deep architecture boosts the performance in the task of node property classification. We also study the training performance of Deep-RGCNs (with N layers) and discuss the gradient vanishing and over-smoothing problems common to deeper GCN architectures.
APA, Harvard, Vancouver, ISO, and other styles
49

Brown, Elliot Morgan. "The Application of Synthetic Signals for ECG Beat Classification." BYU ScholarsArchive, 2019. https://scholarsarchive.byu.edu/etd/8116.

Full text
Abstract:
A brief overview of electrocardiogram (ECG) properties and the characteristics of various cardiac conditions is given. Two different models are used to generate synthetic ECG signals. Domain knowledge is used to create synthetic examples of 16 different heart beat types with these models. Other techniques for synthesizing ECG signals are explored. Various machine learning models with different combinations of real and synthetic data are used to classify individual heart beats. The performance of the different methods and models are compared, and synthetic data is shown to be useful in beat classification.
APA, Harvard, Vancouver, ISO, and other styles
50

Sohaib, Ahmad Tauseef, and Shahnawaz Qureshi. "An Empirical Study of Machine Learning Techniques for Classifying Emotional States from EEG Data." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2932.

Full text
Abstract:
With the great advancement in robot technology, smart human-robot interaction is considered to be the most wanted success by the researchers these days. If a robot can identify emotions and intentions of a human interacting with it, that would make robots more useful. Electroencephalography (EEG) is considered one effective way of recording emotions and motivations of a human using brain. Various machine learning techniques are used successfully to classify EEG data accurately. K-Nearest Neighbor, Bayesian Network, Artificial Neural Networks and Support Vector Machine are among the suitable machine learning techniques to classify EEG data. The aim of this thesis is to evaluate different machine learning techniques to classify EEG data associated with specific affective/emotional states. Different methods based on different signal processing techniques are studied to find a suitable method to process the EEG data. Various number of EEG data features are used to identify those which give best results for different classification techniques. Different methods are designed to format the dataset for EEG data. Formatted datasets are then evaluated on various machine learning techniques to find out which technique can accurately classify EEG data according to associated affective/emotional states. Research method includes conducting an experiment. The aim of the experiment was to find the various emotional states in subjects as they look on different pictures and record the EEG data. The obtained EEG data is processed, formatted and evaluated on various machine learning techniques to find out which technique can accurately classify EEG data according to associated affective/emotional states. The experiment confirms the choice of a technique for improving the accuracy of results. According to the results, Support Vector Machine is the first and Regression Tree is the second best to classify EEG data associated with specific affective/emotional states with accuracies up to 70.00% and 60.00% respectively. SVM is better in performance than RT. However, RT is famous for providing better accuracies for diverse EEG data.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography