To see the other types of publications on this topic, follow the link: Streaming Data Processing for Machine Learning.

Dissertations / Theses on the topic 'Streaming Data Processing for Machine Learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Streaming Data Processing for Machine Learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

García-Martín, Eva. "Extraction and Energy Efficient Processing of Streaming Data." Licentiate thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-15532.

Full text
Abstract:
The interest in machine learning algorithms is increasing, in parallel with the advancements in hardware and software required to mine large-scale datasets. Machine learning algorithms account for a significant amount of energy consumed in data centers, which impacts the global energy consumption. However, machine learning algorithms are optimized towards predictive performance and scalability. Algorithms with low energy consumption are necessary for embedded systems and other resource constrained devices; and desirable for platforms that require many computations, such as data centers. Data stream mining investigates how to process potentially infinite streams of data without the need to store all the data. This ability is particularly useful for companies that are generating data at a high rate, such as social networks. This thesis investigates algorithms in the data stream mining domain from an energy efficiency perspective. The thesis comprises of two parts. The first part explores how to extract and analyze data from Twitter, with a pilot study that investigates a correlation between hashtags and followers. The second and main part investigates how energy is consumed and optimized in an online learning algorithm, suitable for data stream mining tasks. The second part of the thesis focuses on analyzing, understanding, and reformulating the Very Fast Decision Tree (VFDT) algorithm, the original Hoeffding tree algorithm, into an energy efficient version. It presents three key contributions. First, it shows how energy varies in the VFDT from a high-level view by tuning different parameters. Second, it presents a methodology to identify energy bottlenecks in machine learning algorithms, by portraying the functions of the VFDT that consume the largest amount of energy. Third, it introduces dynamic parameter adaptation for Hoeffding trees, a method to dynamically adapt the parameters of Hoeffding trees to reduce their energy consumption. The results show an average energy reduction of 23% on the VFDT algorithm.<br>Scalable resource-efficient systems for big data analytics
APA, Harvard, Vancouver, ISO, and other styles
2

Kumar, Saurabh. "Real-Time Road Traffic Events Detection and Geo-Parsing." Thesis, Purdue University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10842958.

Full text
Abstract:
<p> In the 21<sup>st</sup> century, there is an increasing number of vehicles on the road as well as a limited road infrastructure. These aspects culminate in daily challenges for the average commuter due to congestion and slow moving traffic. In the United States alone, it costs an average US driver $1200 every year in the form of fuel and time. Some positive steps, including (a) introduction of the push notification system and (b) deploying more law enforcement troops, have been taken for better traffic management. However, these methods have limitations and require extensive planning. Another method to deal with traffic problems is to track the congested area in a city using social media. Next, law enforcement resources can be re-routed to these areas on a real-time basis. </p><p> Given the ever-increasing number of smartphone devices, social media can be used as a source of information to track the traffic-related incidents. </p><p> Social media sites allow users to share their opinions and information. Platforms like Twitter, Facebook, and Instagram are very popular among users. These platforms enable users to share whatever they want in the form of text and images. Facebook users generate millions of posts in a minute. On these platforms, abundant data, including news, trends, events, opinions, product reviews, etc. are generated on a daily basis. </p><p> Worldwide, organizations are using social media for marketing purposes. This data can also be used to analyze the traffic-related events like congestion, construction work, slow-moving traffic etc. Thus the motivation behind this research is to use social media posts to extract information relevant to traffic, with effective and proactive traffic administration as the primary focus. I propose an intuitive two-step process to utilize Twitter users' posts to obtain for retrieving traffic-related information on a real-time basis. It uses a text classifier to filter out the data that contains only traffic information. This is followed by a Part-Of-Speech (POS) tagger to find the geolocation information. A prototype of the proposed system is implemented using distributed microservices architecture.</p><p>
APA, Harvard, Vancouver, ISO, and other styles
3

Åkerström, Emelie. "Real-time Outlier Detection using Unbounded Data Streaming and Machine Learning." Thesis, Luleå tekniska universitet, Datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-80044.

Full text
Abstract:
Accelerated advancements in technology, the Internet of Things, and cloud computing have spurred an emergence of unstructured data that is contributing to rapid growth in data volumes. No human can manage to keep up with monitoring and analyzing these unbounded data streams and thus predictive and analytic tools are needed. By leveraging machine learning this data can be converted into insights which are enabling datadriven decisions that can drastically accelerate innovation, improve user experience, and drive operational efficiency. The purpose of this thesis is to design and implement a system for real-time outlier detection using unbounded data streams and machine learning. Traditionally, this is accomplished by using alarm-thresholds on important system metrics. Yet, a static threshold cannot account for changes in trends and seasonality, changes in the system, or an increased system load. Thus, the intention is to leverage machine learning to instead look for deviations in the behavior of the data not caused by natural changes but by malfunctions. The use-case driving the thesis forward is real-time outlier detection in a Content Delivery Network (CDN). The input data includes Http-error messages received by clients, and contextual information like region, cache domains, and error codes, to provide tailormade predictions accounting for the trends in the data. The outlier detection system consists of a data collection pipeline leveraging the technique of stream processing, a MiniBatchKMeans clustering model that provides online clustering of incoming data according to their similar characteristics, and an LSTM AutoEncoder that accounts for temporal nature of the data and detects outlier data points in the clusters. An important finding is that an outlier is defined as an abnormal amount of outlier data points all originating from the same cluster, not a single outlier data point. Thus, the alerting system will be implementing an outlier percentage threshold. The experimental results show that an outlier is detected within one minute from a cache break-down. This triggers an alert to the system owners, containing graphs of the clustered data to narrow down the search area of the cause to enable preventive action towards the prominent incident. Further results show that within 2 minutes from fixing the cause the system will provide feedback that the actions taken were successful. Considering the real-time requirements of the CDN environment, it is concluded that the short delay for detection is indeed real-time. Proving that machine learning is indeed able to detect outliers in unbounded data streams in a real-time manner. Further analysis shows that the system is more accurate during peakhours when more data is in circulation than during none peak-hours, despite the temporal LSTM layers. Presumably, an effect from the model needing to train on more data to better account for seasonality and trends. Future work necessary to put the outlier detection system in production thus includes more training to improve accuracy and correctness. Furthermore, one could consider implementing necessary functionality for a production environment and possibly adding enhancing features that can automatically avert incidents detected and handle the causes of them.
APA, Harvard, Vancouver, ISO, and other styles
4

Wang, Zheng. "Machine learning based mapping of data and streaming parallelism to multi-cores." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5664.

Full text
Abstract:
Multi-core processors are now ubiquitous and are widely seen as the most viable means of delivering performance with increasing transistor densities. However, this potential can only be realised if the application programs are suitably parallel. Applications can either be written in parallel from scratch or converted from existing sequential programs. Regardless of how applications are parallelised, the code must be efficiently mapped onto the underlying platform to fully exploit the hardware’s potential. This thesis addresses the problem of finding the best mappings of data and streaming parallelism—two types of parallelism that exist in broad and important domains such as scientific, signal processing and media applications. Despite significant progress having been made over the past few decades, state-of-the-art mapping approaches still largely rely upon hand-crafted, architecture-specific heuristics. Developing a heuristic by hand, however, often requiresmonths of development time. Asmulticore designs become increasingly diverse and complex, manually tuning a heuristic for a wide range of architectures is no longer feasible. What are needed are innovative techniques that can automatically scale with advances in multi-core technologies. In this thesis two distinct areas of computer science, namely parallel compiler design and machine learning, are brought together to develop new compiler-based mapping techniques. Using machine learning, it is possible to automatically build highquality mapping schemes, which adapt to evolving architectures, with little human involvement. First, two techniques are proposed to find the best mapping of data parallelism. The first technique predicts whether parallel execution of a data parallel candidate is profitable on the underlying architecture. On a typical multi-core platform, it achieves almost the same (and sometimes a better) level of performance when compared to the manually parallelised code developed by independent experts. For a profitable candidate, the second technique predicts how many threads should be used to execute the candidate across different program inputs. The second technique achieves, on average, over 96% of the maximum available performance on two different multi-core platforms. Next, a new approach is developed for partitioning stream applications. This approach predicts the ideal partitioning structure for a given stream application. Based on the prediction, a compiler can rapidly search the program space (without executing any code) to generate a good partition. It achieves, on average, a 1.90x speedup over the already tuned partitioning scheme of a state-of-the-art streaming compiler.
APA, Harvard, Vancouver, ISO, and other styles
5

Alzubi, Omar A. "Designing machine learning ensembles : a game coalition approach." Thesis, Swansea University, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.678293.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Osama, Muhammad. "Machine learning for spatially varying data." Licentiate thesis, Uppsala universitet, Avdelningen för systemteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-429234.

Full text
Abstract:
Many physical quantities around us vary across space or space-time. An example of a spatial quantity is provided by the temperature across Sweden on a given day and as an example of a spatio-temporal quantity we observe the counts of the corona virus cases across the globe. Spatial and spatio-temporal data enable opportunities to answer many important questions. For example, what the weather would be like tomorrow or where the highest risk for occurrence of a disease is in the next few days? Answering questions such as these requires formulating and learning statistical models. One of the challenges with spatial and spatio-temporal data is that the size of data can be extremely large which makes learning a model computationally costly. There are several means of overcoming this problem by means of matrix manipulations and approximations. In paper I, we propose a solution to this problem where the model islearned in a streaming fashion, i.e., as the data arrives point by point. This also allows for efficient updating of the learned model based on newly arriving data which is very pertinent to spatio-temporal data. Another interesting problem in the spatial context is to study the causal effect that an exposure variable has on a response variable. For instance, policy makers might be interested in knowing whether increasing the number of police in a district has the desired effect of reducing crimes there. The challenge here is that of spatial confounding. A spatial map of the number of police against the spatial map of the number of crimes in different districts might show a clear association between these two quantities. However, there might be a third unobserved confounding variable that makes both quantities small and large together. In paper II, we propose a solution for estimating causal effects in the presence of such a confounding variable. Another common type of spatial data is point or event data, i.e., the occurrence of events across space. The event could for example be a reported disease or crime and one may be interested in predicting the counts of the event in a given region. A fundamental challenge here is to quantify the uncertainty in the predicted counts in a model in a robust manner. In paper III, we propose a regularized criterion for learning a predictive model of counts of events across spatial regions.The regularization ensures tighter prediction intervals around the predicted counts and have valid coverage irrespective of the degree of model misspecification.
APA, Harvard, Vancouver, ISO, and other styles
7

Awodokun, Olugbenga. "Classification of Patterns in Streaming Data Using Clustering Signatures." University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1504880155623189.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Fothergill, John Simon. "The coaching-machine learning interface : indoor rowing." Thesis, University of Cambridge, 2014. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.648459.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

de, la Rúa Martínez Javier. "Scalable Architecture for Automating Machine Learning Model Monitoring." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-280345.

Full text
Abstract:
Last years, due to the advent of more sophisticated tools for exploratory data analysis, data management, Machine Learning (ML) model training and model serving into production, the concept of MLOps has gained more popularity. As an effort to bring DevOps processes to the ML lifecycle, MLOps aims at more automation in the execution of diverse and repetitive tasks along the cycle and at smoother interoperability between teams and tools involved. In this context, the main cloud providers have built their own ML platforms [4, 34, 61], offered as services in their cloud solutions. Moreover, multiple frameworks have emerged to solve concrete problems such as data testing, data labelling, distributed training or prediction interpretability, and new monitoring approaches have been proposed [32, 33, 65]. Among all the stages in the ML lifecycle, one of the most commonly overlooked although relevant is model monitoring. Recently, cloud providers have presented their own tools to use within their platforms [4, 61] while work is ongoing to integrate existent frameworks [72] into open-source model serving solutions [38]. Most of these frameworks are either built as an extension of an existent platform (i.e lack portability), follow a scheduled batch processing approach at a minimum rate of hours, or present limitations for certain outliers and drift algorithms due to the platform architecture design in which they are integrated. In this work, a scalable automated cloudnative architecture is designed and evaluated for ML model monitoring in a streaming approach. An experimentation conducted on a 7-node cluster with 250.000 requests at different concurrency rates shows maximum latencies of 5.9, 29.92 and 30.86 seconds after request time for 75% of distance-based outliers detection, windowed statistics and distribution-based data drift detection, respectively, using windows of 15 seconds length and 6 seconds of watermark delay.<br>Under de senaste åren har konceptet MLOps blivit alltmer populärt på grund av tillkomsten av mer sofistikerade verktyg för explorativ dataanalys, datahantering, modell-träning och model serving som tjänstgör i produktion. Som ett försök att föra DevOps processer till Machine Learning (ML)-livscykeln, siktar MLOps på mer automatisering i utförandet av mångfaldiga och repetitiva uppgifter längs cykeln samt på smidigare interoperabilitet mellan team och verktyg inblandade. I det här sammanhanget har de största molnleverantörerna byggt sina egna ML-plattformar [4, 34, 61], vilka erbjuds som tjänster i deras molnlösningar. Dessutom har flera ramar tagits fram för att lösa konkreta problem såsom datatestning, datamärkning, distribuerad träning eller tolkning av förutsägelse, och nya övervakningsmetoder har föreslagits [32, 33, 65]. Av alla stadier i ML-livscykeln förbises ofta modellövervakning trots att det är relevant. På senare tid har molnleverantörer presenterat sina egna verktyg att kunna användas inom sina plattformar [4, 61] medan arbetet pågår för att integrera befintliga ramverk [72] med lösningar för modellplatformer med öppen källkod [38]. De flesta av dessa ramverk är antingen byggda som ett tillägg till en befintlig plattform (dvs. saknar portabilitet), följer en schemalagd batchbearbetningsmetod med en lägsta hastighet av ett antal timmar, eller innebär begränsningar för vissa extremvärden och drivalgoritmer på grund av plattformsarkitekturens design där de är integrerade. I det här arbetet utformas och utvärderas en skalbar automatiserad molnbaserad arkitektur för MLmodellövervakning i en streaming-metod. Ett experiment som utförts på ett 7nodskluster med 250.000 förfrågningar vid olika samtidigheter visar maximala latenser på 5,9, 29,92 respektive 30,86 sekunder efter tid för förfrågningen för 75% av avståndsbaserad detektering av extremvärden, windowed statistics och distributionsbaserad datadriftdetektering, med hjälp av windows med 15 sekunders längd och 6 sekunders fördröjning av vattenstämpel.
APA, Harvard, Vancouver, ISO, and other styles
10

Svantesson, David. "Implementing Streaming Parallel Decision Trees on Graphic Processing Units." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-230953.

Full text
Abstract:
Decision trees have long been a prevalent area within machine learning. With streaming data environments as well as large datasets becoming increasingly common, researchers have developed decision tree algorithms adapted to streaming data. One such algorithm is SPDT, which approaches the streaming data problem by making use of workers on a network combined with a dynamic histogram approximation of the data. There exist several implementations for decision trees on GPU, but those are uncommon in a streaming data setting. In this research, conducted at RISE SICS, the possibilities of accelerating the SPDT algorithm on GPU is investigated. An implementation is successfully created using the CUDA platform. The implementation uses a set number of data samples per layer to better fit the GPU platform. Experiments were conducted to investigate the impact on both accuracy and speed. It is found that the GPU implementation performs as well as the CPU implementation in terms of accuracy, suggesting that using small subsets of the data in each layer is sufficient for making accurate split decisions. The GPU implementation is found to be up to 113 times faster than the reference Scala CPU implementation for one of the tested datasets, and 13 times faster on average over all the tested datasets. Weak parts of the implementation are identified, and further improvements are suggested to increase both accuracy and runtime performance.<br>Beslutsträd har länge varit ett betydande område inom maskininlärning. Strömmandedata och stora dataset har blivit allt vanligare, vilket har lett till att forskare utvecklat algoritmer för beslutsträd anpassade till dessa miljöer. En av dessa algoritmer är SPDT. Denna algoritm använder sig av flera arbetare i ett nätverk kombinerat med en dynamisk histogram-representation av data. Det existerar flera implementationer av beslutsträd på grafikkort, men inte många för strömmande data. I detta forskningsarbete, utfört på RISE SICS, undersöks möjligheten att snabba upp SPDT genom att accelerera beräkningar med hjälp av grafikkort. En lyckad implementation skriven i CUDA beskrivs. Implementationen anpassar sig till grafikkortsplattformen genom att använda sig utav ett bestämt antal datapunkter per lager. Experiment som undersöker effekten på noggrannhet och hastighet har genomförts. Resultaten visar att GPU-implementationen presterar lika väl som CPU-implementationen vad gäller noggrannhet, vilket påvisar att användandet av en mindre del av data i varje lager är tillräckligt för goda resultat. GPU-implementationen är upp till 113 gånger snabbare jämfört med en existerande CPU-implementation skriven i Scala, och är i medel 13 gånger snabbare. Svagheter i implementationen identifieras, och vidare förbättringar till implementationen föreslås för att förbättra både noggrannhet och hastighetsprestanda.
APA, Harvard, Vancouver, ISO, and other styles
11

Wang, Wei. "Automatic Chinese calligraphic font generation with machine learning technology." Thesis, University of Macau, 2018. http://umaclib3.umac.mo/record=b3950605.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Bardolet, Pettersson Susana. "Managing imbalanced training data by sequential segmentation in machine learning." Thesis, Linköpings universitet, Avdelningen för medicinsk teknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-155091.

Full text
Abstract:
Imbalanced training data is a common problem in machine learning applications. Thisproblem refers to datasets in which the foreground pixels are significantly fewer thanthe background pixels. By training a machine learning model with imbalanced data, theresult is typically a model that classifies all pixels as the background class. A result thatindicates no presence of a specific condition when it is actually present is particularlyundesired in medical imaging applications. This project proposes a sequential system oftwo fully convolutional neural networks to tackle the problem. Semantic segmentation oflung nodules in thoracic computed tomography images has been performed to evaluate theperformance of the system. The imbalanced data problem is present in the training datasetused in this project, where the average percentage of pixels belonging to the foregroundclass is 0.0038 %. The sequential system achieved a sensitivity of 83.1 % representing anincrease of 34 % compared to the single system. The system only missed 16.83% of thenodules but had a Dice score of 21.6 % due to the detection of multiple false positives. Thismethod shows considerable potential to be a solution to the imbalanced data problem withcontinued development.
APA, Harvard, Vancouver, ISO, and other styles
13

Chen, Li. "Statistical Machine Learning for Multi-platform Biomedical Data Analysis." Diss., Virginia Tech, 2011. http://hdl.handle.net/10919/77188.

Full text
Abstract:
Recent advances in biotechnologies have enabled multiplatform and large-scale quantitative measurements of biomedical events. The need to analyze the produced vast amount of imaging and genomic data stimulates various novel applications of statistical machine learning methods in many areas of biomedical research. The main objective is to assist biomedical investigators to better interpret, analyze, and understand the biomedical questions based on the acquired data. Given the computational challenges imposed by these high-dimensional and complex data, machine learning research finds its new opportunities and roles. In this dissertation thesis, we propose to develop, test and apply novel statistical machine learning methods to analyze the data mainly acquired by dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) and single nucleotide polymorphism (SNP) microarrays. The research work focuses on: (1) tissue-specific compartmental analysis for dynamic contrast-enhanced MR imaging of complex tumors; (2) computational Analysis for detecting DNA SNP interactions in genome-wide association studies. DCE-MRI provides a noninvasive method for evaluating tumor vasculature patterns based on contrast accumulation and washout. Compartmental analysis is a widely used mathematical tool to model dynamic imaging data and can provide accurate pharmacokinetics parameter estimates. However partial volume effect (PVE) existing in imaging data would have profound effect on the accuracy of pharmacokinetics studies. We therefore propose a convex analysis of mixtures (CAM) algorithm to explicitly eliminate PVE by expressing the kinetics in each pixel as a nonnegative combination of underlying compartments and subsequently identifying pure volume pixels at the corners of the clustered pixel time series scatter plot. The algorithm is supported by a series of newly proved theorems and additional noise filtering and normalization preprocessing. We demonstrate the principle and feasibility of the CAM approach together with compartmental modeling on realistic synthetic data, and compare the accuracy of parameter estimates obtained using CAM or other relevant techniques. Experimental results show a significant improvement in the accuracy of kinetic parameter estimation. We then apply the algorithm to real DCE-MRI data of breast cancer and observe improved pharmacokinetics parameter estimation that separates tumor tissue into sub-regions with differential tracer kinetics on a pixel-by-pixel basis and reveals biologically plausible tumor tissue heterogeneity patterns. This method has combined the advantages of multivariate clustering, convex optimization and compartmental modeling approaches. Interactions among genetic loci are believed to play an important role in disease risk. Due to the huge dimension of SNP data (normally several millions in genome-wide association studies), the combinatorial search and statistical evaluation required to detect multi-locus interactions constitute a significantly challenging computational task. While many approaches have been proposed for detecting such interactions, their relative performance remains largely unclear, due to the fact that performance was evaluated on different data sources, using different performance measures, and under different experimental protocols. Given the importance of detecting gene-gene interactions, a thorough evaluation of the performance and limitations of available methods, a theoretical analysis of the interaction effect and the genetic factors it depends on, and the development of more efficient methods are warranted. Therefore, we perform a computational analysis for detect interactions among SNPs. The contributions are four-fold: (1) developed simulation tools for evaluating performance of any technique designed to detect interactions among genetic variants in case-control studies; (2) used these tools to compare performance of five popular SNP detection methods; and (3) derived analytic relationships between power and the genetic factors, which not only support the experimental results but also gives a quantitative linkage between interaction effect and these factors; (4) based on the novel insights gained by comparative and theoretical analysis, developed an efficient statistically-principled method, namely the hybrid correlation-based association (HCA) to detect interacting SNPs. The HCA algorithm is based on three correlation-based statistics, which are designed to measure the strength of multi-locus interaction with three different interaction types, covering a large portion of possible interactions. Moreover, to maximize the detection power (sensitivity) while suppressing false positive rate (or retaining moderate specificity), we also devised a strategy to hybridize these three statistics in a case-by-case way. A heuristic search strategy is also proposed to largely decrease the computational complexity, especially for high-order interaction detection. We have tested HCA in both simulation study and real disease study. HCA and the selected peer methods were compared on a large number of simulated datasets, each including multiple sets of interaction models. The assessment criteria included several power measures, family-wise type I error rate, and computational complexity. The experimental results of HCA on the simulation data indicate its promising performance in terms of a good balance between detection accuracy and computational complexity. By running on multiple real datasets, HCA also replicates plausible biomarkers reported in previous literatures.<br>Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
14

Lu, Yang. "Advances in imbalanced data learning." HKBU Institutional Repository, 2019. https://repository.hkbu.edu.hk/etd_oa/657.

Full text
Abstract:
With the increasing availability of large amount of data in a wide range of applications, no matter for industry or academia, it becomes crucial to understand the nature of complex raw data, in order to gain more values from data engineering. Although many problems have been successfully solved by some mature machine learning techniques, the problem of learning from imbalanced data continues to be one of the challenges in the field of data engineering and machine learning, which attracted growing attention in recent years due to its complexity. In this thesis, we focus on four aspects of imbalanced data learning and propose solutions to the key problems. The first aspect is about ensemble methods for imbalanced data classification. Ensemble methods, e.g. bagging and boosting, have the advantages to cure imbalanced data by integrated with sampling methods. However, there are still problems in the integration. One problem is that undersampling and oversampling are complementary each other and the sampling ratio is crucial to the classification performance. This thesis introduces a new method HSBagging which is based on bagging with hybrid sampling. Experiments show that HSBagging outperforms other state-of-the-art bagging method on imbalanced data. Another problem is about the integration of boosting and sampling for imbalanced data classification. The classifier weights of existing AdaBoost-based methods are inconsistent with the objective of class imbalance classification. In this thesis, we propose a novel boosting optimization framework GOBoost. This framework can be applied to any boosting-based method for class imbalance classification by simply replacing the calculation of classifier weights. Experiments show that the GOBoost-based methods significantly outperform the corresponding boosting-based methods. The second aspect is about online learning for imbalanced data stream with concept drift. In the online learning scenario, if the data stream is imbalanced, it will be difficult to detect concept drifts and adapt the online learner to them. The ensemble classifier weights are hard to adjust to achieve the balance between the stability and adaptability. Besides, the classier built on samples in size-fixed chunk, which may be highly imbalanced, is unstable in the ensemble. In this thesis, we propose Adaptive Chunk-based Dynamic Weighted Majority (ACDWM) to dynamically weigh the individual classifiers according to their performance on the current data chunk. Meanwhile, the chunk size is adaptively selected by statistical hypothesis tests. Experiments on both synthetic and real datasets with concept drift show that ACDWM outperforms both of the state-of-the-art chunk-based and online methods. In addition to imbalanced data classification, the third aspect is about clustering on imbalanced data. This thesis studies the key problem of imbalanced data clustering called uniform effect within the k-means-type framework, where the clustering results tend to be balanced. Thus, this thesis introduces a new method called Self-adaptive Multi-prototype-based Competitive Learning (SMCL) for imbalanced clusters. It uses multiple subclusters to represent each cluster with an automatic adjustment of the number of subclusters. Then, the subclusters are merged into the final clusters based on a novel separation measure. Experimental results show the efficacy of SMCL for imbalanced clusters and the superiorities against its competitors. Rather than a specific algorithm for imbalanced data learning, the final aspect is about a measure of class imbalanced dataset for classification. Recent studies have shown that imbalance ratio is not the only cause of the performance loss of a classifier in imbalanced data classification. To the best of our knowledge, there is no any measurement about the extent of influence of class imbalance on the classification performance of imbalanced data. Accordingly, this thesis proposes a data measure called Bayes Imbalance Impact Index (B1³) to reflect the extent of influence purely by the factor of imbalance for the whole dataset. As a result we can therefore use B1³ to judge whether it is worth using imbalance recovery methods like sampling or cost-sensitive methods to recover the performance loss of a classifier. The experiments show that B1³ is highly consistent with improvement of F1score made by the imbalance recovery methods on both synthetic and real benchmark datasets. Two ensemble frameworks for imbalanced data classification are proposed for sampling rate selection and boosting weight optimization, respectively. 2. •A chunk-based online learning algorithm is proposed to dynamically adjust the ensemble classifiers and select the chunk size for imbalanced data stream with concept drift. 3. •A multi-prototype competitive learning algorithm is proposed for clustering on imbalanced data. 4. •A measure of imbalanced data is proposed to evaluate how the classification performance of a dataset is influenced by the factor of imbalance.
APA, Harvard, Vancouver, ISO, and other styles
15

Zhong, Yuqing. "Investigating Human Gut Microbiome in Obesity with Machine Learning Methods." Thesis, University of North Texas, 2017. https://digital.library.unt.edu/ark:/67531/metadc1011875/.

Full text
Abstract:
Obesity is a common disease among all ages that has threatened human health and has become a global concern. Gut microbiota can affect human metabolism and thus may modulate obesity. Certain mixes of gut microbiota can protect the host to be healthy or predispose the host to obesity. Modern next-generation sequencing technique allows accessing huge amount of genetic information underlying microbiota and thus provides new insights into the functionality of these micro-organisms and their interactions with the host. Multiple previous studies have demonstrated that the microbiome might contribute to obesity by increasing dietary energy harvest, promoting fat deposition and triggering systemic inflammation. However, these researches are either based on lab cultivation studies or basic statistical analysis. In order to further explore how gut microbiota affect obesity, this thesis utilize a series of machine learning methods to analyze large amount of metagenomics data from human gut microbiome. The publicly available HMP (Human Microbiome Project) metagenomic sequencing data, contain microbiome data for healthy adults, including overweight and obese individuals, were used for this study. HMP gut data were organized based on two different feature definitions: taxonomic information and metabolic reconstruction information. Several widely used classification algorithms: namely Naive Bayes, Random Forest, SVM and elastic net logistic regression were applied to predict healthy or obese status of the subjects based on the cross-validation accuracy. Furthermore, the corresponding feature selection algorithms were used to identify signature features in each dataset that lead to the differences between healthy and obese samples. The results showed that these algorithms perform poorly on taxonomic data than metabolic pathway data though lots of selected taxa are still supported by literature. Among all the combinations between different algorithms and data, elastic net logistic regression has the best cross-validation performance and thus becomes the best model. In this model, several important features are found and some of these are consistent with the previous studies. Rerunning classifiers by using features selected by elastic net logistic regression again further improved the performance of the classifiers. On the other hand, this study uncovered some new features that haven't been supported by previous studies. The new features could also be the potential target to distinguish obese and healthy subjects. The present thesis work compares the strengths and weaknesses of different machine learning techniques with different types of features originating from the same metagenomics data. The features selected by these models could provide a deep understanding of the metabolic mechanisms of micro-organisms. It is therefore worth to comprehensively understand the differences of gut microbiota between healthy and obese subjects, and particularly how gut microbiome affects obesity.
APA, Harvard, Vancouver, ISO, and other styles
16

Rahman, M. Mostafizur. "Machine learning based data pre-processing for the purpose of medical data mining and decision support." Thesis, University of Hull, 2014. http://hydra.hull.ac.uk/resources/hull:10103.

Full text
Abstract:
Building an accurate and reliable model for prediction for different application domains, is one of the most significant challenges in knowledge discovery and data mining. Sometimes, improved data quality is itself the goal of the analysis, usually to improve processes in a production database and the designing of decision support. As medicine moves forward there is a need for sophisticated decision support systems that make use of data mining to support more orthodox knowledge engineering and Health Informatics practice. However, the real-life medical data rarely complies with the requirements of various data mining tools. It is often inconsistent, noisy, containing redundant attributes, in an unsuitable format, containing missing values and imbalanced with regards to the outcome class label. Many real-life data sets are incomplete, with missing values. In medical data mining the problem with missing values has become a challenging issue. In many clinical trials, the medical report pro-forma allow some attributes to be left blank, because they are inappropriate for some class of illness or the person providing the information feels that it is not appropriate to record the values for some attributes. The research reported in this thesis has explored the use of machine learning techniques as missing value imputation methods. The thesis also proposed a new way of imputing missing value by supervised learning. A classifier was used to learn the data patterns from a complete data sub-set and the model was later used to predict the missing values for the full dataset. The proposed machine learning based missing value imputation was applied on the thesis data and the results are compared with traditional Mean/Mode imputation. Experimental results show that all the machine learning methods which we explored outperformed the statistical method (Mean/Mode). The class imbalance problem has been found to hinder the performance of learning systems. In fact, most of the medical datasets are found to be highly imbalance in their class label. The solution to this problem is to reduce the gap between the minority class samples and the majority class samples. Over-sampling can be applied to increase the number of minority class sample to balance the data. The alternative to over-sampling is under-sampling where the size of majority class sample is reduced. The thesis proposed one cluster based under-sampling technique to reduce the gap between the majority and minority samples. Different under-sampling and over-sampling techniques were explored as ways to balance the data. The experimental results show that for the thesis data the new proposed modified cluster based under-sampling technique performed better than other class balancing techniques. In further research it is found that the class imbalance problem not only affects the classification performance but also has an adverse effect on feature selection. The thesis proposed a new framework for feature selection for class imbalanced datasets. The research found that, using the proposed framework the classifier needs less attributes to show high accuracy, and more attributes are needed if the data is highly imbalanced. The research described in the thesis contains the flowing four novel main contributions. a) Improved data mining methodology for mining medical data b) Machine learning based missing value imputation method c) Cluster Based semi-supervised class balancing method d) Feature selection framework for class imbalance datasets The performance analysis and comparative study show that the use of proposed method of missing value imputation, class balancing and feature selection framework can provide an effective approach to data preparation for building medical decision support.
APA, Harvard, Vancouver, ISO, and other styles
17

Skaik, Ruba. "Predicting Depression and Suicide Ideation in the Canadian Population Using Social Media Data." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42346.

Full text
Abstract:
The economic burden of mental illness costs Canada billions of dollars every year. Millions of people suffer from mental illness, and only a fraction receives adequate treatment. Identifying people with mental illness requires initiation from those in need, available medical services, and professional experts’ time allocation. These resources might not be available all the time. The common practice is to rely on clinical data, which is generally collected after the illness is developed and reported. Moreover, such clinical data is incomplete and hard to obtain. An alternative data source is conducting surveys through phone calls, interviews, or mail, but this is costly and time-consuming. Social media analysis has brought advances in leveraging population data to understand mental health problems. Thus, analyzing social media posts can be an essential alternative for identifying mental disorders throughout the Canadian population. Big data research of social media may also endorse standard surveillance approaches and provide decision-makers with usable information. More precisely, social media analysis has shown promising results for public health assessment and monitoring. In this research, we explore the task of automatically analysing social media textual data using Natural Language Processing (NLP) and Machine Learning (ML) techniques to detect signs of mental health disorders that need attention, such as depression and suicide ideation. Considering the lack of comprehensive annotated data in this field, we propose a methodology for transfer learning to utilize the information hidden in a training sample and leverage it on a different dataset to choose the best-generalized model to be applied at the population level. We also present evidence that ML models designed to predict suicide ideation using Reddit data can utilize the knowledge they encoded to make predictions on Twitter data, even though the two platforms differ in the purpose, structure, and limitations. In our proposed models, we use feature engineering with supervised machine learning algorithms (such as SVM, LR, RF, XGBoost, and GBDT), and we compare their results with those of deep learning algorithms (such as LSTM, Bi-LSTM, and CNNs). We adopt the CNN model for depression classification that obtained the highest F1-score on the test dataset (0.898) and 0.941 recall. This model is later used to estimate the depression level of the population. For suicide ideation detection, we used the CNN model with pre-trained fastText word embeddings and linguistic features (LIWC). The model achieved an F1-score of 0.936 and a recall of 0.88 to predict suicide ideation at the user-level on the test set. To compare our models’ predictions with official statics, we used 2015-2016 population based Canadian Community Health Survey (CCHS) on Mental Health and Well-being conducted by Statistics Canada. The data is used to estimate depression and suicidality in Canadian provinces and territories. For depression, (n=53,050) respondents filled in the Patient Health Questionnaire-9 (PHQ-9) from 8 provinces/territories. Each survey respondent with a score ≥ 10 on the PHQ-9 was interpreted as having moderate to severe depression because this score is frequently used as a screening cut-point. The weighted percentage of depression prevalence during 2015 for females and males of the age between 15 to 75 was 11.5% and 8.1%, respectively (with 54.2% females and 45.8% males). Our model was applied on a population-representative dataset that contains 24,251 Twitter users who posted 1,735,200 tweets during 2015 with a Pearson correlation of 0.88 for both sex and age within the seven provinces and NT territory included in the CCHS. An age correlation of 0.95 was calculated for age and sex (separately) and our model estimated that 10% of the sample dataset has evidence of depression (58.3% females and 41.7% males). For the second task, suicide ideation, Statistics Canada (2015) estimated the total number of people who reported serious suicidal thoughts as 3,396,700 persons, i.e., 9.514% of the total population, whereas our models estimated 10.6% of the population sample were at risk of suicide ideation (59% females and 41% males). The Pearson correlation coefficients between the actual suicide ideation within the last 12 months and the predicted model for each province per age, sex, and both more than 0.62, which indicates a reasonable correlation.
APA, Harvard, Vancouver, ISO, and other styles
18

Bergdorf, Johan. "Machine learning and rule induction in invoice processing : Comparing machine learning methods in their ability to assign account codes in the bookkeeping process." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-235931.

Full text
Abstract:
Companies with more than 3 million SEK in revenue per year are by law in Sweden required to bookkeep invoices as soon as the invoice arrives after a purchase. One part in this bookkeeping process is to choose which accounts to be credited for every received invoice. This is a time-consuming process which demands to find the right account codes for every invoice depending on a number of factors. This thesis investigates how well machine learning can manage this process. Specifically, it is investigated how well machine learning methods that produce unordered rule sets can classify invoice data for prediction of account codes. These rule induction methods are compared to two other popular and well-tested machine learning methods that do not necessarily produce rules for interpretation and knowledge discovery as well as two naive classifiers for baseline comparisons. The results show that naive classifiers are strong but that the machine learning methods perform better when it comes to accuracy and F2score. The results also show that the rule induction method, FURIA, produces significantly less number of rules than MODLEM. The non-rule induction method Random forest has a tendency to perform best overall when it comes to given performance metrics.<br>Företag med över 3 miljoner SEK i omsättning per år är enligt lag i Sverige skyldiga att bokföra fakturor när de tas emot efter ett köp. En del i denna bokförinsprocess är att välja vilka konton som skall krediteras/debiteras för varje mottagen faktura. Detta är en tidskrävandeprocess som kräver att man använder och hittar rätt kontokod för varje faktura beroende på olika faktorer. Denna rapport undersöker hur väl maskininlärning klarar av denna process. Specifikt undersöks hur väl maskininläringsmetoder som producerar oordnade regeluppsättningar kan klassificera fakturadata för prediktering av kontokoder. Dessa jämförs mot två andra populära och vältestade maskininlärningmetoder som inte nödväntigtvis kan producera regler för tolkning och kunskapsupptäckande samt två naive metoder som grundnivå att jämföra med. Resultaten visar att naiva metoder är starka men att maskininlärningsmetoder lyckas prestera bättre när det kommer till bland annat accuracy och F2score. Resultaten visar också att regelframtagningsalgoritmen FURIA producerar signifikant andel färre regler än MODLEM. Random forest tenderar att prestera bäst överlag när det kommer till givna utvärderingsmått.
APA, Harvard, Vancouver, ISO, and other styles
19

Stefanova, Zheni Svetoslavova. "Machine Learning Methods for Network Intrusion Detection and Intrusion Prevention Systems." Scholar Commons, 2018. https://scholarcommons.usf.edu/etd/7367.

Full text
Abstract:
Given the continuing advancement of networking applications and our increased dependence upon software-based systems, there is a pressing need to develop improved security techniques for defending modern information technology (IT) systems from malicious cyber-attacks. Indeed, anyone can be impacted by such activities, including individuals, corporations, and governments. Furthermore, the sustained expansion of the network user base and its associated set of applications is also introducing additional vulnerabilities which can lead to criminal breaches and loss of critical data. As a result, the broader cybersecurity problem area has emerged as a significant concern, with many solution strategies being proposed for both intrusion detection and prevention. Now in general, the cybersecurity dilemma can be treated as a conflict-resolution setup entailing a security system and minimum of two decision agents with competing goals (e.g., the attacker and the defender). Namely, on the one hand, the defender is focused on guaranteeing that the system operates at or above an adequate (specified) level. Conversely, the attacker is focused on trying to interrupt or corrupt the system’s operation. In light of the above, this dissertation introduces novel methodologies to build appropriate strategies for system administrators (defenders). In particular, detailed mathematical models of security systems are developed to analyze overall performance and predict the likely behavior of the key decision makers influencing the protection structure. The initial objective here is to create a reliable intrusion detection mechanism to help identify malicious attacks at a very early stage, i.e., in order to minimize potentially critical consequences and damage to system privacy and stability. Furthermore, another key objective is also to develop effective intrusion prevention (response) mechanisms. Along these lines, a machine learning based solution framework is developed consisting of two modules. Specifically, the first module prepares the system for analysis and detects whether or not there is a cyber-attack. Meanwhile, the second module analyzes the type of the breach and formulates an adequate response. Namely, a decision agent is used in the latter module to investigate the environment and make appropriate decisions in the case of uncertainty. This agent starts by conducting its analysis in a completely unknown milieu but continually learns to adjust its decision making based upon the provided feedback. The overall system is designed to operate in an automated manner without any intervention from administrators or other cybersecurity personnel. Human input is essentially only required to modify some key model (system) parameters and settings. Overall, the framework developed in this dissertation provides a solid foundation from which to develop improved threat detection and protection mechanisms for static setups, with further extensibility for handling streaming data.
APA, Harvard, Vancouver, ISO, and other styles
20

Gonzalez, Munoz Mario, and Philip Hedström. "Predicting Customer Behavior in E-commerce using Machine Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-260269.

Full text
Abstract:
E-handel har varit en snabbt växande sektor de senaste åren och förväntas fortsätta växa i samma takt under de närmsta. Detta har öppnat upp nya möjligheter för företag som försöker sälja sina produkter och tjänster, men det tvingar dem även att utnyttja dessa möjligheter för att vara konkurrenskraftiga. En intressant möjlighet som vi har valt att fokusera detta arbete på är förmågan att använda kunddata, som inte varit tillgänglig i fysiska butiker, till att identifiera mönster i kundbeteenden. Förhoppningsvis ger detta en ökad förståelse för kunderna och gör det möjligt att förutspå framtida beteenden. Vi fokuserade specifikt på att skilja mellan potentiella köpare och faktiska köpare, med avsikt att identifiera nyckelfaktorer som avgör ifall en kund genomför ett köp eller ej. Detta gjorde vi genom att använda Binary Logistic Regression, en algoritm som använder övervakad maskininlärning för att klassificera en observation mellan två klasser. Vi lyckades ta fram en modell som förutsåg om en kund skulle genomföra ett köp eller ej med en noggrannhet på 88%.<br>E-commerce has been a rapidly growing sector during the last years, and are predicted to continue to grow as fast during the next ones. This has opened up a lot of opportunities for companies trying to sell their products or services, but it is also forcing them to exploit these opportunities before their competitors in order to not fall behind. One interesting opportunity we have chosen to focus this thesis on is the ability to use customer data, that has not been available with physical stores, to identify customer behaviour patterns and develop a better understanding for the customers. Hopefully this makes it possible to predict customer behaviour. We specifically focused on distinguishing possible-buyers from buyers, with the intent of identifying key factors that affect whether the customer performs a purchase or not. We did this using Binary Logistic Regression, a supervised machine learning algorithm that is trained to classify an input observation. We managed to create a model that predicted whether or not a customer was a possible-buyer or buyer with an accuracy of 88%.
APA, Harvard, Vancouver, ISO, and other styles
21

Ahmed, Kachkach. "Analyzing user behavior and sentiment in music streaming services." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186527.

Full text
Abstract:
These last years, streaming services (for music, podcasts, TV shows and movies) have been under the spotlight by disrupting traditional media consumption platforms. If the technical implications of streaming huge amounts of data are well researched, much remains to be done to analyze the wealth of data collected by these services and exploit it to its full potential in order to improve them. Using raw data about users’ interactions with the music streaming service Spotify, this thesis focuses on three main concepts: streaming context, user attention and the sequential analysis of user actions. We discuss the importance of each of these aspects and propose different statistical and machine learning techniques to model them. We show how these models can be used to improve streaming services by inferring user sentiment and improving recommender systems, characterizing user sessions, extracting behavioral patterns and providing useful business metrics.<br>De senaste åren har strömningtjänster (för musik, podcasts, TV-serier och filmer) varit i strålkastarljuset genom att förändra synen på hur vi konsumerar media. Om det tekniska impikationerna av att strömma stora mängder data är väl utforskat finns det mycket kvar i att analysera de stora datamängderna som samlas in för att förstå och förbättra tjänsterna. Genom att använda rådata om hur användarna interagerar med musiktjänsten Spotify, fokuserar den här uppsatsen på tre huvudkoncept: strömmandets kontext, användares uppmäksamhet samt sekvensiell analys av användares handlingar. Vi diskuterar betydelsen av varje koncept och föreslår en olika statistiska och maskininlärningstekniker för att modellera dem. Vi visar hur dessa modeller kan användas för att förbättra strömmningstjänster genom att antyda användares sentiment, förbättra rekommendationer, karaktärisera användarsessioner, extrahera betendemönster och ta fram användbar affärsdata.
APA, Harvard, Vancouver, ISO, and other styles
22

Hoyt, Matthew Ray. "Automatic Tagging of Communication Data." Thesis, University of North Texas, 2012. https://digital.library.unt.edu/ark:/67531/metadc149611/.

Full text
Abstract:
Globally distributed software teams are widespread throughout industry. But finding reliable methods that can properly assess a team's activities is a real challenge. Methods such as surveys and manual coding of activities are too time consuming and are often unreliable. Recent advances in information retrieval and linguistics, however, suggest that automated and/or semi-automated text classification algorithms could be an effective way of finding differences in the communication patterns among individuals and groups. Communication among group members is frequent and generates a significant amount of data. Thus having a web-based tool that can automatically analyze the communication patterns among global software teams could lead to a better understanding of group performance. The goal of this thesis, therefore, is to compare automatic and semi-automatic measures of communication and evaluate their effectiveness in classifying different types of group activities that occur within a global software development project. In order to achieve this goal, we developed a web-based component that can be used to help clean and classify communication activities. The component was then used to compare different automated text classification techniques on various group activities to determine their effectiveness in correctly classifying data from a global software development team project.
APA, Harvard, Vancouver, ISO, and other styles
23

Moshfeghi, Mohammadshakib, Jyoti Prasad Bartaula, and Aliye Tuke Bedasso. "Emotion Recognition from EEG Signals using Machine Learning." Thesis, Blekinge Tekniska Högskola, Sektionen för ingenjörsvetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-4147.

Full text
Abstract:
The beauty of affective computing is to make machine more emphatic to the user. Machines with the capability of emotion recognition can actually look inside the user’s head and act according to observed mental state. In this thesis project, we investigate different features set to build an emotion recognition system from electroencephalographic signals. We used pictures from International Affective Picture System to motivate three emotional states: positive valence (pleasant), neutral, negative valence (unpleasant) and also to induce three sets of binary states: positive valence, not positive valence; negative valence, not negative valence; and neutral, not neutral. This experiment was designed with a head cap with six electrodes at the front of the scalp which was used to record data from subjects. To solve the recognition task we developed a system based on Support Vector Machines (SVM) and extracted the features, some of them we got from literature study and some of them proposed by ourselves in order to rate the recognition of emotional states. With this system we were able to achieve an average recognition rate up to 54% for three emotional states and an average recognition rate up to 74% for the binary states, solely based on EEG signals.
APA, Harvard, Vancouver, ISO, and other styles
24

Hyberg, Martin. "Software Issue Time Estimation With Natural Language Processing and Machine Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-295202.

Full text
Abstract:
Time estimation for software issues is crucial to planning projects. Developers and experts have for many decades tried to estimate time requirements for issues as accurately as possible. The methods that are used today are often time-consuming and complex. This thesis investigates if the time estimation process can be done with natural language processing and machine learning. Three different word embeddings were used to represent the free text description, bag-of-words with tf-idf weighing, word2Vec and fastText. The different word embeddings were then fed into two types of machine learning approaches, classification and regression. The classification was binary and can be formulated as will the issue take more than three hours?. The goal of the regression problem was to predict an actual value for the time that the issue would take to complete. The classification models performance were measured with an F1-score, and the regression model was measured with an R2-score. The best F1- score for classification was 0.748 and was achieved with the word2Vec word embedding and an SVM classifier. The best score for the regression analysis was achieved with the bag-of-words word embedding, which achieved an R2- score of 0.380. Further evaluation of the results and a comparison to actual estimates made by the company show that humans only performs slightly better than the models given the binary classification defined above. The F1-score of the employees was 0.792, a difference of just 0.044 from the best F1-score made by the models. This thesis concludes that the models are not good enough to use in a professional setting. An F1-score of 0.748 could be used in other settings, but the classification question in this problem is too broad to be used for a real project. The results for the regression is also too low to be of any valuable use.<br>Tidsuppskattning för programvaruärenden är en avgörande del för planering av projekt. Utvecklare och experter har i många årtionden försökt uppskatta tiden ett ärende kommer ta så exakt som möjligt. Metoderna som används idag är ofta tidskrävande och komplexa. Denna avhandling undersöker om tidsuppskattningsprocessen kan göras med hjälp av språkteknologi och maskininlärning. De flesta programvaruärenden har en fritextbeskrivning av vad som är fel eller behöver läggas till. Tre olika ordinbäddningar användes för att representera fritextbeskrivningen, bag-of-word med tf-idf-viktning, word2Vec och fastText. De olika ordinbäddningarna matades sedan in i två typer av maskininlärningsmetoder, klassificering och regression. Klassificeringen var binär och frågan kan formuleras som tar ärendet mer än tre timmar?. Målet med regressionsproblemet var att förutsäga ett faktiskt värde för den tid som frågan skulle ta att slutföra. Klassificeringsmodellens prestanda mättes med en F1-poäng och regressionsmodellen mättes med en R2-poäng. Den bästa F1-poängen för klassificering var 0.748 och uppnåddes med en word2Vec-ordinbäddning och en SVM-klassificeringsmodell. Den bästa poängen för regressionsanalysen uppnåddes med en bag-of-words-inbäddning, som uppnådde en R2-poäng på 0.380. Vidare undersökning av resultaten och en jämförelse av faktiskta tidsestimat som gjorts av företaget visar att människor bara är lite bättre än modellerna givet klassificeringsfrågan beskriven ovan. F1-poängen för de anställda var 0.792, bara 0.044 bättre än det bästa F1-poängen för modellerna. Slutsatsen för denna avhandling är att modellerna inte är tillräckligt bra för att användas i en professionell miljö. En F1-poäng på 0.748 kan användas i andra situationer, men klassificeringsfrågan i detta problem är för bred för att användas för ett riktigt projekt. Resultatet för regressionen är också för lågt för att vara till någon värdefull användning.
APA, Harvard, Vancouver, ISO, and other styles
25

Alkathiri, Abdul Aziz. "Decentralized Large-Scale Natural Language Processing Using Gossip Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281277.

Full text
Abstract:
The field of Natural Language Processing in machine learning has seen rising popularity and use in recent years. The nature of Natural Language Processing, which deals with natural human language and computers, has led to the research and development of many algorithms that produce word embeddings. One of the most widely-used of these algorithms is Word2Vec. With the abundance of data generated by users and organizations and the complexity of machine learning and deep learning models, performing training using a single machine becomes unfeasible. The advancement in distributed machine learning offers a solution to this problem. Unfortunately, due to reasons concerning data privacy and regulations, in some real-life scenarios, the data must not leave its local machine. This limitation has lead to the development of techniques and protocols that are massively-parallel and data-private. The most popular of these protocols is federated learning. However, due to its centralized nature, it still poses some security and robustness risks. Consequently, this led to the development of massively-parallel, data private, decentralized approaches, such as gossip learning. In the gossip learning protocol, every once in a while each node in the network randomly chooses a peer for information exchange, which eliminates the need for a central node. This research intends to test the viability of gossip learning for large- scale, real-world applications. In particular, it focuses on implementation and evaluation for a Natural Language Processing application using gossip learning. The results show that application of Word2Vec in a gossip learning framework is viable and yields comparable results to its non-distributed, centralized counterpart for various scenarios, with an average loss on quality of 6.904%.<br>Fältet Naturlig Språkbehandling (Natural Language Processing eller NLP) i maskininlärning har sett en ökande popularitet och användning under de senaste åren. Naturen av Naturlig Språkbehandling, som bearbetar naturliga mänskliga språk och datorer, har lett till forskningen och utvecklingen av många algoritmer som producerar inbäddningar av ord. En av de mest använda av dessa algoritmer är Word2Vec. Med överflödet av data som genereras av användare och organisationer, komplexiteten av maskininlärning och djupa inlärningsmodeller, blir det omöjligt att utföra utbildning med hjälp av en enda maskin. Avancemangen inom distribuerad maskininlärning erbjuder en lösning på detta problem, men tyvärr får data av sekretesskäl och datareglering i vissa verkliga scenarier inte lämna sin lokala maskin. Denna begränsning har lett till utvecklingen av tekniker och protokoll som är massivt parallella och dataprivata. Det mest populära av dessa protokoll är federerad inlärning (federated learning), men på grund av sin centraliserade natur utgör det ändock vissa säkerhets- och robusthetsrisker. Följaktligen ledde detta till utvecklingen av massivt parallella, dataprivata och decentraliserade tillvägagångssätt, såsom skvallerinlärning (gossip learning). I skvallerinlärningsprotokollet väljer varje nod i nätverket slumpmässigt en like för informationsutbyte, vilket eliminerarbehovet av en central nod. Syftet med denna forskning är att testa livskraftighetenav skvallerinlärning i större omfattningens verkliga applikationer. I synnerhet fokuserar forskningen på implementering och utvärdering av en NLP-applikation genom användning av skvallerinlärning. Resultaten visar att tillämpningen av Word2Vec i en skvallerinlärnings ramverk är livskraftig och ger jämförbara resultat med dess icke-distribuerade, centraliserade motsvarighet för olika scenarier, med en genomsnittlig kvalitetsförlust av 6,904%.
APA, Harvard, Vancouver, ISO, and other styles
26

Smith, Sydney. "Approaches to Natural Language Processing." Scholarship @ Claremont, 2018. http://scholarship.claremont.edu/cmc_theses/1817.

Full text
Abstract:
This paper explores topic modeling through the example text of Alice in Wonderland. It explores both singular value decomposition as well as non-­‐‑negative matrix factorization as methods for feature extraction. The paper goes on to explore methods for partially supervised implementation of topic modeling through introducing themes. A large portion of the paper also focuses on implementation of these techniques in python as well as visualizations of the results which use a combination of python, html and java script along with the d3 framework. The paper concludes by presenting a mixture of SVD, NMF and partially-­‐‑supervised NMF as a possible way to improve topic modeling.
APA, Harvard, Vancouver, ISO, and other styles
27

Belcin, Andrei. "Smart Cube Predictions for Online Analytic Query Processing in Data Warehouses." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/41956.

Full text
Abstract:
A data warehouse (DW) is a transformation of many sources of transactional data integrated into a single collection that is non-volatile and time-variant that can provide decision support to managerial roles within an organization. For this application, the database server needs to process multiple users’ queries by joining various datasets and loading the result in main memory to begin calculations. In current systems, this process is reactionary to users’ input and can be undesirably slow. In previous studies, it was shown that a personalization scheme of a single user’s query patterns and loading the smaller subset into main memory the query response time significantly shortened the query response time. The LPCDA framework developed in this research handles multiple users’ query demands, and the query patterns are subject to change (so-called concept drift) and noise. To this end, the LPCDA framework detects changes in user behaviour and dynamically adapts the personalized smart cube definition for the group of users. Numerous data mart (DM)s, as components of the DW, are subject to intense aggregations to assist analytics at the request of automated systems and human users’ queries. Subsequently, there is a growing need to properly manage the supply of data into main memory that is in closest proximity to the CPU that computes the query in order to reduce the response time from the moment a query arrives at the DW server. As a result, this thesis proposes an end-to-end adaptive learning ensemble for resource allocation of cuboids within a a DM to achieve a relevant and timely constructed smart cube before the time in need, as a way of adopting the just-in-time inventory management strategy applied in other real-world scenarios. The algorithms comprising the ensemble involve predictive methodologies from Bayesian statistics, data mining, and machine learning, that reflect the changes in the data-generating process using a number of change detection algorithms. Therefore, given different operational constraints and data-specific considerations, the ensemble can, to an effective degree, determine the cuboids in the lattice of a DM to pre-construct into a smart cube ahead of users submitting their queries, thereby benefiting from a quicker response than static schema views or no action at all.
APA, Harvard, Vancouver, ISO, and other styles
28

Gallego, Jutglà Esteve. "New signal processing and machine learning methods for EEG data analysis of patients with Alzheimer's disease." Doctoral thesis, Universitat de Vic - Universitat Central de Catalunya, 2015. http://hdl.handle.net/10803/290853.

Full text
Abstract:
Les malalties neurodegeneratives són un conjunt de malalties que afecten al cervell. Aquestes malalties estan relacionades amb la pèrdua progressiva de l'estructura o la funció de les neurones, incloent-hi la mort d'aquestes. La malaltia de l'Alzheimer és una de les malalties neurodegeneratives més comunes. Actualment, no es coneix cap cura per a l'Alzheimer, però es creu que hi ha un grup de medicaments que el que fan és retardar-ne els principals símptomes. Aquests s'han de prendre en les primeres fases de la malaltia ja que sinó no tenen efecte. Per tant, el diagnòstic precoç de la malaltia de l'Alzheimer és un factor clau. En aquesta tesis doctoral s'han estudiat diferents aspectes relacionats amb la neurociència per investigar diferents eines que permetin realitzar un diagnòstic precoç de la malaltia en qüestió. Per fer-ho, s'han treballat diferents aspectes com el preprocessament de dades, l'extracció de característiques, la selecció de característiques i la seva posterior classificació.<br>Neurodegenerative diseases are a group of disorders that affect the brain. These diseases are related with changes in the brain that lead to loss of brain structure or loss of neurons, including the dead of some neurons. Alzheimer's disease (AD) is one of the most well-known neurodegenerative diseases. Nowadays there is no cure for this disease. However, there are some medicaments that may delay the symptoms if they are used during the first stages of the disease, otherwise they have no effect. Therefore early diagnose is presented as a key factor. This PhD thesis works different aspects related with neuroscience, in order to develop new methods for the early diagnose of AD. Different aspects have been investigated, such as signal preprocessing, feature extraction, feature selection and its classification.
APA, Harvard, Vancouver, ISO, and other styles
29

Zhou, Li. "Parallel Processing Systems for Data and Computation Efficiency with Applications to Graph Computing and Machine Learning." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu156349344248694.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Salvi, Giampiero. "Mining Speech Sounds : Machine Learning Methods for Automatic Speech Recognition and Analysis." Doctoral thesis, Stockholm : KTH School of Computer Science and Comunication, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4111.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Gentek, Anna. "Activity Recognition Using Supervised Machine Learning and GPS Sensors." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-295600.

Full text
Abstract:
Human Activity Recognition has become a popular research topic among data scientists. Over the years, multiple studies regarding humans and their daily motion habits have been investigated for many different purposes. This fact is not surprising when we look at all the opportunities and applications that can be applied and utilized thanks to the results of these algorithms. In this project we implement a system that can effectively collect sensor data from mobile devices, process it and by using supervised machine learning successfully predict the class of a performed activity. The project was executed based on datasets and features extracted from GPS sensors. The system was trained using various machine learning algorithms and Python SciKit to guarantee optimal solutions with accurate predictions. Finally, we applied a majority vote rule to secure the best possible accuracy of the activity classification process. As a result we were able to identify various activities including walking, cycling, driving and public transportation methods bus and metro with 90+% accuracy.<br>Att utföra aktivitetsigenkänning på människor har blivit ett populärt forskningsämne bland datavetare, där flertalet studier rörande människor och deras dagliga rörelsevanor undersökts för många olika syften. Detta är inte förvånande när man ser till de möjligheter och användningsområden som kan tillämpas och utnyttjas tack vare resultaten från dessa system. Detta projekt går ut på att implementera ett system som mha samlad sensordata från mobila enheter, kan bearbeta den och genom s.k övervakad maskininlärning med goda resultat bestämma den aktivitet som utförts. Projektet genomfördes baserat på dataset och egenskaper extraherade från GPS-data. Systemet tränades med olika maskininlärningsalgoritmer genom Python SciKit för att välja den bäst lämpade metoden för detta projekt. Slutligen tillämpade vi majority votemetoden för att säkerställa bästa möjliga noggrannhet i aktivitetsklassificeringsprocessen. Resultatet blev ett system som framgångsrikt kan identifiera aktiviteterna gå, cykla, köra bil samt med ett ytterligare fokus på kollektivtrafikmetoderna buss och tunnelbana, med en noggrannhet på över 90%.<br>Kandidatexjobb i elektroteknik 2020, KTH, Stockholm
APA, Harvard, Vancouver, ISO, and other styles
32

Holmes, Michael P. "Multi-tree Monte Carlo methods for fast, scalable machine learning." Diss., Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/33865.

Full text
Abstract:
As modern applications of machine learning and data mining are forced to deal with ever more massive quantities of data, practitioners quickly run into difficulty with the scalability of even the most basic and fundamental methods. We propose to provide scalability through a marriage between classical, empirical-style Monte Carlo approximation and deterministic multi-tree techniques. This union entails a critical compromise: losing determinism in order to gain speed. In the face of large-scale data, such a compromise is arguably often not only the right but the only choice. We refer to this new approximation methodology as Multi-Tree Monte Carlo. In particular, we have developed the following fast approximation methods: 1. Fast training for kernel conditional density estimation, showing speedups as high as 10⁵ on up to 1 million points. 2. Fast training for general kernel estimators (kernel density estimation, kernel regression, etc.), showing speedups as high as 10⁶ on tens of millions of points. 3. Fast singular value decomposition, showing speedups as high as 10⁵ on matrices containing billions of entries. The level of acceleration we have shown represents improvement over the prior state of the art by several orders of magnitude. Such improvement entails a qualitative shift, a commoditization, that opens doors to new applications and methods that were previously invisible, outside the realm of practicality. Further, we show how these particular approximation methods can be unified in a Multi-Tree Monte Carlo meta-algorithm which lends itself as scaffolding to the further development of new fast approximation methods. Thus, our contribution includes not just the particular algorithms we have derived but also the Multi-Tree Monte Carlo methodological framework, which we hope will lead to many more fast algorithms that can provide the kind of scalability we have shown here to other important methods from machine learning and related fields.
APA, Harvard, Vancouver, ISO, and other styles
33

Cavina, Eugenio. "GEAR: una piattaforma Big Data per l'elaborazione di stream di dati attraverso Machine Learning e Business Rules." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20416/.

Full text
Abstract:
Nel panorama digitale degli ultimi anni, l’utilizzo dei pagamenti elettronici è in continua crescita. La conseguente delibera di regolamentazioni per la tutela dei consumatori implica anche l’utilizzo di sistemi di Online Fraud Detection (OFD) in grado di gestire grandi quantità di dati. La tesi parte dai sistemi OFD (tipicamente proprietari) per creare una soluzione più generale rispetto al dominio specifico e con tecnologie open source. I sistemi di OFD presentano infatti caratteristiche generalizzabili e riutilizzabili in domini applicativi diversi dai pagamenti elettronici e queste sono state estratte per creare una soluzione più generale. Questo sistema è basato su alcuni passaggi astratti e concettuali, da eseguire in real-time su uno stream di dati. Si definisce quindi il modello GEAR (Gather, Enrich, Assess, React), l’architettura della pipeline di trasformazione dei dati e infine uno stack tecnologico di strumenti Big Data. Si illustra anche una possibile implementazione dei componenti della pipeline, disaccoppiati e riutilizzabili, con allegate valutazioni delle prestazioni su un caso d’uso. La piattaforma creata implementa il modello attraverso l’utilizzo trasparente, da parte dei componenti, di algoritmi di machine learning e regole di business (attraverso un Business Rules Management System distribuito), abilitando di fatto il loro utilizzo per l’elaborazione di stream di dati.
APA, Harvard, Vancouver, ISO, and other styles
34

Lattouf, Mouzeina. "Assessment of Predictive Models for Improving Default Settings in Streaming Services." Thesis, KTH, Skolan för kemi, bioteknologi och hälsa (CBH), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-284482.

Full text
Abstract:
Streaming services provide different settings where customers can choose a sound and video quality based on personal preference. The majority of users never make an active choice; instead, they get a default quality setting which is chosen automatically for them based on some parameters, like internet connection quality. This thesis explores personalising the default audio setting, intending to improve the user experience. It achieves this by leveraging machine learning trained on the fraction of users that have made active choices in changing the quality setting. The assumption that user similarity in users who make an active choice can be leveraged to impact user experience was the idea behind this thesis work. It was issued to study which type of data from different categories: demographic, product and consumption is most predictive of a user's taste in sound quality. A case study was conducted to achieve the goals for this thesis. Five predictive model prototypes were trained, evaluated, compared and analysed using two different algorithms: XGBoost and Logistic Regression, and targeting two regions: Sweden and Brazil. Feature importance analysis was conducted using SHapley Additive exPlanations(SHAP), a unified framework for interpreting predictions with a game theoretic approach, and by measuring coefficient weights to determine the most predictive features. Besides exploring the feature impact, the thesis also answers how reasonable it is to generalise these models to non-selecting users by performing hypothesis testing. The project also covered bias analysis between users with and without active quality settings and how that affects the models. The models with XGBoost had higher performance. The results showed that demographic and product data had a higher impact on model predictions in both regions. Although, different regions did not have the same data features as most predictive, so there were differences observed in feature importance between regions and also between platforms. The results of hypothesis testing did not indicate a valid reason to consider the models to work for non-selective users. However, the method is negatively affected by other factors such as small changes in big datasets that impact the statistical significance. Data bias in some data features was found, which indicated a correlation but not the causation behind the patterns. The results of this thesis additionally show how machine learning can improve user experience in regards to default sound quality settings, by leveraging models on user similarity in users who have changed the sound quality to the most suitable for them.<br>Streamingtjänster erbjuder olika inställningar där kunderna kan välja ljud- och videokvalitet baserat på personliga preferenser. Majoriteten av användarna gör aldrig ett aktivt val; de tilldelas istället en standardkvalitetsinställning som väljs automatiskt baserat på vissa parametrar, som internetanslutningskvalitet. Denna avhandling undersöker anpassning av standardljudinställningen, med avsikt att förbättra användarupplevelsen. Detta uppnås genom att tillämpa maskininlärning på den andel användare som har aktivt ändrat kvalitetsinställningen. Antagandet att användarlikhet hos användare som gör ett aktivt val kan utnyttjas för att påverka användarupplevelsen var tanken bakom detta examensarbete. Det utfärdades för att studera vilken typ av data från olika kategorier: demografi, produkt och konsumtion är mest förutsägande för användarens smak i ljudkvalitet. En fallstudie genomfördes för att uppnå målen för denna avhandling. Fem prediktiva modellprototyper tränades, utvärderades, jämfördes och analyserades med två olika algoritmer: XGBoost och Logistisk Regression, och inriktade på två regioner: Sverige och Brasilien. Analys av funktionsvikt genomfördes med SHapley Additive exPlanations (SHAP), en enhetlig ram för att tolka förutsägelser med en spelteoretisk metod, och genom att mäta koefficientvikter för att bestämma de mest prediktiva funktionerna. Förutom att utforska funktionens påverkan, svarar avhandlingen också på hur rimligt det är att generalisera dessa modeller för icke-selektiva användare genom att utföra hypotesprövning. Projektet omfattade också biasanalys mellan användare med och utan aktiva kvalitetsinställningar och hur det påverkar modellerna. Modellerna med XGBoost hade högre prestanda. Resultaten visade att demografisk data och produktdata hade en högre inverkan på modellförutsägelser i båda regionerna. Däremot hade olika regioner inte samma datafunktioner som mest prediktiva, skillnader observerades i funktionsvikt mellan regioner och även mellan plattformar. Resultaten av hypotesprövningen indikerade inte på vägande anledning för att anse att modellerna skulle fungera för icke-selektiva användare. Däremot har metoden påverkats negativt av andra faktorer som små förändringar i stora datamängder som påverkar den statistiska signifikansen. Data bias hittades i vissa datafunktioner, vilket indikerade en korrelation men inte orsaken bakom mönstren. Resultaten av denna avhandling visar dessutom hur maskininlärning kan förbättra användarupplevelsen när det gäller standardinställningar för ljudkvalitet, genom att utnyttja modeller för användarlikhet hos användare som har ändrat ljudkvaliteten till det mest lämpliga för dem.
APA, Harvard, Vancouver, ISO, and other styles
35

The, Matthew. "Statistical and machine learning methods to analyze large-scale mass spectrometry data." Licentiate thesis, KTH, Genteknologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-185149.

Full text
Abstract:
As in many other fields, biology is faced with enormous amounts ofdata that contains valuable information that is yet to be extracted. The field of proteomics, the study of proteins, has the luxury of having large repositories containing data from tandem mass-spectrometry experiments, readily accessible for everyone who is interested. At the same time, there is still a lot to discover about proteins as the main actors in cell processes and cell signaling. In this thesis, we explore several methods to extract more information from the available data using methods from statistics and machine learning. In particular, we introduce MaRaCluster, a new method for clustering mass spectra on large-scale datasets. This method uses statistical methods to assess similarity between mass spectra, followed by the conservative complete-linkage clustering algorithm.The combination of these two resulted in up to 40% more peptide identifications on its consensus spectra compared to the state of the art method. Second, we attempt to clarify and promote protein-level false discovery rates (FDRs). Frequently, studies fail to report protein-level FDRs even though the proteins are actually the entities of interest. We provided a framework in which to discuss protein-level FDRs in a systematic manner to open up the discussion and take away potential hesitance. We also benchmarked some scalable protein inference methods and included the best one in the Percolator package. Furthermore, we added functionality to the Percolator package to accommodate the analysis of studies in which many runs are aggregated. This reduced the run time for a recent study regarding a draft human proteome from almost a full day to just 10 minutes on a commodity computer, resulting in a list of proteins together with their corresponding protein-level FDRs.<br><p>QC 20160412</p>
APA, Harvard, Vancouver, ISO, and other styles
36

Hu, Ji. "A virtual machine architecture for IT-security laboratories." Phd thesis, [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=980935652.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Giovanelli, Joseph. "AutoML: A new methodology to automate data pre-processing pipelines." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20422/.

Full text
Abstract:
It is well known that we are living in the Big Data Era. Indeed, the exponential growth of Internet of Things, Web of Things and Pervasive Computing systems greatly increased the amount of stored data. Thanks to the availability of data, the figure of the Data Scientist has become one of the most sought, because he is capable of transforming data, performing analysis on it, and applying Machine Learning techniques to improve the business decisions of companies. Yet, Data Scientists do not scale. It is almost impossible to balance their number and the required effort to analyze the increasingly growing sizes of available data. Furthermore, today more and more non-experts use Machine Learning tools to perform data analysis but they do not have the required knowledge. To this end, tools that help them throughout the Machine Learning process have been developed and are typically referred to as AutoML tools. However, even with the presence of such tools, raw data (i.e., without being pre-processed) are rarely ready to be consumed, and generally perform poorly when consumed in a raw form. A pre-processing phase (i.e., application of a set of transformations), which improves the quality of the data and makes it suitable for algorithms is usually required. Most of AutoML tools do not consider this preliminary part, even though it has already shown to improve the final performance. Moreover, there exist a few works that actually support pre-processing, but they provide just the application of a fixed series of transformations, decided a priori, not considering the nature of the data, the used algorithm, or simply that the order of the transformations could affect the final result. In this thesis we propose a new methodology that allows to provide a series of pre-processing transformations according to the specific presented case. Our approach analyzes the nature of the data, the algorithm we intend to use, and the impact that the order of transformations could have.
APA, Harvard, Vancouver, ISO, and other styles
38

Dail, Mathias. "Clustering unstructured life sciences experiments with unsupervised machine learning : Natural language processing for unstructured life sciences texts." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-265549.

Full text
Abstract:
The purpose of this master’s thesis is to analyse different types of document representations in the context of improving, in an unsupervised manner, the searchability of unstructured textual life sciences experiments by clustering similar experiments together. The challenge is to produce, analyse and compare different representations of the life sciences data by using traditional and advanced unsupervised Machine learning models. The text data analysed in this work is noisy and very heterogeneous, as it comes from a real-world Electronic Lab Notebook. Clustering unstructured and unlabeled text experiments is challenging. It requires the creation of representations based only on the relevant information existing in an experiment. This work studies statistical and generative techniques, word embeddings and some of the most recent deep learning models in Natural Language Processing to create the various representation of the studied data. It explores the possibility of combining multiple techniques and using external life-sciences knowledge-bases to create richer representations before applying clustering algorithms. Different types of analysis are performed, including an assessment done by experts, to evaluate and compare the scientific relevance of the cluster of experiments created by the different data representations. The results show that traditional statistical techniques can still produce good baselines. Modern deep learning techniques have been shown to model the studied data well and create rich representations. Combining multiple techniques with external knowledge (biomedical and life-science-related ontologies) have been shown to produce the best results in grouping similar relevant experiments together. The different studied techniques enable to model different, and complementary aspects of a text, therefore combining them is a key to significantly improve the clustering of unstructured data.<br>Syftet med denna uppsats är att analysera olika typer av dokumentrepresentationer för att, på ett oövervakat sätt, förbättra sökbarheten hos ostrukturerade biomedicinska experiment genom att kluster-samla liknande experiment tillsammans. Arbetet innefattar att producera, analysera och jämföra textrepresenta- tioner med hjälp av olika traditionella och moderna maskininlärningsmetoder. Den data som analyserats är brusig och heterogen eftersom den kommer från manuellt skrivna experiment från ett elektroniskt labbokssystem. Att kluster-indela ostrukturerade och oannoterade experiment är en utmaning. Det kräver en representation av texten som enbart baseras på väsentlig information. I denna uppsats har statistiska och generativa tekniker som inbäddade ord samt de senaste framstegen inom djup maskininlärning inom området naturlig textbearbetning använts för att skapa olika textrepresentationer. Genom att kombinera olika tekniker samt att utnyttja externa biomedicinska kunskapskällor har möjligheten att skapa en bättre representation undersökts. Flera analyser har gjorts och dessa har kompletterats med en manuell utvärdering utförd av experter inom det biomedicinska kunskapsfältet. Resultatet visar att traditionella statistiska metoder kan skapa en rimlig basnivå. Moderna djupinlärningsalgoritmer har också visat sig fungera mycket väl och skapat rika representationer av innehållet. Kombinationer av flera tekniker samt användningen av externa biomedicinska kunskapskällor och ontologier har visat sig ge bäst resultat. De olika teknikerna verkar modellera olika och komplementära aspekter av en text, och att kombinera dem kan vara en nyckel till att signifikant förbättra sökbarheten hos ostrukturerad text.
APA, Harvard, Vancouver, ISO, and other styles
39

Qi, Muxi. "A Comprehensive Comparative Performance Evaluation of Signal Processing Features in Detecting Alcohol Consumption from Gait Data." Digital WPI, 2016. https://digitalcommons.wpi.edu/etd-theses/275.

Full text
Abstract:
Excessive alcohol is the third leading lifestyle-related cause of death in the United States. Alcohol intoxication has a significant effect on how the human body operates, and is especially harmful to the human brain and heart. To help individuals to monitor their alcohol intoxication, several methods have been proposed to detect alcohol consumption levels including direct Blood Alcohol Concentration (BAC) measurement by breathalyzers and various wearable sensor devices. More recently, Arnold et al proposed a machine-learning-based method of passively inferring intoxication levels from gait data by classifying smartphone accelerometer readings. Their work utilized 11 smartphone accelerometer features in the time and frequency domains, achieving a classification accuracy of 57%. This thesis extends the work of Arnold et al by extracting and comparing the efficacy of a more comprehensive list of 27 signal processing features in the time, frequency, wavelet, statistical and information theory domains, evaluating how much using them improves the accuracy of supervised BAC classification of accelerometer gait data. Correlation-based Feature Selection (CFS) is used to identify and rank features most correlated with alcohol-induced gait changes. 22 of the 27 features investigated showed statistically significant correlations with BAC levels. The most correlated features were then used to classify labeled samples of intoxicated gait data in order to test their detection accuracy. Statistical features had the best classification accuracy of 83.89%, followed by time domain features and frequency domain features follow with accuracies of 83.22% and 82.21%, respectively. Classification using all 22 statistically significant signal processing features yielded an accuracy of 84.9% for the Random Forest classifier.
APA, Harvard, Vancouver, ISO, and other styles
40

van, Schaik Sebastiaan Johannes. "A framework for processing correlated probabilistic data." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:91aa418d-536e-472d-9089-39bef5f62e62.

Full text
Abstract:
The amount of digitally-born data has surged in recent years. In many scenarios, this data is inherently uncertain (or: probabilistic), such as data originating from sensor networks, image and voice recognition, location detection, and automated web data extraction. Probabilistic data requires novel and different approaches to data mining and analysis, which explicitly account for the uncertainty and the correlations therein. This thesis introduces ENFrame: a framework for processing and mining correlated probabilistic data. Using this framework, it is possible to express both traditional and novel algorithms for data analysis in a special user language, without having to explicitly address the uncertainty of the data on which the algorithms operate. The framework will subsequently execute the algorithm on the probabilistic input, and perform exact or approximate parallel probability computation. During the probability computation, correlations and provenance are succinctly encoded using probabilistic events. This thesis contains novel contributions in several directions. An expressive user language – a subset of Python – is introduced, which allows a programmer to implement algorithms for probabilistic data without requiring knowledge of the underlying probabilistic model. Furthermore, an event language is presented, which is used for the probabilistic interpretation of the user program. The event language can succinctly encode arbitrary correlations using events, which are the probabilistic counterparts of deterministic user program variables. These highly interconnected events are stored in an event network, a probabilistic interpretation of the original user program. Multiple techniques for exact and approximate probability computation (with error guarantees) of such event networks are presented, as well as techniques for parallel computation. Adaptations of multiple existing data mining algorithms are shown to work in the framework, and are subsequently subjected to an extensive experimental evaluation. Additionally, a use-case is presented in which a probabilistic adaptation of a clustering algorithm is used to predict faults in energy distribution networks. Lastly, this thesis presents techniques for integrating a number of different probabilistic data formalisms for use in this framework and in other applications.
APA, Harvard, Vancouver, ISO, and other styles
41

Bergh, Adrienne. "A Machine Learning Approach to Predicting Alcohol Consumption in Adolescents From Historical Text Messaging Data." Chapman University Digital Commons, 2019. https://digitalcommons.chapman.edu/cads_theses/2.

Full text
Abstract:
Techniques based on artificial neural networks represent the current state-of-the-art in machine learning due to the availability of improved hardware and large data sets. Here we employ doc2vec, an unsupervised neural network, to capture the semantic content of text messages sent by adolescents during high school, and encode this semantic content as numeric vectors. These vectors effectively condense the text message data into highly leverageable inputs to a logistic regression classifier in a matter of hours, as compared to the tedious and often quite lengthy task of manually coding data. Using our machine learning approach, we are able to train a logistic regression model to predict adolescents' engagement in substance abuse during distinct life phases with accuracy ranging from 76.5% to 88.1%. We show the effects of grade level and text message aggregation strategy on the efficacy of document embedding generation with doc2vec. Additional examination of the vectorizations for specific terms extracted from the text message data adds quantitative depth to this analysis. We demonstrate the ability of the method used herein to overcome traditional natural language processing concerns related to unconventional orthography. These results suggest that the approach described in this thesis is a competitive and efficient alternative to existing methodologies for predicting substance abuse behaviors. This work reveals the potential for the application of machine learning-based manipulation of text messaging data to development of automatic intervention strategies against substance abuse and other adolescent challenges.
APA, Harvard, Vancouver, ISO, and other styles
42

Tempfli, Peter. "Preprocessing method comparison and model tuning for natural language data." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34438.

Full text
Abstract:
Twitter and other microblogging services are a valuable source for almost real-time marketing, public opinion and brand-related consumer information mining. As such, collection and analysis of user-generated natural language content is in the focus of research regarding automated sentiment analysis. The most successful approach in the field is supervised machine learning, where the three key problems are data cleaning and transformation, feature generation and model choice and training parameter selection. Papers in recent years thoroughly examined the field and there is a agreement that relatively simple techniques as bag-of-words transformation of text and a naive bayes models can generate acceptable results (between 75% and 85% percent F1-scores for an average dataset) and fine tuning can be really difficult and yields relatively small results. However, a few percent in performance even on a middle-size dataset can mean thousands of better classified documents, which can mean thousands of missed sales or angry customers in any business domain. Thus this work presents and demonstrates a framework for better tailored, fine-tuned models for analysing twitter data. The experiments show that Naive Bayes classifiers with domain specific stopword selection work the best (up to 88% F1-score), however the performance dramatically decreases if the data is unbalanced or the classes are not binary. Filtering stopwords is crucial to increase prediction performance; and the experiment shows that a stopword set should be domain-specific. The conclusion is that there is no one best way for model training and stopword selection in sentiment analysis. Thus the work suggests that there is space for using a comparison framework to fine-tune prediction models to a given problem: such a comparison framework should compare different training settings on the same dataset, so the best trained models can be found for a given real-life problem.
APA, Harvard, Vancouver, ISO, and other styles
43

Yang, Baoyao. "Distribution alignment for unsupervised domain adaptation: cross-domain feature learning and synthesis." HKBU Institutional Repository, 2018. https://repository.hkbu.edu.hk/etd_oa/556.

Full text
Abstract:
In recent years, many machine learning algorithms have been developed and widely applied in various applications. However, most of them have considered the data distributions of the training and test datasets to be similar. This thesis concerns on the decrease of generalization ability in a test dataset when the data distribution is different from that of the training dataset. As labels may be unavailable in the test dataset in practical applications, we follow the effective approach of unsupervised domain adaptation and propose distribution alignment methods to improve the generalization ability of models learned from the training dataset in the test dataset. To solve the problem of joint distribution alignment without target labels, we propose a new criterion of domain-shared group sparsity that is an equivalent condition for equal conditional distribution. A domain-shared group-sparse dictionary learning model is built with the proposed criterion, and a cross-domain label propagation method is developed to learn a target-domain classifier using the domain-shared group-sparse representations and the target-specific information from the target data. Experimental results show that the proposed method achieves good performance on cross-domain face and object recognition. Moreover, most distribution alignment methods have not considered the difference in distribution structures, which results in insufficient alignment across domains. Therefore, a novel graph alignment method is proposed, which aligns both data representations and distribution structural information across the source and target domains. An adversarial network is developed for graph alignment by mapping both source and target data to a feature space where the data are distributed with unified structure criteria. Promising results have been obtained in the experiments on cross-dataset digit and object recognition. Problem of dataset bias also exists in human pose estimation across datasets with different image qualities. Thus, this thesis proposes to synthesize target body parts for cross-domain distribution alignment, to address the problem of cross-quality pose estimation. A translative dictionary is learned to associate the source and target domains, and a cross-quality adaptation model is developed to refine the source pose estimator using the synthesized target body parts. We perform cross-quality experiments on three datasets with different image quality using two state-of-the-art pose estimators, and compare the proposed method with five unsupervised domain adaptation methods. Our experimental results show that the proposed method outperforms not only the source pose estimators, but also other unsupervised domain adaptation methods.
APA, Harvard, Vancouver, ISO, and other styles
44

Sonnert, Adrian. "Predicting inter-frequency measurements in an LTE network using supervised machine learning : a comparative study of learning algorithms and data processing techniques." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-148553.

Full text
Abstract:
With increasing demands on network reliability and speed, network suppliers need to effectivize their communications algorithms. Frequency measurements are a core part of mobile network communications, increasing their effectiveness would increase the effectiveness of many network processes such as handovers, load balancing, and carrier aggregation. This study examines the possibility of using supervised learning to predict the signal of inter-frequency measurements by investigating various learning algorithms and pre-processing techniques. We found that random forests have the highest predictive performance on this data set, at 90.7\% accuracy. In addition, we have shown that undersampling and varying the discriminator are effective techniques for increasing the performance on the positive class on frequencies where the negative class is prevalent. Finally, we present hybrid algorithms in which the learning algorithm for each model depends on attributes of the training data set. These algorithms perform at a much higher efficiency in terms of memory and run-time without heavily sacrificing predictive performance.
APA, Harvard, Vancouver, ISO, and other styles
45

Alvarado, Mantecon Jesus Gerardo. "Towards the Automatic Classification of Student Answers to Open-ended Questions." Thesis, Université d'Ottawa / University of Ottawa, 2019. http://hdl.handle.net/10393/39093.

Full text
Abstract:
One of the main research challenges nowadays in the context of Massive Open Online Courses (MOOCs) is the automation of the evaluation process of text-based assessments effectively. Text-based assessments, such as essay writing, have been proved to be better indicators of higher level of understanding than machine-scored assessments (E.g. Multiple Choice Questions). Nonetheless, due to the rapid growth of MOOCs, text-based evaluation has become a difficult task for human markers, creating the need of automated systems for grading. In this thesis, we focus on the automated short answer grading task (ASAG), which automatically assesses natural language answers to open-ended questions into correct and incorrect classes. We propose an ensemble supervised machine learning approach that relies on two types of classifiers: a response-based classifier, which centers around feature extraction from available responses, and a reference-based classifier which considers the relationships between responses, model answers and questions. For each classifier, we explored a set of features based on words and entities. For the response-based classifier, we tested and compared 5 features: traditional n-gram models, entity URIs (Uniform Resource Identifier) and entity mentions both extracted using a semantic annotation API, entity mention embeddings based on GloVe and entity URI embeddings extracted from Wikipedia. For the reference-based classifier, we explored fourteen features: cosine similarity between sentence embeddings from student answers and model answers, number of overlapping elements (words, entity URI, entity mention) between student answers and model answers or question text, Jaccard similarity coefficient between student answers and model answers or question text (based on words, entity URI or entity mentions) and a sentence embedding representation. We evaluated our classifiers on three datasets, two of which belong to the SemEval ASAG competition (Dzikovska et al., 2013). Our results show that, in general, reference-based features perform much better than response-based features in terms of accuracy and macro average f1-score. Within the reference-based approach, we observe that the use of S6 embedding representation, which considers question text, student and model answer, generated the best performing models. Nonetheless, their combination with other similarity features helped build more accurate classifiers. As for response-based classifiers, models based on traditional n-gram features remained the best models. Finally, we combined our best reference-based and response-based classifiers using an ensemble learning model. Our ensemble classifiers combining both approaches achieved the best results for one of the evaluation datasets, but underperformed on the remaining two. We also compared the best two classifiers with some of the main state-of-the-art results on the SemEval competition. Our final embedded meta-classifier outperformed the top-ranking result on the SemEval Beetle dataset and our top classifier on SemEval SciEntBank, trained on reference-based features, obtained the 2nd position. In conclusion, the reference-based approach, powered mainly by sentence level embeddings and other similarity features, proved to generate the most efficient models in two out of three datasets and the ensemble model was the best on the SemEval Beetle dataset.
APA, Harvard, Vancouver, ISO, and other styles
46

Snyders, Sean. "Inductive machine learning bias in knowledge-based neurocomputing." Thesis, Stellenbosch : Stellenbosch University, 2003. http://hdl.handle.net/10019.1/53463.

Full text
Abstract:
Thesis (MSc) -- Stellenbosch University , 2003.<br>ENGLISH ABSTRACT: The integration of symbolic knowledge with artificial neural networks is becoming an increasingly popular paradigm for solving real-world problems. This paradigm named knowledge-based neurocomputing, provides means for using prior knowledge to determine the network architecture, to program a subset of weights to induce a learning bias which guides network training, and to extract refined knowledge from trained neural networks. The role of neural networks then becomes that of knowledge refinement. It thus provides a methodology for dealing with uncertainty in the initial domain theory. In this thesis, we address several advantages of this paradigm and propose a solution for the open question of determining the strength of this learning, or inductive, bias. We develop a heuristic for determining the strength of the inductive bias that takes the network architecture, the prior knowledge, the learning method, and the training data into consideration. We apply this heuristic to well-known synthetic problems as well as published difficult real-world problems in the domain of molecular biology and medical diagnoses. We found that, not only do the networks trained with this adaptive inductive bias show superior performance over networks trained with the standard method of determining the strength of the inductive bias, but that the extracted refined knowledge from these trained networks deliver more concise and accurate domain theories.<br>AFRIKAANSE OPSOMMING: Die integrasie van simboliese kennis met kunsmatige neurale netwerke word 'n toenemende gewilde paradigma om reelewereldse probleme op te los. Hierdie paradigma genoem, kennis-gebaseerde neurokomputasie, verskaf die vermoe om vooraf kennis te gebruik om die netwerkargitektuur te bepaal, om a subversameling van gewigte te programeer om 'n leersydigheid te induseer wat netwerkopleiding lei, en om verfynde kennis van geleerde netwerke te kan ontsluit. Die rol van neurale netwerke word dan die van kennisverfyning. Dit verskaf dus 'n metodologie vir die behandeling van onsekerheid in die aanvangsdomeinteorie. In hierdie tesis adresseer ons verskeie voordele wat bevat is in hierdie paradigma en stel ons 'n oplossing voor vir die oop vraag om die gewig van hierdie leer-, of induktiewe sydigheid te bepaal. Ons ontwikkel 'n heuristiek vir die bepaling van die induktiewe sydigheid wat die netwerkargitektuur, die aanvangskennis, die leermetode, en die data vir die leer proses in ag neem. Ons pas hierdie heuristiek toe op bekende sintetiese probleme so weI as op gepubliseerde moeilike reelewereldse probleme in die gebied van molekulere biologie en mediese diagnostiek. Ons bevind dat, nie alleenlik vertoon die netwerke wat geleer is met die adaptiewe induktiewe sydigheid superieure verrigting bo die netwerke wat geleer is met die standaardmetode om die gewig van die induktiewe sydigheid te bepaal nie, maar ook dat die verfynde kennis wat ontsluit is uit hierdie geleerde netwerke meer bondige en akkurate domeinteorie lewer.
APA, Harvard, Vancouver, ISO, and other styles
47

Qahwaji, Rami S. R., and Tufan Colak. "Automatic Short-Term Solar Flare Prediction Using Machine Learning and Sunspot Associations." Springer, 2007. http://hdl.handle.net/10454/4092.

Full text
Abstract:
Yes<br>In this paper, a machine-learning-based system that could provide automated short-term solar flare prediction is presented. This system accepts two sets of inputs: McIntosh classification of sunspot groups and solar cycle data. In order to establish a correlation between solar flares and sunspot groups, the system explores the publicly available solar catalogues from the National Geophysical Data Center to associate sunspots with their corresponding flares based on their timing and NOAA numbers. The McIntosh classification for every relevant sunspot is extracted and converted to a numerical format that is suitable for machine learning algorithms. Using this system we aim to predict whether a certain sunspot class at a certain time is likely to produce a significant flare within six hours time and if so whether this flare is going to be an X or M flare. Machine learning algorithms such as Cascade-Correlation Neural Networks (CCNNs), Support Vector Machines (SVMs) and Radial Basis Function Networks (RBFN) are optimised and then compared to determine the learning algorithm that would provide the best prediction performance. It is concluded that SVMs provide the best performance for predicting whether a McIntosh classified sunspot group is going to flare or not but CCNNs are more capable of predicting the class of the flare to erupt. A hybrid system that combines a SVM and a CCNN is suggested for future use.<br>EPSRC
APA, Harvard, Vancouver, ISO, and other styles
48

Yang, Shaojie. "A Data Augmentation Methodology for Class-imbalanced Image Processing in Prognostic and Health Management." University of Cincinnati / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin161375046654683.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Pandey, Amare Ketsela Tesfaye and Amrit. "Empirical Evaluation of Machine Learning Algorithms based on EMG, ECG and GSR Data to Classify Emotional States." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-3673.

Full text
Abstract:
The peripheral psychophysiological signals (EMG, ECG and GSR) of 13 participants were recorded in the well planned Cognition and Robotics lab at BTH University and 9 participants data were taken for further processing. Thirty(30) pictures of IAPS were shown to each participant individually as stimuli, and each picture was displayed for five-second intervals. Signal preprocessing, feature extraction and selection, models, datasets formation and data analysis and interpretation were done. The correlation between a combination of EMG, ECG and GSR signal and emotional states were investigated. 2- Dimensional valence-arousal model was used to represent emotional states. Finally, accuracy comparisons among selected machine learning classification algorithms have performed. Context: Psychophysiological measurement is one of the recent and popular ways to identify emotions when using computers or robots. It can be done using peripheral signals: Electromyography (EMG), Electrocardiography (ECG) and Galvanic Skin Response (GSR). The signals from these measurements are considered as reliable signals and can produce the required data. It is further carried out by preprocessing of data, feature selection and classification. Classification of EMG, ECG and GSR data can be conducted with appropriate machine learning algorithms for better accuracy results. Objectives: In this study, we investigate and analyzed with psychophysiological (EMG, ECG and GSR) data to find best classifier algorithm. Our main objective is to classify those data with appropriate machine learning techniques. Classifications of psychophysiological data are useful in emotion recognition. Therefore, our ultimate goal is to provide validated classified psychological measures for the automated adoption of human robot performance. Methods: We conducted a literature review in order to answer RQ1. The sources used are Inspec/ Compendex, IEEE, ACM Digital Library, Google Scholar and Springer Link. This helps us to identify suitable features required for the classification after reading the articles and papers that are peer reviewed as well as lie relevant to the area. Similarly, this helps us to select appropriate machine learning algorithms. We conducted an experiment in order to answer RQ2 and RQ3. A pilot experiment, then after main experiment was conducted in the Cognition and Robotics lab at the university. An experiment was conducted to take measures from EMG, ECG and GSR signal. Results: We obtained different accuracy results using different sets of datasets. The classification accuracy result was best given by the Support Vector Machine algorithm, which gives up to 59% classified emotional states correctly. Conclusions: The psychophysiological signals are very inconsistent with individual participant for specific emotion. Hence, the result we got from the experiment was higher with a single participant than all participants were together. Although, having large number of instances are good to train the classifier well.<br>The thesis is focused to classify emotional states from physiological signals. Features extraction and selection of the physiological signal was done, which was used for dataset formation and then classification of those emotional states. IAPS pictures were used to elicit emotional/affective states. Experiment was conducted with 13 participants in cognition and Robotics lab using biosensors EMG, ECG and GSR at BTH University. Nine participants data were taken for further preprocessing. We observed in our thesis the classification of emotions which could be analyzed by a combination of psychophysiological signal as Model A and Model B. Since signals of subjects are different for same emotional state, the accuracy was better for single participant than all participants together. Classification of emotional states is useful for HCI and HRI to manufacture emotional intelligence robot. So, it is essential to provide best classifier algorithms which can be helpful to detect emotions for developing emotional intelligence robots. Our work contribution lies in providing best algorithms for emotion recognition for psychophysiological data and selected features. Most of the results showed that SVM performed best with classification accuracy up to 59 % for single participant and 48.05 % for all participants together. For a single dataset and single participant, we found 60.17 % accuracy from MLP but it consumed more time and memory than other algorithms during classification. The rest of the algorithms like BNT, Naive Bayes, KNN and J48 also gave competitive accuracy to SVM. We conclude that SVM algorithm for emotion recognition from a combination of EMG, ECG and GSR is capable of handling and giving better classification accuracy among others. Tally between IAPS pictures with SAM helped to remove less correlated signals and to obtain better accuracies. Still the obtained results are small in percentage. Therefore, more participants are probably needed to get a better accuracy result over the whole dataset.<br>amarehenry@gmail.com ; Mobile: 0767042234 amrit.pandey111@gmail.com ; Mobile : 0704763190
APA, Harvard, Vancouver, ISO, and other styles
50

Domínguez, Samamés Christian Andrés. "Machine Learning: Recomendaciones en base a los gustos y preferencias de los estudiantes de la UPC usuarios de Netflix." Bachelor's thesis, Universidad Peruana de Ciencias Aplicadas (UPC), 2020. http://hdl.handle.net/10757/655135.

Full text
Abstract:
En los últimos años, plataformas en streaming como Netflix se han popularizado a nivel mundial con gran rapidez. Este éxito se debe a muchos factores, sin embargo, la clave ha sido el sistema de recomendaciones que utiliza dicha plataforma. Los usuarios no solo reciben recomendaciones en base a sus gustos y preferencias, sino que también reciben una interfaz personalizada de la pantalla inicial de Netflix preparada especialmente para cada uno de ellos. Este sistema promete ser eficiente no solo para conservar a sus usuarios actuales, sino también para llamar la atención de nuevos posibles usuarios. Ante ello, la industria del entretenimiento ha decidido convertir a las plataformas en streaming en sus principales aliados. No solo se han estrenado series originales dentro de la plataforma, sino también películas que no pudieron ser estrenadas, en su momento, en cines por problemas de distribución. Si bien es cierto, se han realizado diversos estudios en diferentes países sobre la efectividad del sistema de recomendaciones de Netflix, en países latinoamericanos como Perú, no se ha comprobado si este mecanismo es tan efectivo como se afirma. Por ello, resulta necesario realizar una investigación sobre este tema para así comprender la forma más efectiva de conectar con el público latinoamericano y acoger a este importante mercado.<br>In recent years, streaming platforms like Netflix have become very popular worldwide. This success is due to many factors; however, the key has been the recommendation system this platform uses. Users not only receive recommendations based on their tastes and preferences, but they also receive a personalized Netflix home screen interface specially prepared for each of them. This system promises to be efficient not only to retain its current users, but also to attract the attention of new potential users. Now, the entertainment industry has decided to make streaming platforms its main allies. Not only have original series been released within the platform, but movies have also been released in spite of cinema’s distribution issues. Although it is true, various studies have been carried out in different countries on the effectiveness of Netflix's recommendation system, in Latin American countries such as Peru, it has not been proven whether this mechanism is as effective as it is claimed. For this reason, it is necessary to carry out research on this topic in order to understand the most effective way to connect with the Latin American public and embrace this important market.<br>Trabajo de investigación
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!