Dissertations / Theses: 'Bayes Algorithm'

1

Bissmark, Johan, and Oscar Wärnling. "The Sparse Data Problem Within Classification Algorithms : The Effect of Sparse Data on the Naïve Bayes Algorithm." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-209227.

Full text

Abstract:

In today’s society, software and apps based on machine learning and predictive analysis are of the essence. Machine learning has provided us with the possibility of predicting likely future outcomes based on previously collected data in order to save time and resources. A common problem in machine learning is sparse data, which alters the performance of machine learning algorithms and their ability to calculate accurate predictions. Data is considered sparse when certain expected values in a dataset are missing, which is a common phenomenon in general large scaled data analysis. This report will mainly focus on the Naïve Bayes classification algorithm and how it is affected by sparse data in comparison to other widely used classification algorithms. The significance of the performance loss associated with sparse data is studied and analyzed, in order to measure the effect sparsity has on the ability to compute accurate predictions. In conclusion, the results of this report lay a solid argument for the conclusion that the Naïve Bayes algorithm is far less affected by sparse data compared to other common classification algorithms. A conclusion that is in line with what previous research suggests.
I dagens samhälle är maskininlärningsbaserade applikationer och mjukvara, tillsammans med förutsägelser, högst aktuellt. Maskininlärning har gett oss möjligheten att förutsäga troliga utfall baserat på tidigare insamlad data och därigenom spara tid och resurser. Ett vanligt förekommande problem inom maskininlärning är gles data, eftersom det påverkar prestationen hos algoritmer för maskininlärning och deras förmåga att kunna beräkna precisa förutsägelser. Data anses vara gles när vissa förväntade värden i ett dataset saknas, vilket generellt är vanligt förekommande i storskaliga dataset. I den här rapporten ligger fokus huvudsakligen på klassificeringsalgoritmen Naïve Bayes och hur den påverkas av gles data jämfört med andra frekvent använda klassifikationsalgoritmer. Omfattningen av prestationssänkningen som resultat av gles data studeras och analyseras för att mäta hur stor effekt gles data har på förmågan att kunna beräkna precisa förutsägelser. Avslutningsvis lägger resultaten i den här rapporten grund för slutsatsen att algoritmen Naïve Bayes påverkas mindre av gles data jämfört med andra vanligt förekommande klassificeringsalgoritmer. Den här rapportens slutsats stöds även av vad tidigare forskning har visat.

APA, Harvard, Vancouver, ISO, and other styles

2

Volfson, Alexander. "Exploring the optimal Transformation for Volatility." Digital WPI, 2010. https://digitalcommons.wpi.edu/etd-theses/472.

Full text

Abstract:

This paper explores the fit of a stochastic volatility model, in which the Box-Cox transformation of the squared volatility follows an autoregressive Gaussian distribution, to the continuously compounded daily returns of the Australian stock index. Estimation was difficult, and over-fitting likely, because more variables are present than data. We developed a revised model that held a couple of these variables fixed and then, further, a model which reduced the number of variables significantly by grouping trading days. A Metropolis-Hastings algorithm was used to simulate the joint density and derive estimated volatilities. Though autocorrelations were higher with a smaller Box-Cox transformation parameter, the fit of the distribution was much better.

APA, Harvard, Vancouver, ISO, and other styles

3

Nguyen, Huu Du. "System Reliability : Inference for Common Cause Failure Model in Contexts of Missing Information." Thesis, Lorient, 2019. http://www.theses.fr/2019LORIS530.

Full text

Abstract:

Le bon fonctionnement de l’ensemble d’un système industriel est parfois fortement dépendant de la fiabilité de certains éléments qui le composent. Une défaillance de l’un de ces éléments peut conduire à une défaillance totale du système avec des conséquences qui peuvent être catastrophiques en particulier dans le secteur de l’industrie nucléaire ou dans le secteur de l’industrie aéronautique. Pour réduire ce risque de panne catastrophique, une stratégie consiste à dupliquer les éléments sensibles dans le dispositif. Ainsi, si l’un de ces éléments tombe en panne, un autre pourra prendre le relais et le bon fonctionnement du système pourra être maintenu. Cependant, on observe couramment des situations qui conduisent à des défaillances simultanées d’éléments du système : on parle de défaillance de cause commune. Analyser, modéliser, prédire ce type d’événement revêt donc une importance capitale et sont l’objet des travaux présentés dans cette thèse. Il existe de nombreux modèles pour les défaillances de cause commune. Des méthodes d’inférence pour étudier les paramètres de ces modèles ont été proposées. Dans cette thèse, nous considérons la situation où l’inférence est menée sur la base de données manquantes. Nous étudions en particulier le modèle BFR (Binomial Failure Rate) et la méthode des facteurs alpha. En particulier, une approche bayésienne est développée en s’appuyant sur des techniques algorithmiques (Metropolis, IBF). Dans le domaine du nucléaire, les données de défaillances sont peu abondantes et des techniques particulières d’extrapolations de données doivent être mis en oeuvre pour augmenter l’information. Nous proposons dans le cadre de ces stratégies, des techniques de prédiction des défaillances de cause commune. L’actualité récente a mis en évidence l’importance de la fiabilité des systèmes redondants et nous espérons que nos travaux contribueront à une meilleure compréhension et prédiction des risques de catastrophes majeures
The effective operation of an entire industrial system is sometimes strongly dependent on the reliability of its components. A failure of one of these components can lead to the failure of the system with consequences that can be catastrophic, especially in the nuclear industry or in the aeronautics industry. To reduce this risk of catastrophic failures, a redundancy policy, consisting in duplicating the sensitive components in the system, is often applied. When one of these components fails, another will take over and the normal operation of the system can be maintained. However, some situations that lead to simultaneous failures of components in the system could be observed. They are called common cause failure (CCF). Analyzing, modeling, and predicting this type of failure event are therefore an important issue and are the subject of the work presented in this thesis. We investigate several methods to deal with the statistical analysis of CCF events. Different algorithms to estimate the parameters of the models and to make predictive inference based on various type of missing data are proposed. We treat confounded data using a BFR (Binomial Failure Rare) model. An EM algorithm is developed to obtain the maximum likelihood estimates (MLE) for the parameters of the model. We introduce the modified-Beta distribution to develop a Bayesian approach. The alpha-factors model is considered to analyze uncertainties in CCF. We suggest a new formalism to describe uncertainty and consider Dirichlet distributions (nested, grouped) to make a Bayesian analysis. Recording of CCF cause data leads to incomplete contingency table. For a Bayesian analysis of this type of tables, we propose an algorithm relying on inverse Bayes formula (IBF) and Metropolis-Hasting algorithm. We compare our results with those obtained with the alpha- decomposition method, a recent method proposed in the literature. Prediction of catastrophic event is addressed and mapping strategies are described to suggest upper bounds of prediction intervals with pivotal method and Bayesian techniques. Recent events have highlighted the importance of reliability redundant systems and we hope that our work will contribute to a better understanding and prediction of the risks of major CCF events

APA, Harvard, Vancouver, ISO, and other styles

4

Agarwal, Akrita. "Exploring the Noise Resilience of Combined Sturges Algorithm." University of Cincinnati / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1447070335.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Schmidt, Samuel. "A Massively Parallel Algorithm for Cell Classification Using CUDA." University of Cincinnati / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1448873851.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Chao, Yang, and Peng Zhang. "One General Approach For Analysing Compositional Structure Of Terms In Biomedical Field." Thesis, Tekniska Högskolan, Högskolan i Jönköping, JTH. Forskningsmiljö Informationsteknik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-20913.

Full text

Abstract:

The root is the primary lexical unit of Ontological terms, which carries the most significant aspects of semantic content and cannot be reduced into small constituents. It is the key of ontological term structure. After the identification of root, we can easily get the meaning of terms. According to the meaning, it’s helpful to identify the other parts of terms, such as the relation, definition and so on. We have generated a general classification model to identify the roots of terms in this master thesis. There are four features defined in our classification model: the Token, the POS, the Length and the Position. Implementation is followed using Java and algorithm is followed using Naïve Bayes. We implemented and evaluated the classification model using Gene Ontology (GO). The evaluation results showed that our framework and model were effective.

APA, Harvard, Vancouver, ISO, and other styles

7

Harrington, Edward, and edwardharrington@homemail com au. "Aspects of Online Learning." The Australian National University. Research School of Information Sciences and Engineering, 2004. http://thesis.anu.edu.au./public/adt-ANU20060328.160810.

Full text

Abstract:

Online learning algorithms have several key advantages compared to their batch learning algorithm counterparts: they are generally more memory efficient, and computationally mor efficient; they are simpler to implement; and they are able to adapt to changes where the learning model is time varying. Online algorithms because of their simplicity are very appealing to practitioners. his thesis investigates several online learning algorithms and their application. The thesis has an underlying theme of the idea of combining several simple algorithms to give better performance. In this thesis we investigate: combining weights, combining hypothesis, and (sort of) hierarchical combining.¶ Firstly, we propose a new online variant of the Bayes point machine (BPM), called the online Bayes point machine (OBPM). We study the theoretical and empirical performance of the OBPm algorithm. We show that the empirical performance of the OBPM algorithm is comparable with other large margin classifier methods such as the approximately large margin algorithm (ALMA) and methods which maximise the margin explicitly, like the support vector machine (SVM). The OBPM algorithm when used with a parallel architecture offers potential computational savings compared to ALMA. We compare the test error performance of the OBPM algorithm with other online algorithms: the Perceptron, the voted-Perceptron, and Bagging. We demonstrate that the combinationof the voted-Perceptron algorithm and the OBPM algorithm, called voted-OBPM algorithm has better test error performance than the voted-Perceptron and Bagging algorithms. We investigate the use of various online voting methods against the problem of ranking, and the problem of collaborative filtering of instances. We look at the application of online Bagging and OBPM algorithms to the telecommunications problem of channel equalization. We show that both online methods were successful at reducing the effect on the test error of label flipping and additive noise.¶ Secondly, we introduce a new mixture of experts algorithm, the fixed-share hierarchy (FSH) algorithm. The FSH algorithm is able to track the mixture of experts when the switching rate between the best experts may not be constant. We study the theoretical aspects of the FSH and the practical application of it to adaptive equalization. Using simulations we show that the FSH algorithm is able to track the best expert, or mixture of experts, in both the case where the switching rate is constant and the case where the switching rate is time varying.

APA, Harvard, Vancouver, ISO, and other styles

8

Sandberg, Sebastian. "Identifying Hateful Text on Social Media with Machine Learning Classifiers and Normalization Methods - Using Support Vector Machines and Naive Bayes Algorithm." Thesis, Umeå universitet, Institutionen för datavetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-155353.

Full text

Abstract:

Hateful content on social media is a growing problem. In this thesis, machine learning algorithms and pre-processing methods have been combined in order to train classifiers in identifying hateful text on social media. The combinations have been compared in terms of performance, where the considered performance criteria have been F-score and accuracy in classification. Training are performed using Naive Bayes algorithm(NB) and Support Vector Machines (SVM). The pre-processing techniques that have been used are tokenization and normalization. Fortokenization, an open-source unigram tokenizer have been used while a normalization model that normalizes each tweet pre-classification have been developed in Java. Normalization include basic clean up methods such as removing stop words, URLs, and punctuation, as well as altering methods such as emoticon conversion and spell checking. Both binary and multi-class versions of the classifiers have been used on balanced and unbalanced data. Both machine learning algorithms perform on a reasonable level with accuracy between 76.70% and 93.55% and an F-score between 0.766 and 0.935. The results point towards the fact that the main purpose of normalization is to reduce noise, balancing data is necessary and that SVM seem to slightly outperform NB.

APA, Harvard, Vancouver, ISO, and other styles

9

Ramos, Gustavo da Mota. "Seleção entre estratégias de geração automática de dados de teste por meio de métricas estáticas de softwares orientados a objetos." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/100/100131/tde-05122018-202315/.

Full text

Abstract:

Produtos de software com diferentes complexidades são criados diariamente através da elicitação de demandas complexas e variadas juntamente a prazos restritos. Enquanto estes surgem, altos níveis de qualidade são esperados para tais, ou seja, enquanto os produtos tornam-se mais complexos, o nível de qualidade pode não ser aceitável enquanto o tempo hábil para testes não acompanha a complexidade. Desta maneira, o teste de software e a geração automática de dados de testes surgem com o intuito de entregar produtos contendo altos níveis de qualidade mediante baixos custos e rápidas atividades de teste. Porém, neste contexto, os profissionais de desenvolvimento dependem das estratégias de geração automáticas de testes e principalmente da seleção da técnica mais adequada para conseguir maior cobertura de código possível, este é um fator importante dados que cada técnica de geração de dados de teste possui particularidades e problemas que fazem seu uso melhor em determinados tipos de software. A partir desde cenário, o presente trabalho propõe a seleção da técnica adequada para cada classe de um software com base em suas características, expressas por meio de métricas de softwares orientados a objetos a partir do algoritmo de classificação Naive Bayes. Foi realizada uma revisão bibliográfica de dois algoritmos de geração, algoritmo de busca aleatório e algoritmo de busca genético, compreendendo assim suas vantagens e desvantagens tanto de implementação como de execução. As métricas CK também foram estudadas com o intuito de compreender como estas podem descrever melhor as características de uma classe. O conhecimento adquirido possibilitou coletar os dados de geração de testes de cada classe como cobertura de código e tempo de geração a partir de cada técnica e também as métricas CK, permitindo assim a análise destes dados em conjunto e por fim execução do algoritmo de classificação. Os resultados desta análise demonstraram que um conjunto reduzido e selecionado das métricas CK é mais eficiente e descreve melhor as características de uma classe se comparado ao uso do conjunto por completo. Os resultados apontam também que as métricas CK não influenciam o tempo de geração dos dados de teste, entretanto, as métricas CK demonstraram correlação moderada e influência na seleção do algoritmo genético, participando assim na sua seleção pelo algoritmo Naive Bayes
Software products with different complexity are created daily through analysis of complex and varied demands together with tight deadlines. While these arise, high levels of quality are expected for such, as products become more complex, the quality level may not be acceptable while the timing for testing does not keep up with complexity. In this way, software testing and automatic generation of test data arise in order to deliver products containing high levels of quality through low cost and rapid test activities. However, in this context, software developers depend on the strategies of automatic generation of tests and especially on the selection of the most adequate technique to obtain greater code coverage possible, this is an important factor given that each technique of data generation of test have peculiarities and problems that make its use better in certain types of software. From this scenario, the present work proposes the selection of the appropriate technique for each class of software based on its characteristics, expressed through object oriented software metrics from the naive bayes classification algorithm. Initially, a literature review of the two generation algorithms was carried out, random search algorithm and genetic search algorithm, thus understanding its advantages and disadvantages in both implementation and execution. The CK metrics have also been studied in order to understand how they can better describe the characteristics of a class. The acquired knowledge allowed to collect the generation data of tests of each class as code coverage and generation time from each technique and also the CK metrics, thus allowing the analysis of these data together and finally execution of the classification algorithm. The results of this analysis demonstrated that a reduced and selected set of metrics is more efficient and better describes the characteristics of a class besides demonstrating that the CK metrics have little or no influence on the generation time of the test data and on the random search algorithm . However, the CK metrics showed a medium correlation and influence in the selection of the genetic algorithm, thus participating in its selection by the algorithm naive bayes

APA, Harvard, Vancouver, ISO, and other styles

10

Lee, Jun won. "Relationships Among Learning Algorithms and Tasks." BYU ScholarsArchive, 2011. https://scholarsarchive.byu.edu/etd/2478.

Full text

Abstract:

Metalearning aims to obtain knowledge of the relationship between the mechanism of learning and the concrete contexts in which that mechanisms is applicable. As new mechanisms of learning are continually added to the pool of learning algorithms, the chances of encountering behavior similarity among algorithms are increased. Understanding the relationships among algorithms and the interactions between algorithms and tasks help to narrow down the space of algorithms to search for a given learning task. In addition, this process helps to disclose factors contributing to the similar behavior of different algorithms. We first study general characteristics of learning tasks and their correlation with the performance of algorithms, isolating two metafeatures whose values are fairly distinguishable between easy and hard tasks. We then devise a new metafeature that measures the difficulty of a learning task that is independent of the performance of learning algorithms on it. Building on these preliminary results, we then investigate more formally how we might measure the behavior of algorithms at a ner grained level than a simple dichotomy between easy and hard tasks. We prove that, among all many possible candidates, the Classifi er Output Difference (COD) measure is the only one possessing the properties of a metric necessary for further use in our proposed behavior-based clustering of learning algorithms. Finally, we cluster 21 algorithms based on COD and show the value of the clustering in 1) highlighting interesting behavior similarity among algorithms, which leads us to a thorough comparison of Naive Bayes and Radial Basis Function Network learning, and 2) designing more accurate algorithm selection models, by predicting clusters rather than individual algorithms.

APA, Harvard, Vancouver, ISO, and other styles

11

Cheeti, Srilaxmi. "Cross-domain sentiment classification using grams derived from syntax trees and an adapted naive Bayes approach." Thesis, Kansas State University, 2012. http://hdl.handle.net/2097/13733.

Full text

Abstract:

Master of Science
Department of Computing and Information Sciences
Doina Caragea
There is an increasing amount of user-generated information in online documents, includ- ing user opinions on various topics and products such as movies, DVDs, kitchen appliances, etc. To make use of such opinions, it is useful to identify the polarity of the opinion, in other words, to perform sentiment classification. The goal of sentiment classification is to classify a given text/document as either positive, negative or neutral based on the words present in the document. Supervised learning approaches have been successfully used for sentiment classification in domains that are rich in labeled data. Some of these approaches make use of features such as unigrams, bigrams, sentiment words, adjective words, syntax trees (or variations of trees obtained using pruning strategies), etc. However, for some domains the amount of labeled data can be relatively small and we cannot train an accurate classifier using the supervised learning approach. Therefore, it is useful to study domain adaptation techniques that can transfer knowledge from a source domain that has labeled data to a target domain that has little or no labeled data, but a large amount of unlabeled data. We address this problem in the context of product reviews, specifically reviews of movies, DVDs and kitchen appliances. Our approach uses an Adapted Naive Bayes classifier (ANB) on top of the Expectation Maximization (EM) algorithm to predict the sentiment of a sentence. We use grams derived from complete syntax trees or from syntax subtrees as features, when training the ANB classifier. More precisely, we extract grams from syntax trees correspond- ing to sentences in either the source or target domains. To be able to transfer knowledge from source to target, we identify generalized features (grams) using the frequently co-occurring entropy (FCE) method, and represent the source instances using these generalized features. The target instances are represented with all grams occurring in the target, or with a reduced grams set obtained by removing infrequent grams. We experiment with different types of grams in a supervised framework in order to identify the most predictive types of gram, and further use those grams in the domain adaptation framework. Experimental results on several cross-domains task show that domain adaptation approaches that combine source and target data (small amount of labeled and some unlabeled data) can help learn classifiers for the target that are better than those learned from the labeled target data alone.

APA, Harvard, Vancouver, ISO, and other styles

12

Lindberg, Jennifer, and Henrik Siren. "An implementation analysis of a machine learning algorithm on eye-tracking data in order to detect early signs of dementia." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281298.

Full text

Abstract:

This study aims to investigate whether or not it is possible to use a machine learning algorithm on eye-tracking data in order to detect early signs of Alzheimer’s disease, which is a type of dementia. Early signs of Alzheimer’s are characterized by mild cognitive impairment. In addition to this, patients with mild cognitive impairment fixate more when reading. The eye-tracking data is gathered in trials, conducted by specialist doctors at a hospital, where 24 patients read a text. Furthermore, the data is pre-processed by extracting different features, such as fixations and difficulty levels of the specific passage in the text. Thenceforth, the features are applied in a naïve Bayes machine learning algorithm, implementing so called leave-one-out cross validation, under two separate conditions; using both fixation features and features related to the difficulty of the text and in addition to this, only using fixation features. Finally, the two conditions achieved the same results - with an accuracy of 64%. Thereby, the conclusion was drawn that even though the amount of data samples (patients) was small, the machine learning algorithm could somewhat predict if a patient was at an early stage of Alzheimer’s disease or not, based on eye-tracking data. Additionally, the implementation is further analyzed through the use of a stakeholder analysis, a SWOT-analysis and from an innovation perspective.
Denna studie syftar till att undersöka huruvida det är möjligt att använda en maskininlärningsalgoritm på eye-tracking data för att upptäcka tidiga tecken på Alzheimer’s sjukdom, vilket är en typ av demens. Tidiga tecken på Alzheimer’s karaktäriseras av mild kognitiv nedsättning. Vidare fixerar patienter med en mild kognitiv nedsättning mer när de läser. Eye-tracking data samlas in i undersökningar genomförda av specialistläkare på ett sjukhus, där 24 patienter läser en text. Därefter förbehandlas datan genom att extrahera olika features, såsom fixeringar och svårighetsnivåer på specifika avsnitt i texten. Efter detta appliceras features i en naïve Bayes maskininlärningsalgoritm som implementerar så kallad leave-one- out cross validation under två separata fall; användande av enbart fixerings features samt användandet av både fixerings features och features för svårighetsgrad för olika avsnitt i texten. Slutligen erhölls samma resultat i båda fallen – med en accuracy på 64%. Därav drogs slutsatsen att även om mängden data (antalet patienter) var liten, kunde maskininlärningsalgoritmen till viss del förutse om en patient var i ett tidigt stadie av Alzheimer’s sjukdom eller inte, baserat på eye-tracking data. Dessutom analyseras implementationen vidare med användning av en intressentanalys, en SWOT-analys och från ett innovationsperspektiv.

APA, Harvard, Vancouver, ISO, and other styles

13

Wedenberg, Kim, and Alexander Sjöberg. "Online inference of topics : Implementation of the topic model Latent Dirichlet Allocation using an online variational bayes inference algorithm to sort news articles." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-222429.

Full text

Abstract:

The client of the project has problems with complex queries and noisewhen querying their stream of ﬁve million news articles per day. Thisresults in much manual work when sorting and pruning the search result of their query. Instead of using direct text matching, the approachof the project was to use a topic model to describe articles in terms oftopics covered and to use this new information to sort the articles. An online version of the topic model Latent Dirichlet Allocationwas implemented using online variational Bayes inference to handlestreamed data. Using 100 dimensions, topics such as sports and politics emerged during training on a 1.7 million articles big simulatedstream. These topics were used to sort articles based on context. Theimplementation was found accurate enough to be useful for the client aswell as fast and stable enough to be a feasible solution to the problem.

APA, Harvard, Vancouver, ISO, and other styles

14

Jaradat, Shatha. "OLLDA: Dynamic and Scalable Topic Modelling for Twitter : AN ONLINE SUPERVISED LATENT DIRICHLET ALLOCATION ALGORITHM." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177535.

Full text

Abstract:

Providing high quality of topics inference in today's large and dynamic corpora, such as Twitter, is a challenging task. This is especially challenging taking into account that the content in this environment contains short texts and many abbreviations. This project proposes an improvement of a popular online topics modelling algorithm for Latent Dirichlet Allocation (LDA), by incorporating supervision to make it suitable for Twitter context. This improvement is motivated by the need for a single algorithm that achieves both objectives: analyzing huge amounts of documents, including new documents arriving in a stream, and, at the same time, achieving high quality of topics’ detection in special case environments, such as Twitter. The proposed algorithm is a combination of an online algorithm for LDA and a supervised variant of LDA - labeled LDA. The performance and quality of the proposed algorithm is compared with these two algorithms. The results demonstrate that the proposed algorithm has shown better performance and quality when compared to the supervised variant of LDA, and it achieved better results in terms of quality in comparison to the online algorithm. These improvements make our algorithm an attractive option when applied to dynamic environments, like Twitter. An environment for analyzing and labelling data is designed to prepare the dataset before executing the experiments. Possible application areas for the proposed algorithm are tweets recommendation and trends detection.
Tillhandahålla högkvalitativa ämnen slutsats i dagens stora och dynamiska korpusar, såsom Twitter, är en utmanande uppgift. Detta är särskilt utmanande med tanke på att innehållet i den här miljön innehåller korta texter och många förkortningar. Projektet föreslår en förbättring med en populär online ämnen modellering algoritm för Latent Dirichlet Tilldelning (LDA), genom att införliva tillsyn för att göra den lämplig för Twitter sammanhang. Denna förbättring motiveras av behovet av en enda algoritm som uppnår båda målen: analysera stora mängder av dokument, inklusive nya dokument som anländer i en bäck, och samtidigt uppnå hög kvalitet på ämnen "upptäckt i speciella fall miljöer, till exempel som Twitter. Den föreslagna algoritmen är en kombination av en online-algoritm för LDA och en övervakad variant av LDA - Labeled LDA. Prestanda och kvalitet av den föreslagna algoritmen jämförs med dessa två algoritmer. Resultaten visar att den föreslagna algoritmen har visat bättre prestanda och kvalitet i jämförelse med den övervakade varianten av LDA, och det uppnådde bättre resultat i fråga om kvalitet i jämförelse med den online-algoritmen. Dessa förbättringar gör vår algoritm till ett attraktivt alternativ när de tillämpas på dynamiska miljöer, som Twitter. En miljö för att analysera och märkning uppgifter är utformad för att förbereda dataset innan du utför experimenten. Möjliga användningsområden för den föreslagna algoritmen är tweets rekommendation och trender upptäckt.

APA, Harvard, Vancouver, ISO, and other styles

15

Mathema, Najma. "Predicting Plans and Actions in Two-Player Repeated Games." BYU ScholarsArchive, 2020. https://scholarsarchive.byu.edu/etd/8683.

Full text

Abstract:

Artificial intelligence (AI) agents will need to interact with both other AI agents and humans. One way to enable effective interaction is to create models of associates to help to predict the modeled agents' actions, plans, and intentions. If AI agents are able to predict what other agents in their environment will be doing in the future and can understand the intentions of these other agents, the AI agents can use these predictions in their planning, decision-making and assessing their own potential. Prior work [13, 14] introduced the S# algorithm, which is designed as a robust algorithm for many two-player repeated games (RGs) to enable cooperation among players. Because S# generates actions, has (internal) experts that seek to accomplish an internal intent, and associates plans with each expert, it is a useful algorithm for exploring intent, plan, and action in RGs. This thesis presents a graphical Bayesian model for predicting actions, plans, and intents of an S# agent. The same model is also used to predict human action. The actions, plans and intentions associated with each S# expert are (a) identified from the literature and (b) grouped by expert type. The Bayesian model then uses its transition probabilities to predict the action and expert type from observing human or S# play. Two techniques were explored for translating probability distributions into specific predictions: Maximum A Posteriori (MAP) and Aggregation approach. The Bayesian model was evaluated for three RGs (Prisoners Dilemma, Chicken and Alternator) as follows. Prediction accuracy of the model was compared to predictions from machine learning models (J48, Multi layer perceptron and Random Forest) as well as from the fixed strategies presented in [20]. Prediction accuracy was obtained by comparing the model's predictions against the actual player's actions. Accuracy for plan and intent prediction was measured by comparing predictions to the actual plans and intents followed by the S# agent. Since the plans and the intents of human players were not recorded in the dataset, this thesis does not measure the accuracy of the Bayesian model against actual human plans and intents. Results show that the Bayesian model effectively models the actions, plans, and intents of the S# algorithm across the various games. Additionally, the Bayesian model outperforms other methods for predicting human actions. When the games do not allow players to communicate using so-called â€œcheap talkâ€, the MAP-based predictions are significantly better than Aggregation-based predictions. There is no significant difference in the performance of MAP-based and Aggregation-based predictions for modeling human behavior when cheaptalk is allowed, except in the game of Chicken.

APA, Harvard, Vancouver, ISO, and other styles

16

Nieuwenhoff, Nathalia. "Uma comparação da aplicação de métodos computacionais de classificação de dados aplicados ao consumo de cinema no Brasil." Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/100/100131/tde-01062017-085136/.

Full text

Abstract:

As técnicas computacionais de aprendizagem de máquina para classificação ou categorização de dados estão sendo cada vez mais utilizadas no contexto de extração de informações ou padrões em bases de dados volumosas em variadas áreas de aplicação. Em paralelo, a aplicação destes métodos computacionais para identificação de padrões, bem como a classificação de dados relacionados ao consumo dos bens de informação é considerada uma tarefa complexa, visto que tais padrões de decisão do consumo estão relacionados com as preferências dos indivíduos e dependem de uma composição de características individuais, variáveis culturais, econômicas e sociais segregadas e agrupadas, além de ser um tópico pouco explorado no mercado brasileiro. Neste contexto, este trabalho realizou o estudo experimental a partir da aplicação do processo de Descoberta do conhecimento (KDD), o que inclui as etapas de seleção e Mineração de Dados, para um problema de classificação binária, indivíduos brasileiros que consomem e não consomem um bem de informação, filmes em salas de cinema, a partir dos dados obtidos na Pesquisa de Orçamento Familiar (POF) 2008-2009, pelo Instituto Brasileiro de Geografia e Estatística (IBGE). O estudo experimental resultou em uma análise comparativa da aplicação de duas técnicas de aprendizagem de máquina para classificação de dados, baseadas em aprendizado supervisionado, sendo estas Naïve Bayes (NB) e Support Vector Machine (SVM). Inicialmente, a revisão sistemática realizada com o objetivo de identificar estudos relacionados a aplicação de técnicas computacionais de aprendizado de máquina para classificação e identificação de padrões de consumo indica que a utilização destas técnicas neste contexto não é um tópico de pesquisa maduro e desenvolvido, visto que não foi abordado em nenhum dos trabalhos estudados. Os resultados obtidos a partir da análise comparativa realizada entre os algoritmos sugerem que a escolha dos algoritmos de aprendizagem de máquina para Classificação de Dados está diretamente relacionada a fatores como: (i) importância das classes para o problema a ser estudado; (ii) balanceamento entre as classes; (iii) universo de atributos a serem considerados em relação a quantidade e grau de importância destes para o classificador. Adicionalmente, os atributos selecionados pelo algoritmo de seleção de variáveis Information Gain sugerem que a decisão de consumo de cultura, mais especificamente do bem de informação, filmes em cinema, está fortemente relacionada a aspectos dos indivíduos relacionados a renda, nível de educação, bem como suas preferências por bens culturais
Machine learning techniques for data classification or categorization are increasingly being used for extracting information or patterns from volumous databases in various application areas. Simultaneously, the application of these computational methods to identify patterns, as well as data classification related to the consumption of information goods is considered a complex task, since such decision consumption paterns are related to the preferences of individuals and depend on a composition of individual characteristics, cultural, economic and social variables segregated and grouped, as well as being not a topic explored in the Brazilian market. In this context, this study performed an experimental study of application of the Knowledge Discovery (KDD) process, which includes data selection and data mining steps, for a binary classification problem, Brazilian individuals who consume and do not consume a information good, film at theaters in Brazil, from the microdata obtained from the Brazilian Household Budget Survey (POF), 2008-2009, performed by the Brazilian Institute of Geography and Statistics (IBGE). The experimental study resulted in a comparative analysis of the application of two machine-learning techniques for data classification, based on supervised learning, such as Naïve Bayes (NB) and Support Vector Machine (SVM). Initially, a systematic review with the objective of identifying studies related to the application of computational techniques of machine learning to classification and identification of consumption patterns indicates that the use of these techniques in this context is not a mature and developed research topic, since was not studied in any of the papers analyzed. The results obtained from the comparative analysis performed between the algorithms suggest that the choice of the machine learning algorithms for data classification is directly related to factors such as: (i) importance of the classes for the problem to be studied; (ii) balancing between classes; (iii) universe of attributes to be considered in relation to the quantity and degree of importance of these to the classifiers. In addition, the attributes selected by the Information Gain variable selection algorithm suggest that the decision to consume culture, more specifically information good, film at theaters, is directly related to aspects of individuals regarding income, educational level, as well as preferences for cultural goods

APA, Harvard, Vancouver, ISO, and other styles

17

Mauuary, Didier. "Détection, estimation et identification pour la tomographie acoustique océanique : étude théorique et expérimentale." Grenoble INPG, 1994. http://www.theses.fr/1994INPG0033.

Full text

Abstract:

La tomographie acoustique oceanique s'est developpee depuis une quinzaine d'annees en apparaissant comme un moyen jusque-la inexistant d'observer la dynamique des oceans sur des espaces de grandes echelles (100, 1000 km) et sur de longues periodes (1 mois, 1 an). Mais le principe de faisabilite de la methode est encore au cur du probleme surtout pour les zones experimentees par les laboratoires francais et europeens. Les resultats experimentaux ont remis en cause les procedes classiques de traitement du signal. Les instruments et la chaine de pretraitement ont fait l'objet d'une etude complete. Nous mettons en evidence les proprietes de la reponse impulsionnelle instrumentale et l'impact du doppler sur le systeme de mesure des temps de propagation des trajets multiples. Les proprietes spatiales du doppler et la sensibilite des signaux utilises en font une grandeur qui peut etre maintenant exploitee sous la forme d'une antenne a ouverture synthetique. Les methodes avancees de traitement du signal necessaires pour estimer les temps de propagation des trajets multiples et pour les identifier aux trajectoires predites par un modele acoustique sont formalisees avec les outils statistiques de decision. L'estimation bayesienne de temps de retard est analysee en detail et les domaines ou elle ameliore la precision des estimateurs sont donnes. Un nouveau concept, resolvant le probleme de l'identification, est propose. Il utilise d'une maniere fondamentale une information d'origine oceanique et un modele acoustique de trace de rayons. Les algorithmes qui en decoulent permettent d'identifier statistiquement les trajets acoustiques instables ou non resolus. Tous les concepts proposes sont valides sur des donnees experimentales

APA, Harvard, Vancouver, ISO, and other styles

18

Borgos, Hilde Grude. "Stochastic Modeling and Statistical Inference of Geological Fault Populations and Patterns." Doctoral thesis, Norwegian University of Science and Technology, Department of Mathematical Sciences, 2000. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-503.

Full text

Abstract:

The focus of this work is on faults, and the main issue is statistical analysis and stochastic modeling of faults and fault patterns in petroleum reservoirs. The thesis consists of Part I-V and Appendix A-C. The units can be read independently. Part III is written for a geophysical audience, and the topic of this part is fault and fracture size-frequency distributions. The remaining parts are written for a statistical audience, but can also be read by people with an interest in quantitative geology. The topic of Part I and II is statistical model choice for fault size distributions, with a samling algorithm for estimating Bayes factor. Part IV describes work on spatial modeling of fault geometry, and Part V is a short note on line partitioning. Part I, II and III constitute the main part of the thesis. The appendices are conference abstracts and papers based on Part I and IV.

Paper III: reprinted with kind permission of the American Geophysical Union. An edited version of this paper was published by AGU. Copyright [2000] American Geophysical Union

APA, Harvard, Vancouver, ISO, and other styles

19

Petřík, Patrik. "Predikce vývoje akciového trhu prostřednictvím technické a psychologické analýzy." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2010. http://www.nusl.cz/ntk/nusl-222507.

Full text

Abstract:

This work deals with stock market prediction via technical and psychological analysis. We introduce theoretical resources of technical and psychological analysis. We also introduce some methods of artificial intelligence, specially neural networks and genetic algorithms. We design a system for stock market prediction. We implement and test a part of system. In conclusion we discuss results.

APA, Harvard, Vancouver, ISO, and other styles

20

Sengupta, Aritra. "Empirical Hierarchical Modeling and Predictive Inference for Big, Spatial, Discrete, and Continuous Data." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1350660056.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Lacasse, Alexandre. "Bornes PAC-Bayes et algorithmes d'apprentissage." Thesis, Université Laval, 2010. http://www.theses.ulaval.ca/2010/27635/27635.pdf.

Full text

Abstract:

L’objet principale de cette thèse est l’étude théorique et la conception d’algorithmes d’apprentissage concevant des classificateurs par vote de majorité. En particulier, nous présentons un théorème PAC-Bayes s’appliquant pour borner, entre autres, la variance de la perte de Gibbs (en plus de son espérance). Nous déduisons de ce théorème une borne du risque du vote de majorité plus serrée que la fameuse borne basée sur le risque de Gibbs. Nous présentons également un théorème permettant de borner le risque associé à des fonctions de perte générale. À partir de ce théorème, nous concevons des algorithmes d’apprentissage construisant des classificateurs par vote de majorité pondérés par une distribution minimisant une borne sur les risques associés aux fonctions de perte linéaire, quadratique, exponentielle, ainsi qu’à la fonction de perte du classificateur de Gibbs à piges multiples. Certains de ces algorithmes se comparent favorablement avec AdaBoost.
The main purpose of this thesis is the theoretical study and the design of learning algorithms returning majority-vote classifiers. In particular, we present a PAC-Bayes theorem allowing us to bound the variance of the Gibbs’ loss (not only its expectation). We deduce from this theorem a bound on the risk of a majority vote tighter than the famous bound based on the Gibbs’ risk. We also present a theorem that allows to bound the risk associated with general loss functions. From this theorem, we design learning algorithms building weighted majority vote classifiers minimizing a bound on the risk associated with the following loss functions : linear, quadratic and exponential. Also, we present algorithms based on the randomized majority vote. Some of these algorithms compare favorably with AdaBoost.

APA, Harvard, Vancouver, ISO, and other styles

22

Vau, Bernard. "Algorithmes d’identification par régression pseudo-linéaire avec prédicteurs paramétrisés sur des bases généralisées de fonctions de transfert orthonormales." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLN062.

Full text

Abstract:

Cette thèse porte sur l’identification des systèmes linéaires stationnaires, représentés par des fonctions de transfert en temps discret. Pour un ordre donné, contrairement aux méthodes d'identification visant explicitement à minimiser la variance de l'erreur de prédiction, les algorithmes basés sur la régression pseudo-linéaire induisent des modèles dont la distribution des biais est dépendante de la paramétrisation du prédicteur. Ceci a été démontré grâce au concept innovant d'erreur de prédiction équivalente, signal en général non mesurable, dont la variance est effectivement minimisée dans le cadre de la régression pseudo-linéaire.Dans un second temps, sont proposées des versions revisitées des algorithmes récursifs de l'erreur de sortie et des moindres carrés étendus (ainsi que de leurs équivalents en boucle fermée), dont les prédicteurs sont exprimés sur les bases généralisées de fonctions de transfert orthonormales, introduites par Heuberger et al. dans les années 1990 et 2000. La sélection des pôles de la base revient à imposer le noyau reproduisant de l'espace de Hilbert auquel appartiennent ces fonctions de transfert, et à spécifier la manière dont l'approximation est réalisée par les algorithmes. Nous utilisons une expression particulière de ce noyau reproduisant pour introduire un indicateur de l'effet des pôles de la base sur la qualité de l'ajustement du modèle dans le domaine fréquentiel. Cet indicateur joue un grand rôle d'un point de vue heuristique. Enfin, un test de validation en adéquation avec ces algorithmes d'identification est proposé, dont les propriétés statistiques sont explicitées. Les retombées concrètes de ces travaux résident dans la mise à disposition de paramètres de réglages simples et peu nombreux (les pôles de la base), utilisables en fonction du but implicite assigné à l'identification. L'obtention de modèles d'ordre réduit s'en trouve facilitée. De plus l'identification des systèmes raides - comportant des modes dont les fréquences sont séparées de plusieurs décades- jusqu'alors impossible en temps discret, est rendue accessible
This thesis deals with identification of linear time invariant systems described by discrete-time transfer functions. For a given order, contrary to identification methods minimizing explicitly the prediction error variance, algorithms based on pseudo-linear regression produce models with a bias distribution dependent on the predictor parametrization. This has been demonstrated by the innovating concept of equivalent prediction error, a signal in general non-measurable, whose variance is effectively minimized by the pseudo-linear regression.In a second step, revisited versions of recursive algorithms are proposed (Output Error, extended least squares, and their equivalents in closed-loop), whose predictors are expressed on generalized bases of transfer functions introduced by Heuberger et al. in the 1990s and 2000s. The selection of the basis poles is equivalent to define the reproducing kernel of the Hilbert space associated to these functions, and to impose how approximation is achieved by the algorithms. A particular expression of this reproducing kernel is employed to introduce an indicator of the basis poles effect on the model fit in the frequency domain. This indicator plays a great role from a heuristic point of view.At last, a validation test in accordance with these algorithms is proposed. Its statistical properties are given. This set of algorithms provides to the user some simple tuning parameters (the basis poles) that can be selected in function of the implicit purpose assigned to the identification procedure. Obtaining reduced order models is made easier, while identification of stiff systems –impossible until now in discrete-time- becomes accessible

APA, Harvard, Vancouver, ISO, and other styles

23

Chang, Eun R. "Grobner Bases and Ideals of Points." VCU Scholars Compass, 2006. http://scholarscompass.vcu.edu/etd/1214.

Full text

Abstract:

The main point of this thesis is an introduction to the theory of Grobner bases. The concept of Grobner basis and construction of the Grobner basis by Buchberger's Algorithm, in which the notion of S-polynomials is introduced, and a few modified or improved versions of Grobner basis algorithm are reviewed in this paper. In Chapter 1, we have a review of ideals, the definitions and types of monomial ordering, the multivariate polynomial division algorithm and its examples. After ascertaining the monomial ordering on multivariate polynomials, we establish a leading term of a polynomial.In Chapter 2, after defining Grobner bases, we study some nice and useful properties of Grobner bases, such as a uniqueness of reduced Grobner basis and existence of a Grobner basis.In Chapter 3, we explore the Buchberger-Moller algorithm to construct Grobner bases and return a set of polynomials whose residue classes form a basis of a quotient of a polynomial ring. Also, we survey a generalized Buchberger-Moller algorithm to determine directly a Grobner basis for the intersection of a finite number of ideals.In Chapter 4, we conclude this paper with some applications of Grobner bases.

APA, Harvard, Vancouver, ISO, and other styles

24

Robert, Valérie. "Classification croisée pour l'analyse de bases de données de grandes dimensions de pharmacovigilance." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS111/document.

Full text

Abstract:

Cette thèse regroupe des contributions méthodologiques à l'analyse statistique des bases de données de pharmacovigilance. Les difficultés de modélisation de ces données résident dans le fait qu'elles produisent des matrices souvent creuses et de grandes dimensions. La première partie des travaux de cette thèse porte sur la classification croisée du tableau de contingence de pharmacovigilance à l’aide du modèle des blocs latents de Poisson normalisé. L'objectif de la classification est d'une part de fournir aux pharmacologues des zones intéressantes plus réduites à explorer de manière plus précise, et d'autre part de constituer une information a priori utilisable lors de l'analyse des données individuelles de pharmacovigilance. Dans ce cadre, nous détaillons une procédure d'estimation partiellement bayésienne des paramètres du modèle et des critères de sélection de modèles afin de choisir le modèle le plus adapté aux données étudiées. Les données étant de grandes dimensions, nous proposons également une procédure pour explorer de manière non exhaustive mais pertinente, l'espace des modèles en coclustering. Enfin, pour mesurer la performance des algorithmes, nous développons un indice de classification croisée calculable en pratique pour un nombre de classes élevé. Les développements de ces outils statistiques ne sont pas spécifiques à la pharmacovigilance et peuvent être utile à toute analyse en classification croisée. La seconde partie des travaux de cette thèse porte sur l'analyse statistique des données individuelles, plus nombreuses mais également plus riches en information. L'objectif est d'établir des classes d'individus selon leur profil médicamenteux et des sous-groupes d'effets et de médicaments possiblement en interaction, palliant ainsi le phénomène de coprescription et de masquage que peuvent présenter les méthodes existantes sur le tableau de contingence. De plus, l'interaction entre plusieurs effets indésirables y est prise en compte. Nous proposons alors le modèle des blocs latents multiple qui fournit une classification croisée simultanée des lignes et des colonnes de deux tableaux de données binaires en leur imposant le même classement en ligne. Nous discutons des hypothèses inhérentes à ce nouveau modèle et nous énonçons des conditions suffisantes de son identifiabilité. Ensuite, nous présentons une procédure d'estimation de ses paramètres et développons des critères de sélection de modèles associés. De plus, un modèle de simulation numérique des données individuelles de pharmacovigilance est proposé et permet de confronter les méthodes entre elles et d'étudier leurs limites. Enfin, la méthodologie proposée pour traiter les données individuelles de pharmacovigilance est explicitée et appliquée à un échantillon de la base française de pharmacovigilance entre 2002 et 2010
This thesis gathers methodological contributions to the statistical analysis of large datasets in pharmacovigilance. The pharmacovigilance datasets produce sparse and large matrices and these two characteritics are the main statistical challenges for modelling them. The first part of the thesis is dedicated to the coclustering of the pharmacovigilance contingency table thanks to the normalized Poisson latent block model. The objective is on the one hand, to provide pharmacologists with some interesting and reduced areas to explore more precisely. On the other hand, this coclustering remains a useful background information for dealing with individual database. Within this framework, a parameter estimation procedure for this model is detailed and objective model selection criteria are developed to choose the best fit model. Datasets are so large that we propose a procedure to explore the model space in coclustering, in a non exhaustive way but a relevant one. Additionnally, to assess the performances of the methods, a convenient coclustering index is developed to compare partitions with high numbers of clusters. The developments of these statistical tools are not specific to pharmacovigilance and can be used for any coclustering issue. The second part of the thesis is devoted to the statistical analysis of the large individual data, which are more numerous but also provides even more valuable information. The aim is to produce individual clusters according their drug profiles and subgroups of drugs and adverse effects with possible links, which overcomes the coprescription and masking phenomenons, common contingency table issues in pharmacovigilance. Moreover, the interaction between several adverse effects is taken into account. For this purpose, we propose a new model, the multiple latent block model which enables to cocluster two binary tables by imposing the same row ranking. Assertions inherent to the model are discussed and sufficient identifiability conditions for the model are presented. Then a parameter estimation algorithm is studied and objective model selection criteria are developed. Moreover, a numeric simulation model of the individual data is proposed to compare existing methods and study its limits. Finally, the proposed methodology to deal with individual pharmacovigilance data is presented and applied to a sample of the French pharmacovigilance database between 2002 and 2010

APA, Harvard, Vancouver, ISO, and other styles

25

Savulionienė, Loreta. "Association rules search in large data bases." Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2014. http://vddb.library.lt/obj/LT-eLABa-0001:E.02~2014~D_20140519_102242-45613.

Full text

Abstract:

The impact of information technology is an integral part of modern life. Any activity is related to information and data accumulation and storage, therefore, quick analysis of information is necessary. Today, the traditional data processing and data reports are no longer sufficient. The need of generating new information and knowledge from given data is understandable; therefore, new facts and knowledge, which allow us to forecast customer behaviour or financial transactions, diagnose diseases, etc., can be generated applying data mining techniques. The doctoral dissertation analyses modern data mining algorithms for estimating frequent sub-sequences and association rules. The dissertation proposes a new stochastic algorithm for mining frequent sub-sequences, its modifications SDPA1 and SDPA2 and stochastic algorithm for discovery of association rules, and presents the evaluation of the algorithm errors. These algorithms are approximate, but allow us to combine two important tests, i.e. time and accuracy. The algorithms have been tested using real and simulated databases.
Informacinių technologijų įtaka neatsiejama nuo šiuolaikinio gyvenimo. Bet kokia veiklos sritis yra susijusi su informacijos, duomenų kaupimu, saugojimu. Šiandien nebepakanka tradicinio duomenų apdorojimo bei įvairių ataskaitų formavimo. Duomenų tyrybos technologijų taikymas leidžia iš turimų duomenų išgauti naujus faktus ar žinias, kurios leidžia prognozuoti veiklą, pavyzdžiui, pirkėjų elgesį ar finansines tendencijas, diagnozuoti ligas ir pan. Disertacijoje nagrinėjami duomenų tyrybos algoritmai dažniems posekiams ir susietumo taisyklėms nustatyti. Disertacijoje sukurtas naujas stochastinis dažnų posekių paieškos algoritmas, jo modifikacijos SDPA1, SDPA2 ir stochastinis susietumo taisyklių nustatymo algoritmas bei pateiktas šių algoritmų paklaidų įvertinimas. Šie algoritmai yra apytiksliai, tačiau leidžia suderinti du svarbius kriterijus  laiką ir tikslumą. Šie algoritmai buvo testuojami naudojant realias bei imitacines duomenų bazes.

APA, Harvard, Vancouver, ISO, and other styles

26

Ben, Mohamed Khalil. "Traitement de requêtes conjonctives avec négation : algorithmes et expérimentations." Phd thesis, Université Montpellier II - Sciences et Techniques du Languedoc, 2010. http://tel.archives-ouvertes.fr/tel-00563217.

Full text

Abstract:

Dans cette thèse, nous nous intéressons à des problèmes à la croisée de deux domaines, les bases de données et les bases de connaissances. Nous considérons deux problèmes équivalents concernant les requêtes conjonctives avec négation : l'inclusion de requêtes et l'évaluation d'une requête booléenne sous l'hypothèse du monde ouvert. Nous reformulons ces problèmes sous la forme d'un problème de déduction dans un fragment de la logique du premier ordre. Puis nous raffinons des schémas d'algorithmes déjà existants et proposons de nouveaux algorithmes. Pour les étudier et les comparer expérimentalement, nous proposons un générateur aléatoire et analysons l'influence des différents paramètres sur la difficulté des instances du problème étudié. Finalement, à l'aide de cette méthodologie expérimentale, nous comparons les apports des différents raffinements et les algorithmes entre eux.

APA, Harvard, Vancouver, ISO, and other styles

27

STEINBRUCH, DAVID. "A STUDY OF MULTILABEL TEXT CLASSIFICATION ALGORITHMS USING NAIVE-BAYES." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2006. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=9637@1.

Full text

Abstract:

CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO
A quantidade de informação eletrônica vem crescendo de forma acelerada, motivada principalmente pela facilidade de publicação e divulgação que a Internet proporciona. Desta forma, é necessária a organização da informação de forma a facilitar a sua aquisição. Muitos trabalhos propuseram resolver este problema através da classificação automática de textos associando a eles vários rótulos (classificação multirótulo). No entanto, estes trabalhos transformam este problema em subproblemas de classificação binária, considerando que existe independência entre as categorias. Além disso, utilizam limiares (thresholds), que são muito específicos para o conjunto de treinamento utilizado, não possuindo grande capacidade de generalização na aprendizagem. Esta dissertação propõe dois algoritmos de classificação automática de textos baseados no algoritmo multinomial naive Bayes e sua utilização em um ambiente on-line de classificação automática de textos com realimentação de relevância pelo usuário. Para testar a eficiência dos algoritmos propostos, foram realizados experimentos na base de notícias Reuters 21758 e na base de documentos médicos Ohsumed.
The amount of electronic information has been growing fast, mainly due to the easiness of publication and spreading that Internet provides. Therefore, is necessary the organisation of information to facilitate its retrieval. Many works have solved this problem through the automatic text classification, associating to them several labels (multilabel classification). However, those works have transformed this problem into binary classification subproblems, considering there is not dependence among categories. Moreover, they have used thresholds, which are very sepecific of the classifier document base, and so, does not have great generalization capacity in the learning process. This thesis proposes two text classifiers based on the multinomial algorithm naive Bayes and its usage in an on-line text classification environment with user relevance feedback. In order to test the proposed algorithms efficiency, experiments have been performed on the Reuters 21578 news base, and on the Ohsumed medical document base.

APA, Harvard, Vancouver, ISO, and other styles

28

Dunn, Will-Matthis III. "Algorithms and applications of comprehensive Groebner bases." Diss., The University of Arizona, 1995. http://hdl.handle.net/10150/187274.

Full text

Abstract:

In this dissertation we study several improvements to algorithms used to generate comprehensive Groebner bases of Volker Weispfenning (7). Comprehensive Groebner bases are bases for ideals in the ring of polynomials in several variables whose coefficients are polynomials in several symbolic parameters over a given field. These bases have the fundamental property that, for any possible assignment (specialization) of the field elements for the parameters, the comprehensive Groebner basis generators become generators for a usual Groebner basis for the ideal of polynomials with coefficients in the field. Chapter 1 gives the necessary background for understanding the basic construction of comprehensive Groebner bases. We show how it is also possible to construct these bases for ideals in the ring of polynomials whose coefficients are rational functions of the symbolic parameters. We amplify the description of assigning "colors" to coefficients as given in (7). These assignments are used to establish criteria to determine the effect specialization will bear upon a given coefficient of a given polynomial. We also amplify the constructions for S-polynomial and normal form computations in this realm. In Chapter 2, we present several modifications to the algorithms in Chapter 1. These modifications allow for more efficient machine computations and yield simpler output. We show a new design methodology for assignments of "colors" that allows for more readable and useful output. We encode this methodology in our "saturated form" of a condition which makes good use of Groebner basis theory and saturations of ideals. This new design allows for sharper precision when working with comprehensive Groebner bases. In particular, we demonstrate how this design eliminates all unnecessary "virtual" Buchberger algorithms as described in (1). In Chapter 3, we show the precision of our new design in examples from several areas of commutative algebra. We demonstrate the simplicity that may be attained in examples from Automatic Geometric Theorem Proving, and in the study of parametric varieties. Also, we embed our new design in an algorithm that has potential use in the study of dimension bounds of parametric varieties.

APA, Harvard, Vancouver, ISO, and other styles

29

Enkosky, Thomas. "Grobner Bases and an Algorithm to Find the Monomials of an Ideal." Fogler Library, University of Maine, 2004. http://www.library.umaine.edu/theses/pdf/EnkoskyT2004.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Germain, Pascal. "Algorithmes d'apprentissage automatique inspirés de la théorie PAC-Bayes." Thesis, Université Laval, 2009. http://www.theses.ulaval.ca/2009/26191/26191.pdf.

Full text

Abstract:

Dans un premier temps, ce mémoire présente un théorème PAC-Bayes général, duquel il est possible d'obtenir simplement plusieurs bornes PAC-Bayes connues. Ces bornes permettent de calculer une garantie sur le risque d'un classificateur à partir de ses performances sur l'ensemble de données d'entraînement. Par l'interprétation du comportement de deux bornes PAC-Bayes, nous énonçons les caractéristiques propres aux classificateurs qu'elles favorisent. Enfin, une spécialisation de ces bornes à la famille des classificateurs linéaires est détaillée. Dans un deuxième temps, nous concevons trois nouveaux algorithmes d'apprentissage automatique basés sur la minimisation, par la méthode de descente de gradient conjugué, de l'expression mathématique de diverses formulations des bornes PAC-Bayes. Le dernier algorithme présenté utilise une fraction de l'ensemble d'entraînement pour l'acquisition de connaissances a priori. Ces algorithmes sont aptes à construire des classificateurs exprimés par vote de majorité ainsi que des classificateurs linéaires exprimés implicitement à l'aide de la stratégie du noyau. Finalement, une étude empirique élaborée compare les trois algorithmes entre eux et révèle que certaines versions de ces algorithmes construisent des classificateurs compétitifs avec ceux obtenus par AdaBoost et les SVM.
At first, this master thesis presents a general PAC-Bayes theorem, from which we can easily obtain some well-known PAC-Bayes bounds. Those bounds allow us to compute a guarantee on the risk of a classifier from its achievements on the training set. We analyze the behavior of two PAC-Bayes bounds and we determine peculiar characteristics of classifiers favoured by those bounds. Then, we present a specialization of those bounds to the linear classifiers family. Secondly, we conceive three new machine learning algorithms based on the minimization, by conjugate gradient descent, of various mathematical expressions of the PAC-Bayes bounds. The last algorithm uses a part of the training set to capture a priori knowledges. One can use those algorithms to construct majority vote classifiers as well as linear classifiers implicitly represented by the kernel trick. Finally, an elaborated empirical study compares the three algorithms and shows that some versions of those algorithms are competitive with both AdaBoost and SVM.
Inscrit au Tableau d'honneur de la Faculté des études supérieures

APA, Harvard, Vancouver, ISO, and other styles

31

Cabarcas, Daniel. "An Implementation of Faugère's F4 Algorithm for Computing Gröbner Bases." University of Cincinnati / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1277120935.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Dieng, Cheikh Tidiane. "Etude et implantation de l'extraction de requetes frequentes dans les bases de donnees multidimensionnelles." Thesis, Cergy-Pontoise, 2011. http://www.theses.fr/2011CERG0530.

Full text

Abstract:

Au cours de ces dernières années, le problème de la recherche de requêtes fréquentes dans les bases de données est un problème qui a suscité de nombreuses recherches. En effet, beaucoup de motifs intéressants comme les règles d'association, des dépendances fonctionnelles exactes ou approximatives, des dépendances fonctionnelles conditionnelles exactes ou approximatives peuvent être découverts simplement, contrairement au méthodes classiques qui requièrent plusieurs transformations de la base pour extraire de tels motifs.Cependant, le problème de la recherche de requêtes fréquentes dans les bases de données relationnelles est un problème difficile car, d'une part l'espace de recherche est très grand (puisque égal à l'ensemble de toutes les requêtes pouvant être posées sur une base de données), et d'autre part, savoir si deux requêtes sont équivalentes (donc engendrant les calculs de support redondants) est un problème NP-Complet.Dans cette thèse, nous portons notre attention sur les requêtes de type projection-selection-jointure, et nous supposons que la base de données est définie selon un schéma étoile. Sous ces hypothèses, nous définissons une relation de pré-ordre (≼) entre les requêtes et nous montrons que :1. La mesure de support est anti-monotone par rapport à ≼, et2. En définissant, q ≡ q′ si et seulement si q ≼ q′ et q′ ≼ q, alors toutes les requêtes d'une même classe d'équivalence ont même support.Les principales contributions de cette thèse sont, d'une part d'étudier formellement les propriétés du pré-ordre et de la relation d'équivalence ci-dessus, et d'autre part, de proposer un algorithme par niveau de type Apriori pour rechercher l'ensemble des requêtes fréquentes d'une base de données définie sur un schéma étoile. De plus, cet algorithme a été implémenté et les expérimentations que nous avons réalisées montrent que, selon notre approche, le temps de calcul des requêtes fréquentes dans une base de données définie sur un schéma étoile reste acceptable, y compris dans le cas de grandes tables de faits
The problem of mining frequent queries in a database has motivated many research efforts during the last two decades. This is so because many interesting patterns, such as association rules, exact or approximative functional dependencies and exact or approximative conditional functional dependencies can be easily retrieved, which is not possible using standard techniques.However, the problem mining frequent queries in a relational database is not easy because, on the one hand, the size of the search space is huge (because encompassing all possible queries that can be addressed to a given database), and on the other hand, testing whether two queries are equivalent (which entails redundant support computations) is NP-Complete.In this thesis, we focus on projection-selection-join queries, assuming that the database is defined over a star schema. In this setting, we define a pre-ordering (≼) between queries and we prove the following basic properties:1. The support measure is anti-monotonic with respect to ≼, and2. Defining q ≡ q′ if and only if q ≼ q′ and q′ ≼ q, all equivalent queries have the same support.The main contributions of the thesis are, on the one hand to formally sudy properties of the pre-ordering and the equivalence relation mentioned above, and on the other hand, to prose a levewise, Apriori like algorithm for the computation of all frequent queries in a relational database defined over a star schema. Moreover, this algorithm has been implemented and the reported experiments show that, in our approach, runtime is acceptable, even in the case of large fact tables

APA, Harvard, Vancouver, ISO, and other styles

33

Keller, Benjamin J. "Algorithms and Orders for Finding Noncummutative Gröbner Bases." Diss., Virginia Tech, 1997. http://hdl.handle.net/10919/30506.

Full text

Abstract:

The problem of choosing efficient algorithms and good admissible orders for computing Gröbner bases in noncommutative algebras is considered. Gröbner bases are an important tool that make many problems in polynomial algebra computationally tractable. However, the computation of Gröbner bases is expensive, and in noncommutative algebras is not guaranteed to terminate. The algorithm, together with the order used to determine the leading term of each polynomial, are known to affect the cost of the computation, and are the focus of this thesis. A Gröbner basis is a set of polynomials computed, using Buchberger's algorithm, from another set of polynomials. The noncommutative form of Buchberger's algorithm repeatedly constructs a new polynomial from a triple, which is a pair of polynomials whose leading terms overlap and form a nontrivial common multiple. The algorithm leaves a number of details underspecified, and can be altered to improve its behavior. A significant improvement is the development of a dynamic dictionary matching approach that efficiently solves the pattern matching problems of noncommutative Gröbner basis computations. Three algorithmic alternatives are considered: the strategy for selecting triples (selection), the strategy for removing triples from consideration (triple elimination), and the approach to keeping the set interreduced (set reduction). Experiments show that the selection strategy is generally more significant than the other techniques, with the best strategy being the one that chooses the triple with the shortest common multiple. The best triple elimination strategy ignoring resource constraints is the Gebauer-Müller strategy. However, another strategy is defined that can perform as well as the Gebauer-Müller strategy in less space. The experiments also show that the admissible order used to determine the leading term of a polynomial is more significant than the algorithm. Experiments indicate that the choice of order is dependent on the input set of polynomials, but also suggest that the length lexicographic order is a good choice for many problems. A more practical approach to chosing an order may be to develop heuristics that attempt to find an order that minimizes the number of overlaps considered during the computation.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

34

Larrieu, Robin. "Arithmétique rapide pour des corps finis." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLX073/document.

Full text

Abstract:

La multiplication de polynômes est une opération fondamentale en théorie de la complexité. En effet, pour de nombreux problèmes d’arithmétique, la complexité des algorithmes s’exprime habituellement en fonction de la complexité de la multiplication. Un meilleur algorithme de multiplication permet ainsi d’effectuer les opérations concernées plus rapidement. Un résultat de 2016 a établi une meilleure complexité asymptotique pour la multiplication de polynômes dans des corps finis. Cet article constitue le point de départ de la thèse ; l’objectif est d’étudier les conséquences à la fois théoriques et pratiques de la nouvelle borne de complexité.La première partie s’intéresse à la multiplication de polynômes à une variable. Cette partie présente deux nouveaux algorithmes censés accélérer le calcul en pratique (plutôt que d’un point de vue asymptotique). S’il est difficile dans le cas général d’observer l’amélioration prévue, certains cas précis sont particulièrement favorables. En l’occurrence, le second algorithme proposé, spécifique aux corps finis, conduit à une meilleure implémentation de la multiplication dans F_2[X], environ deux fois plus rapide que les logiciels précédents.La deuxième partie traite l’arithmétique des polynômes à plusieurs variables modulo un idéal, telle qu’utilisée par exemple pour la résolution de systèmespolynomiaux. Ce travail suppose une situation simplifiée, avec seulement deux variables et sous certaines hypothèses de régularité. Dans ce cas particulier, la deuxième partie de la thèse donne des algorithmes de complexité asymptotiquement optimale (à des facteurs logarithmiques près), par rapport à la taille des entrées/sorties. L’implémentation pour ce cas spécifique est alors nettement plus rapide que les logiciels généralistes, le gain étant de plus en plus marqué lorsque la taille de l’entrée augmente
The multiplication of polynomials is a fundamental operation in complexity theory. Indeed, for many arithmetic problems, the complexity of algorithms is expressed in terms of the complexity of polynomial multiplication. For example, the complexity of euclidean division or of multi-point evaluation/interpolation (and others) is often expressed in terms of the complexity of polynomial multiplication. This shows that a better multiplication algorithm allows to perform the corresponding operations faster. A 2016 result gave an improved asymptotic complexity for the multiplication of polynomials over finite fields. This article is the starting point of the thesis; the present work aims to study the implications of the new complexity bound, from a theoretical and practical point of view.The first part focuses on the multiplication of univariate polynomials. This part presents two new algorithms that should make the computation faster in practice (rather than asymptotically speaking). While it is difficult in general to observe the predicted speed-up, some specific cases are particularly favorable. Actually, the second proposed algorithm, which is specific to finite fields, leads to a better implementation for the multiplication in F 2 [X], about twice as fast as state-of-the-art software.The second part deals with the arithmetic of multivariate polynomials modulo an ideal, as considered typically for polynomial system solving. This work assumes a simplified situation, with only two variables and under certain regularity assumptions. In this particular case, there are algorithms whose complexity is asymptotically optimal (up to logarithmic factors), with respect to input/output size. The implementation for this specific case is then significantly faster than general-purpose software, the speed-up becoming larger and larger when the input size increases

APA, Harvard, Vancouver, ISO, and other styles

35

Gillet, Noel. "Optimisation de requêtes sur des données massives dans un environnement distribué." Thesis, Bordeaux, 2017. http://www.theses.fr/2017BORD0553/document.

Full text

Abstract:

Les systèmes de stockage distribués sont massivement utilisés dans le contexte actuel des grandes masses de données. En plus de gérer le stockage de ces données, ces systèmes doivent répondre à une quantité toujours plus importante de requêtes émises par des clients distants afin d’effectuer de la fouille de données ou encore de la visualisation. Une problématique majeure dans ce contexte consiste à répartir efficacement les requêtes entre les différents noeuds qui composent ces systèmes afin de minimiser le temps de traitement des requêtes ( temps maximum et en moyenne d’une requête, temps total de traitement pour toutes les requêtes...). Dans cette thèse nous nous intéressons au problème d’allocation de requêtes dans un environnement distribué. On considère que les données sont répliquées et que les requêtes sont traitées par les noeuds stockant une copie de la donnée concernée. Dans un premier temps, des solutions algorithmiques quasi-optimales sont proposées lorsque les communications entre les différents noeuds du système se font de manière asynchrone. Le cas où certains noeuds du système peuvent être en panne est également considéré. Dans un deuxième temps, nous nous intéressons à l’impact de la réplication des données sur le traitement des requêtes. En particulier, un algorithme qui adapte la réplication des données en fonction de la demande est proposé. Cet algorithme couplé à nos algorithmes d’allocation permet de garantir une répartition des requêtes proche de l’idéal pour toute distribution de requêtes. Enfin, nous nous intéressons à l’impact de la réplication quand les requêtes arrivent en flux sur le système. Nous procédons à une évaluation expérimentale sur la base de données distribuées Apache Cassandra. Les expériences réalisées confirment l’intérêt de la réplication et de nos algorithmes d’allocation vis-à-vis des solutions présentes par défaut dans ce système
Distributed data store are massively used in the actual context of Big Data. In addition to provide data management features, those systems have to deal with an increasing amount of queries sent by distant users in order to process data mining or data visualization operations. One of the main challenge is to evenly distribute the workload of queries between the nodes which compose these system in order to minimize the treatment time. In this thesis, we tackle the problem of query allocation in a distributed environment. We consider that data are replicated and a query can be handle only by a node storing the concerning data. First, near-optimal algorithmic proposals are given when communications between nodes are asynchronous. We also consider that some nodes can be faulty. Second, we study more deeply the impact of data replication on the query treatement. Particularly, we present an algorithm which manage the data replication based on the demand on these data. Combined with our allocation algorithm, we guaranty a near-optimal allocation. Finally, we focus on the impact of data replication when queries are received as a stream by the system. We make an experimental evaluation using the distributed database Apache Cassandra. The experiments confirm the interest of our algorithmic proposals to improve the query treatement compared to the native allocation scheme in Cassandra

APA, Harvard, Vancouver, ISO, and other styles

36

Collobert, Ronan. "Algorithmes d'Apprentissage pour grandes bases de données." Paris 6, 2004. http://www.theses.fr/2004PA066063.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Bender, Matias Rafael. "Algorithms for sparse polynomial systems : Gröbner bases and resultants." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS029.

Full text

Abstract:

La résolution de systèmes polynomiaux est l’un des problèmes les plus anciens et importants en mathématiques informatiques et a des applications dans plusieurs domaines des sciences et de l’ingénierie. C'est un problème intrinsèquement difficile avec une complexité au moins exponentielle du nombre de variables. Cependant, dans la plupart des cas, les systèmes polynomiaux issus d'applications ont une structure quelconque. Dans cette thèse, nous nous concentrons sur l'exploitation de la structure liée à la faible densité des supports des polynômes; c'est-à-dire que nous exploitons le fait que les polynômes n'ont que quelques monômes à coefficients non nuls. Notre objectif est de résoudre les systèmes plus rapidement que les estimations les plus défavorables, qui supposent que tous les termes sont présents. Nous disons qu'un système creux est non mixte si tous ses polynômes ont le même polytope de Newton, et mixte autrement. La plupart des travaux sur la résolution de systèmes creux concernent le cas non mixte, à l'exception des résultants creux et des méthodes d'homotopie. Nous développons des algorithmes pour des systèmes mixtes. Nous utilisons les résultantes creux et les bases de Groebner. Nous travaillons sur chaque théorie indépendamment, mais nous les combinons également: nous tirons parti des propriétés algébriques des systèmes associés à une résultante non nulle pour améliorer la complexité du calcul de leurs bases de Groebner; par exemple, nous exploitons l’exactitude du complexe de Koszul pour déduire un critère d’arrêt précoce et éviter tout les réductions à zéro. De plus, nous développons des algorithmes quasi-optimaux pour décomposer des formes binaires
Solving polynomial systems is one of the oldest and most important problems in computational mathematics and has many applications in several domains of science and engineering. It is an intrinsically hard problem with complexity at least single exponential in the number of variables. However, in most of the cases, the polynomial systems coming from applications have some kind of structure. In this thesis we focus on exploiting the structure related to the sparsity of the supports of the polynomials; that is, we exploit the fact that the polynomials only have a few monomials with non-zero coefficients. Our objective is to solve the systems faster than the worst case estimates that assume that all the terms are present. We say that a sparse system is unmixed if all its polynomials have the same Newton polytope, and mixed otherwise. Most of the work on solving sparse systems concern the unmixed case, with the exceptions of mixed sparse resultants and homotopy methods. In this thesis, we develop algorithms for mixed systems. We use two prominent tools in nonlinear algebra: sparse resultants and Groebner bases. We work on each theory independently, but we also combine them to introduce new algorithms: we take advantage of the algebraic properties of the systems associated to a non-vanishing resultant to improve the complexity of computing their Groebner bases; for example, we exploit the exactness of some strands of the associated Koszul complex to deduce an early stopping criterion for our Groebner bases algorithms and to avoid every redundant computation (reductions to zero). In addition, we introduce quasi-optimal algorithms to decompose binary forms

APA, Harvard, Vancouver, ISO, and other styles

38

O'Brien, Killian Michael. "Seifert's algorithm, Châtelet bases and the Alexander ideals of classical knots." Thesis, Durham University, 2002. http://etheses.dur.ac.uk/4192/.

Full text

Abstract:

I begin by developing a procedure for the construction of a Seifert surface, using Seifert's algorithm, and the calculation of a Seifert matrix for a knot from a suitable encoding of a knot diagram. This procedure deals with the inherent indeterminacy of the diagram encoding and is fully implementable. From a Seifert matrix one can form a presentation matrix for the Alexander module of a knot and calculate generators for the Alexander ideals. But to use the Alexander ideals to their full potential to distinguish pairs of knots one needs a Gröbner basis type theory for A = Z[t,t(-1)], the ring of Laurent polynomials with integer coefficients. I prove the existence of what I call Châtelet bases for ideals in A. These are types of Gröbner bases. I then develop an algorithm for the calculation of a Châtelet basis of an ideal from any set of generators for that ideal. This is closely related to Buchberger's algorithm for Gröbner bases in other polynomial rings. Using these algorithms and the knot diagram tables in the program Knotscape I calculate Châtelet bases for the Alexander ideals of all prime knots of up to 14 crossings. We determine the number of distinct ideals that occur and find examples of pairs of mutant knots distinguished by the higher Alexander ideals but not by any of the polynomials of Alexander, Jones, Kauffman or HOMFLY.

APA, Harvard, Vancouver, ISO, and other styles

39

Shanian, Sara. "Sample Compressed PAC-Bayesian Bounds and Learning Algorithms." Thesis, Université Laval, 2012. http://www.theses.ulaval.ca/2012/29037/29037.pdf.

Full text

Abstract:

Dans le domaine de la classification, les algorithmes d'apprentissage par compression d'échantillons sont des algorithmes qui utilisent les données d'apprentissage disponibles pour construire l'ensemble de classificateurs possibles. Si les données appartiennent seulement à un petit sous-espace de l'espace de toutes les données «possibles», ces algorithmes possédent l'intéressante capacité de ne considérer que les classificateurs qui permettent de distinguer les exemples qui appartiennent à notre domaine d'intérêt. Ceci contraste avec d'autres algorithmes qui doivent considérer l'ensemble des classificateurs avant d'examiner les données d'entraînement. La machine à vecteurs de support (le SVM) est un algorithme d'apprentissage très performant qui peut être considéré comme un algorithme d'apprentissage par compression d'échantillons. Malgré son succès, le SVM est actuellement limité par le fait que sa fonction de similarité doit être un noyau symétrique semi-défini positif. Cette limitation rend le SVM difficilement applicable au cas où on désire utiliser une mesure de similarité quelconque.
In classification, sample compression algorithms are the algorithms that make use of the available training data to construct the set of possible predictors. If the data belongs to only a small subspace of the space of all "possible" data, such algorithms have the interesting ability of considering only the predictors that distinguish examples in our areas of interest. This is in contrast with non sample compressed algorithms which have to consider the set of predictors before seeing the training data. The Support Vector Machine (SVM) is a very successful learning algorithm that can be considered as a sample-compression learning algorithm. Despite its success, the SVM is currently limited by the fact that its similarity function must be a symmetric positive semi-definite kernel. This limitation by design makes SVM hardly applicable for the cases where one would like to be able to use any similarity measure of input example. PAC-Bayesian theory has been shown to be a good starting point for designing learning algorithms. In this thesis, we propose a PAC-Bayes sample-compression approach to kernel methods that can accommodate any bounded similarity function. We show that the support vector classifier is actually a particular case of sample-compressed classifiers known as majority votes of sample-compressed classifiers. We propose two different groups of PAC-Bayesian risk bounds for majority votes of sample-compressed classifiers. The first group of proposed bounds depends on the KL divergence between the prior and the posterior over the set of sample-compressed classifiers. The second group of proposed bounds has the unusual property of having no KL divergence when the posterior is aligned with the prior in some precise way that we define later in this thesis. Finally, for each bound, we provide a new learning algorithm that consists of finding the predictor that minimizes the bound. The computation times of these algorithms are comparable with algorithms like the SVM. We also empirically show that the proposed algorithms are very competitive with the SVM.

APA, Harvard, Vancouver, ISO, and other styles

40

Bost, Raphaël. "Algorithmes de recherche sur bases de données chiffrées." Thesis, Rennes 1, 2018. http://www.theses.fr/2018REN1S001/document.

Full text

Abstract:

La recherche sur les bases de données chiffrées vise à rendre efficace une tâche apparemment simple : déléguer le stockage de données à un serveur qui ne serait pas de confiance, tout en conservant des fonctionnalités de recherche. Avec le développement des services de stockage dans le Cloud, destinés aussi bien aux entreprises qu'aux individus, la mise au point de solutions efficaces à ce problème est essentielle pour permettre leur déploiement à large échelle. Le principal problème de la recherche sur bases de données chiffrées est qu'un schéma avec une sécurité ''parfaite'' implique un surcoût en termes de calcul et de communication qui serait inacceptable pour des fournisseurs de services sur le Cloud ou pour les utilisateurs - tout du moins avec les technologies actuelles. Cette thèse propose et étudie de nouvelles notions de sécurité et de nouvelles constructions de bases de données chiffrées permettant des recherches efficaces et sûres. En particulier, nous considérons la confidentialité persistante et la confidentialité future de ces bases de données, ce que ces notions impliquent en termes de sécurité et d'efficacité, et comment les réaliser. Ensuite, nous montrons comment protéger les utilisateurs de bases de données chiffrées contre des attaques actives de la part du serveur hébergeant la base, et que ces protections ont un coût inévitable. Enfin, nous étudions les attaques existantes contre ces bases de données chiffrées et comment les éviter
Searchable encryption aims at making efficient a seemingly easy task: outsourcing the storage of a database to an untrusted server, while keeping search features. With the development of Cloud storage services, for both private individuals and businesses, efficiency of searchable encryption became crucial: inefficient constructions would not be deployed on a large scale because they would not be usable. The key problem with searchable encryption is that any construction achieving ''perfect security'' induces a computational or a communicational overhead that is unacceptable for the providers or for the users --- at least with current techniques and by today's standards. This thesis proposes and studies new security notions and new constructions of searchable encryption, aiming at making it more efficient and more secure. In particular, we start by considering the forward and backward privacy of searchable encryption schemes, what it implies in terms of security and efficiency, and how we can realize them. Then, we show how to protect an encrypted database user against active attacks by the Cloud provider, and that such protections have an inherent efficiency cost. Finally, we take a look at existing attacks against searchable encryption, and explain how we might thwart them

APA, Harvard, Vancouver, ISO, and other styles

41

Park, Chee-Hang. "Algorithmes de jointure parallele et n-aire : application aux reseaux locaux et aux machines bases de donnees multiprocesseurs." Paris 6, 1987. http://www.theses.fr/1987PA066569.

Full text

Abstract:

Cette these propose une solution concrete pour une requete de jointure (plus precisement, jointure naturelle) dans un reseau local avec diffusion. Les algorithmes proposes sont ceux de jointure n-aire qui permettent un haut degre de parallelisme sans resultats intermediaires

APA, Harvard, Vancouver, ISO, and other styles

42

Vial-Montpellier, Pierre. "Sur la construction des bases d'ondelettes." Aix-Marseille 3, 1993. http://www.theses.fr/1993AIX30063.

Full text

Abstract:

Cette thèse est consacrée à divers problèmes de construction de bases d'ondelettes. Partant d'une analyse multirésolution en dimension quelconque, définie en remplaçant la dilatation par 2 par l'action d'une matrice convenable et satisfaisant à une condition de nature topologique assez faible, nous donnons un algorithme très simple de construction des ondelettes mères associées. Nous construisons ensuite des exemples particuliers d'analyses multirésolution, en dimension 1, avec un facteur de dilatation égal a 3 et ou la fonction d'échelle est simultanément a support compact, interpolatrice et telle que ses translatées entières forment une suite orthonormée. Les ondelettes mères associées s'obtiennent facilement et sont également à support compact. Enfin, nous proposons une construction de bases d'ondelettes sur l'intervalle supprimant certains inconvénients des constructions antérieures tout en préservant les propriétés importantes habituelles des ondelettes

APA, Harvard, Vancouver, ISO, and other styles

43

Svärd, Mikael, and Philip Rumman. "COMBATING DISINFORMATION : Detecting fake news with linguistic models and classification algorithms." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-209755.

Full text

Abstract:

The purpose of this study is to examine the possibility of accurately distinguishing fabricated news from authentic news stories using Naive Bayes classification algorithms. This involves a comparative study of two different machine learning classification algorithms. The work also contains an overview of how linguistic text analytics can be utilized in detection purposes and an attempt to extract interesting information was made using Word Frequencies. A discussion of how different actors and parties in businesses and governments are affected by and how they handle deception caused by fake news articles was also made. This study further tries to ascertain what collective steps could be made towards introducing a functioning solution to combat fake news. The result swere inconclusive and the simple Naive Bayes algorithms used did not yieldfully satisfactory results. Word frequencies alone did not give enough information for detection. They were however found to be potentially useful as part of a larger set of algorithms and strategies as part of a solution to handling of misinformation.
Syftet med denna studie är att undersöka möjligheten att på ett pålitligt sättskilja mellan fabricerade och autentiska nyheter med hjälp av Naive bayesalgoritmer,detta involverar en komparativ studie mellan två olika typer avalgoritmer. Arbetet innehåller även en översikt över hur lingvistisk textanalyskan användas för detektion och ett försök gjordes att extrahera information medhjälp av ordfrekvenser. Det förs även en diskussion kring hur de olika aktörernaoch parterna inom näringsliv och regeringar påverkas av och hur de hanterarbedrägeri kopplat till falska nyheter. Studien försöker vidare undersöka vilkasteg som kan tas mot en fungerande lösning för att motarbeta falska nyheter. Algoritmernagav i slutändan otillfredställande resultat och ordfrekvenserna kundeinte ensamma ge nog med information. De tycktes dock potentiellt användbarasom en del i ett större maskineri av algoritmer och strategier ämnade att hanteradesinformation.

APA, Harvard, Vancouver, ISO, and other styles

44

Linfoot, Andy James. "A Case Study of A Multithreaded Buchberger Normal Form Algorithm." Diss., The University of Arizona, 2006. http://hdl.handle.net/10150/305141.

Full text

Abstract:

Groebner bases have many applications in mathematics, science, and engineering. This dissertation deals with the algorithmic aspects of computing these bases. The dissertation begins with a brief introduction of fundamental concepts about Groebner bases. Following this a discussion of various implementation issues are discussed. Much of the practical difficulties of using Groebner basis algorithms and techniques stems from the high computational complexity. It is shown that the algorithmic complexity of computing a Groebner basis primarily stems from the calculation of normal forms. This is established by studying run profiles of various computations. This leads to two options of making Groebner basis techniques more practical. They are to reduce the complexity by developing new algorithms (heuristics) or reduce running time of normal form calculations by introducing concurrency. The later approach is taken in the remainder of the dissertation where a multithreaded normal form algorithm is presented and discussed. It is shown with a simple example that the new algorithm demonstrates a speedup and scalability. The algorithm also has the advantage of being completion strategy independent. We conclude with an outline of future research involving the new algorithm.

APA, Harvard, Vancouver, ISO, and other styles

45

Winkleblack, Scott Kenneth swinkleb. "ReGen: Optimizing Genetic Selection Algorithms for Heterogeneous Computing." DigitalCommons@CalPoly, 2014. https://digitalcommons.calpoly.edu/theses/1236.

Full text

Abstract:

GenSel is a genetic selection analysis tool used to determine which genetic markers are informational for a given trait. Performing genetic selection related analyses is a time consuming and computationally expensive task. Due to an expected increase in the number of genotyped individuals, analysis times will increase dramatically. Therefore, optimization efforts must be made to keep analysis times reasonable. This thesis focuses on optimizing one of GenSel’s underlying algorithms for heterogeneous computing. The resulting algorithm exposes task-level parallelism and data-level parallelism present but inaccessible in the original algorithm. The heterogeneous computing solution, ReGen, outperforms the optimized CPU implementation achieving a 1.84 times speedup.

APA, Harvard, Vancouver, ISO, and other styles

46

Brum, Vinicius Campista. "Análise de agrupamento e estabilidade para aquisição e validação de conhecimento em bases de dados de alta dimensionalidade." Universidade Federal de Juiz de Fora (UFJF), 2015. https://repositorio.ufjf.br/jspui/handle/ufjf/4826.

Full text

Abstract:

Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-06-06T12:39:52Z No. of bitstreams: 1 viniciuscampistabrum.pdf: 846002 bytes, checksum: 5ac93812c3739c70741f6052b77b22c8 (MD5)
Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-06T14:06:19Z (GMT) No. of bitstreams: 1 viniciuscampistabrum.pdf: 846002 bytes, checksum: 5ac93812c3739c70741f6052b77b22c8 (MD5)
Made available in DSpace on 2017-06-06T14:06:19Z (GMT). No. of bitstreams: 1 viniciuscampistabrum.pdf: 846002 bytes, checksum: 5ac93812c3739c70741f6052b77b22c8 (MD5) Previous issue date: 2015-08-28
CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Análise de agrupamento é uma tarefa descritiva e não-supervisionada de mineração de dados que utiliza amostras não-rotuladas com o objetivo de encontrar grupos naturais, isto é, grupos de amostras fortemente relacionadas de forma que as amostras que per-tençam a um mesmo grupo sejam mais similares entre si do que amostras em qualquer outro grupo. Avaliação ou validação é considerada uma tarefa essencial dentro da análise de agrupamento. Essa tarefa apresenta técnicas que podem ser divididas em dois tipos: técnicas não-supervisionadas ou de validação interna e técnicas supervisionadas ou de va-lidação externa. Trabalhos recentes introduziram uma abordagem de validação interna que busca avaliar e melhorar a estabilidade do algoritmo de agrupamento por meio de identificação e remoção de amostras que são consideradas prejudiciais e, portanto, de-veriam ser estudadas isoladamente. Por meio de experimentos foi identificado que essa abordagem apresenta características indesejáveis que podem resultar em remoção de todo um grupo e ainda não garante melhoria de estabilidade. Considerando essas questões, neste trabalho foi desenvolvida uma abordagem mais ampla utilizando algoritmo genético para análise de agrupamento e estabilidade de dados. Essa abordagem busca garantir melhoria de estabilidade, reduzir o número de amostras para remoção e permitir que o usuário controle o processo de análise de estabilidade, o que resulta em maior aplicabi-lidade e confiabilidade para tal processo. A abordagem proposta foi avaliada utilizando diferentes algoritmos de agrupamento e diferentes bases de dados, sendo que uma base de dados genotípicos também foi utilizada com o intuito de aquisição e validação de conhe-cimento. Os resultados mostram que a abordagem proposta é capaz de garantir melhoria de estabilidade e também é capaz de reduzir o número de amostras para remoção. Os resultados também sugerem a utilização da abordagem como uma ferramenta promissora para aquisição e validação de conhecimento em estudos de associação ampla do genoma (GWAS). Este trabalho apresenta uma abordagem que contribui para aquisição e valida-ção de conhecimento por meio de análise de agrupamento e estabilidade de dados.
Clustering analysis is a descriptive and unsupervised data mining task, which uses non-labeled samples in order to find natural groups, i.e. groups of closely related samples such that samples within the same cluster are more similar than samples within the other clusters. Evaluation and validation are considered essential tasks within the clustering analysis. These tasks present techniques that can be divided into two kinds: unsuper-vised or internal validation techniques and supervised or external validation techniques. Recent works introduced an internal clustering validation approach to evaluate and im-prove the clustering algorithm stability through identifying and removing samples that are considered harmful and therefore they should be studied separately. Through experi-mentation, it was identified that this approach has two undesirable characteristics, it can remove an entire cluster from dataset and still decrease clustering stability. Taking into account these issues, in this work a broader approach was developed using genetic algo-rithm for clustering and data stability analysis. This approach aims to increase stability, to reduce the number of samples for removal and to allow the user control the stability analysis process, which gives greater applicability and reliability for such process. This approach was evaluated using different kinds of clustering algorithm and datasets. A genotype dataset was also used in order to knowledge acquisition and validation. The results show the approach proposed in this work is able to increase stability, and it is also able to reduce the number of samples for removal. The results also suggest the use of this approach as a promising tool for knowledge acquisition and validation on genome-wide association studies (GWAS). This work presents an approach that contributes for knowledge acquisition and validation through clustering and data stability analysis.

APA, Harvard, Vancouver, ISO, and other styles

47

Mahfoudi, Abdelwahab. "Contribution a l'algorithmique pour l'analyse des bases de données statistiques hétérogènes." Dijon, 1995. http://www.theses.fr/1995DIJOS009.

Full text

Abstract:

La première partie est consacrée a l'analyse et au positionnement multidimensionnel des tableaux de données hétérogènes (qualitatifs et quantitatifs). On présente d'abord une synthèse des méthodes de multidimensional scaling (MDS) et des problèmes sous jacents. On étudie ensuite le problème de l'homogénéisation des données par transformation des variables et on établit les limites d'une telle approche. Nous abordons ensuite le problème sous l'angle du M. D. S. , divers indices de dépendances entre attributs de natures différentes sont construits. L'ensemble des résultats est implémenté dans le package Kalita. La deuxième partie traite de la détection des outliers dans les données. Après un historique, une présentation des principales règles existantes pour la détection ainsi que des relations d'équivalence de certaines d'entre elles sont données. Une règle générale est ensuite construite dans le cadre d'un modèle linéaire généralisé et sa distribution est établie. Cette règle généralise la plupart de celles existantes et s'applique aux principaux modèles d'analyse de données (A. C. P. , Anova, Manova, modèle linéaire et polynomial)

APA, Harvard, Vancouver, ISO, and other styles

48

Cabarcas, Daniel. "Gröbner Bases Computation and Mutant Polynomials." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1307321300.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Djenderedjian, Ivan. "Bases de Gröbner minimales sur un anneau principal." Aix-Marseille 1, 2000. http://www.theses.fr/2000AIX11040.

Full text

Abstract:

Cette these traite des bases de grobner a coefficients dans un anneau principal (z dans ce resume). Une nouvelle notion, celle de base de grobner minimale, y est definie. On montre qu'il y a unicite pour les termes dominants de ces bases minimales. Celles-ci permettent de clarifier la relation qui existe entre les bases de grobner faibles et fortes. Des algorithmes sont donnes pour calculer des bases de grobner faibles, minimales, et fortes ; ainsi que pour passer d'un type de ces bases a l'autre. Les bases de grobner minimales permettent de montrer que certains nombres premiers apparaissent toujours dans les denominateurs des coefficients d'une base de grobner unitaire d'un ideal de qx 1 x n. Elles permettent aussi de donner une reponse aux problemes de reduction d'une base de grobner d'un ideal de qx 1, , x n modulo un nombre premier. Les problemes d'extension (puis de restriction) d'un ideal i de zx 1, , x n a qx 1, , x n sont etudies ainsi que les liens entre les bases de grobner de i, iq et iqzx 1, , x n, ce qui produit des invariants independant de l'ordre monomial choisi. Un lien etroit entre les problemes d'extension et de reduction modulo un nombre premier est etabli.

APA, Harvard, Vancouver, ISO, and other styles

50

Medeiros, Camila Alves de [UNESP]. "Extração de conhecimento em bases de dados espaciais: algoritmo CHSMST+." Universidade Estadual Paulista (UNESP), 2014. http://hdl.handle.net/11449/110986.

Full text

Abstract:

Made available in DSpace on 2014-12-02T11:16:48Z (GMT). No. of bitstreams: 0 Previous issue date: 2014-03-28Bitstream added on 2014-12-02T11:21:33Z : No. of bitstreams: 1 000792056.pdf: 4132783 bytes, checksum: c8f25369424acc27b4e2db78b152533f (MD5)
O desenvolvimento de tecnologias de coleta de informações espaciais resultou no armazenamento de um grande volume de dados, que devido à complexidade dos dados e dos respectivos relacionamentos torna-se impraticável a aplicação de técnicas tradicionais para prospecção em bases de dados espaciais. Nesse sentido, diversos algoritmos vêm sendo propostos, sendo que os algoritmos de agrupamento de dados espaciais são os que mais se destacam devido a sua alta aplicabilidade em diversas áreas. No entanto, tais algoritmos ainda necessitam superar vários desafios para que encontrem resultados satisfatórios em tempo hábil. Com o propósito de contribuir neste sentido, neste trabalho é apresentado um novo algoritmo, denominado CHSMST+, que realiza o agrupamento de dados considerando tanto a distância quanto a similaridade, o que possibilita correlacionar atributos espaciais e não espaciais. Tais tarefas são realizadas sem parâmetros de entrada e interação com usuário, o que elimina a dependência da interpretação do usuário para geração dos agrupamentos, bem como possibilita a obtenção de agrupamentos mais eficientes uma vez que os cálculos realizados pelo algoritmo são mais precisos que uma análise visual dos mesmos. Além destas técnicas, é utilizada a abordagem multithreading, que possibilitou uma redução média de 38,52% no tempo de processamento. O algoritmo CHSMST+ foi aplicado em bases de dados espaciais da área da saúde e meio ambiente, mostrando a capacidade de utilizá-lo em diferentes contextos, o que torna ainda mais relevante o trabalho realizado
The development of technologies for collecting spatial information has resulted in a large volume of stored data, which makes inappropriate the use of conventional data mining techniques for knowledge extraction in spatial databases, due to the high complexity of these data and its relationships. Therefore, several algorithms have been proposed, and the spatial clustering ones stand out due to their high applicability in many fields. However, these algorithms still need to overcome many challenges to reach satisfactory results in a timely manner. In this work, we present a new algorithm, namely CHSMST+, which works with spatial clustering considering both distance and similarity, allowing to correlate spatial and non-spatial attributes. These tasks are performed without input parameters and user interaction, eliminating the dependence of the user interpretation for cluster generation and enabling the achievement of cluster in a more efficient way, since the calculations performed by the algorithm are more accurate than visual analysis of them. Together with these techniques, we use a multithreading approach, which allowed an average reduction of 38,52% in processing time. The CHSMST+ algorithm was applied in spatial databases of health and environment, showing the ability to apply it in different contexts, which makes this work even more relevant

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Bayes Algorithm'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles