Academic literature on the topic 'PREDICTION DATASET'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'PREDICTION DATASET.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Dissertations / Theses on the topic "PREDICTION DATASET"

1

Klus, Petr 1985. ""The Clever machine"- a computational tool for dataset exploration and prediction." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/482051.

Full text
Abstract:
The purpose of my doctoral studies was to develop an algorithm for large-scale analysis of protein sets. This thesis outlines the methodology and technical work performed as well as relevant biological cases involved in creation of the core algorithm, the cleverMachine (CM), and its extensions multiCleverMachine (mCM) and cleverGO. The CM and mCM provide characterisation and classification of protein groups based on physico-chemical features, along with protein abundance and Gene Ontology annotation information, to perform an accurate data exploration. My method provides both computational and experimental scientists with a comprehensive, easy to use interface for high-throughput protein sequence screening and classification.<br>El propósito de mis estudios doctorales era desarrollar un algoritmo para el análisis a gran escala de conjuntos de datos de proteínas. Esta tesis describe la metodología, el trabajo técnico desarrollado y los casos biológicos envueltos en la creación del algoritmo principal –el cleverMachine (CM) y sus extensiones multiCleverMachine (mCM) y cleverGO. El CM y mCM permiten la caracterización y clasificación de grupos de proteínas basados en características físico-químicas, junto con la abundancia de proteínas y la anotación de ontología de genes, para así elaborar una exploración de datos correcta. Mi método está compuesto por científicos tanto computacionales como experimentales con una interfaz amplia, fácil de usar para un monitoreo y clasificación de secuencia de proteínas de alto rendimiento.
APA, Harvard, Vancouver, ISO, and other styles
2

Clayberg, Lauren (Lauren W. ). "Web element role prediction from visual information using a novel dataset." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/132734.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020<br>Cataloged from the official PDF of thesis.<br>Includes bibliographical references (pages 89-90).<br>Machine learning has enhanced many existing tech industries, including end-to-end test automation for web applications. One of the many goals that mabl and other companies have in this new tech initiative is to automatically gain insight into how web applications work. The task of web element role prediction is vital for the advancement of this newly emerging product category. I applied supervised visual machine learning techniques to the task. In addition, I created a novel dataset and present detailed attribute distribution and bias information. The dataset is used to provide updated baselines for performance using current day web applications, and a novel metric is provided to better quantify the performance of these models. The top performing model achieves an F1-score of 0.45 on ten web element classes. Additional findings include color distributions for different web element roles, and how some color spaces are more intuitive to humans than others.<br>by Lauren Clayberg.<br>M. Eng.<br>M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
3

Oppon, Ekow CruickShank. "Synergistic use of promoter prediction algorithms: a choice of small training dataset?" Thesis, University of the Western Cape, 2000. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_8222_1185436339.

Full text
Abstract:
<p>Promoter detection, especially in prokaryotes, has always been an uphill task and may remain so, because of the many varieties of sigma factors employed by various organisms in transcription. The situation is made more complex by the fact, that any seemingly unimportant sequence segment may be turned into a promoter sequence by an activator or repressor (if the actual promoter sequence is made unavailable). Nevertheless, a computational approach to promoter detection has to be performed due to number of reasons. The obvious that comes to mind is the long and tedious process involved in elucidating promoters in the &lsquo<br>wet&rsquo<br>laboratories not to mention the financial aspect of such endeavors. Promoter detection/prediction of an organism with few characterized promoters (M.tuberculosis) as envisaged at the beginning of this work was never going to be easy. Even for the few known Mycobacterial promoters, most of the respective sigma factors associated with their transcription were not known. If the information (promoter-sigma) were available, the research would have been focused on categorizing the promoters according to sigma factors and training the methods on the respective categories. That is assuming that, there would be enough training data for the respective categories. Most promoter detection/prediction studies have been carried out on E.coli because of the availability of a number of experimentally characterized promoters (+- 310). Even then, no researcher to date has extended the research to the entire E.coli genome.</p>
APA, Harvard, Vancouver, ISO, and other styles
4

Vandehei, Bailey R. "Leveraging Defects Life-Cycle for Labeling Defective Classes." DigitalCommons@CalPoly, 2019. https://digitalcommons.calpoly.edu/theses/2111.

Full text
Abstract:
Data from software repositories are a very useful asset to building dierent kinds of models and recommender systems aimed to support software developers. Specically, the identication of likely defect-prone les (i.e., classes in Object-Oriented systems) helps in prioritizing, testing, and analysis activities. This work focuses on automated methods for labeling a class in a version as defective or not. The most used methods for automated class labeling belong to the SZZ family and fail in various circum- stances. Thus, recent studies suggest the use of aect version (AV) as provided by developers and available in the issue tracker such as JIRA. However, in many cir- cumstances, the AV might not be used because it is unavailable or inconsistent. The aim of this study is twofold: 1) to measure the AV availability and consistency in open-source projects, 2) to propose, evaluate, and compare to SZZ, a new method for labeling defective classes which is based on the idea that defects have a stable life-cycle in terms of proportion of versions needed to discover the defect and to x the defect. Results related to 212 open-source projects from the Apache ecosystem, featuring a total of about 125,000 defects, show that the AV cannot be used in the majority (51%) of defects. Therefore, it is important to investigate automated meth- ods for labeling defective classes. Results related to 76 open-source projects from the Apache ecosystem, featuring a total of about 6,250,000 classes that are are aected by 60,000 defects and spread over 4,000 versions and 760,000 commits, show that the proposed method for labeling defective classes is, in average among projects and de- fects, more accurate, in terms of Precision, Kappa, F1 and MCC than all previously proposed SZZ methods. Moreover, the improvement in accuracy from combining SZZ with defects life-cycle information is statistically signicant but practically irrelevant ( overall and in average, more accurate via defects' life-cycle than any SZZ method.
APA, Harvard, Vancouver, ISO, and other styles
5

Sousa, Massáine Bandeira e. "Improving accuracy of genomic prediction in maize single-crosses through different kernels and reducing the marker dataset." Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/11/11137/tde-07032018-163203/.

Full text
Abstract:
In plant breeding, genomic prediction (GP) may be an efficient tool to increase the accuracy of selecting genotypes, mainly, under multi-environments trials. This approach has the advantage to increase genetic gains of complex traits and reduce costs. However, strategies are needed to increase the accuracy and reduce the bias of genomic estimated breeding values. In this context, the objectives were: i) to compare two strategies to obtain markers subsets based on marker effect regarding their impact on the prediction accuracy of genome selection; and, ii) to compare the accuracy of four GP methods including genotype × environment interaction and two kernels (GBLUP and Gaussian). We used a rice diversity panel (RICE) and two maize datasets (HEL and USP). These were evaluated for grain yield and plant height. Overall, the prediction accuracy and relative efficiency of genomic selection were increased using markers subsets, which has the potential for build fixed arrays and reduce costs with genotyping. Furthermore, using Gaussian kernel and the including G×E effect, there is an increase in the accuracy of the genomic prediction models.<br>No melhoramento de plantas, a predição genômica (PG) é uma eficiente ferramenta para aumentar a eficiência seletiva de genótipos, principalmente, considerando múltiplos ambientes. Esta técnica tem como vantagem incrementar o ganho genético para características complexas e reduzir os custos. Entretanto, ainda são necessárias estratégias que aumentem a acurácia e reduzam o viés dos valores genéticos genotípicos. Nesse contexto, os objetivos foram: i) comparar duas estratégias para obtenção de subconjuntos de marcadores baseado em seus efeitos em relação ao seu impacto na acurácia da seleção genômica; ii) comparar a acurácia seletiva de quatro modelos de PG incluindo o efeito de interação genótipo × ambiente (G×A) e dois kernels (GBLUP e Gaussiano). Para isso, foram usados dados de um painel de diversidade de arroz (RICE) e dois conjuntos de dados de milho (HEL e USP). Estes foram avaliados para produtividade de grãos e altura de plantas. Em geral, houve incremento da acurácia de predição e na eficiência da seleção genômica usando subconjuntos de marcadores. Estes poderiam ser utilizados para construção de arrays e, consequentemente, reduzir os custos com genotipagem. Além disso, utilizando o kernel Gaussiano e incluindo o efeito de interação G×A há aumento na acurácia dos modelos de predição genômica.
APA, Harvard, Vancouver, ISO, and other styles
6

Johansson, David. "Price Prediction of Vinyl Records Using Machine Learning Algorithms." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-96464.

Full text
Abstract:
Machine learning algorithms have been used for price prediction within several application areas. Examples include real estate, the stock market, tourist accommodation, electricity, art, cryptocurrencies, and fine wine. Common approaches in studies are to evaluate the accuracy of predictions and compare different algorithms, such as Linear Regression or Neural Networks. There is a thriving global second-hand market for vinyl records, but the research of price prediction within the area is very limited. The purpose of this project was to expand on existing knowledge within price prediction in general to evaluate some aspects of price prediction of vinyl records. That included investigating the possible level of accuracy and comparing the efficiency of algorithms. A dataset of 37000 samples of vinyl records was created with data from the Discogs website, and multiple machine learning algorithms were utilized in a controlled experiment. Among the conclusions drawn from the results was that the Random Forest algorithm generally generated the strongest results, that results can vary substantially between different artists or genres, and that a large part of the predictions had a good accuracy level, but that a relatively small amount of large errors had a considerable effect on the general results.
APA, Harvard, Vancouver, ISO, and other styles
7

Baveye, Yoann. "Automatic prediction of emotions induced by movies." Thesis, Ecully, Ecole centrale de Lyon, 2015. http://www.theses.fr/2015ECDL0035/document.

Full text
Abstract:
Jamais les films n’ont été aussi facilement accessibles aux spectateurs qui peuvent profiter de leur potentiel presque sans limite à susciter des émotions. Savoir à l’avance les émotions qu’un film est susceptible d’induire à ses spectateurs pourrait donc aider à améliorer la précision des systèmes de distribution de contenus, d’indexation ou même de synthèse des vidéos. Cependant, le transfert de cette expertise aux ordinateurs est une tâche complexe, en partie due à la nature subjective des émotions. Cette thèse est donc dédiée à la détection automatique des émotions induites par les films, basée sur les propriétés intrinsèques du signal audiovisuel. Pour s’atteler à cette tâche, une base de données de vidéos annotées selon les émotions induites aux spectateurs est nécessaire. Cependant, les bases de données existantes ne sont pas publiques à cause de problèmes de droit d’auteur ou sont de taille restreinte. Pour répondre à ce besoin spécifique, cette thèse présente le développement de la base de données LIRIS-ACCEDE. Cette base a trois avantages principaux: (1) elle utilise des films sous licence Creative Commons et peut donc être partagée sans enfreindre le droit d’auteur, (2) elle est composée de 9800 extraits vidéos de bonne qualité qui proviennent de 160 films et courts métrages, et (3) les 9800 extraits ont été classés selon les axes de “valence” et “arousal” induits grâce un protocole de comparaisons par paires mis en place sur un site de crowdsourcing. L’accord inter-annotateurs élevé reflète la cohérence des annotations malgré la forte différence culturelle parmi les annotateurs. Trois autres expériences sont également présentées dans cette thèse. Premièrement, des scores émotionnels ont été collectés pour un sous-ensemble de vidéos de la base LIRIS-ACCEDE dans le but de faire une validation croisée des classements obtenus via crowdsourcing. Les scores émotionnels ont aussi rendu possible l’apprentissage d’un processus gaussien par régression, modélisant le bruit lié aux annotations, afin de convertir tous les rangs liés aux vidéos de la base LIRIS-ACCEDE en scores émotionnels définis dans l’espace 2D valence-arousal. Deuxièmement, des annotations continues pour 30 films ont été collectées dans le but de créer des modèles algorithmiques temporellement fiables. Enfin, une dernière expérience a été réalisée dans le but de mesurer de façon continue des données physiologiques sur des participants regardant les 30 films utilisés lors de l’expérience précédente. La corrélation entre les annotations physiologiques et les scores continus renforce la validité des résultats de ces expériences. Equipée d’une base de données, cette thèse présente un modèle algorithmique afin d’estimer les émotions induites par les films. Le système utilise à son avantage les récentes avancées dans le domaine de l’apprentissage profond et prend en compte la relation entre des scènes consécutives. Le système est composé de deux réseaux de neurones convolutionnels ajustés. L’un est dédié à la modalité visuelle et utilise en entrée des versions recadrées des principales frames des segments vidéos, alors que l’autre est dédié à la modalité audio grâce à l’utilisation de spectrogrammes audio. Les activations de la dernière couche entièrement connectée de chaque réseau sont concaténées pour nourrir un réseau de neurones récurrent utilisant des neurones spécifiques appelés “Long-Short-Term- Memory” qui permettent l’apprentissage des dépendances temporelles entre des segments vidéo successifs. La performance obtenue par le modèle est comparée à celle d’un modèle basique similaire à l’état de l’art et montre des résultats très prometteurs mais qui reflètent la complexité de telles tâches. En effet, la prédiction automatique des émotions induites par les films est donc toujours une tâche très difficile qui est loin d’être complètement résolue<br>Never before have movies been as easily accessible to viewers, who can enjoy anywhere the almost unlimited potential of movies for inducing emotions. Thus, knowing in advance the emotions that a movie is likely to elicit to its viewers could help to improve the accuracy of content delivery, video indexing or even summarization. However, transferring this expertise to computers is a complex task due in part to the subjective nature of emotions. The present thesis work is dedicated to the automatic prediction of emotions induced by movies based on the intrinsic properties of the audiovisual signal. To computationally deal with this problem, a video dataset annotated along the emotions induced to viewers is needed. However, existing datasets are not public due to copyright issues or are of a very limited size and content diversity. To answer to this specific need, this thesis addresses the development of the LIRIS-ACCEDE dataset. The advantages of this dataset are threefold: (1) it is based on movies under Creative Commons licenses and thus can be shared without infringing copyright, (2) it is composed of 9,800 good quality video excerpts with a large content diversity extracted from 160 feature films and short films, and (3) the 9,800 excerpts have been ranked through a pair-wise video comparison protocol along the induced valence and arousal axes using crowdsourcing. The high inter-annotator agreement reflects that annotations are fully consistent, despite the large diversity of raters’ cultural backgrounds. Three other experiments are also introduced in this thesis. First, affective ratings were collected for a subset of the LIRIS-ACCEDE dataset in order to cross-validate the crowdsourced annotations. The affective ratings made also possible the learning of Gaussian Processes for Regression, modeling the noisiness from measurements, to map the whole ranked LIRIS-ACCEDE dataset into the 2D valence-arousal affective space. Second, continuous ratings for 30 movies were collected in order develop temporally relevant computational models. Finally, a last experiment was performed in order to collect continuous physiological measurements for the 30 movies used in the second experiment. The correlation between both modalities strengthens the validity of the results of the experiments. Armed with a dataset, this thesis presents a computational model to infer the emotions induced by movies. The framework builds on the recent advances in deep learning and takes into account the relationship between consecutive scenes. It is composed of two fine-tuned Convolutional Neural Networks. One is dedicated to the visual modality and uses as input crops of key frames extracted from video segments, while the second one is dedicated to the audio modality through the use of audio spectrograms. The activations of the last fully connected layer of both networks are conv catenated to feed a Long Short-Term Memory Recurrent Neural Network to learn the dependencies between the consecutive video segments. The performance obtained by the model is compared to the performance of a baseline similar to previous work and shows very promising results but reflects the complexity of such tasks. Indeed, the automatic prediction of emotions induced by movies is still a very challenging task which is far from being solved
APA, Harvard, Vancouver, ISO, and other styles
8

Lamichhane, Niraj. "Prediction of Travel Time and Development of Flood Inundation Maps for Flood Warning System Including Ice Jam Scenario. A Case Study of the Grand River, Ohio." Youngstown State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ysu1463789508.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Rai, Manisha. "Topographic Effects in Strong Ground Motion." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/56593.

Full text
Abstract:
Ground motions from earthquakes are known to be affected by earth's surface topography. Topographic effects are a result of several physical phenomena such as the focusing or defocusing of seismic waves reflected from a topographic feature and the interference between direct and diffracted seismic waves. This typically causes an amplification of ground motion on convex features such as hills and ridges and a de-amplification on concave features such as valleys and canyons. Topographic effects are known to be frequency dependent and the spectral accelerations can sometimes reach high values causing significant damages to the structures located on the feature. Topographically correlated damage pattern have been observed in several earthquakes and topographic amplifications have also been observed in several recorded ground motions. This phenomenon has also been extensively studied through numerical analyses. Even though different studies agree on the nature of topographic effects, quantifying these effects have been challenging. The current literature has no consensus on how to predict topographic effects at a site. With population centers growing around regions of high seismicity and prominent topographic relief, such as California, and Japan, the quantitative estimation of the effects have become very important. In this dissertation, we address this shortcoming by developing empirical models that predict topographic effects at a site. These models are developed through an extensive empirical study of recorded ground motions from two large strong-motion datasets namely the California small to medium magnitude earthquake dataset and the global NGA-West2 datasets, and propose topographic modification factors that quantify expected amplification or deamplification at a site. To develop these models, we required a parameterization of topography. We developed two types of topographic parameters at each recording stations. The first type of parameter is developed using the elevation data around the stations, and comprise of parameters such as smoothed slope, smoothed curvature, and relative elevation. The second type of parameter is developed using a series of simplistic 2D numerical analysis. These numerical analyses compute an estimate of expected 2D topographic amplification of a simple wave at a site in several different directions. These 2D amplifications are used to develop a family of parameters at each site. We study the trends in the ground motion model residuals with respect to these topographic parameters to determine if the parameters can capture topographic effects in the recorded data. We use statistical tests to determine if the trends are significant, and perform mixed effects regression on the residuals to develop functional forms that can be used to predict topographic effect at a site. Finally, we compare the two types of parameters, and their topographic predictive power.<br>Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
10

Cooper, Heather. "Comparison of Classification Algorithms and Undersampling Methods on Employee Churn Prediction: A Case Study of a Tech Company." DigitalCommons@CalPoly, 2020. https://digitalcommons.calpoly.edu/theses/2260.

Full text
Abstract:
Churn prediction is a common data mining problem that many companies face across industries. More commonly, customer churn has been studied extensively within the telecommunications industry where there is low customer retention due to high market competition. Similar to customer churn, employee churn is very costly to a company and by not deploying proper risk mitigation strategies, profits cannot be maximized, and valuable employees may leave the company. The cost to replace an employee is exponentially higher than finding a replacement, so it is in any company’s best interest to prioritize employee retention. This research combines machine learning techniques with undersampling in hopes of identifying employees at risk of churn so retention strategies can be implemented before it is too late. Four different classification algorithms are tested on a variety of undersampled datasets in order to find the most effective undersampling and classification method for predicting employee churn. Statistical analysis is conducted on the appropriate evaluation metrics to find the most significant methods. The results of this study can be used by the company to target individuals at risk of churn so that risk mitigation strategies can be effective in retaining the valuable employees. Methods and results can be tested and applied across different industries and companies.
APA, Harvard, Vancouver, ISO, and other styles
More sources
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography